Adaptive Task Assignment for Crowdsourced Classification拥挤源分类的自适应任务分配-

Adaptive Task Assignment for Crowdsourced ClassificationChien-Ju Ho, Shahin Jabbaricjhocs.ucla.edu, shahincs.ucla.edu University of California, Los AngelesJennifer Wortman Vaughanjennmicrosoft.com Microsoft Research, New York City and University of California, Los AngelesAbstractCrowdsourcing markets have gained popular- ity as a tool for inexpensively collecting datafrom diverse populations of workers. Classifi- cation tasks, in which workers provide labels(such as “offensive” or “not offensive”) for in- stances (such as “websites”), are among the most common tasks posted, but due to hu- man error and the prevalence of spam, the labels collected are often noisy. This problem is typically addressed by collecting labels for each instance from multiple workers and com- bining them in a clever way, but the question of how to choose which tasks to assign to each worker is often overlooked.We investigate the problem of task assignment and label in-ference for heterogeneous classification tasks. By applying online primal-dual techniques, we derive a provably near-optimal adaptive assignment algorithm. We show that adap- tively assigning workers to tasks can lead to more accurate predictions at a lower cost when the available workers are diverse.1. IntroductionCrowdsourcing markets provide a platform for inex- pensively harnessing human computation power tosolve tasks that are notoriously difficult for computers. In a typical crowdsourcing market, such as Amazon Mechanical Turk, registered users may post their own “microtasks” which are completed by workers in ex- change for a small payment, usually around ten cents. A microtask may involve, for example, verifying the phone number of a business, determining whether or not an image contains a tree, or determining (subjec-Proceedings of the 30thInternational Conference on Ma- chine Learning, Atlanta, Georgia, USA, 2013.JMLR: W Wais et al., 2010). Forclassification tasks, this problem can be overcome by collecting labels for each instance from multiple work- ers and combining these to infer the true label. In- deed, much recent research has focused on developing algorithms for combining labels from heterogeneous la- belers (Dekel Ipeirotis et al., 2010). However, this research has typically focused on the in- ference problem, sidestepping the question of how to assign workers to tasks by assuming that the learner has no control over the assignment.One exception is the work of Karger et al. (2011a;b), who focus on the situation in which all tasks are homogeneous (i.e.,equally difficult and not requiring specialized skills), in which case they show that it is not possible to do better than using a random assignment.One might expect the assignment to matter more when the tasks are heterogeneous.Classifying images of dogs versus images of cats is likely easier for the av- erage worker than classifying images of Welsh Terriers versus images of Airedale Terriers. It might be neces- sary to assign more workers to tasks of the latter typeto produce high confidence labels. The assignment can also be important when tasks require specialized skills. A worker who knows little about dogs may not be able to produce high quality labels for the Terrier task, but may have skills that are applicable elsewhere.We investigate the problem of task assignment and la-bel inference for heterogeneous classification tasks. InAdaptive Task Assignment for Crowdsourced Classificationour model, a task requester has a set of tasks, each of which consists of an instance for which he would like to infer a binary label.Workers arrive online. The learner must decide which tasks to assign to each worker, and then use the noisy labels produced by the workers to infer the true label for each task.The goal of the learner is to output a set of labels withsufficiently low error while requesting as few labels from workers as possible. Building on online primal- dual methods (Buchbinder b) and Ho Devanur et al., 2011) to network optimization (Alon et al., 2004) and pag- ing (Bansal et al., 2007). Ho if the requester could quickly verify the accuracy of la-bels, he wouldnt need the workers labels in the first place. In their model, each task may be assigned to a worker only once. In ours, repeated labeling is neces- sary since there would be no way to estimate workerquality without it. These differences require a different problem formulation and novel analysis techniques.Repeated labeling has received considerable empiri- cal attention, dating back to the EM-based algorithm of Dawid 2008; Dekel when it is close to 0, the label will be random noise.To model the fact that the requester cannot wait arbi- trarily long, we assume that he can only assign tasks tothe first m workers who arrive, for some known m. We therefore index workers 1, ,m. Later we consider an additional m workers who are used for exploration.In addition to assigning tasks to workers, the learn-ing algorithm must produce a final e