Crowdsourcing tasks in Linked Data management-

Crowdsourcing tasks in Linked Data managementElena Simperl1, Barry Norton2, and Denny Vrandecic31;3Institute AIFB, Karslruhe Institute of Technology, Germany2Ontotext AD, Bulgaria1elena.simperlkit.edu, 2barry.nortonontotext.com,3denny.vrandecickit.eduAbstract. Many aspects of Linked Data management including exposing legacydata and applications to semantic formats, designing vocabularies to describeRDF data, identifying links between entities, query processing, and data curation are necessarily tackled through the combination of human effort with algorith-mic techniques. In the literature on traditional data management the theoreticaland technical groundwork to realize and manage such combinations is being es-tablished. In this paper we build upon and extend these ideas to propose a frame-work by which human and computational intelligence can co-exist by augmentingexisting Linked Data and Linked Service technology with crowdsourcing func-tionality. Starting from a motivational scenario we introduce a set of generic taskswhich may feasibly be approached using crowdsourcing platforms such as Ama-zons Mechanical Turk, explain how these tasks can be decomposed and trans-lated into MTurk projects, and roadmap the extensions to SPARQL, D2RQ/R2Rand Linked Data browsing that are required to achieve this vision.1 IntroductionOne of the basic design principles in Linked Data is that its usage in applications shouldbe amenable to a high level of automation. Standardized interfaces should allow to loaddata directly from the Web, resolve descriptions of unknown resources, and automati-cally integrate data sets published by different parties according to various vocabularies.But the actual experience with developing applications that consume Linked Data soonreveals the fact that for many components of a Linked Data application this is hardlythe case, and that many aspects of Linked Data management remain, for principledor technical reasons, heavily reliant on human intervention. This includes exposinglegacy data and applications to semantic formats, designing vocabularies to describeRDF data, identifying links between entities, vocabulary mapping, query processingover distributed data sets, and data curation, to name only several of the more prominentexamples. In all of these areas, human abilities are indispensable for the resolution ofthose particular tasks that are acknowledged to be hardly approachable in a systematic,engineering-driven fashion; and also, though to a lesser extent, for those tasks that havebeen subject to a wide array of techniques that attempt to perform them automatically,but yet require human input to produce training data and validate their results.In previous work of ours we have extensively discussed the importance of com-bining human and computational intelligence to handle such inherently human-driventasks, which, abstracting from their technical flavor in the context of Linked Data, tendto be highly contextual and often knowledge-intensive, thus challenging to fully auto-mate through algorithmic approaches 2, 19. Instead of aiming at such fully automatedsolutions, which often do not reach a level of quality required to create useful resultsand applications,1 we propose a framework in which such human computation becomesan integral part of existing Linked Data and Linked Service technology as crowdsourc-ing functionality exposed via platforms such as Amazons Mechanical Turk.2 We arguethat the types of tasks that are decisively required to run a Linked Data application canlargely be uniformly decomposed, and a formal, declarative description of the domain,scope and purpose of the application can form the basis for the automatic design andseamless operation of crowdsourcing features to overcome the limitations and comple-ment computational methods and techniques. As a next step, we explain how these taskscan be decomposed and translated into MTurk projects, and roadmap the extensions toSPARQL, D2RQ/R2R and Linked Data browsing that are required to turn the access tohuman intelligence in the context of specific applications into a commodity.2 Human intelligence tasks in Linked Data managementTwo of the primary advantages claimed for exposing data sets in the form of LinkedData are improvements and uniformity, allowing provision at Web-scale, in data discov-ery and data integration. In the former case a follow-your-nose approach is enabled,wherein links between data sets facilitate browsing through the Web of Data. On thetechnical level previously undiscovered data is aggregated, and enriches the semanticsof known resources (ad hoc integration), by virtue of the RDFs uniform data model.True integration across this Web of Data, however, is hampered by the publish first,refine later philosophy encouraged by the Linking Open Data movement. While thishas resulted in an impressive amount of Linked Data online, quality of the actual dataand of the links connecting data sets is som