资源预览内容
第1页 / 共25页
第2页 / 共25页
第3页 / 共25页
第4页 / 共25页
第5页 / 共25页
第6页 / 共25页
第7页 / 共25页
第8页 / 共25页
第9页 / 共25页
第10页 / 共25页
亲,该文档总共25页,到这儿已超出免费预览范围,如果喜欢就下载吧!
资源描述
GoogleClusterComputingFacultyTrainingWorkshop ModuleIII Nutch Meta details BuilttoencouragepublicsearchworkOpen source w pluggablemodulesCheaptorun bothmachines adminsGoal Searchmorepages withbetterquality thananyotherenginePrettygoodrankingHasdone 200Mpages morepossibleHadoopisaspinoff Outline NutchdesignLinkdatabase fetcher indexer etc HadoopsupportDistributedfilesystem jobcontrol WebDB MovingParts AcquisitioncycleWebDBFetcherIndexgenerationIndexingLinkanalysis maybe Servingresults WebDB Containsinfoonallpages linksURL lastdownload failures linkscore contenthash refcountingSourcehash targetURLMustalwaysbeconsistentDesignedtominimizediskseeks19msseektimex200mnewpages mo 44daysofdiskseeks Single diskWebDBwashugeheadache Fetcher Fetcherisverystupid Nota crawler Pre MapRed divide to fetchlist intokpieces oneforeachfetchermachineURLsforonedomaingotosamelist otherwiserandom Politeness w ointer fetcherprotocolsCanobserverobots txtsimilarlyBetterDNS robotscachingEasyparallelismTwooutputs pages WebDBedits 2 Sortedits externally ifnecessary WebDB FetcherUpdates WebDB Fetcheredits 1 Writedownfetcheredits 3 Readstreamsinparallel emittingnewdatabase 4 Repeatforothertables Indexing Iteratethroughallkpagesetsinparallel constructinginvertedindexCreatesa searchabledocument of URLtextContenttextIncominganchortextOthercontenttypesmighthaveadifferentdocumentfieldsEg emailhassender receiverAnysearchablefieldend userwillwantUsesLucenetextindexer Linkanalysis Apage srelevancedependsonbothintrinsicandextrinsicfactorsIntrinsic pagetitle URL textExtrinsic anchortext linkgraphPageRankismostfamousofmanyOthersinclude HITSOPICSimpleincominglinkcountLinkanalysisissexy butimportancegenerallyoverstated Linkanalysis 2 NutchperformsanalysisinWebDBEmitascoreforeachknownpageAtindextime incorporatescoreintoinvertedindexExtremelytime consumingInourcase disk consuming too becausewewanttouselow memorymachines Fastandeasy 0 5 log incominglinks britney QueryProcessing Docs0 1M Docs1 2M Docs2 3M Docs3 4M Docs4 5M britney britney britney britney britney Ds1 29 Ds1 2M 1 7M Ds2 3M 2 9M Ds3 1M 3 2M Ds4 4M 4 5M 1 2M 4 4M 29 AdministeringNutch AdmincostsarecriticalIt sahasslewhenyouhave25machinesGooglehas 100k probablymoreFilesWebDBcontent workingfilesFetchlists fetchedpagesLinkanalysisoutputs workingfilesInvertedindicesJobsEmitfetchlists fetch updateWebDBRunlinkanalysisBuildinvertedindices AdministeringNutch 2 Adminsoundsboring butit snot ReallyIswearLarge filemaintenanceGoogleFileSystem Ghemawat Gobioff Leung NutchDistributedFileSystemJobControlMap Reduce DeanandGhemawat Pig YahooResearch DataStorage BigTable NutchDistributedFileSystem Similar butnotidentical toGFSRequirementsarefairlystrangeExtremelylargefilesMostfilesreadonce fromstarttoendLowadmincostsperGBEquallystrangedesignWrite once withdeleteSinglefilecanexistacrossmanymachinesWhollyautomaticfailurerecovery NDFS 2 DatadividedintoblocksBlockscanbecopied replicatedDatanodesholdandserveblocksNamenodeholdsmetainfoFilename blocklistBlock datanode locationDatanodesreportintonamenodeeveryfewseconds NDFSFileRead Namenode Datanode0 Datanode1 Datanode2 Datanode3 Datanode4 Datanode5 ClientasksdatanodeforfilenameinfoNamenoderespondswithblocklist andlocation s foreachblockClientfetcheseachblock insequence fromadatanode crawl txt block 33 datanodes1 4 block 95 datanodes0 2 block 65 datanodes1 4 5 NDFSReplication Namenode Datanode0 33 95 Datanode1 46 95 Datanode2 33 104 Datanode3 21 33 46 Datanode4 90 Datanode5 21 90 104 AlwayskeepatleastkcopiesofeachblkImaginedatanode4dies blk90lostNamenodelosesheartbeat decrementsblk90 sreferencecount Asksdatanode5toreplicateblk90todatanode0Choosingreplicationtargetistricky Blk90todn0 Map Reduce Map ReduceisprogrammingmodelfromLisp andotherplaces EasytodistributeacrossnodesNiceretry failuresemanticsmap key val isrunoneachiteminsetemitskey valpairsreduce key vals isrunforeachuniquekeyemittedbymap emitsfinaloutputManyproblemscanbephrasedthisway Map Reduce 2 Task countwordsindocsInputconsistsof url contents pairsmap key url val contents Foreachwordwincontents emit w 1 reduce key word values uniq counts Sumall 1 sinvalueslistEmitresult word sum Map Reduce 3 Task grepInputconsistsof url offset singleline map key url offset val line Ifcontentsmatchesregexp emit line 1 reduce key line values uniq counts Don tdoanything justemitlineWecanalsodographinversion linkanalysis WebDBupdates etc Map Reduce 4 Howisthisdistributed Partitioninputkey valuepairsintochunks runmap tasksinparallelAfterallmap sarecomplete consolidateallemittedvaluesforeachuniqueemittedkeyNowpartitionspaceofoutputmapkeys andrunreduce inparallelIfmap orreduce fails reexecute Map ReduceJobProcessing JobTracker TaskTracker0 TaskTracker1 TaskTracker2 TaskTracker3 TaskTracker4 TaskTracker5 Clientsubmits grep job indicatingcodeandinputfilesJobTrackerbreaksinputfileintokchunks inthiscase6 Assignsworktottrackers Aftermap tasktrackersexchangemap outputtobuildreduce keyspaceJobTrackerbreaksreduce keyspaceintomchunks inthiscase
收藏 下载该资源
网站客服QQ:2055934822
金锄头文库版权所有
经营许可证:蜀ICP备13022795号 | 川公网安备 51140202000112号