搜索引擎技术闫宏飞北京大学计算机系网络实验室-

搜索引擎技术搜索引擎技术2004年12月24日CERNET20041内容提要搜索引擎工作原理信息检索相关研究和机构2搜索引擎 Web Search Engines定义：允许用户递交查询，检索出与查询相关的网页结果列表，并且排序输出。创建索引的方法手工索引自动索引系统结构集中式体系结构分布式体系结构345Browsing ServicesSearch Engine ServicesWebPagesBag of WordsTwo semantics extremesTwo service extremes?6搜索引擎三段式工作流程搜集批量搜集，增量式搜集；搜集目标，搜集策略预处理关键词提取；重复网页消除；链接分析；索引服务查询方式和匹配；结果排序；文档摘要搜集搜集整理整理服务服务7搜索引擎系统流程 8天网搜索引擎系统流程9分布式Web搜集系统结构协调协调进程进程（节点）（节点）抓取抓取进程进程协调协调进程进程（节点）（节点）抓取抓取进程进程协调协调进程进程（节点）（节点）抓取抓取进程进程调度模块调度模块10天网存储格式version: 1.0/ version numberurl: / URLorigin: / original URLdate: Tue, 15 Apr 2003 08:13:06 GMT / time of harvestip: 162.105.129.12 / IP addressunzip-length: 30233 / If included, the data must be compressedlength: 18133/ data length/ a blank lineXXXXXXXX/ the followings are data partXXXXXXXX.XXXXXXXX/ data end/ insert a new line11 (Indexes)Choices for accessing data during query evaluationScan the entire collectionTypical in early (batch) retrieval systemsComputational and I/O costs are O(characters in collection)Practical for only “small” text collectionsLarge memory systems make scanning feasibleUse indexes for direct accessEvaluation time O(query term occurrences in collection)Practical for “large” collectionsMany opportunities for optimizationHybrids: Use small index, then scan a subset of the collection12IndexesWhat should the index contain?Database systems index primary and secondarykeysThis is the hybrid approachIndex provides fast access to a subset of database recordsScan subset to find solution setIR Problem:Cannot predict keys that people will use in queriesEvery word in a document is a potential search termIR Solution: Index by all keys (words) full text indexes13Index ContentsThe contents depend upon the retrieval modelFeature presence/absenceBooleanStatistical (tf, df, ctf, doclen, maxtf)Often about 10% the size of the raw data, compressedPositionalFeature location within documentGranularities include word, sentence, paragraph, etcCoarse granularities are less precise, but take less spaceWord-level granularity about 20-30% the size of the raw data,compressed14Indexes: ImplementationCommon implementations of indexesBitmapsSignature filesInverted filesCommon index componentsDictionary (lexicon)Postingsdocument idsword positionsNo positional data indexed15Inverted Files16Inverted Files17Word-Level Inverted File18Inverted Search Algorithm1.Find query elements (terms) in the lexicon2.Retrieve postings for each lexicon entry3.Manipulate postings according to the retrieval model19Word-Level Inverted FileQuery: 1.porridge & pot (BOOL) 2.“porridge pot” (BOOL)3. porridge pot (VSM)lexiconpostingAnswer20内容提要搜索引擎工作原理信息检索相关研究和机构21A Brief history of Modern Information RetrievalIn 1945, Vannevar Bush published As We May Think in the Atlantic monthly.In the 1960s, the SMART system by Gerard Salton and his studentsCranfield evaluations done by Cyril CleverdonThe 1970s and 1980s saw many developments built on the advances of the 1960s.In 1992 with the inception of Text Retrieval Conference.The algorithms developedThe algorithms developed in IR were employed for searching the Web from 1996.22Clustering of SIGIR papers by topic vs. year23Question answering24Clustering25Inverted files & Implementations26Message understanding & TDT27Filtering28Hypertext IR, Multiple evidence29Probabilistic & Language models30Distributed IR31Evaluation32Topic distillation & Linkage retrieval 33Text categorisation34Document summarisation35Cross lingual36信息检索相关研究和机构CIIR, University of MassachusettsLTI, Carnegie Mellon UniversityThe Stanford University DB GroupMicrosoft Research AsiaTREC北京大学北京大学, 网络实验室网络实验室, 天网组天网组37Lemur简介38Lemur Toolkit目标：为促进LM和IR研究的research systemad hoc , distributed retrieval, cross-language IR, summarization, filtering, and classification功能:支持大规模文档数据库的索引建立Simple Language Model实现基于Language Model和其它多个检索模型的系统实现:C and C+ Unix / Windows Current Version 3.139MRA: Towards Next Generation Web SearchFrom Pages to BlocksAnalyze the Web at finer granularityFrom Surface Web to Deep WebUnleash the huge assets of high-value informationFrom Unstructure to StructureProvide well organized resultsFrom relevance to intelligenceContribute knowledge discovery with searchFrom Desktop Search to Mobile SearchBridge physical world search to digital world search40The Stanford Univ. DB GroupWebBaseCrawling, storage, indexing, and querying of large collections of Web pages.Digital LibrariesInfrastructure and services for creating, disseminating, sharing and managing information 41TREC ConferenceEstablished in 1992 to evaluate large-scale IRRetrieving documents from a gigabyte collectionHas run continuously since thenTREC 2004(13th) meeting is in NovemberRun by NISTs Information Access DivisionProbably most well known IR evaluation settingStarted with 25 participating organizations in 1992 evaluationIn 2003, there were 93 groups from 22 different countriesProceedings available on-line ( )Overview of TREC 2003 at 42TREC consists of IR research tracksAd hoc, routing, confusion ( scanned documents, speech recognition ), video, filtering, multilingual ( cross-language, Spanish, Chinese ), question answering, novelty, high precision, interactive, Web, database merging, NLP, Each track works on roughly the same modelNovember: track approved by TREC communityWinter: tracks members finalize format for trackSpring: researchers train system based on specificationSummer: researchers carry out format evaluationUsually a “blind” evaluation: research do not know answerFall: NIST carries out evaluationNovember: Group meeting (TREC) to find out:How well your site didHow others tackled the programMany tracks are run by volunteers outside of NIST (e.g. Web)“Coopetition” model of evaluationSuccessful approaches generally adopted in next cycleTREC General Format43TREC Tracks44Summary of VLC/Web Track evaluation 1996 - 200345Tianwang Group PKU46474849CWT100g构建时间表我是一小步，人类的一大步!5051截止2004-12-20北大燕穹数据共享情况2.5/8.8 = 28.4%52提交结果的参加队注：注：pooling还包括还包括google,yisou,baidu,sogou,zhongsou五个五个SE的检索结果。的检索结果。 53主主题提取提取导航航搜索搜索其中其中TIANWANG_RUN仅供参考供参考评测结果54总结搜索引擎工作原理信息检索相关研究和机构55谢谢！谢谢！56Vector Space Model文档d和查询q在向量空间中表示为两个m维向量，每维度的权值用TFIDF，其相似度用向量夹角余弦度量，有: (使用原始的tf,idf公式)BACK57Query Answer1.porridge & pot (BOOL) d22.“porridge pot” (BOOL)null3. porridge pot (VSM)d2 d1d5 Next page BACK58CIIR-Center for Intelligent Information Retrieval UMASS One of the leading research groups in IRimproving the probabilistic models, first description of a retrieval system based on statistical language models. introduced and improved a number of techniques for text and query representationautomatically representing databases and combining local searches for DIRfirst high capacity probabilistic filtering architecturedefine and evaluate the first versions of event detection and tracking softwareearliest research on ranking and representation techniques for Asian languagesfirst approaches to information extraction that emphasized learningnovel techniques for indexing images and video59CIIR cont.Researchmore than 500 journal and refereed conference papers over the past 12 years (52 submissions in 2003). industrial and government collaboration INQUERYlicensed our software to nearly 300 sites Education 20 Ph.D.s , 29 M.S. 123/145, 34/4 graduate/undergraduate60CIIR cont.PersonnelFaculty4(W. BRUCE CROFT)Technical personel10Graduate student34/10GroupsIESL:Information Extraction and Synthesis LaboratoryIR :Information Retrieval LaboratoryMIR :Multimedia Indexing and Retrieval LaboratoryThe CIIR is currently concentrating on the unsolved long-term research problems that underlie effective information retrieval text representation, query acquisition,retrieval models61LTI : Language Technologies Institue CMUMachine Translation, Natural Language Processing, Speech, and Information Retrieval IR Projects (Jamie Callan and Yiming Yang )Adaptive Information Filtering Distributed Information Retrieval / Federated Search Email Classification and PrioritizationMinerva: Web Mining for Question AnsweringMuchMore: Translingual Information Retrieval JAVELIN: Open-Domain Question AnsweringBACK62