基于网页正文结构树的近似网页去重算法研究（学位论文-工学）-

基于网页正文结构树的近似网页去重算法研究重庆大学硕士学位论文（学术学位）学生姓名：牙漫指导教师：熊忠阳教授专业：计算机系统结构学科门类：工学重庆大学计算机学院二 O 一三年四月Research on Detection and Elimination ofSimilar Web Pages Based on Text StructureA Thesis Submitted to Chongqing Universityin Partial Fulfillment of the Requirement for theMasters Degree of EngineeringByYa ManSupervised by Prof. Zhongyang XiongSpecialty: Computer ArchitectureCollege of Computer ScienceChongqing University, Chongqing, ChinaApril, 2013重庆大学硕士学位论文中文摘要摘要据美国计算机协会统计，重复网页数量约占网页总量的 30%-45%。伴随搜索引擎数量不断增加，用户对搜索引擎体验要求的提高，搜素质量成为各搜索引擎赢取用户的砝码。搜索引擎若能够及时去除这些重复网页，系统不仅能节省大量存储空间，间接降低设备采购成本，也能提高网络的检索质量和访问效率，提高用户体验满意率。网页正文内容的特征提取以及大规模相似性比较是网页去重的关键问题。按照传统算法的各自突出特点将其分为三类：基于 URL 去重算法，仅能根据 URL地址去除完全重复网页；基于特征串匹配去重算法，具有较高的准确率，但去重时间消耗高；基于聚类去重算法，具有较高的召回率，对于一些新闻题材或模板类文章准确率较低。分析转载网页发现，重复网页在内容上可能有变化，但文档格式较少发生改变，即网页正文结构几乎不变。针对此特点，本文提出基于正文结构树的两个去重算法。通过分析重复网页发现，长句不具有主题代表性。面对网页采集器更改规则，越长的句子表现越脆弱。本文对基于正文结构及长句去重算法进行改进，提出基于正文结构树及关键句的算法。算法中提取包含关键词的句子作为特征句，且特征句的数目由段落长度决定，使得提取的特征句的数目更全面的概括文章内容。实验表明，改进算法去重准确率、召回率都有所提高。特征项的粒度越小，散列后的特征指纹越不易被干扰。依据此特性，本文提出了基于正文结构树及特征串的去重算法。首先，此算法中提取网页中高频标点所在句子中的首尾汉字作为特征码。其次，利用 Bloom Filter 算法获取特征指纹。最后，按层次指纹进行相似度判别。实验表明，此算法在召回率方面有大幅度提高，在对小文档去重上表现的尤其明显，且大大降低了去重时间。关键词：网页去重，正文结构树，关键句，层次比较，高频标点I重庆大学硕士学位论文英文摘要ABSTRACTAccording to the statistics of ACM, the number of repeated web page accounts forabout 30%-45%. With the increasing number of search engines and the improvement ofusers requirements, the search quality becomes the weight to win the users for all of thesearch engines. If the duplicated web pages removed timely, search engine can not onlysave a lot of storage space, indirectly reducing equipment procurement cost, but alsoimprove the retrieval quality of the network and accessing efficiency. Finally, itimproves satisfaction of users.The key points of the elimination of duplicated web pages are text featureextraction and the calculation of large-scale informations. Traditional text featureextraction algorithm is generally divided into three categories. The first one is based onURL which only removing the mirror site. The second one is based on the matching ofcharacter string which has high accuracy and high time complexity. The third one isbased on clustering. The last method is very high in recall, but its accuracy is relativelylow for the news and the template texts.By analyzing near-duplicated web pages, found that repeated pages may havemuch change in the content, but few document format. In view of this characteristic, thepaper puts forward two algorithms based on text structure tree.The long sentence doesnt representative theme of the web page. Facing pagecollector change rules, the longer the sentence is more fragile. This paper puts forwardthe algorithm based on text structure tree and key words to improve the algorithm basedon long sentences. The algorithm extracts sentences which contains keywords as keysentence. And the number of features is determined by the length of paragraphs.Experiment shows that the improved algorithm effectively avoids these two drawbacks,and the accuracy and recall rate are improved.The smaller feature is hashed was less interference. According to the feature,algorithm based on text structure tree and character strings is proved. Firstly, it extractsthe head and tail words of a certain sentence in which high frequency punctuationsoccur. Secondly, it generates the fingerprint with Bloom Filter algorithm. Finally, itdetermines the similarity according to the layer fingerprint. Experiment shows that thisII重庆大学硕士学位论文英文摘要algorithm has greatly improved in the recall rate, which is especially in small documents,and greatly reduces the time complexity.Key words: elimination of near-duplicated web pages; text structure tree; key sentence;layer fingerprint; high frequency punctuationIII重庆大学硕士学位论文目录目录中文摘要 I英文摘要 II1 绪论 11.1 研究背景 11.2 研究的意义