深层网中查询入口的填充策略研究-

索取号： TP301/3.554 密级：公开硕士学位论文深层网中查询入口的填充策略研究研究生 : 马建华指导教师 : 杨晓江教授培养单位 : 教育科学学院一级学科 : 教育学二级学科 : 教育技术学完成时间 : 2009 年 3 月 10 日答辩时间 :学位论文独创性声明本人郑重声明：1、坚持以 “求实、创新 ”的科学精神从事研究工作。2、本论文是我个人在导师指导下进行的研究工作和取得的研究成果。3、本论文中除引文外，所有实验、数据和有关材料均是真实的。4、本论文中除引文和致谢的内容外，不包含其他人或其它机构已经发表或撰写过的研究成果。5、其他同志对本研究所做的贡献均已在论文中作了声明并表示了谢意。研究生签名：日期：学位论文使用授权声明本人完全了解南京师范大学有关保留、使用学位论文的规定，学校有权保留学位论文并向国家主管部门或其指定机构送交论文的电子版和纸质版；有权将学位论文用于非赢利目的的少量复制并允许论文进入学校图书馆被查阅；有权将学位论文的内容编入有关数据库进行检索；有权将学位论文的标题和摘要汇编出版。保密的学位论文在解密后适用本规定。研究生签名：日期：摘要I摘要目前搜索引擎索引的绝大部分是表层网的信息，限于一些技术原因，搜索引擎几乎无法索引深层网中的信息。但是深层网具有容量大、质量高和专业性强等诸多优点，它的意义及重要性无法被人们忽略，于是找到一种能够爬行深层网的方式是非常必要的，所以构造一个深层网爬行器来获取深层网中的数据是非常有意义的,而表单自动填充是深层网爬行器的重要组成部分。本文首先介绍了深层网的价值及难以搜索深层网的原因，分析对比了国内外研究现状，介绍了 HTML 表单、文档对象模型(DOM)、抽取方法、本体知识和相似度计算方法，在此基础上本文提出了一套填充深层网入口表单的策略。首先使用改进的启发式规则识别深层查询入口表单，再通过本文提出的就近原则算法提取表单标签，在进行最后的匹配填充之前对抽取到的标签进行标准化，最后通过改进的基于语义的相似度匹配算法对深层网表单标签和本体领域知识库的属性进行匹配，这样就可以模拟用户填充深层网入口表单的过程了。结尾对整个算法进行了实验验证。选取了图书领域的深层网入口表单进行实验，先识别表单查询入口，实验结果表明使用本文总结的启发式规则准确率能达到 90.76%。对表单提取时，使用就近原则算法提取表单标签的准确率能达到 94.23%。接着，使用改进的基于语义相似度计算算法寻找与表单标签相匹配的属性，找到匹配的属性之后，用属性的值对表单控件进行填充。结果表明，匹配的成功率达到 88.83%，填充的成功率达到 95.43%。也就是说，本文提出的填充深层网入口表单的策略是有效的。关键词：深层网，查询入口，表单填充AbstractIIAbstractAt present, limited to some technical reasons, general search engines can only index the information on the surface web instead of the deep web. However, deep web is of great advantage, such as large capacity, high quality and professional character, etc. Thus, its importance and influence should not be ignored. And it is rather necessary to search for an approach to crawl the deep web. Therefore, it is greatly significant to construct a deep web crawler, of which automatic form fill is an essential part, to gain the data on the deep Web.This thesis first introduces the value of the deep web and the reason why searching on the deep web is difficult, analyzes and compares the study of the case at home and abroad. It also introduces the HTML form, Document Object Model (DOM), Ontology knowledge and extraction method. On the basis, the author proposes a strategy of filling a query entrance of the deep web. Firstly, the author uses heuristic rules to identify those forms in deep web. Secondly, with the algorithm of the nearest principia, the author extracts those labels of form. Before filling those forms respectively, standardizing those labels is adopted. At last, employing the algorithm based on improved ontology similar matching, the author matches the label of form with the attribute of semantic domain warehouse. In this way, we can simulate the process of user to fill the forms of deep web.At the end of the paper, the algorithm proposed is verified thorough the experiment. Those websites from library domain is made use of. The first step is to identify those query entrance of forms, and the experiment shows that with those heuristic rules summarized, the veracity rate is up to 90.76%. As for extracting label, the veracity rate is 94.23% according to the nearest rule arithmetic. Then, employing the algorithm based on improved ontology similar matching to match between the label of form and the attribute of semantic domain warehouse, the author can use the value of attribute to fill the form controls when finding the matching attribute. The results show that the matching has a higher success rate of 88.83% and filling form controls is 95.43%. In most cases, the method of automatically filling forms is effective.The future work including some new challenges and technological possibilities is mentioned at the end of this paper.Key words: Deep web, Query entrance, Form fillII目录1目录摘要 .IAbstract.II第 1章绪论.11.1 深层网简介.11.1.1 深层网的定义.11.1.2 深层网的信息价值.21.1.3 难以搜索深层网的原因.31.1.4 深层网相关研究.41.2 本文的主要工作.51.2.1 研究目的.51.2.2 研究内容.