毕业设计（论文）-Web服务爬虫程序的设计与实现-

Web 服务爬虫程序的设计与实现1Web 服务爬虫程序的设计与实现摘要随着互联网的发展人们对跨平台程序之间进行交互的需求也越来越大，Web 服务的提出有效地满足了人们的需要，它可以实现两个跨平台程序之间的无缝连接，从而降低了软件维护和升级的费用。目前，互联网上存在着大量的Web 服务，但是它们大都分散在不同的服务器上，这就使得用户在浩瀚的互联网上寻找自己需要的服务时要耗费大量的时间和精力，因此设计一个可以将到处分散的服务收集起来，统一存储在本地数据库，并对其进行管理与更新的程序十分有必要。本项目是一个基于 Python 的多线程的 Web 服务爬虫程序，它采用广度优先爬取的方式，先抓取出种子网站的全部 URL 链接存入列队中，然后再依次访问列队中的 URL，继续抓取页面中新的 URL 往列队中存放，一直循环直到列队为空停止。对于抓取到的 URL 依次用正则表达式进行判别，检查其是否符合Web 服务 WSDL 文档的规范。对于符合规范的 URL，访问其对应的页面，若可以访问则将该页面下载到本地，如此，便可以将网络上分散的 Web 服务描述文档 WSDL 文档都抓取到本地。接下来，对抓取到本地的文档进行解析，获取其中的关键信息，比如服务名称、端口类型、操作等，然后将这些信息存入数据库中。最后，开发一个 Web 服务展示网站将爬取到的 Web 服务进行分类展示到页面上，同时显示其相关信息，方便用户查看与阅读。本文对该爬虫程序进行了详细的介绍。首先从研究背景与现状入手，在介绍完项目关键技术的基础上，重点介绍了 Web 服务爬虫程序的设计、 WSDL 文档的解析与存储以及展示网站的设计与实现，最后对整个项目进行了总结并对该项目以后的发展做了展望。关键词：Web 服务，网络爬虫， WSDL，Python ，服务解析 Web 服务爬虫程序的设计与实现2DESIGN AND IMPLENMENTATION OF WEB CLAWER OF WEB SERVICEAbstractWith the development of the Internet, the demand for interaction of cross platform program is also growing, the proposal of Web services effectively meets the needs. Web services can achieve a seamless connection between the two cross platform programs, which reduces software maintenance and upgrade costs. Currently, there are a large number of Web services on the Internet, but most of them scattered in different server, which makes users spend a lot of time and energy to find them in the vast Internet when they need this service. So it is necessary to design a program that can collected services which decentralized everywhere to local database to unified storage, manage and update.The project is a multi-threaded Python-based Web service crawler, which uses breadth-first crawling way. Firstly, the crawler crawls out of the entire URLs from seed page, then turn the URL into the queue and turn to access the URL of URL queue, and then continue to crawl the page to get new URLs, repeat the above operation until the queue is empty. For the URL crawled, discriminating them with a regular expression, to check whether they meet the Web service WSDL document specification. For compliant URL, visit the corresponding page, if you can access the page then download it, then WSDL documents are crawled locally. Next, the local documents crawled need to parse to get the critical information, such as the service name, port types, operations, etc. Then store the information into the database. Finally, displaying the clawed Web services in a Web site, showing their information, making it easier for users to view and read. Web 服务爬虫程序的设计与实现3In this paper, the crawler has been described in detail. The paper starts with the research background and present situation, and on the basis of introduction to project key technologies, it focuses on the design of Web service crawlers, the analysis and design and implementation of storage and exhibition site WSDL documents, and finally the entire project are summarized and future development of the project are put forward.Keywords: Web services, Web Crawler, WSDL, Python, services parsing Web 服务爬虫程序的设计与实现4目录1 绪论 .11.1 研究背景 .11.2 研究现状 .21.2.1 Web 服务搜索技术 .21.2.2 搜索引擎与网络爬虫技术 .31.3 项目研究内容 .41.4 论文章节安排 .41.5 本章小结 .52 核心技术简介 .62.1 Web 服务 .62.2 WSDL 文档相关技术 .62.2.1 XML.62.2.2 WSDL 文档 .72.2.3 xml.etree.ElementTree .72.3 网络爬虫 .82.4 MySQL 数据库 .92.5 Apache 服务器 .