当前位置：首页 > 网络爬虫的设计与实现

网络爬虫的设计与实现

62 次阅读
3 次下载
2025/6/3 15:13:53

ABSTRACT

The main purpose of this project is to design subject-oriented web crawler process which is also required to meet certain performance, taking into account the diverse needs of web crawlers. Web Crawler uses the technology. of Breadth-first search.Web crawler uses multi-threaded technology, so that spiders crawl can have more powerful capabilities.Set connection time and read time of the web connection of the Web crawler , to avoid unlimited waiting.In order to meet different needs, so that crawlers can achieve pre-set theme crawling a specific topic.Research the principle web crawler and and realize the related functions. Key words：Web crawler; subject-oriented; multi-threading

天津大学2007届本科生毕业设计（论文）

第一章概述 .................................... 1

1.1 课题背景 ....................................... 1

1.2 网络爬虫的历史和分类 .......................... 1 1.2.1 网络爬虫的历史 .............................. 1 1.2.2 网络爬虫的分类 .............................. 2 1.3 网络爬虫的发展趋势 ............................ 3

第二章相关技术背景 ............................ 5

2.1 网络爬虫的定义 ................................ 5 2.2 网页搜索策略介绍 .............................. 5 2.2.1 广度优先搜索策略 ............................ 5 2.2.2 最佳优先搜索策略 ............................ 6 2.3 判断相关度算法 ................................ 6

第三章网络爬虫模型的分析和概要设计 ............ 8

3.1 网络爬虫的模型分析 ............................ 8 3.2 网络爬虫的搜索策略 ............................ 8 3.3 网络爬虫的主题相关度判断 ...................... 9 3.4 网络爬虫的概要设计 ........................... 11

第四章网络爬虫模型的设计和实现 ............... 14

4.1 网络爬虫总体设计 ............................. 14 4.2 网络爬虫具体设计 ............................. 14

天津大学2007届本科生毕业设计（论文）

4.2.1 爬取网页 ................................... 14 4.2.2 分析网页 ................................... 15 4.2.3 判断相关度 ................................. 16 4.2.4 保存网页信息 ............................... 17 4.2.5 数据库设计和存储 ........................... 17 4.2.6 4.2.7 4.2.8 第五章第六章多线程的实现 ............................... 17 附加功能 ................................... 18 整体流程 ................................... 18

测试 ................................... 20 总结和展望 ............................. 24

天津大学2007届本科生毕业设计（论文）

第一章概述

1.1 课题背景

网络爬虫，是一种按照一定的规则，自动的抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁，自动索引，模拟程序或者蠕虫。

网络检索功能起于互联网内容爆炸性发展所带来的对内容检索的需求。搜索引擎不断的发展,人们的需求也在不断的提高,网络信息搜索已经成为人们每天都要进行的内容.如何使搜索引擎能时刻满足人们的需求。最初的检索功能通过索引站的方式实现，而有了网络机器人，即网络爬虫这个技术之后，搜索引擎的时代便开始一发不可收拾了。

1.2 网络爬虫的历史和分类 1.2.1 网络爬虫的历史

在互联网发展初期，网站相对较少，信息查找比较容易。然而伴随互联网爆炸性的发展，普通网络用户想找到所需的资料简直如同大海捞针，这时为满足大众信息检索需求的专业搜索网站便应运而生了。

现代意义上的搜索引擎的祖先，是1990年由蒙特利尔大学学生Alan Emtage发明的Archie。虽然当时World Wide Web还未出现，但网络中文件传输还是相当频繁的，而且由于大量的文件散布在各个分散的FTP主机中，查询起来非常不便，因此Alan Archie工作原理与现在的搜索引擎已经很接近，它依靠脚本程序自动搜索网上的文件，然后对有关信息进行索引，供使用者以一定的表达式查询。由于 Archie深受用户欢迎，受其启发，美国内华达System Computing Services大学于1993年开发了另一个与之非常相似的搜索工具，不过此时的搜索工具除了索引文件外，已能检索网页。

当时，“机器人”一词在编程者中十分流行。电脑“机器人”（Computer Robot）是指某个能以人类无法达到的速度不间断地执行某项任务的软件程序。由于专门用于检索信息的“机器人”程序象蜘蛛一样在网络间爬来爬去，因此，搜索引擎的“机器人”程序就被称为“蜘蛛”程序。世界上第一个用于监测互联网发展规模的“机器人”程序是Matthew Gray开发的World wide Web Wanderer。刚

搜索更多关于：网络爬虫的设计与实现的文档

版权认领

下载文档10.00 元 加入VIP免费下载

推荐下载

本文作者：...

共分享92篇相关文档

文档简介：

ABSTRACT The main purpose of this project is to design subject-oriented web crawler process which is also required to meet certain performance, taking into account the diverse needs of web crawlers. Web Crawler uses the technology. of Breadth-first search.Web crawler uses multi-threaded technology, so that spiders crawl can have more powerful capabilities.Set

网络爬虫的设计与实现

相关文档

相关推荐