博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
crawlSpider全站数据爬取
阅读量:5349 次
发布时间:2019-06-15

本文共 9135 字,大约阅读时间需要 30 分钟。

简介:

CrawlSpider其实是Spider的一个子类,除了继承到Spider的特性和功能外,还派生除了其自己独有的更加强大的特性和功能。其中最显著的功能就是”LinkExtractors链接提取器“。Spider是所有爬虫的基类,其设计原则只是为了爬取start_url列表中网页,而从爬取到的网页中提取出的url进行继续的爬取工作使用CrawlSpider更合适。

使用:

创建scrapy工程:scrapy startproject projectName

创建爬虫文件:scrapy genspider -t crawl spiderName www.xxx.com    --此指令对比以前的指令多了 "-t crawl",表示创建的爬虫文件是基于CrawlSpider这个类的,而不再是Spider这个基类

爬虫文件:

import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Ruleclass ChoutidemoSpider(CrawlSpider):    name = 'choutiDemo'    #allowed_domains = ['www.chouti.com']    start_urls = ['http://www.chouti.com/']    rules = (        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),    )    def parse_item(self, response):        i = {}        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()        #i['name'] = response.xpath('//div[@id="name"]').extract()        #i['description'] = response.xpath('//div[@id="description"]').extract()        return i  - 2,3行:导入CrawlSpider相关模块  - 7行:表示该爬虫程序是基于CrawlSpider类的  - 12,13,14行:表示为提取Link规则  - 16行:解析方法  CrawlSpider类和Spider类的最大不同是CrawlSpider多了一个rules属性,其作用是定义”提取动作“。在rules中可以包含一个或多个Rule对象,在Rule对象中包含了LinkExtractor对象。

 LinkExtractor:链接提取器

LinkExtractor(         allow=r'Items/',# 满足括号中“正则表达式”的值会被提取,如果为空,则全部匹配。         deny=xxx,  # 满足正则表达式的则不会被提取。          restrict_xpaths=xxx, # 满足xpath表达式的值会被提取         restrict_css=xxx, # 满足css表达式的值会被提取         deny_domains=xxx, # 不会被提取的链接的domains。     )    - 作用:提取response中符合规则的链接。

 Rule : 规则解析器。根据链接提取器中提取到的链接,根据指定规则提取解析器链接网页中的内容。

Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True)    - 参数介绍:      参数1:指定链接提取器      参数2:指定规则解析器解析数据的规则(回调函数)      参数3:是否将链接提取器继续作用到链接提取器提取出的链接网页中。当callback为None,参数3的默认值为true。

CrawlSpider整体爬取流程:

a)爬虫文件首先根据起始url,获取该url的网页内容b)链接提取器会根据指定提取规则将步骤a中网页内容中的链接进行提取c)规则解析器会根据指定解析规则将链接提取器中提取到的链接中的网页内容根据指定的规则进行解析d)将解析数据封装到item中,然后提交给管道进行持久化存储

爬取糗事百科糗图板块的所有页码数据

爬虫文件

mport scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom qiubaiBycrawl.items import QiubaibycrawlItemimport reclass QiubaitestSpider(CrawlSpider):    name = 'qiubaiTest'    #起始url    start_urls = ['http://www.qiushibaike.com/']    #定义链接提取器,且指定其提取规则    page_link = LinkExtractor(allow=r'/8hr/page/\d+/')        rules = (        #定义规则解析器,且指定解析规则通过callback回调函数        Rule(page_link, callback='parse_item', follow=True),    )    #自定义规则解析器的解析规则函数    def parse_item(self, response):        div_list = response.xpath('//div[@id="content-left"]/div')                for div in div_list:            #定义item            item = QiubaibycrawlItem()            #根据xpath表达式提取糗百中段子的作者            item['author'] = div.xpath('./div/a[2]/h2/text()').extract_first().strip('\n')            #根据xpath表达式提取糗百中段子的内容            item['content'] = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip('\n')            yield item #将item提交至管道

item文件

import scrapyclass QiubaibycrawlItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    author = scrapy.Field() #作者    content = scrapy.Field() #内容

管道文件

class QiubaibycrawlPipeline(object):        def __init__(self):        self.fp = None            def open_spider(self,spider):        print('开始爬虫')        self.fp = open('./data.txt','w')            def process_item(self, item, spider):        #将爬虫文件提交的item写入文件进行持久化存储        self.fp.write(item['author']+':'+item['content']+'\n')        return item        def close_spider(self,spider):        print('结束爬虫')        self.fp.close()

 爬取boss

爬虫文件

import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rulefrom bossPro.items import DetailItem,FirstItem#爬取的是岗位名称(首页)和岗位描述(详情页)class BossSpider(CrawlSpider):    name = 'boss'    # allowed_domains = ['www.xxx.com']    start_urls = ['https://www.zhipin.com/c101010100/?query=python%E5%BC%80%E5%8F%91&page=1&ka=page-prev']    #获取所有的页码连接    link = LinkExtractor(allow=r'page=\d+')    link_detail = LinkExtractor(allow=r'/job_detail/.*?html')    #/job_detail/f2a47b2f40c53bd41XJ93Nm_GVQ~.html    #/job_detail/47dc9803e93701581XN80ty7GFI~.html    rules = (        Rule(link, callback='parse_item', follow=True),        Rule(link_detail, callback='parse_detail'),    )    #将页码连接对应的页面数据中的岗位名称进行解析    def parse_item(self, response):        li_list = response.xpath('//div[@class="job-list"]/ul/li')        for li in li_list:            item = FirstItem()            job_title = li.xpath('.//div[@class="job-title"]/text()').extract_first()            item['job_title'] = job_title            # print(job_title)            yield item    def parse_detail(self,response):        job_desc = response.xpath('//*[@id="main"]/div[3]/div/div[2]/div[2]/div[1]/div//text()').extract()        item = DetailItem()        job_desc = ''.join(job_desc)        item['job_desc'] = job_desc        yield item

items文件

# Define here the models for your scraped items## See documentation in:# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass DetailItem(scrapy.Item):    # define the fields for your item here like:    job_desc = scrapy.Field()class FirstItem(scrapy.Item):    # define the fields for your item here like:    job_title = scrapy.Field()

管道文件

# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlclass BossproPipeline(object):    f1,f2 = None,None    def open_spider(self,spider):        self.f1 = open('a.txt','w',encoding='utf-8')        self.f2 = open('b.txt', 'w', encoding='utf-8')    def process_item(self, item, spider):        #item在同一时刻只可以接收到某一个指定item对象        if item.__class__.__name__ == 'FirstItem':            job_title = item['job_title']            self.f1.write(job_title+'\n')        else:            job_desc = item['job_desc']            self.f2.write(job_desc)        return item

配置文件

# -*- coding: utf-8 -*-# Scrapy settings for bossPro project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:##     https://doc.scrapy.org/en/latest/topics/settings.html#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#     https://doc.scrapy.org/en/latest/topics/spider-middleware.htmlBOT_NAME = 'bossPro'SPIDER_MODULES = ['bossPro.spiders']NEWSPIDER_MODULE = 'bossPro.spiders'USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'bossPro (+http://www.yourdomain.com)'# Obey robots.txt rulesROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)#CONCURRENT_REQUESTS = 32# Configure a delay for requests for the same website (default: 0)# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs#DOWNLOAD_DELAY = 3# The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16# Disable cookies (enabled by default)#COOKIES_ENABLED = False# Disable Telnet Console (enabled by default)#TELNETCONSOLE_ENABLED = False# Override the default request headers:#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',# 'Accept-Language': 'en',#}# Enable or disable spider middlewares# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = {
# 'bossPro.middlewares.BossproSpiderMiddleware': 543,#}# Enable or disable downloader middlewares# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#DOWNLOADER_MIDDLEWARES = {
# 'bossPro.middlewares.BossproDownloaderMiddleware': 543,#}# Enable or disable extensions# See https://doc.scrapy.org/en/latest/topics/extensions.html#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,#}# Configure item pipelines# See https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = { 'bossPro.pipelines.BossproPipeline': 300,}# Enable and configure the AutoThrottle extension (disabled by default)# See https://doc.scrapy.org/en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = 5# The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 60# The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False# Enable and configure HTTP caching (disabled by default)# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = 'httpcache'#HTTPCACHE_IGNORE_HTTP_CODES = []#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
View Code

 

转载于:https://www.cnblogs.com/wanglan/p/10840680.html

你可能感兴趣的文章
Linux常用命令(十六)
查看>>
Linux常用命令(二十四)
查看>>
4种java定时器
查看>>
Vue.js 教程
查看>>
linux 设置网卡
查看>>
hive 语法 case when 语法
查看>>
Ajax:js读取txt内容(json格式内容)
查看>>
Task 7 买书最低价格问题
查看>>
Selenium3+python自动化007-警告框
查看>>
html5 相同形状的图形进行循环
查看>>
springboot中文官方文档
查看>>
lamdba表达式
查看>>
ThreadLocal实现线程范围内共享
查看>>
多校HDU5723 最小生成树+dfs回溯
查看>>
ASP.NET MVC分页实现之改进版-增加同一个视图可设置多个分页
查看>>
关于ASP.NET MVC开发设计中出现的问题与解决方案汇总 【持续更新】
查看>>
关于Entity Framework中的Attached报错的完美解决方案终极版
查看>>
Selenium之Web页面滚动条滚操作
查看>>
组合数据类型练习,英文词频统计实例上
查看>>
Uber回馈开源的一些软件
查看>>