目录

尚硅谷爬虫note16

尚硅谷爬虫note16

一、crawlSpider

  1. 安装scrapy

终端中:pip install scrapy

2. 创建项目

1)创建项目

scrapy startproject 项目名

2)切换到spiders目录下

cd 项目名\项目名\spiders

3)创建文件

scrapy genspider -t crawl 文件名 网址

4)运行

scrapy crawl 文件名

  1. 对文件中的rule进行修改

鼠标悬浮在地址页上——右键检查:

https://i-blog.csdnimg.cn/direct/a2f95bb7bb0d474dbb04d5e54c4bbf0b.png

对rule中的正则表达式allow进行修改: 修改完成后会提取当前起始url( start_urls ) 的链接

allow=r'/book/1104_\d+html'

其中:

\d:表示数字

+:表示1~多

:转义,使符号“.”生效

对rule中的follow进行修改:

follow = False

1440(行)/40(行/页)/3(本/行)  = 12页(共13页),少了一页

原因:规则里不包含首页

只提取到2-13页

https://i-blog.csdnimg.cn/direct/d183a0ddbc604163ae4dd18b51a198ab.png

https://i-blog.csdnimg.cn/direct/76ef3ae6f9f448a0bb73ec38b7ae5f9f.png

首页:

# 首页:1104_1
start_urls = ["https://www.dushu.com/book/1104_1.html"]

1560行:

https://i-blog.csdnimg.cn/direct/83503229ffe54c28b2f13a4e3098f9b1.png

2. 读书网

1)dsw.PY

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
#导入
from demo_dsw.items import DemoDswItem


class DswSpider(CrawlSpider):
    name = "dsw"
    #修改 allowed_domains范围:只保留域名
    allowed_domains = ["www.dushu.com"]
    # 首页:1104_1
    start_urls = ["https://www.dushu.com/book/1104_1.html"]

    rules = (Rule(LinkExtractor(allow=r'/book/1104_\d+html'),
                  callback="parse_item",
                  follow=False),)

    def parse_item(self, response):
        img_list = response.xpath('//div[@class = "bookslist"]//img')

        for img in img_list:
            name = img.xpath('./@data-original').extract_first()
            src = img.xpath('./@alt').extract_first()

           #创建一本书
            book = DemoDswItem(name = name, src = src)
            #返回
            yield book


        #CTRL + alt + L:查看book.json文件

2)items.PY

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class DemoDswItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # pass

    #书名
    name = scrapy.Field()
    #图片
    src = scrapy.Field()

3)pipelines.PY

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class DemoDswPipeline:
    # 开启
    def open_spider(self,spider):
        self.fp = open('book.json','w',encoding='utf-8')

    def process_item(self, item, spider):
        # 中间:只能写字符串
        self.fp.write(str(item))
        return item

    # 关闭
    def close_spider(self,spider):
        self.fp.close()

4)settings.PY

解除注释:

ITEM_PIPELINES = {
   "demo_dsw.pipelines.DemoDswPipeline": 300,
}

3. scrapy的post请求

scrapy_post.PY

import json
from typing import Iterable

import scrapy
from scrapy import Request


class ScrapypostSpider(scrapy.Spider):
    name = "scrapyPost"
    allowed_domains = ["fanyi.baidu.com"]
    # post请求:
    # post请求必须依赖参数才能执行
    # 如果没有参数,post请求没有任何意义
    # status_urls也没用了
    # post方法也没用了
    # start_urls = ["https://fanyi.baidu.com/sug"]

    # def parse(self, response):

'
        # pass
def start_requests(self):
    url = 'https://fanyi.baidu.com/sug'
    data = {
            'kw': 'cat'
        }

    yield scrapy.FormRequest(url = url,formdata = data,callback = self.parse_second)

    def parse_second(self,response):
        content = response.text
        obj = json.loads(content,encodings = 'utf-8')
        print(obj)