尚硅谷爬虫note16

JAY.LIN 收录于未分类

2025-03-10 约 873 字预计阅读 2 分钟

https://bing.ee123.net/img/rand?artid=146143719

尚硅谷爬虫note16

一、crawlSpider

安装scrapy

终端中：pip install scrapy

2. 创建项目

1）创建项目

scrapy startproject 项目名

2）切换到spiders目录下

cd 项目名\项目名\spiders

3）创建文件

scrapy genspider -t crawl 文件名网址

4)运行

scrapy crawl 文件名

对文件中的rule进行修改

鼠标悬浮在地址页上——右键检查：

对rule中的正则表达式allow进行修改：修改完成后会提取当前起始url（ start_urls ）的链接

allow=r'/book/1104_\d+html'

其中：

\d：表示数字

+：表示1~多

：转义，使符号“.”生效

对rule中的follow进行修改：

follow = False

1440（行）/40（行/页）/3（本/行） = 12页（共13页），少了一页

原因：规则里不包含首页

只提取到2-13页

首页：

# 首页：1104_1
start_urls = ["https://www.dushu.com/book/1104_1.html"]

1560行：

2. 读书网

1）dsw.PY

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
#导入
from demo_dsw.items import DemoDswItem


class DswSpider(CrawlSpider):
    name = "dsw"
    #修改 allowed_domains范围：只保留域名
    allowed_domains = ["www.dushu.com"]
    # 首页：1104_1
    start_urls = ["https://www.dushu.com/book/1104_1.html"]

    rules = (Rule(LinkExtractor(allow=r'/book/1104_\d+html'),
                  callback="parse_item",
                  follow=False),)

    def parse_item(self, response):
        img_list = response.xpath('//div[@class = "bookslist"]//img')

        for img in img_list:
            name = img.xpath('./@data-original').extract_first()
            src = img.xpath('./@alt').extract_first()

           #创建一本书
            book = DemoDswItem(name = name, src = src)
            #返回
            yield book


        #CTRL + alt + L：查看book.json文件

2）items.PY

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class DemoDswItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # pass

    #书名
    name = scrapy.Field()
    #图片
    src = scrapy.Field()

3）pipelines.PY

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class DemoDswPipeline:
    # 开启
    def open_spider(self,spider):
        self.fp = open('book.json','w',encoding='utf-8')

    def process_item(self, item, spider):
        # 中间：只能写字符串
        self.fp.write(str(item))
        return item

    # 关闭
    def close_spider(self,spider):
        self.fp.close()

4）settings.PY

解除注释：

ITEM_PIPELINES = {
   "demo_dsw.pipelines.DemoDswPipeline": 300,
}

3. scrapy的post请求

scrapy_post.PY

import json
from typing import Iterable

import scrapy
from scrapy import Request


class ScrapypostSpider(scrapy.Spider):
    name = "scrapyPost"
    allowed_domains = ["fanyi.baidu.com"]
    # post请求：
    # post请求必须依赖参数才能执行
    # 如果没有参数，post请求没有任何意义
    # status_urls也没用了
    # post方法也没用了
    # start_urls = ["https://fanyi.baidu.com/sug"]

    # def parse(self, response):

'
        # pass
def start_requests(self):
    url = 'https://fanyi.baidu.com/sug'
    data = {
            'kw': 'cat'
        }

    yield scrapy.FormRequest(url = url,formdata = data,callback = self.parse_second)

    def parse_second(self,response):
        content = response.text
        obj = json.loads(content,encodings = 'utf-8')
        print(obj)