Scrapy

Scrapy is an and collaborative framework for extracting the data you need from websites.

In a fast, simple, yet extensible way.

快速开始

install

  [plaintext]
1
$ pip install scrapy

windows 报错

windows 参见下面的 windows 安装报错

安装解决之后,继续执行

  [plaintext]
1
$ pip install scrapy

版本验证

  [plaintext]
1
2
> scrapy version Scrapy 1.6.0

编写测试文件

  • myspider.py
  [py]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import scrapy class BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['https://blog.csdn.net/ryo1060732496'] next_pages = [] for num in range(2, 22): next_pages.append('https://blog.csdn.net/ryo1060732496/article/list/'+str(num)) def parse(self, response): for href in response.css('.article-list>.article-item-box>h4'): yield {'href': href.css('a').attrib['href']} for next_page in self.next_pages: yield response.follow(next_page, self.parse)
  • run
  [plaintext]
1
scrapy runspider myspider.py

日志信息如下:

如果报错,参考 运行报错 解决之后再次运行即可。

  • 结果日志

看的出来这里用到了多线程,并且提供了大量的日志统计信息。

个人收获

平台效应

前面的各种 request bs 都是一个框架,工具包

但是这里是一个完整的平台。具有

windows 安装报错

报错信息

  [plaintext]
1
2
3
4
5
6
7
8
9
10
running build_ext building 'twisted.test.raiser' extension error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": https://visualstudio.microsoft.com/downloads/ ---------------------------------------- Failed building wheel for Twisted Running setup.py clean for Twisted Failed to build Twisted Installing collected packages: queuelib, Twisted, PyDispatcher, pyasn1, pyasn1-modules, service-identity, scrapy Running setup.py install for Twisted ... error

解决

手动安装twisted插件:

在 http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted 用Ctrl+F搜索twisted,下载对应版本。

版本如下:

  [plaintext]
1
2
3
4
5
6
7
8
Twisted‑19.2.0‑cp27‑cp27m‑win32.whl Twisted‑19.2.0‑cp27‑cp27m‑win_amd64.whl Twisted‑19.2.0‑cp35‑cp35m‑win32.whl Twisted‑19.2.0‑cp35‑cp35m‑win_amd64.whl Twisted‑19.2.0‑cp36‑cp36m‑win32.whl Twisted‑19.2.0‑cp36‑cp36m‑win_amd64.whl Twisted‑19.2.0‑cp37‑cp37m‑win32.whl Twisted‑19.2.0‑cp37‑cp37m‑win_amd64.whl

我是Python3.7,64位,即下载 Twisted‑19.2.0‑cp37‑cp37m‑win_amd64.whl

安装 wheel

  [plaintext]
1
pip install wheel

安装 twisted 插件

把下载下来的 Twisted‑19.2.0‑cp37‑cp37m‑win_amd64.whl 放到Python37\Scripts目录,

我的个人目录:

  [plaintext]
1
C:\Users\binbin.hou\AppData\Local\Programs\Python\Python37\Scripts

执行

  [plaintext]
1
pip install .\Twisted-19.2.0-cp37-cp37m-win_amd64.whl

日志如下:

  [plaintext]
1
2
3
4
Processing c:\users\binbin.hou\appdata\local\programs\python\python37\scripts\twisted-19.2.0-cp37-cp37m-win_amd64.whl Requirement already satisfied: incremental>=16.10.1 in c:\users\binbin.hou\appdata\local\programs\python\python37\lib\site-packages (from ... Installing collected packages: Twisted Successfully installed Twisted-19.2.0

运行报错

报错信息

  [plaintext]
1
2
3
File "c:\users\binbin.hou\appdata\local\programs\python\python37\lib\site-packages\twisted\internet\_win32stdio.py", line 9, in <module> import win32api builtins.ModuleNotFoundError: No module named 'win32api'

解决

原因就是缺少 win32 的包

解决办法:安装pywin32

运行命令

  [plaintext]
1
pip install pywin32

建议使用这种方式

参考资料

安装报错

windows7 安装Scrapy爬虫框架报错解决方案

Windows安装Python组件Scrapy报错的解决方案

运行报错