Scrapy
Scrapy is an and collaborative framework for extracting the data you need from websites.
In a fast, simple, yet extensible way.
快速开始
install
$ pip install scrapy
windows 报错
windows 参见下面的 windows 安装报错
安装解决之后,继续执行
$ pip install scrapy
版本验证
> scrapy version
Scrapy 1.6.0
编写测试文件
- myspider.py
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['https://blog.csdn.net/ryo1060732496']
next_pages = []
for num in range(2, 22):
next_pages.append('https://blog.csdn.net/ryo1060732496/article/list/'+str(num))
def parse(self, response):
for href in response.css('.article-list>.article-item-box>h4'):
yield {'href': href.css('a').attrib['href']}
for next_page in self.next_pages:
yield response.follow(next_page, self.parse)
- run
scrapy runspider myspider.py
日志信息如下:
如果报错,参考 运行报错 解决之后再次运行即可。
- 结果日志
看的出来这里用到了多线程,并且提供了大量的日志统计信息。
个人收获
平台效应
前面的各种 request bs 都是一个框架,工具包
但是这里是一个完整的平台。具有
windows 安装报错
报错信息
running build_ext
building 'twisted.test.raiser' extension
error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": https://visualstudio.microsoft.com/downloads/
----------------------------------------
Failed building wheel for Twisted
Running setup.py clean for Twisted
Failed to build Twisted
Installing collected packages: queuelib, Twisted, PyDispatcher, pyasn1, pyasn1-modules, service-identity, scrapy
Running setup.py install for Twisted ... error
解决
手动安装twisted插件:
在 http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted 用Ctrl+F搜索twisted,下载对应版本。
版本如下:
Twisted‑19.2.0‑cp27‑cp27m‑win32.whl
Twisted‑19.2.0‑cp27‑cp27m‑win_amd64.whl
Twisted‑19.2.0‑cp35‑cp35m‑win32.whl
Twisted‑19.2.0‑cp35‑cp35m‑win_amd64.whl
Twisted‑19.2.0‑cp36‑cp36m‑win32.whl
Twisted‑19.2.0‑cp36‑cp36m‑win_amd64.whl
Twisted‑19.2.0‑cp37‑cp37m‑win32.whl
Twisted‑19.2.0‑cp37‑cp37m‑win_amd64.whl
我是Python3.7,64位,即下载 Twisted‑19.2.0‑cp37‑cp37m‑win_amd64.whl
安装 wheel
pip install wheel
安装 twisted 插件
把下载下来的 Twisted‑19.2.0‑cp37‑cp37m‑win_amd64.whl
放到Python37\Scripts目录,
我的个人目录:
C:\Users\binbin.hou\AppData\Local\Programs\Python\Python37\Scripts
执行
pip install .\Twisted-19.2.0-cp37-cp37m-win_amd64.whl
日志如下:
Processing c:\users\binbin.hou\appdata\local\programs\python\python37\scripts\twisted-19.2.0-cp37-cp37m-win_amd64.whl
Requirement already satisfied: incremental>=16.10.1 in c:\users\binbin.hou\appdata\local\programs\python\python37\lib\site-packages (from ...
Installing collected packages: Twisted
Successfully installed Twisted-19.2.0
运行报错
报错信息
File "c:\users\binbin.hou\appdata\local\programs\python\python37\lib\site-packages\twisted\internet\_win32stdio.py", line 9, in <module>
import win32api
builtins.ModuleNotFoundError: No module named 'win32api'
解决
原因就是缺少 win32 的包
解决办法:安装pywin32
运行命令
pip install pywin32
建议使用这种方式
参考资料
安装报错
Windows安装Python组件Scrapy报错的解决方案