Scrapy

Scrapy is an open source and collaborative framework for extracting the data you need from websites.

In a fast, simple, yet extensible way.

安装实战

更新 pip

  [plaintext]
1
$ sudo pip install --upgrade pip

日志

  [plaintext]
1
2
3
4
5
6
7
8
9
10
11
12
$ sudo pip install --upgrade pip Password: The directory '/Users/houbinbin/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. The directory '/Users/houbinbin/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. Collecting pip Downloading https://files.pythonhosted.org/packages/0f/74/ecd13431bcc456ed390b44c8a6e917c1820365cbebcb6a8974d1cd045ab4/pip-10.0.1-py2.py3-none-any.whl (1.3MB) 100% |████████████████████████████████| 1.3MB 505kB/s Installing collected packages: pip Found existing installation: pip 9.0.1 Uninstalling pip-9.0.1: Successfully uninstalled pip-9.0.1 Successfully installed pip-10.0.1

安装 scrapy

  [plaintext]
1
$ sudo pip install scrapy

日志

  [plaintext]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
$ sudo pip install scrapy The directory '/Users/houbinbin/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. The directory '/Users/houbinbin/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag. Collecting scrapy Downloading https://files.pythonhosted.org/packages/db/9c/cb15b2dc6003a805afd21b9b396e0e965800765b51da72fe17cf340b9be2/Scrapy-1.5.0-py2.py3-none-any.whl (251kB) 100% |████████████████████████████████| 256kB 309kB/s Collecting w3lib>=1.17.0 (from scrapy) Downloading https://files.pythonhosted.org/packages/37/94/40c93ad0cadac0f8cb729e1668823c71532fd4a7361b141aec535acb68e3/w3lib-1.19.0-py2.py3-none-any.whl Collecting six>=1.5.2 (from scrapy) Downloading https://files.pythonhosted.org/packages/67/4b/141a581104b1f6397bfa78ac9d43d8ad29a7ca43ea90a2d863fe3056e86a/six-1.11.0-py2.py3-none-any.whl Collecting cssselect>=0.9 (from scrapy) Downloading https://files.pythonhosted.org/packages/7b/44/25b7283e50585f0b4156960691d951b05d061abf4a714078393e51929b30/cssselect-1.0.3-py2.py3-none-any.whl Collecting parsel>=1.1 (from scrapy) Downloading https://files.pythonhosted.org/packages/bc/b4/2fd37d6f6a7e35cbc4c2613a789221ef1109708d5d4fb9fd5f6f721a43c9/parsel-1.4.0-py2.py3-none-any.whl Collecting service-identity (from scrapy) Downloading https://files.pythonhosted.org/packages/29/fa/995e364220979e577e7ca232440961db0bf996b6edaf586a7d1bd14d81f1/service_identity-17.0.0-py2.py3-none-any.whl Requirement already satisfied: pyOpenSSL in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from scrapy) (0.13.1) Collecting queuelib (from scrapy) Downloading https://files.pythonhosted.org/packages/4c/85/ae64e9145f39dd6d14f8af3fa809a270ef3729f3b90b3c0cf5aa242ab0d4/queuelib-1.5.0-py2.py3-none-any.whl Collecting PyDispatcher>=2.0.5 (from scrapy) Downloading https://files.pythonhosted.org/packages/cd/37/39aca520918ce1935bea9c356bcbb7ed7e52ad4e31bff9b943dfc8e7115b/PyDispatcher-2.0.5.tar.gz Collecting Twisted>=13.1.0 (from scrapy) Downloading https://files.pythonhosted.org/packages/12/2a/e9e4fb2e6b2f7a75577e0614926819a472934b0b85f205ba5d5d2add54d0/Twisted-18.4.0.tar.bz2 (3.0MB) 100% |████████████████████████████████| 3.0MB 925kB/s Collecting lxml (from scrapy) Downloading https://files.pythonhosted.org/packages/18/95/abf8204fbbc9a01e0e156029cd1ee974237b5798b9e84477df6c4fabfbd2/lxml-4.2.1-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (8.8MB) 100% |████████████████████████████████| 8.8MB 1.8MB/s Collecting attrs (from service-identity->scrapy) Downloading https://files.pythonhosted.org/packages/41/59/cedf87e91ed541be7957c501a92102f9cc6363c623a7666d69d51c78ac5b/attrs-18.1.0-py2.py3-none-any.whl Requirement already satisfied: pyasn1 in /Library/Python/2.7/site-packages (from service-identity->scrapy) (0.3.3) Collecting pyasn1-modules (from service-identity->scrapy) Downloading https://files.pythonhosted.org/packages/e9/51/bcd96bf6231d4b2cc5e023c511bee86637ba375c44a6f9d1b4b7ad1ce4b9/pyasn1_modules-0.2.1-py2.py3-none-any.whl (60kB) 100% |████████████████████████████████| 61kB 42kB/s Collecting zope.interface>=4.4.2 (from Twisted>=13.1.0->scrapy) Downloading https://files.pythonhosted.org/packages/ac/8a/657532df378c2cd2a1fe6b12be3b4097521570769d4852ec02c24bd3594e/zope.interface-4.5.0.tar.gz (151kB) 100% |████████████████████████████████| 153kB 83kB/s Collecting constantly>=15.1 (from Twisted>=13.1.0->scrapy) Downloading https://files.pythonhosted.org/packages/b9/65/48c1909d0c0aeae6c10213340ce682db01b48ea900a7d9fce7a7910ff318/constantly-15.1.0-py2.py3-none-any.whl Collecting incremental>=16.10.1 (from Twisted>=13.1.0->scrapy) Downloading https://files.pythonhosted.org/packages/f5/1d/c98a587dc06e107115cf4a58b49de20b19222c83d75335a192052af4c4b7/incremental-17.5.0-py2.py3-none-any.whl Collecting Automat>=0.3.0 (from Twisted>=13.1.0->scrapy) Downloading https://files.pythonhosted.org/packages/17/6a/1baf488c2015ecafda48c03ca984cf0c48c254622668eb1732dbe2eae118/Automat-0.6.0-py2.py3-none-any.whl Collecting hyperlink>=17.1.1 (from Twisted>=13.1.0->scrapy) Downloading https://files.pythonhosted.org/packages/a7/b6/84d0c863ff81e8e7de87cff3bd8fd8f1054c227ce09af1b679a8b17a9274/hyperlink-18.0.0-py2.py3-none-any.whl Requirement already satisfied: setuptools in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from zope.interface>=4.4.2->Twisted>=13.1.0->scrapy) (18.5) Requirement already satisfied: idna>=2.5 in /Library/Python/2.7/site-packages (from hyperlink>=17.1.1->Twisted>=13.1.0->scrapy) (2.6) matplotlib 1.3.1 requires nose, which is not installed. matplotlib 1.3.1 requires tornado, which is not installed. pyasn1-modules 0.2.1 has requirement pyasn1<0.5.0,>=0.4.1, but you'll have pyasn1 0.3.3 which is incompatible. Installing collected packages: six, w3lib, cssselect, lxml, parsel, attrs, pyasn1-modules, service-identity, queuelib, PyDispatcher, zope.interface, constantly, incremental, Automat, hyperlink, Twisted, scrapy Found existing installation: six 1.4.1 Cannot uninstall 'six'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.
  • matplotlib
  [plaintext]
1
$ sudo python -mpip install -U matplotlib --ignore-installed six

更新默认 python

  • 下载
  [plaintext]
1
$ ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
  • 配置到环境变量
  [plaintext]
1
$ echo "export PATH=/usr/local/bin:/usr/local/sbin:$PATH" >> ~/.bashrc
  • 刷新
  [plaintext]
1
$ source ~/.bashrc
  • 重新安装
  [plaintext]
1
$ brew install python
  • 如果需要更新
  [plaintext]
1
2
$ brew update $ brew upgrade python

快速开始

hello.py

  [py]
1

输出结果

  [plaintext]
1