爬虫学习-CFANZ编程社区

　　http://py3study.com/Article/part/type_id/3/p/3.html

　　scrapy中文文档：https://scrapy-chs.readthedocs.io/zh_CN/0.24/intro/overview.html#id1

一、urllib的request模块

一、Python2/3中urllib库的一些常见用法

　　Python2/3中urllib库的一些常见用法：https://www.jb51.net/article/130918.htm

二、beautifulsoup

　　beautifulsoup4中文文档：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/

三、爬虫框架scrapy使用

一、scrapy安装

　　1、Linux环境安装scrapy

　　2、Windows环境安装scrapy

　　　　1、安装过程

1、首先升级pip
    python -m pip install --upgrade pip

2、安装Visual Studio专业版（选择需要的版本）
    https://visualstudio.microsoft.com/zh-hans/
    
3、安装lxml和wheel（可以下载需要的版本安装或直接指定版本）
    pip install lxml wheel

4、安装scrapy（可以指定版本安装）
    pip install scrapy
    指定版本安装：pip install scrapy==1.5.1

　　　　2、pip安装scrap报错解决

building 'twisted.test.raiser' extension
error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

爬虫学习_python

　　解决方案：下载所需的包，进行本地安装

http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted 下载twisted对应版本的whl文件（根据自己python和系统环境进行下载）

pip install C:\Users\wzs\Downloads\Twisted-18.7.0-cp36-cp36m-win_amd64.whl

　　解决完这个报错，再进行安装scrapy，进行验证

pip install

二、scrapy命令行操作

　　1、全局命令

　　　　1、scrapy -h查看帮助信息

Scrapy 1.5.1 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ moreUse "scrapy <command> -h" to see more info about a command

　　　　2、fetch 显示爬取过程

　　默认显示爬取过程（不显示爬取过程机上--nolog参数）

语法：scrapy fetch [url] 
前置条件：项目存在/不存在均可 
示例：scrapy fetch "http://www.yetianlian.com/" 
命令其实是在标准输出中显示调用一个爬虫来爬取指定的url的全过程。

Fetch a URL using the Scrapy downloader and print its content to stdout. You
may want to use --nolog to disable logging

Options
=======
--help, -h              show this help message and exit
--spider=SPIDER         use this spider
--headers               print response HTTP headers instead of body
--no-redirect           do not handle HTTP 3xx status codes and print response
                        as-is

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure

三、scrapy常见错误

　　1、ModuleNotFoundError: No module named 'win32api'

ModuleNotFoundError: No module named 'win32api'

没有安装win32api，解决办法安装上相关模块即可

pip install -i https://pypi.douban.com/simple pypiwin32

四、XPath用法

　　XPath教程：http://www.w3school.com.cn/xpath/index.asp

一、XPath入门

　　XPath 是一门在 XML 文档中查找信息的语言。XPath 用于在 XML 文档中通过元素和属性进行导航。

XPath 使用路径表达式在 XML 文档中进行导航
XPath 包含一个标准函数库
XPath 是 XSLT 中的主要元素
XPath 是一个 W3C 标准

　　Python处理HTML和XML文化的库lxml.etree

二、Python中XPath常用方法

三、Python中XPath高级用法

　　https://www.jianshu.com/p/1575db75670f