本章节的代码在Mac上完成，Windows平台会有差异。

部署开发测试环境

有两种方式在本机安装开发测试环境：

本机安装Scrapy
```
 pip install scrapy     
 scrapy shell
```

使用Vagrant安装Learning Scrapy的Docker Images

git clone https://github.com/scalingexcellence/scrapybook.git
cd scrapybook
vagrant up --no-parallel

注意：在Mac上，如果没有本地安装docker，则需要修改Vagrant文件，设置d.force_host_vm = true。

开发爬虫代码

网络爬虫的处理流程：

指定起始URL
指定抓取URL
指定抓取Item
输出抓取Item

起始URL：

https://www.yelp.ca/search?find_loc=langley,+BC&cflt=chinese

Yelp餐馆页面样例：

餐馆列表	餐馆信息

抓取项：

餐馆基本信息：名字，地址，电话
餐馆评价星级
餐馆评论信息

首先通过Chrome浏览器找到一个餐馆url的xpath：

//*[@id="super-container"]/div/div[2]/div[1]/div/div[4]/ul[2]/li/div/div[1]/div[1]/div/div[2]/h3/span/a/@href'

然后通过比较其他餐馆url在Firefox的xpath中验证得出通用的xpath：

//*[@id="super-container"]/div/div[2]/div[1]/div/div[4]/ul[2]/li/div/div[1]/div[1]/div/div[2]/h3/span/a/@href

（以后通过机器学习等方式可以做到自动化解析）

截屏演示：

Step 1	Step 2	Step 3	Step 4

最后得到抓取Yelp网站的第一个Spider：

import scrapy

class YelpSpider(scrapy.Spider):
    name = 'yelpspider'
    start_urls = ['https://www.yelp.ca/search?find_loc=langley,+BC&cflt=chinese']

    def parse(self, response):
        urls = response.xpath('//*[@id="super-container"]/div/div[2]/div[1]/div/div[4]/ul[2]/li/div/div[1]/div[1]/div/div[2]/h3/span/a/@href').extract()
        for url in urls:
            yield {'url': response.urljoin(url)}

运行命令：

$ scrapy runspider YelpSpider.py -o urls.json

得到结果：

[
{"url": "https://www.yelp.ca/biz/luxe-chinese-seafood-restaurant-langley"},
{"url": "https://www.yelp.ca/biz/golden-river-express-langley"},
{"url": "https://www.yelp.ca/biz/yummy-house-restaurant-langley"},
{"url": "https://www.yelp.ca/biz/empire-garden-chinese-restaurant-langley"},
{"url": "https://www.yelp.ca/biz/chans-palace-langley-2"},
{"url": "https://www.yelp.ca/biz/silver-dragon-restaurant-langley"},
{"url": "https://www.yelp.ca/biz/chop-chop-take-out-langley"},
{"url": "https://www.yelp.ca/biz/dragon-garden-chinese-restaurant-langley"},
{"url": "https://www.yelp.ca/biz/wongs-chinese-seafood-restaurant-langley"},
{"url": "https://www.yelp.ca/biz/oriental-pearl-langley"}
]

我们需要把全部餐馆的链接抓取下来，因此按照上面的方法找出Next链接的xpath，递归抓取全部url：

import scrapy

class YelpSpider(scrapy.Spider):
    name = 'yelpspider'
    start_urls = ['https://www.yelp.ca/search?find_loc=langley,+BC&cflt=chinese']

    def parse(self, response):
        urls = response.xpath('//*[@id="super-container"]/div/div[2]/div[1]/div/div[4]/ul[2]/li/div/div[1]/div[1]/div/div[2]/h3/span/a/@href').extract()
        for url in urls:
            yield {'url': response.urljoin(url)}

        next_page = response.xpath('//*[@id="super-container"]/div/div[2]/div[1]/div/div[4]/div/div/div/div[2]/div/div[last()]/a/@href')
        if next_page:
            url = response.urljoin(next_page[0].extract())
            yield scrapy.Request(url, self.parse)

输出文件：urls.json

取得全部的餐馆url后，我们需要对每个餐馆页面进行解析，获取我们需要的分析数据。首先分析餐馆的基本信息的xpath：

餐馆名字 name:
//*[@id="wrap"]/div[3]/div/div[1]/div/div[2]/div[1]/div[1]/h1/text()

餐馆地址 address:
//*[@id="wrap"]/div[3]/div/div[1]/div/div[3]/div[1]/div/div[2]/ul/li[1]/div/strong/address//text()

餐馆电话 phone:
//*[@id="wrap"]/div[3]/div/div[1]/div/div[3]/div[1]/div/div[2]/ul/li[3]/span[3]/text()

评价数量 review_counts:
//*[@id="wrap"]/div[3]/div/div[1]/div/div[2]/div[1]/div[2]/div[1]/div[1]/span/span/text()

评价星级 review_stars:
//*[@id="wrap"]/div[3]/div/div[1]/div/div[2]/div[1]/div[2]/div[1]/div[1]/div/meta/content/text()

代码如下：

import scrapy

class YelpSpider(scrapy.Spider):
    name = 'yelpspider'
    start_urls = ['https://www.yelp.ca/search?find_loc=langley,+BC&cflt=chinese']

    def parse(self, response):
        urls = response.xpath('//*[@id="super-container"]/div/div[2]/div[1]/div/div[4]/ul[2]/li/div/div[1]/div[1]/div/div[2]/h3/span/a/@href').extract()
        for url in urls:
            yield scrapy.Request(response.urljoin(url), callback=self.parse_contents)
            # yield {'url': response.urljoin(url)}

        next_page = response.xpath('//*[@id="super-container"]/div/div[2]/div[1]/div/div[4]/div/div/div/div[2]/div/div[last()]/a/@href')
        if next_page:
            url = response.urljoin(next_page[0].extract())
            yield scrapy.Request(url, self.parse)

    def parse_contents(self, response):
        name = ' '.join(response.xpath('//*[@id="wrap"]/div[3]/div/div[1]/div/div[2]/div[1]/div[1]/h1/text()').extract()).strip('n')
        address = ' '.join(response.xpath('//*[@id="wrap"]/div[3]/div/div[1]/div/div[3]/div[1]/div/div[2]/ul/li[1]/div/strong/address//text()').extract()).strip('n')
        phone = ' '.join(response.xpath('//*[@id="wrap"]/div[3]/div/div[1]/div/div[3]/div[1]/div/div[2]/ul/li[3]/span[3]/text()').extract()).strip('n')
        review_counts = response.xpath('//*[@id="wrap"]/div[3]/div/div[1]/div/div[2]/div[1]/div[2]/div[1]/div[1]/span/span/text()').extract()
        review_stars = response.xpath('//*[@id="wrap"]/div[3]/div/div[1]/div/div[2]/div[1]/div[2]/div[1]/div[1]/div/meta/@content').extract()
        yield {'name': name,
               'address': address,
               'phone': phone,
               'ranking':'',
               'review_counts': review_counts,
               'review_stars': review_stars,
               'url': response.url
               }

运行命令：

scrapy runspider YelpSpider.py -o restaurants.json

输出文件：restaurants.json

关于xpath的学习，请参考w3schools XPath Syntax

评论抓取

评论一般是网页中很难抓取的部分，因为这部分经常是动态生成，通过常规方式得到的xpath和实际的网页源代码对应不上，这时就需要通过scrapy shell进行调试分析了。

通过Chrome进行网页分析，得出下面的xpath：

餐馆网页 restaurant url:
https://www.yelp.ca/biz/luxe-chinese-seafood-restaurant-langley

餐馆评论部分 reviews_section:
//*[@id="super-container"]/div/div/div[1]/div[3]/div[1]/div[2]/ul/li

评论用户名称 reviews_user:
//*[@id="super-container"]/div/div/div[1]/div[3]/div[1]/div[2]/ul/li[2]/div/div[1]/div/div/div[2]/ul[1]/li[1]/a/text()

评论用户链接 reviews_user_url:
//*[@id="super-container"]/div/div/div[1]/div[3]/div[1]/div[2]/ul/li[2]/div/div[1]/div/div/div[2]/ul[1]/li[1]/a/@href

评论日期 reviews_date:
//*[@id="super-container"]/div/div/div[1]/div[3]/div[1]/div[2]/ul/li[2]/div/div[2]/div[1]/div/span/meta/@content

评论星级 reviews_stars:
//*[@id="super-container"]/div/div/div[1]/div[3]/div[1]/div[2]/ul/li[2]/div/div[2]/div[1]/div/div/div/meta/@content

评论内容 reviews_content:
//*[@id="super-container"]/div/div/div[1]/div[3]/div[1]/div[2]/ul/li[2]/div/div[2]/div[1]/p

得到代码：

import scrapy

class YelpSpider(scrapy.Spider):
    name = 'yelpspider'
    start_urls = ['https://www.yelp.ca/search?find_loc=langley,+BC&cflt=chinese']

    def parse(self, response):
        urls = response.xpath('//*[@id="super-container"]/div/div[2]/div[1]/div/div[4]/ul[2]/li/div/div[1]/div[1]/div/div[2]/h3/span/a/@href').extract()
        for url in urls:
            yield scrapy.Request(response.urljoin(url), callback=self.parse_contents)

        next_page = response.xpath('//*[@id="super-container"]/div/div[2]/div[1]/div/div[4]/div/div/div/div[2]/div/div[last()]/a/@href')
        if next_page:
            url = response.urljoin(next_page[0].extract())
            yield scrapy.Request(url, self.parse)

    def parse_contents(self, response):
        name = ' '.join(
            response.xpath('//*[@id="wrap"]/div[3]/div/div[1]/div/div[2]/div[1]/div[1]/h1/text()').extract()).strip(
            'n')

        filename = 'temp_' + ''.join(name.split()) + '.html'
        with open(filename, 'w') as f:
            f.write(response.body)

        reviews_sections = response.xpath('//*[@id="super-container"]/div/div/div[1]/div[3]/div[1]/div[2]/ul/li')

        for sel in reviews_sections:
            reviews_user = sel.xpath('div/div[1]/div/div/div[2]/ul[1]/li[1]/a/text()').extract()
            reviews_user_url = sel.xpath('div/div[1]/div/div/div[2]/ul[1]/li[1]/a/@href').extract()
            reviews_date = sel.xpath('div/div[2]/div[1]/div/span/meta/@content').extract()
            reviews_stars = sel.xpath('div/div[2]/div[1]/div/div/div/meta/@content').extract()
            reviews_contents = sel.xpath('div/div[2]/div[1]/p/text()').extract()
            yield {'name': name,
                   'reviews_user': reviews_user,
                   'reviews_user_url': response.urljoin(reviews_user_url),
                   'reviews_date': reviews_date,
                   'reviews_stars': reviews_stars,
                   'reviews_contents': reviews_contents,
                   'restaurant_url': response.url
                   }

运行命令：

scrapy runspider YelpSpider_Reviews.py -o restaurant_reviews.json

输出文件中只有部分信息。用scrapy shell进行测试的结果为空值：

>>> response.xpath('//*[@id="super-container"]/div/div/div[1]/div[3]/div[1]/div[2]/ul/li')

[]

为了便于调试，代码中已将获取网页保存至本地硬盘，用Chrome打开temp_LuxeChineseSeafoodRestaurant.html，找出这里的xpath：

//*[@id="super-container"]/div[1]/div/div[1]/div[4]/div[1]/div[2]

在scrapy shell中验证，结果如下：

>>> response.xpath('//*[@id="super-container"]/div[1]/div/div[1]/div[4]/div[1]/div[2]')

[<Selector xpath='//*[@id="super-container"]/div[1]/div/div[1]/div[4]/div[1]/div[2]' data=u'<div class="review-list">n              '>]

修改代码：

import scrapy

class YelpSpider(scrapy.Spider):
    name = 'yelpspider'
    start_urls = ['https://www.yelp.ca/search?find_loc=langley,+BC&cflt=chinese']

    def parse(self, response):
        urls = response.xpath('//*[@id="super-container"]/div/div[2]/div[1]/div/div[4]/ul[2]/li/div/div[1]/div[1]/div/div[2]/h3/span/a/@href').extract()
        for url in urls:
            yield scrapy.Request(response.urljoin(url), callback=self.parse_contents)

        next_page = response.xpath('//*[@id="super-container"]/div/div[2]/div[1]/div/div[4]/div/div/div/div[2]/div/div[last()]/a/@href')
        if next_page:
            url = response.urljoin(next_page[0].extract())
            yield scrapy.Request(url, self.parse)

    def parse_contents(self, response):
        name = ' '.join(
            response.xpath('//*[@id="wrap"]/div[3]/div/div[1]/div/div[2]/div[1]/div[1]/h1/text()').extract()).strip(
            'n')

        # filename = 'temp_' + ''.join(name.split()) + '.html'
        # with open(filename, 'w') as f:
        #     f.write(response.body)

        reviews_sections = response.xpath('//*[@id="super-container"]/div[1]/div/div[1]/div[4]/div[1]/div[2]/ul/li')

        # from scrapy.shell import inspect_response
        # inspect_response(response, self)

        for sel in reviews_sections:

            reviews_user = sel.xpath('div/div[1]/div/div/div[2]/ul[1]/li[1]/a/text()').extract()
            reviews_user_url = sel.xpath('div/div[1]/div/div/div[2]/ul[1]/li[1]/a/@href').extract()
            reviews_date = sel.xpath('div/div[2]/div[1]/div/span/meta/@content').extract()
            reviews_stars = sel.xpath('div/div[2]/div[1]/div/div/div/meta/@content').extract()
            reviews_contents = sel.xpath('div/div[2]/div[1]/p/text()').extract()
            print(reviews_user)
            print(reviews_contents)
            yield {'name': name,
                   'reviews_user': reviews_user,
                   'reviews_user_url': 'https://www.yelp.ca' + ''.join(reviews_user_url),
                   'reviews_date': reviews_date,
                   'reviews_stars': reviews_stars,
                   'reviews_contents': reviews_contents,
                   'restaurant_url': response.url
                   }

输出文件：restaurant_reviews.json

评论中也涉及到翻页，代码如下：

import scrapy

class YelpSpider(scrapy.Spider):
    name = 'yelpspider'
    start_urls = ['https://www.yelp.ca/search?find_loc=langley,+BC&cflt=chinese']

    def parse(self, response):
        urls = response.xpath('//*[@id="super-container"]/div/div[2]/div[1]/div/div[4]/ul[2]/li/div/div[1]/div[1]/div/div[2]/h3/span/a/@href').extract()
        for url in urls:
            yield scrapy.Request(response.urljoin(url), callback=self.parse_contents)

        next_page = response.xpath('//*[@id="super-container"]/div/div[2]/div[1]/div/div[4]/div/div/div/div[2]/div/div[last()]/a/@href')
        if next_page:
            url = response.urljoin(next_page[0].extract())
            yield scrapy.Request(url, self.parse)

    def parse_contents(self, response):
        name = ' '.join(
            response.xpath('//*[@id="wrap"]/div[3]/div/div[1]/div/div[2]/div[1]/div[1]/h1/text()').extract()).strip(
            'n')

        # filename = 'temp_' + ''.join(name.split()) + '.html'
        # with open(filename, 'w') as f:
        #     f.write(response.body)

        reviews_sections = response.xpath('//*[@id="super-container"]/div[1]/div/div[1]/div[4]/div[1]/div[2]/ul/li')

        # from scrapy.shell import inspect_response
        # inspect_response(response, self)

        for sel in reviews_sections:

            reviews_user = sel.xpath('div/div[1]/div/div/div[2]/ul[1]/li[1]/a/text()').extract()
            reviews_user_url = sel.xpath('div/div[1]/div/div/div[2]/ul[1]/li[1]/a/@href').extract()
            reviews_date = sel.xpath('div/div[2]/div[1]/div/span/meta/@content').extract()
            reviews_stars = sel.xpath('div/div[2]/div[1]/div/div/div/meta/@content').extract()
            reviews_contents = sel.xpath('div/div[2]/div[1]/p/text()').extract()

            yield {'name': name,
                   'reviews_user': reviews_user,
                   'reviews_user_url': 'https://www.yelp.ca' + ''.join(reviews_user_url),
                   'reviews_date': reviews_date,
                   'reviews_stars': reviews_stars,
                   'reviews_contents': reviews_contents,
                   'restaurant_url': response.url
                   }

        # filename = 'temp_review_urls.txt'
        # with open(filename, 'a') as f:
        #     f.writelines(response.url + 'n')

        review_next_page = response.xpath('//*[@id="super-container"]/div[1]/div/div[1]/div[4]/div[1]/div[3]/div/div/div[2]/div/div[last()]/a/@href')
        if review_next_page:
            url = response.urljoin(review_next_page[0].extract())
            # filename = 'temp_review_urls.txt'
            # with open(filename, 'a') as f:
            #     f.writelines(url + 'n')
            yield scrapy.Request(url, self.parse_contents)

输出文件：restaurant_reviews_pages

评论抓取部分如果读者有更好建议，欢迎留言联系。

Learning Scrapy建议使用scrapy parse进行调试：

$ scrapy parse --spider=basic http://web:9312/properties/property_000001.
html

增量抓取

如果数据没有发生变化，每天进行全量抓取解析，会浪费许多计算和存储资源。因此，需要使用增量抓取的策略，提高效率：

判断区域是否有新增餐馆或关闭餐馆
判断餐馆基本信息是否有变化，例如地址、电话等
判断餐馆评论是否有新增信息

方案确定后，需要部署爬虫到云端，每日定期爬取内容。

以上还处于思考阶段，后面会继续补充完善。

部署爬虫代码

爬虫代码可以在本地机器运行，也可以部署到云计算平台上。scrapy的架构在不做大改的情况下，单机较适合最多不超过千万级（每天千万的请求量，千万级的解析量（要看具体解析内容，解析是计算密集型））每天的爬取。当你的爬虫效率低的时候，确认是卡在CPU、网络IO还是磁盘IO上。卡在CPU上时，profile一下确认瓶颈代码段，scrapy自带的item_loader里面用了很多动态特性，较卡cpu。另外如果计算资源实在不够，只能在开发代码上优化一下，写出效率更高css、xpath选择器（少用通用选择器），但这样会影响开发效率。卡网络io时，测试下是自己网络不佳还是对面服务器扛不住了，也检查下dns。卡磁盘io时，看下数据库磁盘io。

部署到本地电脑

本机上安装好Scrapy后就可以直接运行命令抓取数据了。但是，很多网站会有反爬虫技术，经常会以封锁IP的方式停止返回请求，也会导致以后的正常请求不能使用。所以，在抓取数据时要注意不要过度，也可以通过设置代理等方式防止IP被封。

在运行Scrapy抓取命令前运行如下命令：

export http_proxy=http://proxy:port

部署到SaaS平台

Scrapinghub：

 shub login
Insert your Scrapinghub API Key: <API_KEY>

# Deploy the spider to Scrapy Cloud
 shub deploy

# Schedule the spider for execution
 shub schedule blogspider 
Spider blogspider scheduled, watch it running here:
https://app.scrapinghub.com/p/26731/job/1/8

# Retrieve the scraped data
 shub items 26731/1/8
{"title": "Black Friday, Cyber Monday: Are They Worth It?"}
{"title": "Tips for Creating a Cohesive Company Culture Remotely"}
...

在发布spider时需要用到Project ID，可以登录后在Spiders -> Codes & Deploys里找到：

The ID of the project:

73763

部署到IaaS平台

爬虫为计算密集型任务，IaaS的弹性资源调度这时可以帮忙。AWS EC2是一个很好的选择。可以参考文章：

How to crawl a quarter billion webpages in 40 hours

参考文章：使用scrapy,redis, mongodb,graphite实现的一个分布式网络爬虫,底层存储mongodb集群,分布式使用redis实现, 爬虫状态显示使用graphite实现。

实践：Scrapy

部署开发测试环境

开发爬虫代码

评论抓取

增量抓取

部署爬虫代码

部署到本地电脑

部署到SaaS平台

部署到IaaS平台

results matching ""

No results matching ""