拉勾教育Python爬虫高级入门Scrapy框架入门级案例 - 北京海淀中关村软件工程师培训

提前准备插件安装：
pip install scrapy
这里是运行成功的截图
很多人学习python，不知道从何学起。
很多人学习python，掌握了基本语法过后，不知道在哪里寻找案例上手。
很多已经做案例的人，却不知道如何去学习更加高深的知识。
那么针对这三类人，我给大家提供一个好的学习平台，免费领取视频教程，电子书籍，以及课程的源代码！拉勾IT课小编为大家分析。

python install Twisted
这里是运行成功的截图

阅读目录
• 系列文章目录
• 前言
• 一、编写Tenxun.py爬虫文件
• 二、在item.py列表里进行设置数据表
• 三、在pipelines.py列表里进行设置数据表
• 四、在settings.py文件里配置爬虫
• 五、运行爬虫
• 总结

________________________________________
前言
随着我们对爬虫的了解，以前我们用requests可以请求进行解析网页可以提供我们想要的数据，现在我们网页的数据量很多的时候，我们就要应用Scrapy异步爬虫进行爬取网页，下面由我向大家介绿一下Scrapy实战爬取腾讯招聘的职位
________________________________________
一、编写Tenxun.py爬虫文件
图二

此文件为核心文件，我们在设计爬虫网页时，要在这里进行设计。，这里我将把源码公开，进行讲解。首先创建一个scrapy项目，下面是实例代码
scrapy startproject demoTenXun
上面的是运行成功的代码截图二，下面我们要在dmoTenXun下面spider文件夹里新建一个Tenxun.py文件进行编写。上面的是图三是我们通过F12进行的网页上的数据，我们可以清楚看到此为爬虫中的一种“ajax渲染”下面我们要在dmoTenXun下面spider文件夹里新建一个Tenxun.py文件进行编写。
import scrapy
import json
from demoTenXun.items import DemotenxunItem
class TenXunSpider(scrapy.Spider):

name = 'Tenxun' #爬虫名称运行时只要这个爬虫名就可以了
allowed_domains = ['careers.te***']
start_urls=['https://careers.te***/tencentcareer/api/post/Query?timestamp=1602982179339&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=python&pageIndex=1&pageSize=10&language=zh-cn&area=cn']
offer=1
def parse(self, response):
# 通过josn读取数据
datas=json.loads(response.text)
for data in datas['Data']['Posts']:
# 创建一个item对象
item =DemotenxunItem()
item['RecruitPostName']=data['RecruitPostName']
item['Responsibility']=data['Responsibility']
item['LastUpdateTime']=data['LastUpdateTime']
item['LocationName']=data['LocationName']
yield item
self.offer +=1
# 这里加一个判断
if self.offer <=109:
#下一次编写的url
next_url='https://careers.te***/tencentcareer/api/post/Query?timestamp=1602982179339&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=python&pageIndex={}&pageSize=10&language=zh-cn&area=cn'.format(self.offer)

yield scrapy.Request(next_url,self.parse)
二、在item.py列表里进行设置数据表
代码如下（示例）：
import scrapy

class DemotenxunItem(scrapy.Item):
# define the fields for your item here like:
RecruitPostName = scrapy.Field() #岗位名称
Responsibility =scrapy.Field() #岗位职责
LastUpdateTime=scrapy.Field() #发布时间
LocationName=scrapy.Field() #发布地点
pass
三、在pipelines.py列表里进行设置数据表
代码如下（示例）：
import json
import codecs
class DemotenxunPipeline:
def __init__(self):
self.file=codecs.open('tensun.csv','a',encoding='GBK')
def process_item(self, item, spider):
line = json.dumps(dict(item),ensure_ascii=False) +'n'
self.file.write(line)
return item
# return item
def spider_close(self):
self.file.close()
四、在settings.py文件里配置爬虫
下面有些地方修改
#把这个注释去掉
ITEM_PIPELINES = {
'demoTenXun.pipelines.DemotenxunPipeline': 300,
}
3在这里加入你的表头
DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0'
}
#改为False
ROBOTSTXT_OBEY = False
五、运行爬虫
下面为格式
scrapy crawl +你的爬虫名字（在TenXun.py）中找到你的name=''
下面为代码
scrapy crawl Tenxun
总结
提示：以上就是今天要讲的内容，本文仅仅简单介绍了Scrapy的使用，但Scrapy提供了大量能使我们快速便捷地爬取数据的方法