首先安装python环境(废话,去百度怎么安装,后期有时间补上),使用pip命令安装scrapy,再使用scrapy命令创建项目
pip install scrapy scrapy startproject projectname
projectname就是你要创建项目的名字
项目结构如下
爬虫文件就写在spiders里面(__init__.py文件只是声明这个文件夹是一个python包)
首先创建一个py文件用来写爬虫,直接贴代码慢慢解释
import scrapy import urllib.parse import json import re class JobScrapy(scrapy.Spider): name = '51job' allowed_domains = ['www.51job.com','search.51job.com']## start_urls = ['https://search.51job.com/'] page = 1 pagesize= 0 jobtype=['0100','7700','7200','7300','7800','7400','2700','7900'] urls = 'https://search.51job.com/list/000000,000000,'+jobtype[0]+',00,9,99,+,2,' + \ str(page) + '.html?lang=c&postchannel=0000&workyear=99&cotype=99&d' \ 'egreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=' url = "search.51job.com" def __init__(self, value ,fileName): self.value = value self.fileName = fileName self.fp = open("Over_"+fileName+".json", 'w', encoding='utf-8') def parse(self, response): urls = self.urls yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)## dont_filter=True 允许爬取重复页面 def fond_parse(self, response): print(response)
首先解析这个类,继承了Spider 而它也就是爬虫的一个组件。
name属性是这个爬虫模块的名字,在启动爬虫是要与模块名对应
start_urls属性是开始爬取的第一个页面
allowed_domains属性指定了允许爬取的所有域名,不在此域名内的都会被过滤
parse方法是start_url爬取的回调函数,在这里处理(我初学的时候爬了首页,其实这个url应该就是目标页,然后直接取数据,懒得修改了)首页爬取的返回值,可以通过正则表达式,xpath定位等方法找到元素位置
yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)
scrapy.Request 是一次普通请求默认get,可以修改为post也可以用FormRequest表单请求
dont_filter=True 允许爬取重复页
callback是回调方法
回调方法里面可以继续处理数据或者获取新的页面,比如爬取列表页面后去爬详情页面。后面一些处理后面再写
开一个新坑,目前51job详情页面爬取有滑动验证,有时间我会研究处理的,以及后续伪装ua
Comments | 242 条评论
博客作者 porno dla dzieci
His father rakes the clear juice of his son’s leaking manhood over the boy’s tensed cockhead. “Damn it, son! Damn it! Damn it, you hit the mirror above the sink!” His father shouts as the bullets of cum shoot forth from his son’s cock and hits the adjacent mirror directly in front of the porcelain bathroom thrown.
博客作者 canadian pharmacy viagra generic
Howdy! Do you know if they make any plugins to safeguard against hackers? I’m kinda paranoid about losing everything I’ve worked hard on. Any recommendations?
博客作者 porno dla dzieci
“I knew ya couldn’t keep your hands off it.” His dad says. “Men can’t do it, we are drawn to our cocks, like a moth to a flame, and usually that burning sensation that a man feels is the cum boiling up in our balls. You know that feelin’ doncha son?” “Yep. I’ve had one since I put on my gear at practice today.” He tells his dad.
博客作者 canadian prescriptions online
Hi Dear, are you truly visiting this web page daily, if so then you will without doubt take nice knowledge.
博客作者 canadian pharmaceuticals usa
Hey I know this is off topic but I was wondering if you knew of any widgets I could add to my blog that automatically tweet my newest twitter updates. I’ve been looking for a plug-in like this for quite some time and was hoping maybe you would have some experience with something like this. Please let me know if you run into anything. I truly enjoy reading your blog and I look forward to your new updates.
博客作者 porn sex
“You have not fucked, have you, my son? Have you?” His father asks, as he readjusts the cock covered and swelling in his khaki pants. “I thought as much.” He does not hear the slamming of the front door. Neither does he hear the footsteps on the wood floor in the hallway. The shower drowns away all this noise.
博客作者 canadian pharmaceuticals online shipping
This design is steller! You obviously know how to keep a reader entertained. Between your wit and your videos, I was almost moved to start my own blog (well, almost…HaHa!) Excellent job. I really enjoyed what you had to say, and more than that, how you presented it. Too cool!
博客作者 gay sex porn
“That you, son?” A voice chimes in from the hallway. Garrett nods to his father.
博客作者 walgreens pharmacy online
I read this paragraph fully on the topic of the comparison of most up-to-date and preceding technologies, it’s amazing article.
博客作者 list of canadian pharmaceuticals online
Hey! This post could not be written any better! Reading this post reminds me of my good old room mate! He always kept chatting about this. I will forward this page to him. Fairly certain he will have a good read. Thank you for sharing!
博客作者 canadian online pharmacy
Howdy, i read your blog from time to time and i own a similar one and i was just wondering if you get a lot of spam remarks? If so how do you protect against it, any plugin or anything you can suggest? I get so much lately it’s driving me mad so any support is very much appreciated.
博客作者 pharmaceuticals online australia
Wonderful blog! I found it while searching on Yahoo News. Do you have any tips on how to get listed in Yahoo News? I’ve been trying for a while but I never seem to get there! Appreciate it
博客作者 online pharmacies canada
I am actually glad to read this weblog posts which contains lots of valuable facts, thanks for providing these statistics.
博客作者 canada pharmaceuticals
I’m really enjoying the design and layout of your site. It’s a very easy on the eyes which makes it much more pleasant for me to come here and visit more often. Did you hire out a developer to create your theme? Fantastic work!
博客作者 canadian pharmacy generic viagra
If you would like to obtain a good deal from this piece of writing then you have to apply such methods to your won weblog.