首先安装python环境(废话,去百度怎么安装,后期有时间补上),使用pip命令安装scrapy,再使用scrapy命令创建项目
pip install scrapy scrapy startproject projectname
projectname就是你要创建项目的名字
项目结构如下
爬虫文件就写在spiders里面(__init__.py文件只是声明这个文件夹是一个python包)
首先创建一个py文件用来写爬虫,直接贴代码慢慢解释
import scrapy import urllib.parse import json import re class JobScrapy(scrapy.Spider): name = '51job' allowed_domains = ['www.51job.com','search.51job.com']## start_urls = ['https://search.51job.com/'] page = 1 pagesize= 0 jobtype=['0100','7700','7200','7300','7800','7400','2700','7900'] urls = 'https://search.51job.com/list/000000,000000,'+jobtype[0]+',00,9,99,+,2,' + \ str(page) + '.html?lang=c&postchannel=0000&workyear=99&cotype=99&d' \ 'egreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=' url = "search.51job.com" def __init__(self, value ,fileName): self.value = value self.fileName = fileName self.fp = open("Over_"+fileName+".json", 'w', encoding='utf-8') def parse(self, response): urls = self.urls yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)## dont_filter=True 允许爬取重复页面 def fond_parse(self, response): print(response)
首先解析这个类,继承了Spider 而它也就是爬虫的一个组件。
name属性是这个爬虫模块的名字,在启动爬虫是要与模块名对应
start_urls属性是开始爬取的第一个页面
allowed_domains属性指定了允许爬取的所有域名,不在此域名内的都会被过滤
parse方法是start_url爬取的回调函数,在这里处理(我初学的时候爬了首页,其实这个url应该就是目标页,然后直接取数据,懒得修改了)首页爬取的返回值,可以通过正则表达式,xpath定位等方法找到元素位置
yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)
scrapy.Request 是一次普通请求默认get,可以修改为post也可以用FormRequest表单请求
dont_filter=True 允许爬取重复页
callback是回调方法
回调方法里面可以继续处理数据或者获取新的页面,比如爬取列表页面后去爬详情页面。后面一些处理后面再写
开一个新坑,目前51job详情页面爬取有滑动验证,有时间我会研究处理的,以及后续伪装ua
Comments | 242 条评论
博客作者 pharmaceuticals online australia
You ought to take part in a contest for one of the highest quality sites on the net. I most certainly will highly recommend this site!
博客作者 canadian pharmaceuticals for usa sales
I always used to study article in news papers but now as I am a user of web so from now I am using net for posts, thanks to web.
博客作者 pharmaceuticals online australia
Hello, i think that i saw you visited my website thus i came to “return the favor”.I am attempting to find things to improve my site!I suppose its ok to use some of your ideas!!
博客作者 free porn sex
“You are grown. Yes. Physically. Yes, my son. But a full man. No! No!” His father, says, sharply. “You have much more to grow, to mature, before you are a man, before you can call yourself. A man.”
博客作者 prescription drugs from canada
Today, I went to the beachfront with my children. I found a sea shell and gave it to my 4 year old daughter and said "You can hear the ocean if you put this to your ear." She placed the shell to her ear and screamed. There was a hermit crab inside and it pinched her ear. She never wants to go back! LoL I know this is completely off topic but I had to tell someone!
博客作者 free porn sex
The now lukewarm water streams through the curls of his hair rinsing away the last vestiges of the soapy lather. It all goes down the drain in a swirl of bubbles.
博客作者 canadian pharcharmy
Good day! This is kind of off topic but I need some guidance from an established blog. Is it tough to set up your own blog? I’m not very techincal but I can figure things out pretty fast. I’m thinking about making my own but I’m not sure where to begin. Do you have any tips or suggestions? With thanks
博客作者 canadian pharmaceuticals online shipping
Usually I do not read post on blogs, however I wish to say that this write-up very forced me to try and do so! Your writing taste has been amazed me. Thanks, quite nice post.
博客作者 canada pharmaceuticals online
Great delivery. Sound arguments. Keep up the great spirit.
博客作者 canada pharmaceuticals
Do you mind if I quote a few of your articles as long as I provide credit and sources back to your blog? My blog site is in the very same area of interest as yours and my visitors would definitely benefit from a lot of the information you present here. Please let me know if this okay with you. Thanks!
博客作者 canadian pharmaceuticals
Wonderful items from you, man. I’ve understand your stuff prior to and you’re simply too great. I really like what you have bought right here, certainly like what you’re stating and the way in which during which you say it. You’re making it enjoyable and you continue to care for to keep it sensible. I cant wait to learn much more from you. This is actually a wonderful site.
博客作者 24 hour pharmacy
I am not sure where you are getting your info, but great topic. I needs to spend some time learning more or understanding more. Thanks for great information I was looking for this information for my mission.
博客作者 canadian pharmaceuticals online safe
Generally I don’t learn post on blogs, but I would like to say that this write-up very forced me to try and do so! Your writing taste has been amazed me. Thanks, very nice post.
博客作者 canada pharmaceuticals online
Hi, i think that i saw you visited my web site so i came to go back the want?.I’m trying to find things to enhance my website!I assume its ok to use some of your ideas!!
博客作者 free porn sex
He lifts his head from its bowed stance and looks into the eyes of his father wanting some acknowledgement.