首先安装python环境(废话,去百度怎么安装,后期有时间补上),使用pip命令安装scrapy,再使用scrapy命令创建项目
pip install scrapy scrapy startproject projectname
projectname就是你要创建项目的名字
项目结构如下
爬虫文件就写在spiders里面(__init__.py文件只是声明这个文件夹是一个python包)
首先创建一个py文件用来写爬虫,直接贴代码慢慢解释
import scrapy import urllib.parse import json import re class JobScrapy(scrapy.Spider): name = '51job' allowed_domains = ['www.51job.com','search.51job.com']## start_urls = ['https://search.51job.com/'] page = 1 pagesize= 0 jobtype=['0100','7700','7200','7300','7800','7400','2700','7900'] urls = 'https://search.51job.com/list/000000,000000,'+jobtype[0]+',00,9,99,+,2,' + \ str(page) + '.html?lang=c&postchannel=0000&workyear=99&cotype=99&d' \ 'egreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=' url = "search.51job.com" def __init__(self, value ,fileName): self.value = value self.fileName = fileName self.fp = open("Over_"+fileName+".json", 'w', encoding='utf-8') def parse(self, response): urls = self.urls yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)## dont_filter=True 允许爬取重复页面 def fond_parse(self, response): print(response)
首先解析这个类,继承了Spider 而它也就是爬虫的一个组件。
name属性是这个爬虫模块的名字,在启动爬虫是要与模块名对应
start_urls属性是开始爬取的第一个页面
allowed_domains属性指定了允许爬取的所有域名,不在此域名内的都会被过滤
parse方法是start_url爬取的回调函数,在这里处理(我初学的时候爬了首页,其实这个url应该就是目标页,然后直接取数据,懒得修改了)首页爬取的返回值,可以通过正则表达式,xpath定位等方法找到元素位置
yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)
scrapy.Request 是一次普通请求默认get,可以修改为post也可以用FormRequest表单请求
dont_filter=True 允许爬取重复页
callback是回调方法
回调方法里面可以继续处理数据或者获取新的页面,比如爬取列表页面后去爬详情页面。后面一些处理后面再写
开一个新坑,目前51job详情页面爬取有滑动验证,有时间我会研究处理的,以及后续伪装ua
Comments | 145 条评论
博客作者 online pharmacies canada
This piece of writing offers clear idea for the new visitors of blogging, that actually how to do blogging.
博客作者 pharmacy discount
I all the time used to study article in news papers but now as I am a user of web thus from now I am using net for articles or reviews, thanks to web.
博客作者 canada pharmaceuticals online
It’s awesome to pay a quick visit this site and reading the views of all colleagues about this post, while I am also zealous of getting experience.
博客作者 pharmacy cheap no prescription
Yesterday, while I was at work, my cousin stole my apple ipad and tested to see if it can survive a 30 foot drop, just so she can be a youtube sensation. My iPad is now destroyed and she has 83 views. I know this is completely off topic but I had to share it with someone!
博客作者 canadian pharmaceuticals for usa sales
Thanks for another informative site. Where else may I get that kind of information written in such an ideal means? I’ve a project that I am just now running on, and I have been on the look out for such info.
博客作者 pharmacy online prescription
Hi there, I discovered your web site by the use of Google while searching for a related subject, your site came up, it appears great. I’ve bookmarked it in my google bookmarks.
Hi there, just became alert to your blog via Google, and located that it’s really informative. I’m gonna be careful for brussels. I will be grateful when you proceed this in future. Numerous other people will be benefited out of your writing. Cheers!
博客作者 shoppers pharmacy
Hey just wanted to give you a brief heads up and let you know a few of the images aren’t loading properly. I’m not sure why but I think its a linking issue. I’ve tried it in two different web browsers and both show the same results.
博客作者 canadian pharcharmy
I think that everything wrote made a great deal of sense. However, think on this, what if you wrote a catchier title? I mean, I don’t wish to tell you how to run your website, however what if you added a title that makes people desire more? I mean %BLOG_TITLE% is kinda boring. You ought to glance at Yahoo’s home page and watch how they create news titles to grab people to open the links. You might try adding a video or a pic or two to grab people excited about what you’ve got to say. In my opinion, it might bring your blog a little bit more interesting.
博客作者 canada pharmaceuticals online generic
Hello! I’ve been following your blog for some time now and finally got the bravery to go ahead and give you a shout out from Lubbock Texas! Just wanted to mention keep up the excellent work!
博客作者 compound pharmacy
Woah! I’m really loving the template/theme of this blog. It’s simple, yet effective. A lot of times it’s tough to get that "perfect balance" between superb usability and appearance. I must say you have done a great job with this. Also, the blog loads super fast for me on Chrome. Superb Blog!
博客作者 canadian drugstore
I couldn’t resist commenting. Well written!
博客作者 online pharmacies of canada
It’s enormous that you are getting thoughts from this post as well as from our dialogue made at this place.
博客作者 canadian drugs pharmacies online
It’s hard to come by experienced people about this subject, but you sound like you know what you’re talking about! Thanks
博客作者 canada pharmacy online
What’s up all, here every one is sharing these experience, therefore it’s fastidious to read this web site, and I used to visit this web site everyday.
博客作者 canadian pharmaceuticals online
I like the helpful information you provide in your articles. I’ll bookmark your weblog and check again here regularly. I’m quite certain I’ll learn a lot of new stuff right here! Good luck for the next!