首先安装python环境(废话,去百度怎么安装,后期有时间补上),使用pip命令安装scrapy,再使用scrapy命令创建项目
pip install scrapy scrapy startproject projectname
projectname就是你要创建项目的名字
项目结构如下
爬虫文件就写在spiders里面(__init__.py文件只是声明这个文件夹是一个python包)
首先创建一个py文件用来写爬虫,直接贴代码慢慢解释
import scrapy import urllib.parse import json import re class JobScrapy(scrapy.Spider): name = '51job' allowed_domains = ['www.51job.com','search.51job.com']## start_urls = ['https://search.51job.com/'] page = 1 pagesize= 0 jobtype=['0100','7700','7200','7300','7800','7400','2700','7900'] urls = 'https://search.51job.com/list/000000,000000,'+jobtype[0]+',00,9,99,+,2,' + \ str(page) + '.html?lang=c&postchannel=0000&workyear=99&cotype=99&d' \ 'egreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=' url = "search.51job.com" def __init__(self, value ,fileName): self.value = value self.fileName = fileName self.fp = open("Over_"+fileName+".json", 'w', encoding='utf-8') def parse(self, response): urls = self.urls yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)## dont_filter=True 允许爬取重复页面 def fond_parse(self, response): print(response)
首先解析这个类,继承了Spider 而它也就是爬虫的一个组件。
name属性是这个爬虫模块的名字,在启动爬虫是要与模块名对应
start_urls属性是开始爬取的第一个页面
allowed_domains属性指定了允许爬取的所有域名,不在此域名内的都会被过滤
parse方法是start_url爬取的回调函数,在这里处理(我初学的时候爬了首页,其实这个url应该就是目标页,然后直接取数据,懒得修改了)首页爬取的返回值,可以通过正则表达式,xpath定位等方法找到元素位置
yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)
scrapy.Request 是一次普通请求默认get,可以修改为post也可以用FormRequest表单请求
dont_filter=True 允许爬取重复页
callback是回调方法
回调方法里面可以继续处理数据或者获取新的页面,比如爬取列表页面后去爬详情页面。后面一些处理后面再写
开一个新坑,目前51job详情页面爬取有滑动验证,有时间我会研究处理的,以及后续伪装ua
Comments | 140 条评论
博客作者 canada pharmaceuticals
Hello mates, its great piece of writing about cultureand completely defined, keep it up all the time.
博客作者 canadian pharmaceuticals usa
Fastidious answer back in return of this query with firm arguments and telling all on the topic of that.
博客作者 list of canadian pharmaceuticals online
Great article. I am going through some of these issues as well..
博客作者 pharmacies online
Sweet blog! I found it while searching on Yahoo News. Do you have any suggestions on how to get listed in Yahoo News? I’ve been trying for a while but I never seem to get there! Thanks
博客作者 pharmaceuticals online australia
Good post. I learn something totally new and challenging on websites I stumbleupon on a daily basis. It will always be helpful to read through content from other writers and practice a little something from their web sites.
博客作者 canadian online pharmacies legitimate
Thank you for the good writeup. It in fact was a amusement account it. Look advanced to far added agreeable from you! However, how can we communicate?
博客作者 canada pharmacy online
I like the helpful info you provide for your articles. I will bookmark your weblog and check once more here regularly. I’m somewhat certain I’ll learn many new stuff right here! Good luck for the following!
博客作者 canada pharmaceuticals online generic
I’m not that much of a internet reader to be honest but your sites really nice, keep it up! I’ll go ahead and bookmark your site to come back later on. All the best
博客作者 canada pharmaceuticals
Can I simply just say what a relief to discover someone who really understands what they’re talking about over the internet. You definitely know how to bring an issue to light and make it important. More people ought to look at this and understand this side of the story. It’s surprising you aren’t more popular given that you surely have the gift.
博客作者 international pharmacy
Hi! I just wanted to ask if you ever have any issues with hackers? My last blog (wordpress) was hacked and I ended up losing a few months of hard work due to no back up. Do you have any methods to stop hackers?
博客作者 canada drugs pharmacy
you are truly a just right webmaster. The web site loading speed is incredible. It kind of feels that you are doing any distinctive trick. In addition, The contents are masterwork. you’ve done a excellent activity in this matter!
博客作者 canadian pharmaceuticals online shipping
I really like your blog.. very nice colors & theme. Did you make this website yourself or did you hire someone to do it for you? Plz reply as I’m looking to design my own blog and would like to know where u got this from. cheers
博客作者 canadian pharmacy cialis 20mg
Awesome article.
博客作者 canadian pharmacies
Hi everybody, here every person is sharing these kinds of knowledge, thus it’s nice to read this web site, and I used to go to see this blog everyday.
博客作者 online drugstore pharmacy
I do not know whether it’s just me or if everybody else encountering problems with your website. It appears like some of the written text on your content are running off the screen. Can somebody else please comment and let me know if this is happening to them too? This could be a problem with my browser because I’ve had this happen previously. Thanks