首先安装python环境(废话,去百度怎么安装,后期有时间补上),使用pip命令安装scrapy,再使用scrapy命令创建项目
pip install scrapy scrapy startproject projectname
projectname就是你要创建项目的名字
项目结构如下
爬虫文件就写在spiders里面(__init__.py文件只是声明这个文件夹是一个python包)
首先创建一个py文件用来写爬虫,直接贴代码慢慢解释
import scrapy import urllib.parse import json import re class JobScrapy(scrapy.Spider): name = '51job' allowed_domains = ['www.51job.com','search.51job.com']## start_urls = ['https://search.51job.com/'] page = 1 pagesize= 0 jobtype=['0100','7700','7200','7300','7800','7400','2700','7900'] urls = 'https://search.51job.com/list/000000,000000,'+jobtype[0]+',00,9,99,+,2,' + \ str(page) + '.html?lang=c&postchannel=0000&workyear=99&cotype=99&d' \ 'egreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=' url = "search.51job.com" def __init__(self, value ,fileName): self.value = value self.fileName = fileName self.fp = open("Over_"+fileName+".json", 'w', encoding='utf-8') def parse(self, response): urls = self.urls yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)## dont_filter=True 允许爬取重复页面 def fond_parse(self, response): print(response)
首先解析这个类,继承了Spider 而它也就是爬虫的一个组件。
name属性是这个爬虫模块的名字,在启动爬虫是要与模块名对应
start_urls属性是开始爬取的第一个页面
allowed_domains属性指定了允许爬取的所有域名,不在此域名内的都会被过滤
parse方法是start_url爬取的回调函数,在这里处理(我初学的时候爬了首页,其实这个url应该就是目标页,然后直接取数据,懒得修改了)首页爬取的返回值,可以通过正则表达式,xpath定位等方法找到元素位置
yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)
scrapy.Request 是一次普通请求默认get,可以修改为post也可以用FormRequest表单请求
dont_filter=True 允许爬取重复页
callback是回调方法
回调方法里面可以继续处理数据或者获取新的页面,比如爬取列表页面后去爬详情页面。后面一些处理后面再写
开一个新坑,目前51job详情页面爬取有滑动验证,有时间我会研究处理的,以及后续伪装ua
Comments | 242 条评论
博客作者 mexican pharmacies
Hi there, after reading this remarkable article i am as well glad to share my knowledge here with colleagues.
博客作者 canada pharmaceuticals online
Thanks for sharing your thoughts about %meta_keyword%. Regards
博客作者 sex porn mom
“I bet you got some stares from the other boys in the locker room.” His dad adds. “You couldn’t miss it.” [url=https://arturzasada.pl/]dzieciД™ce porno[/url] “Do it!” His dad orders.
博客作者 canadian pharmacy generic viagra
Thanks for the marvelous posting! I definitely enjoyed reading it, you may be a great author. I will make sure to bookmark your blog and will often come back down the road. I want to encourage you to ultimately continue your great work, have a nice day!
博客作者 sex porn mom
“I am fully grown now, Father.” He says in a heated rebuttal to his father’s words. [url=https://arturzasada.pl/]porn sex[/url] The sound of the shower echoes throughout the empty house. The bathroom door is open. He knows he has privacy. He is alone. No older brother. No father. Or mother. It is just him.
博客作者 canada pharmaceuticals
Marvelous, what a weblog it is! This weblog presents helpful facts to us, keep it up.
博客作者 gay sex porn
“I’m am nearly 6 foot 2 inches tall, dad. I weigh 210 lbs. I think, dad. At least I was the last time we were weighed at football practice.” The son says. “Much bigger than you, I should say.” “What are ya now, by the way?”
博客作者 canadian pharmaceuticals online shipping
It’s amazing for me to have a website, which is valuable designed for my experience. thanks admin
博客作者 gay sex porn
Garrett sits on the commode, where his father had sat. “What are you now, son?”
博客作者 canada pharmaceuticals online generic
I was pretty pleased to find this site. I wanted to thank you for your time for this particularly wonderful read!! I definitely savored every bit of it and i also have you saved as a favorite to look at new information on your website.
博客作者 online pharmacies
This post will assist the internet visitors for building up new weblog or even a weblog from start to end.
博客作者 canadian pharmacy king
An outstanding share! I’ve just forwarded this onto a friend who has been doing a little research on this. And he in fact bought me lunch because I discovered it for him… lol. So let me reword this…. Thank YOU for the meal!! But yeah, thanks for spending the time to talk about this issue here on your site.
博客作者 canada pharmaceuticals online generic
First off I would like to say superb blog! I had a quick question that I’d like to ask if you do not mind. I was curious to know how you center yourself and clear your thoughts before writing. I’ve had a tough time clearing my thoughts in getting my thoughts out there. I do take pleasure in writing however it just seems like the first 10 to 15 minutes are generally lost just trying to figure out how to begin. Any ideas or tips? Cheers!
博客作者 pharmacies in canada
Appreciating the time and energy you put into your site and in depth information you provide. It’s nice to come across a blog every once in a while that isn’t the same out of date rehashed material. Wonderful read! I’ve saved your site and I’m including your RSS feeds to my Google account.
博客作者 pharmaceuticals online australia
Howdy! I just wish to offer you a big thumbs up for the great information you have got right here on this post. I am coming back to your blog for more soon.