首先安装python环境(废话,去百度怎么安装,后期有时间补上),使用pip命令安装scrapy,再使用scrapy命令创建项目
pip install scrapy scrapy startproject projectname
projectname就是你要创建项目的名字
项目结构如下
爬虫文件就写在spiders里面(__init__.py文件只是声明这个文件夹是一个python包)
首先创建一个py文件用来写爬虫,直接贴代码慢慢解释
import scrapy import urllib.parse import json import re class JobScrapy(scrapy.Spider): name = '51job' allowed_domains = ['www.51job.com','search.51job.com']## start_urls = ['https://search.51job.com/'] page = 1 pagesize= 0 jobtype=['0100','7700','7200','7300','7800','7400','2700','7900'] urls = 'https://search.51job.com/list/000000,000000,'+jobtype[0]+',00,9,99,+,2,' + \ str(page) + '.html?lang=c&postchannel=0000&workyear=99&cotype=99&d' \ 'egreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=' url = "search.51job.com" def __init__(self, value ,fileName): self.value = value self.fileName = fileName self.fp = open("Over_"+fileName+".json", 'w', encoding='utf-8') def parse(self, response): urls = self.urls yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)## dont_filter=True 允许爬取重复页面 def fond_parse(self, response): print(response)
首先解析这个类,继承了Spider 而它也就是爬虫的一个组件。
name属性是这个爬虫模块的名字,在启动爬虫是要与模块名对应
start_urls属性是开始爬取的第一个页面
allowed_domains属性指定了允许爬取的所有域名,不在此域名内的都会被过滤
parse方法是start_url爬取的回调函数,在这里处理(我初学的时候爬了首页,其实这个url应该就是目标页,然后直接取数据,懒得修改了)首页爬取的返回值,可以通过正则表达式,xpath定位等方法找到元素位置
yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)
scrapy.Request 是一次普通请求默认get,可以修改为post也可以用FormRequest表单请求
dont_filter=True 允许爬取重复页
callback是回调方法
回调方法里面可以继续处理数据或者获取新的页面,比如爬取列表页面后去爬详情页面。后面一些处理后面再写
开一个新坑,目前51job详情页面爬取有滑动验证,有时间我会研究处理的,以及后续伪装ua
Comments | 241 条评论
博客作者 canada pharmaceuticals
Hello! Do you know if they make any plugins to safeguard against hackers? I’m kinda paranoid about losing everything I’ve worked hard on. Any recommendations?
博客作者 canadian pharmaceutical companies
It’s awesome to go to see this site and reading the views of all friends regarding this paragraph, while I am also eager of getting experience.
博客作者 canada pharmaceuticals online generic
If you are going for best contents like myself, just go to see this web page daily for the reason that it provides feature contents, thanks
博客作者 canadian pharmaceuticals online
Very great post. I just stumbled upon your weblog and wished to say that I have truly loved browsing your weblog posts. In any case I will be subscribing on your rss feed and I’m hoping you write again soon!
博客作者 medical pharmacy
We stumbled over here from a different web page and thought I should check things out. I like what I see so now i’m following you. Look forward to looking at your web page for a second time.
博客作者 online pharmacies legitimate
We’re a gaggle of volunteers and opening a new scheme in our community. Your website offered us with useful information to work on. You have performed a formidable task and our entire community will be grateful to you.
博客作者 canadian pharmaceuticals
Hello there! This blog post could not be written much better! Looking at this article reminds me of my previous roommate! He constantly kept talking about this. I’ll send this information to him. Pretty sure he’ll have a good read. Many thanks for sharing!
博客作者 online pharmacies of canada
I really like what you guys are up too. Such clever work and reporting! Keep up the good works guys I’ve incorporated you guys to our blogroll.
博客作者 pharmacies shipping to usa
Hi just wanted to give you a quick heads up and let you know a few of the pictures aren’t loading properly. I’m not sure why but I think its a linking issue. I’ve tried it in two different internet browsers and both show the same results.
博客作者 canadian pharmaceutical companies
I blog often and I truly appreciate your information. This great article has really peaked my interest. I am going to take a note of your site and keep checking for new details about once per week. I subscribed to your RSS feed as well.
博客作者 canadian pharmacy online viagra
Definitely believe that which you stated. Your favorite reason seemed to be on the internet the easiest thing to be aware of. I say to you, I certainly get irked while people consider worries that they just don’t know about. You managed to hit the nail upon the top and defined out the whole thing without having side effect , people can take a signal. Will likely be back to get more. Thanks
博客作者 pharmacy online shopping
Hi to every single one, it’s in fact a fastidious for me to go to see this website, it consists of precious Information.
博客作者 canada pharmaceuticals online generic
Nice post. I was checking continuously this blog and I’m impressed! Very helpful info specifically the last part :) I care for such information a lot. I was seeking this particular information for a very long time. Thank you and good luck.
博客作者 AutiraVam
Concerned about [url=https://www.autism-mmc.com/childhood-autism-treatment/cord-blood-stem-cells-transplantation/]Mardaleishvili cord blood stem cells transplantation[/url] ? autism-mmc.com offers advanced [url=https://www.autism-mmc.com/types-and-causes-of-autism/symptoms-and-causes-of-autism/]symptom[/url] with stem cell therapy. Our proven methods help improve communication, behavior, and overall development. Discover how we can support your child’s progress with personalized care.
博客作者 online pharmacy busted
I was wondering if you ever thought of changing the structure of your site? Its very well written; I love what youve got to say. But maybe you could a little more in the way of content so people could connect with it better. Youve got an awful lot of text for only having one or two pictures. Maybe you could space it out better?