首先安装python环境(废话,去百度怎么安装,后期有时间补上),使用pip命令安装scrapy,再使用scrapy命令创建项目
pip install scrapy scrapy startproject projectname
projectname就是你要创建项目的名字
项目结构如下
爬虫文件就写在spiders里面(__init__.py文件只是声明这个文件夹是一个python包)
首先创建一个py文件用来写爬虫,直接贴代码慢慢解释
import scrapy import urllib.parse import json import re class JobScrapy(scrapy.Spider): name = '51job' allowed_domains = ['www.51job.com','search.51job.com']## start_urls = ['https://search.51job.com/'] page = 1 pagesize= 0 jobtype=['0100','7700','7200','7300','7800','7400','2700','7900'] urls = 'https://search.51job.com/list/000000,000000,'+jobtype[0]+',00,9,99,+,2,' + \ str(page) + '.html?lang=c&postchannel=0000&workyear=99&cotype=99&d' \ 'egreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=' url = "search.51job.com" def __init__(self, value ,fileName): self.value = value self.fileName = fileName self.fp = open("Over_"+fileName+".json", 'w', encoding='utf-8') def parse(self, response): urls = self.urls yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)## dont_filter=True 允许爬取重复页面 def fond_parse(self, response): print(response)
首先解析这个类,继承了Spider 而它也就是爬虫的一个组件。
name属性是这个爬虫模块的名字,在启动爬虫是要与模块名对应
start_urls属性是开始爬取的第一个页面
allowed_domains属性指定了允许爬取的所有域名,不在此域名内的都会被过滤
parse方法是start_url爬取的回调函数,在这里处理(我初学的时候爬了首页,其实这个url应该就是目标页,然后直接取数据,懒得修改了)首页爬取的返回值,可以通过正则表达式,xpath定位等方法找到元素位置
yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)
scrapy.Request 是一次普通请求默认get,可以修改为post也可以用FormRequest表单请求
dont_filter=True 允许爬取重复页
callback是回调方法
回调方法里面可以继续处理数据或者获取新的页面,比如爬取列表页面后去爬详情页面。后面一些处理后面再写
开一个新坑,目前51job详情页面爬取有滑动验证,有时间我会研究处理的,以及后续伪装ua
Comments | 242 条评论
博客作者 list of canadian pharmaceuticals online
Wonderful article! That is the type of info that are meant to be shared around the net. Shame on Google for now not positioning this put up higher! Come on over and visit my site . Thanks =)
博客作者 canadian pharmaceuticals online
I like looking through an article that will make people think. Also, thank you for allowing for me to comment!
博客作者 online pharmacy busted
Its such as you read my thoughts! You appear to know so much about this, like you wrote the guide in it or something. I believe that you simply can do with a few p.c. to force the message home a little bit, but other than that, this is wonderful blog. An excellent read. I will definitely be back.
博客作者 canadian pharmaceuticals online shipping
I’m really loving the theme/design of your blog. Do you ever run into any internet browser compatibility problems? A number of my blog visitors have complained about my blog not operating correctly in Explorer but looks great in Opera. Do you have any suggestions to help fix this issue?
博客作者 canadian pharmaceuticals online
It’s great that you are getting thoughts from this article as well as from our argument made here.
博客作者 canadian pharmaceuticals online safe
This article will help the internet viewers for building up new web site or even a blog from start to end.
博客作者 canadian pharmaceuticals online safe
Hello there! Do you know if they make any plugins to protect against hackers? I’m kinda paranoid about losing everything I’ve worked hard on. Any recommendations?
博客作者 canadian online pharmacies
I think that everything posted was actually very reasonable. However, what about this? suppose you were to create a killer title? I ain’t saying your content is not good., but suppose you added a title to possibly grab a person’s attention? I mean %BLOG_TITLE% is a little boring. You ought to look at Yahoo’s front page and see how they create news titles to get people to click. You might add a related video or a picture or two to get people interested about what you’ve written. Just my opinion, it might make your website a little bit more interesting.
博客作者 canadian pharmacy review
Excellent weblog here! Additionally your website a lot up very fast! What web host are you the use of? Can I am getting your associate hyperlink to your host? I desire my website loaded up as quickly as yours lol
博客作者 online pharmacies of canada
If you are going for best contents like me, just visit this web site daily for the reason that it offers quality contents, thanks
博客作者 pharmacy on line
What’s up, this weekend is nice in favor of me, for the reason that this time i am reading this enormous educational article here at my home.
博客作者 canadian pharmaceuticals
I was able to find good information from your blog posts.
博客作者 canadian pharmaceuticals online shipping
It’s enormous that you are getting thoughts from this paragraph as well as from our argument made here.
博客作者 canada drugs pharmacy
Hello! I could have sworn I’ve visited this site before but after looking at many of the articles I realized it’s new to me. Nonetheless, I’m definitely delighted I found it and I’ll be bookmarking it and checking back often!
博客作者 canadian pharmaceuticals online safe
I believe that is among the most vital information for me. And i am glad reading your article. However want to observation on few common issues, The website taste is wonderful, the articles is in reality excellent : D. Good job, cheers