首先安装python环境(废话,去百度怎么安装,后期有时间补上),使用pip命令安装scrapy,再使用scrapy命令创建项目
pip install scrapy scrapy startproject projectname
projectname就是你要创建项目的名字
项目结构如下
爬虫文件就写在spiders里面(__init__.py文件只是声明这个文件夹是一个python包)
首先创建一个py文件用来写爬虫,直接贴代码慢慢解释
import scrapy import urllib.parse import json import re class JobScrapy(scrapy.Spider): name = '51job' allowed_domains = ['www.51job.com','search.51job.com']## start_urls = ['https://search.51job.com/'] page = 1 pagesize= 0 jobtype=['0100','7700','7200','7300','7800','7400','2700','7900'] urls = 'https://search.51job.com/list/000000,000000,'+jobtype[0]+',00,9,99,+,2,' + \ str(page) + '.html?lang=c&postchannel=0000&workyear=99&cotype=99&d' \ 'egreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=' url = "search.51job.com" def __init__(self, value ,fileName): self.value = value self.fileName = fileName self.fp = open("Over_"+fileName+".json", 'w', encoding='utf-8') def parse(self, response): urls = self.urls yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)## dont_filter=True 允许爬取重复页面 def fond_parse(self, response): print(response)
首先解析这个类,继承了Spider 而它也就是爬虫的一个组件。
name属性是这个爬虫模块的名字,在启动爬虫是要与模块名对应
start_urls属性是开始爬取的第一个页面
allowed_domains属性指定了允许爬取的所有域名,不在此域名内的都会被过滤
parse方法是start_url爬取的回调函数,在这里处理(我初学的时候爬了首页,其实这个url应该就是目标页,然后直接取数据,懒得修改了)首页爬取的返回值,可以通过正则表达式,xpath定位等方法找到元素位置
yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)
scrapy.Request 是一次普通请求默认get,可以修改为post也可以用FormRequest表单请求
dont_filter=True 允许爬取重复页
callback是回调方法
回调方法里面可以继续处理数据或者获取新的页面,比如爬取列表页面后去爬详情页面。后面一些处理后面再写
开一个新坑,目前51job详情页面爬取有滑动验证,有时间我会研究处理的,以及后续伪装ua
Comments | 140 条评论
博客作者 generic viagra online pharmacy
Thanks , I have recently been looking for information about this topic for a long time and yours is the greatest I have found out so far. But, what in regards to the conclusion? Are you sure in regards to the source?
博客作者 canadian pharmaceuticals usa
Hey! This is kind of off topic but I need some help from an established blog. Is it very hard to set up your own blog? I’m not very techincal but I can figure things out pretty quick. I’m thinking about making my own but I’m not sure where to start. Do you have any ideas or suggestions? Many thanks
博客作者 canadian pharmaceuticals for usa sales
You could certainly see your enthusiasm within the work you write. The arena hopes for even more passionate writers like you who aren’t afraid to mention how they believe. At all times go after your heart.
博客作者 national pharmacies
Excellent way of telling, and nice piece of writing to take facts about my presentation focus, which i am going to deliver in university.
博客作者 canadian pharmacy viagra generic
Tremendous things here. I am very satisfied to look your article. Thank you a lot and I’m taking a look ahead to contact you. Will you please drop me a mail?
博客作者 canadian pharmaceuticals
Hi there, You’ve done a fantastic job. I’ll certainly digg it and personally suggest to my friends. I am confident they will be benefited from this website.
博客作者 canada pharmaceuticals
I was recommended this website by my cousin. I’m not sure whether this post is written by him as no one else know such detailed about my difficulty. You’re incredible! Thanks!
博客作者 canada pharmaceuticals
Pretty element of content. I simply stumbled upon your website and in accession capital to say that I acquire in fact loved account your blog posts. Anyway I’ll be subscribing to your augment and even I fulfillment you get entry to consistently quickly.
博客作者 canadian pharmaceuticals
I have been exploring for a bit for any high quality articles or blog posts in this kind of area . Exploring in Yahoo I finally stumbled upon this site. Reading this info So i am happy to show that I’ve an incredibly excellent uncanny feeling I discovered just what I needed. I such a lot surely will make certain to do not forget this web site and provides it a glance on a constant basis.
博客作者 canadian online pharmacies
Stunning quest there. What occurred after? Take care!
博客作者 pharmacy in canada
It’s remarkable for me to have a web site, which is beneficial in favor of my knowledge. thanks admin
博客作者 canadian drugs online pharmacies
Howdy! This is my first visit to your blog! We are a team of volunteers and starting a new initiative in a community in the same niche. Your blog provided us useful information to work on. You have done a outstanding job!
博客作者 discount pharmacy
You can definitely see your skills in the article you write. The world hopes for even more passionate writers such as you who aren’t afraid to say how they believe. All the time go after your heart.
博客作者 pharmacy in canada
Great delivery. Great arguments. Keep up the amazing work.
博客作者 pharmaceuticals online australia
I appreciate, result in I discovered exactly what I was taking a look for. You have ended my 4 day long hunt! God Bless you man. Have a great day. Bye