首先安装python环境(废话,去百度怎么安装,后期有时间补上),使用pip命令安装scrapy,再使用scrapy命令创建项目
pip install scrapy scrapy startproject projectname
projectname就是你要创建项目的名字
项目结构如下
爬虫文件就写在spiders里面(__init__.py文件只是声明这个文件夹是一个python包)
首先创建一个py文件用来写爬虫,直接贴代码慢慢解释
import scrapy import urllib.parse import json import re class JobScrapy(scrapy.Spider): name = '51job' allowed_domains = ['www.51job.com','search.51job.com']## start_urls = ['https://search.51job.com/'] page = 1 pagesize= 0 jobtype=['0100','7700','7200','7300','7800','7400','2700','7900'] urls = 'https://search.51job.com/list/000000,000000,'+jobtype[0]+',00,9,99,+,2,' + \ str(page) + '.html?lang=c&postchannel=0000&workyear=99&cotype=99&d' \ 'egreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=' url = "search.51job.com" def __init__(self, value ,fileName): self.value = value self.fileName = fileName self.fp = open("Over_"+fileName+".json", 'w', encoding='utf-8') def parse(self, response): urls = self.urls yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)## dont_filter=True 允许爬取重复页面 def fond_parse(self, response): print(response)
首先解析这个类,继承了Spider 而它也就是爬虫的一个组件。
name属性是这个爬虫模块的名字,在启动爬虫是要与模块名对应
start_urls属性是开始爬取的第一个页面
allowed_domains属性指定了允许爬取的所有域名,不在此域名内的都会被过滤
parse方法是start_url爬取的回调函数,在这里处理(我初学的时候爬了首页,其实这个url应该就是目标页,然后直接取数据,懒得修改了)首页爬取的返回值,可以通过正则表达式,xpath定位等方法找到元素位置
yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)
scrapy.Request 是一次普通请求默认get,可以修改为post也可以用FormRequest表单请求
dont_filter=True 允许爬取重复页
callback是回调方法
回调方法里面可以继续处理数据或者获取新的页面,比如爬取列表页面后去爬详情页面。后面一些处理后面再写
开一个新坑,目前51job详情页面爬取有滑动验证,有时间我会研究处理的,以及后续伪装ua
Comments | 242 条评论
博客作者 pharmacy on line
I’m gone to convey my little brother, that he should also pay a visit this blog on regular basis to obtain updated from most up-to-date news.
博客作者 pharmacy online shopping
Hey there just wanted to give you a quick heads up. The text in your article seem to be running off the screen in Internet explorer. I’m not sure if this is a formatting issue or something to do with web browser compatibility but I figured I’d post to let you know. The layout look great though! Hope you get the issue solved soon. Many thanks
博客作者 pharmacy on line
This is really attention-grabbing, You are a very professional blogger. I’ve joined your rss feed and look ahead to searching for extra of your excellent post. Additionally, I have shared your site in my social networks
博客作者 canadian pharmacy
Nice post. I used to be checking continuously this blog and I’m impressed! Very useful info specially the closing phase :) I maintain such info a lot. I used to be seeking this certain info for a very lengthy time. Thanks and good luck.
博客作者 canada pharmaceuticals online generic
Have you ever thought about adding a little bit more than just your articles? I mean, what you say is fundamental and all. However think of if you added some great visuals or video clips to give your posts more, "pop"! Your content is excellent but with pics and videos, this blog could definitely be one of the most beneficial in its niche. Great blog!
博客作者 canadian pharmaceuticals
Great beat ! I would like to apprentice even as you amend your website, how could i subscribe for a weblog web site? The account aided me a applicable deal. I were a little bit familiar of this your broadcast offered vibrant transparent idea
博客作者 canadian online pharmacies legitimate
Hello to every body, it’s my first go to see of this blog; this weblog consists of awesome and actually excellent stuff in favor of visitors.
博客作者 canada pharmaceuticals
Do you have a spam issue on this site; I also am a blogger, and I was wanting to know your situation; many of us have developed some nice methods and we are looking to swap methods with other folks, please shoot me an email if interested.
博客作者 canadian pharmaceuticals usa
Hi! Do you know if they make any plugins to help with SEO? I’m trying to get my blog to rank for some targeted keywords but I’m not seeing very good results. If you know of any please share. Many thanks!
博客作者 pharmacy intern
Thank you for the auspicious writeup. It in fact was a amusement account it. Look advanced to far added agreeable from you! By the way, how could we communicate?
博客作者 pharmacy in canada
Greetings from Colorado! I’m bored to death at work so I decided to browse your blog on my iphone during lunch break. I enjoy the information you present here and can’t wait to take a look when I get home. I’m amazed at how quick your blog loaded on my cell phone .. I’m not even using WIFI, just 3G .. Anyhow, very good site!
博客作者 canada pharmaceuticals online generic
This site truly has all of the information I wanted about this subject and didn’t know who to ask.
博客作者 pharmaceuticals online australia
I pay a quick visit day-to-day some blogs and sites to read articles or reviews, but this weblog provides quality based articles.
博客作者 prescription drugs from canada
Hello! This is my first visit to your blog! We are a collection of volunteers and starting a new project in a community in the same niche. Your blog provided us useful information to work on. You have done a wonderful job!
博客作者 navarro pharmacy
I’ve been surfing online more than 2 hours today, yet I never found any interesting article like yours. It is pretty worth enough for me. In my view, if all site owners and bloggers made good content as you did, the web will be much more useful than ever before.