首先安装python环境(废话,去百度怎么安装,后期有时间补上),使用pip命令安装scrapy,再使用scrapy命令创建项目
pip install scrapy scrapy startproject projectname
projectname就是你要创建项目的名字
项目结构如下
爬虫文件就写在spiders里面(__init__.py文件只是声明这个文件夹是一个python包)
首先创建一个py文件用来写爬虫,直接贴代码慢慢解释
import scrapy import urllib.parse import json import re class JobScrapy(scrapy.Spider): name = '51job' allowed_domains = ['www.51job.com','search.51job.com']## start_urls = ['https://search.51job.com/'] page = 1 pagesize= 0 jobtype=['0100','7700','7200','7300','7800','7400','2700','7900'] urls = 'https://search.51job.com/list/000000,000000,'+jobtype[0]+',00,9,99,+,2,' + \ str(page) + '.html?lang=c&postchannel=0000&workyear=99&cotype=99&d' \ 'egreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=' url = "search.51job.com" def __init__(self, value ,fileName): self.value = value self.fileName = fileName self.fp = open("Over_"+fileName+".json", 'w', encoding='utf-8') def parse(self, response): urls = self.urls yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)## dont_filter=True 允许爬取重复页面 def fond_parse(self, response): print(response)
首先解析这个类,继承了Spider 而它也就是爬虫的一个组件。
name属性是这个爬虫模块的名字,在启动爬虫是要与模块名对应
start_urls属性是开始爬取的第一个页面
allowed_domains属性指定了允许爬取的所有域名,不在此域名内的都会被过滤
parse方法是start_url爬取的回调函数,在这里处理(我初学的时候爬了首页,其实这个url应该就是目标页,然后直接取数据,懒得修改了)首页爬取的返回值,可以通过正则表达式,xpath定位等方法找到元素位置
yield scrapy.Request(url=urls, callback=self.fond_parse, dont_filter=True)
scrapy.Request 是一次普通请求默认get,可以修改为post也可以用FormRequest表单请求
dont_filter=True 允许爬取重复页
callback是回调方法
回调方法里面可以继续处理数据或者获取新的页面,比如爬取列表页面后去爬详情页面。后面一些处理后面再写
开一个新坑,目前51job详情页面爬取有滑动验证,有时间我会研究处理的,以及后续伪装ua
Comments | 121 条评论
博客作者 Cribbync
https://meclizinex.com order meclizine 25 mg online cheap
博客作者 Cribbync
https://meclizinex.com/# brand meclizine 25 mg
博客作者 Extended Opportunity
BIG NEWS: there’s a brand new software being launched today that legally tricks AI chatbots into recommending YOUR website.
Go check it out here ==>> https://ext-opp.com/ProfitSGE
That’s right: Just imagine…there’s 1.5 billion people using AI chatbots every day.
What if every time someone…
-> Searched for “best laptops for my needs”… the AI would show them your website?
-> Asked ChatGPT for “best doctors in my city”… it would send them to your local client’s business?
-> Begged Google Gemini for “FAST weight loss”… you guessed it, Gemini would FORCE them to visit your site, display your affiliate offer and fill your pockets full of sales!
This is a TRAFFIC & SEO revolution unlike anything that’s ever come before.
This is YOUR chance to legally “hijack” traffic from 1.5 billion AI chatbots users and funnel it straight to any offer, site, product – for yourself or your clients!
Get your copy here ==>> https://ext-opp.com/ProfitSGE
博客作者 Extended Opportunity
BIG NEWS: there’s a brand new software being launched today that legally tricks AI chatbots into recommending YOUR website.
Go check it out here ==>> https://ext-opp.com/ProfitSGE
That’s right: Just imagine…there’s 1.5 billion people using AI chatbots every day.
What if every time someone…
-> Searched for “best laptops for my needs”… the AI would show them your website?
-> Asked ChatGPT for “best doctors in my city”… it would send them to your local client’s business?
-> Begged Google Gemini for “FAST weight loss”… you guessed it, Gemini would FORCE them to visit your site, display your affiliate offer and fill your pockets full of sales!
This is a TRAFFIC & SEO revolution unlike anything that’s ever come before.
This is YOUR chance to legally “hijack” traffic from 1.5 billion AI chatbots users and funnel it straight to any offer, site, product – for yourself or your clients!
Get your copy here ==>> https://ext-opp.com/ProfitSGE
博客作者 Extended Opportunity
BIG NEWS: there’s a brand new software being launched today that legally tricks AI chatbots into recommending YOUR website.
Go check it out here ==>> https://ext-opp.com/ProfitSGE
That’s right: Just imagine…there’s 1.5 billion people using AI chatbots every day.
What if every time someone…
-> Searched for “best laptops for my needs”… the AI would show them your website?
-> Asked ChatGPT for “best doctors in my city”… it would send them to your local client’s business?
-> Begged Google Gemini for “FAST weight loss”… you guessed it, Gemini would FORCE them to visit your site, display your affiliate offer and fill your pockets full of sales!
This is a TRAFFIC & SEO revolution unlike anything that’s ever come before.
This is YOUR chance to legally “hijack” traffic from 1.5 billion AI chatbots users and funnel it straight to any offer, site, product – for yourself or your clients!
Get your copy here ==>> https://ext-opp.com/ProfitSGE
博客作者 Extended Opportunity
A.I Create & Sell Unlimited Audiobooks to 2.3 Million Users – https://ext-opp.com/ECCO
博客作者 Extended Opportunity
A.I Create & Sell Unlimited Audiobooks to 2.3 Million Users – https://ext-opp.com/ECCO
博客作者 Extended Opportunity
A.I Create & Sell Unlimited Audiobooks to 2.3 Million Users – https://ext-opp.com/ECCO
博客作者 Extended Opportunity
A.I Create & Sell Unlimited Audiobooks to 2.3 Million Users – https://ext-opp.com/ECCO
博客作者 Extended Opportunity
A.I Create & Sell Unlimited Audiobooks to 2.3 Million Users – https://ext-opp.com/ECCO
博客作者 Extended Opportunity
Create Stunning Ebooks In 60 Seconds – https://ext-opp.com/AIEbookPal
博客作者 Extended Opportunity
Create Stunning Ebooks In 60 Seconds – https://ext-opp.com/AIEbookPal
博客作者 Extended Opportunity
Create Stunning Ebooks In 60 Seconds – https://ext-opp.com/AIEbookPal
博客作者 Extended Opportunity
Elevate Learning Adventures with The Story Shack!
A library of 200+ high-quality books tailored to the school curriculum.
StoryShack’s Build a Book bundle features word searches, quizzes, creative coloring pages, high-quality images, and top SEO keywords.
StoryShack’s StoryCraft Pro bundle includes the "Melody Minds Library" with 350+ music tracks and "AnimateMasters Pro," offering 30+ categories of animations.
And as if that’s not enough, here are the MEGA BONUSES:
✔ 100+ Mega Mazes Pack
✔ 100+ Sudoku Elements Pack
✔ 100+ Comic Book Template Pack
✔ 100+ Handwriting Practice Template Pack
✔ 100+ Kids Story Book Templates
✔ Canva Book Templates
✔ Additional beautiful content like journal prompts
✔ INCLUDED: The Ultimate Workbook
Click https://ext-opp.com/StoryShack to explore The Story Shack e-Learning Collection and seize the opportunity for multiplied income!
博客作者 Extended Opportunity
Elevate Learning Adventures with The Story Shack!
A library of 200+ high-quality books tailored to the school curriculum.
StoryShack’s Build a Book bundle features word searches, quizzes, creative coloring pages, high-quality images, and top SEO keywords.
StoryShack’s StoryCraft Pro bundle includes the "Melody Minds Library" with 350+ music tracks and "AnimateMasters Pro," offering 30+ categories of animations.
And as if that’s not enough, here are the MEGA BONUSES:
✔ 100+ Mega Mazes Pack
✔ 100+ Sudoku Elements Pack
✔ 100+ Comic Book Template Pack
✔ 100+ Handwriting Practice Template Pack
✔ 100+ Kids Story Book Templates
✔ Canva Book Templates
✔ Additional beautiful content like journal prompts
✔ INCLUDED: The Ultimate Workbook
Click https://ext-opp.com/StoryShack to explore The Story Shack e-Learning Collection and seize the opportunity for multiplied income!