Is it possible to run another spider from Scrapy spider?(是否可以从 Scrapy spider 运行另一个蜘蛛?)
问题描述
现在我有 2 只蜘蛛,我想做的是
For now I have 2 spiders, what I would like to do is
- Spider
1转到url1并且如果出现url2,用url2<调用蜘蛛2/代码>.也使用管道保存url1的内容. - 蜘蛛
2去url2做点什么.
- Spider
1goes tourl1and ifurl2appears, call spider2withurl2. Also saves the content ofurl1by using pipeline. - Spider
2goes tourl2and do something.
由于两种蜘蛛的复杂性,我想将它们分开.
Due to the complexities of both spiders I would like to have them separated.
我使用 scrapy crawl 的尝试:
def parse(self, response):
p = multiprocessing.Process(
target=self.testfunc())
p.join()
p.start()
def testfunc(self):
settings = get_project_settings()
crawler = CrawlerRunner(settings)
crawler.crawl(<spidername>, <arguments>)
它会加载设置但不会抓取:
It does load the settings but doesn't crawl:
2015-08-24 14:13:32 [scrapy] INFO: Enabled extensions: CloseSpider, LogStats, CoreStats, SpiderState
2015-08-24 14:13:32 [scrapy] INFO: Enabled downloader middlewares: DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, HttpAuthMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-08-24 14:13:32 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-08-24 14:13:32 [scrapy] INFO: Spider opened
2015-08-24 14:13:32 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
文档中有一个关于从脚本启动的示例,但我想做的是在使用 scrapy crawl 命令时启动另一个蜘蛛.
The documentations has a example about launching from script, but what I'm trying to do is launch another spider while using scrapy crawl command.
完整代码
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor
from multiprocessing import Process
import scrapy
import os
def info(title):
print(title)
print('module name:', __name__)
if hasattr(os, 'getppid'): # only available on Unix
print('parent process:', os.getppid())
print('process id:', os.getpid())
class TestSpider1(scrapy.Spider):
name = "test1"
start_urls = ['http://www.google.com']
def parse(self, response):
info('parse')
a = MyClass()
a.start_work()
class MyClass(object):
def start_work(self):
info('start_work')
p = Process(target=self.do_work)
p.start()
p.join()
def do_work(self):
info('do_work')
settings = get_project_settings()
runner = CrawlerRunner(settings)
runner.crawl(TestSpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()
return
class TestSpider2(scrapy.Spider):
name = "test2"
start_urls = ['http://www.google.com']
def parse(self, response):
info('testspider2')
return
我希望是这样的:
- scrapy 抓取测试1(例如,当 response.status_code 为 200 时:)
- 在test1中,调用
scrapy crawl test2
推荐答案
我不会深入给出,因为这个问题真的很老,但我会继续从官方 Scrappy 文档中删除这个片段......你非常接近!哈哈
I won't go in depth given since this question is really old but I'll go ahead drop this snippet from the official Scrappy docs.... You are very close! lol
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider1(scrapy.Spider):
# Your first spider definition
...
class MySpider2(scrapy.Spider):
# Your second spider definition
...
process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
https://doc.scrapy.org/en/latest/topics/实践.html
然后使用回调,你可以在你的蜘蛛之间传递项目做你所说的逻辑函数
And then using callbacks you can pass items between your spiders do do w.e logic functions your talking about
这篇关于是否可以从 Scrapy spider 运行另一个蜘蛛?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:是否可以从 Scrapy spider 运行另一个蜘蛛?
基础教程推荐
- 由Python将MP3转换为MIDI(类型错误:无法加载插件:mtg-Melodia:Melodia) 2022-01-01
- numpy float:比算术运算中内置的慢 10 倍? 2022-01-01
- 在 Celery 工作人员中捕获 Heroku SIGTERM 以优雅地关 2022-01-01
- 用 Python 编写 Fortran 无格式文件 2022-01-01
- 尝试制作WhatsApp机器人 2022-01-01
- 使用生成器和迭代器时 Python 多循环失败 2022-01-01
- Discord.py 缺少必需的参数 2022-01-01
- 将 x 轴刻度更改为自定义字符串 2022-01-01
- pyserial - 可以从线程 a 写入串行端口,是否阻塞从线程 b 读取? 2022-01-01
- 与常规 dict 相比,Python manager.dict() 非常慢 2022-01-01
