首页 \ 问答 \ Scrapy:CrawlSpider规则process_links vs process_request vs下载中间件[重复](Scrapy: CrawlSpider Rules process_links vs process_request vs download middleware [duplicate])

Scrapy:CrawlSpider规则process_links vs process_request vs下载中间件[重复](Scrapy: CrawlSpider Rules process_links vs process_request vs download middleware [duplicate])

这个问题在这里已经有了答案:

这不是“我该如何使用这些?” 还有更多“何时/为什么要使用这些?” 类型问题。

编辑:这个问题几乎与这个问题重复,建议使用下载中间件来过滤此类请求。 更新了我的问题以反映这一点。

Scrapy CrawlSpider文档中 ,规则接受两个callables, process_linksprocess_request (下面引用的文档以便于参考)。

默认情况下,Scrapy会过滤重复的URL,但我希望对请求进行额外的过滤,因为我会获得具有链接到它们的多个不同 URL的页面副本。 像,

URL1 = "http://example.com/somePage.php?id=XYZ&otherParam=fluffyKittens"
URL2 = "http://example.com/somePage.php?id=XYZ&otherParam=scruffyPuppies"

但是,这些URL在查询字符串中将具有类似的元素 - 上面显示的是id

我认为使用我的蜘蛛的process_links可调用来过滤掉重复的请求是有意义的。

问题:

  1. 是否有一些理由为什么process_request会更好地适应这项任务?
  2. 如果没有,您能否举例说明process_request何时更适用?
  3. 下载中间件是否比process_linksprocess_request更合适? 如果是这样,您能提供一个示例,说明process_linksprocess_request何时是更好的解决方案?

文件报价:

process_links是一个可调用的或一个字符串(在这种情况下,将使用来自具有该名称的spider对象的方法),将使用指定的link_extractor为每个响应中提取的每个链接列表调用该方法。 这主要用于过滤目的。

process_request是一个可调用的或一个字符串(在这种情况下,将使用来自具有该名称的spider对象的方法),该方法将在此规则提取的每个请求中调用,并且必须返回请求或None(以过滤请求) )。


This question already has an answer here:

This is less of a "how do I use these?" and more of "when/why would I use these?" type question.

EDIT: This question is a near duplicate of this question, which suggests the use a Download Middleware to filter such requests. Updated my question below to reflect that.

In the Scrapy CrawlSpider documentation, rules accept two callables, process_links and process_request (documentation quoted below for easier reference).

By default Scrapy is filtering duplicated URLs, but I'm looking to do additional filtering of requests because I get duplicates of pages that have multiple distinct URLs linking to them. Things like,

URL1 = "http://example.com/somePage.php?id=XYZ&otherParam=fluffyKittens"
URL2 = "http://example.com/somePage.php?id=XYZ&otherParam=scruffyPuppies"

However, these URLs will have a similar element in the query string - shown above it is the id.

I'm thinking it would make sense to use the process_links callable of my spider to filter out duplicate requests.

Questions:

  1. Is there some reason why process_request would be better suite to this task?
  2. If not, can you provide an example of when process_request would be more applicable?
  3. Is a download middleware more appropriate than either process_links or process_request? If so, can you provide an example of when process_links or process_request would be a better solution?

Documentation quote:

process_links is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called for each list of links extracted from each response using the specified link_extractor. This is mainly used for filtering purposes.

process_request is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called with every request extracted by this rule, and must return a request or None (to filter out the request).


原文:https://stackoverflow.com/questions/16040311
更新时间:2019-06-22 04:58

最满意答案

  1. 不, process_links是你更好的选择,因为你只是过滤网址,并将节省在process_request创建Request只是为了丢弃它的开销。

  2. 如果你想在发送Request之前按一下Request ,例如你想添加一个meta参数或者添加或删除标题, process_request很有用。

  3. 您不需要任何中间件,因为您需要的功能直接构建在Rule 。 如果process_links没有内置到规则中,那么您需要创建自己的中间件。


  1. No, process_links is your better option here are as you are just filtering urls and will save the overhead of having to create the Request in process_request just to discard it.

  2. process_request is useful if you want to massage the Request a little before you send it off, say if you want to add a meta argument or perhaps add or remove headers.

  3. you don't need any middleware in your case because the functionality you need is built directly into the Rule. If process_links were not built into the rules, then you would need to create your own middleware.

2013-04-16

相关问答

更多

LINQ扩展SelectMany in 3.5 vs 4.0?(LINQ extension SelectMany in 3.5 vs 4.0?)

看起来委托选择在C#的新版本(C#4.0与C#3.0 ...不是.NET的版本)中更加智能化。这个想法在VS2008中可用,但它在解决哪个版本的有多个重载时使用的方法。 该方法是在编译时选择的,所以我必须相信,这与更新后的编译器相比,更多的是.NET版本。 您可能会发现,您可以在VS2010中为.NET 2.0编译的解决方案使用新的重载功能。 例如,这适用于VS2008 var ret = new[] { "Hello", "World", "!!!" }.Aggregate(Path.Combi ...

僵尸进程vs孤儿进程(Zombie process vs Orphan process)

当一个孩子退出时,某个进程必须wait它才能获得退出代码。 退出代码存储在进程表中,直到发生这种情况。 阅读该退出代码的行为称为“收获”孩子。 在孩子退出并获得收益的时候,它被称为僵尸。 (当你考虑这个命名时,整个命名有点可怕;我建议不要太多考虑它。) 僵尸只占用进程表中的空间。 它们不占用内存或CPU。 但是,进程表是一个有限的资源,过多的僵尸可以填充它,这意味着没有其他进程可以启动。 除此之外,它们是令人烦恼的混乱,应该强烈避免。 如果一个流程仍然存在,而且孩子仍然在奔跑(并且不会杀死他们的孩 ...

Scrapy的蜘蛛中间件和下载中间件有什么区别?(What is the difference between Scrapy's spider middleware and downloader middleware? [closed])

虽然它们具有几乎相同的界面,但它们有不同的用途: 下载器中间件修改请求和响应或者响应于响应生成请求。 他们不直接与蜘蛛互动。 一些示例是实现cookie,缓存,代理,重定向,设置用户代理标题等的中间件。它们只是向下载器系统添加功能。 蜘蛛中间件修改进出蜘蛛的东西,如请求,项目,异常和start_requests 。 他们确实与下载中间件共享一些基本功能,但他们无法响应回应而生成请求。 它们站在蜘蛛和下载者之间。 一个例子是用错误的HTTP状态代码过滤掉响应。 一些中间件可以作为下载中间件或蜘蛛中间 ...

如何卸载“Microsoft .NET Core 1.0.0 RC2 - VS 2015 Tooling Preview 1”?(How do I uninstall “Microsoft .NET Core 1.0.0 RC2 - VS 2015 Tooling Preview 1”?)

从Microsoft下载安装程序exe的副本: 预览1 预览2 预览2.0.1 预览2.0.2 预览2.0.3 在询问后选择此文件。 然后你可以成功删除它。 Download a copy of the installer exe from Microsoft: DotNetCore.1.0.0.RC2-VS2015Tools.Preview1.exe DotNetCore.1.0.0-VS2015Tools.Preview2.exe DotNetCore.1.0.0-VS2015Tools.Pr ...

django profies和request.user - 错误(django profies and request.user - error)

对于未登录的用户, request.user是AnonymousUser实例,不包含get_profile 。 我们可以检查request.user是否已登录并通过if request.user.is_authenticated():保护已登录用户的逻辑if request.user.is_authenticated(): def process_request(self, request): if request.user.is_authenticated(): try ...

如何根据scrapy中的url过滤重复的请求(how to filter duplicate requests based on url in scrapy)

您可以编写自定义中间件进行重复删除并将其添加到设置中 import os from scrapy.dupefilter import RFPDupeFilter from scrapy.utils.request import request_fingerprint class CustomFilter(RFPDupeFilter): """A dupe filter that considers specific ids in the url""" def __getid(self ...

VS2008比C ++开发VS2005有什么优势?(What are the advantages of VS2008 over VS2005 for C++ development?)

从本地C ++开发人员的角度来看,2005年和2008年之间几乎没有什么区别。 但是,如果从2003年开始,直接升级到2008年是有道理的 - 转换过程应该几乎相同,并且最终会有一个更好的平台。 一些仅适用于2008年的新功能: / MP选项用于多核编译(如果您有多核计算机,则为巨大的倍频程序) 一些改进了多线程应用程序的调试选项 只有2008年还有一些额外的可下载的功能包: TR1文库 新的MFC There are very little difference between 2005 and ...

获取与请求(Fetch vs Request)

response.body可让您以流的形式访问响应。 要读取流: fetch(url).then(response => { const reader = response.body.getReader(); reader.read().then(function process(result) { if (result.done) return; console.log(`Received a ${result.value.length} byte chunk of d ...

相关文章

更多

最新问答

更多
  • Android宽度:100%修复(网站接管问题)(Android width:100% fix (website takeover issue))
  • C ++函数/方法设计的良好实践(Good practice in C++ function/method design)
  • 计算其他表中不存在的所有记录 - SQL查询(Count all records that does not exist to other table - SQL Query)
  • 为什么我要用JPA共享ID?(Why do I get shared Ids with JPA?)
  • asp.net - 如何显示来自html格式的数据行的字段(asp.net - how to display a field from data row that is in html format)
  • 我们如何使用ActiveRecord从连接表中删除行?(How can we delete rows from a join table by using ActiveRecord?)
  • ng-class搞乱了类的顺序(ng-class messing with the order of classes)
  • oracle 12g无效数字错误(oracle 12g invalid number error)
  • 更改ng-src值onclick(Change ng-src value onclick)
  • 如何在android中自动添加自定义依赖项以创建新项目?(How to add custom dependencies automatically in android for ever a new project is created?)
  • datetime函数在PHP中(datetime function in php)
  • 在javascript中获取会话数组的值(in javascript get the value of a session array)
  • 如何在UTF8中编译LaTeX?(How can I compile LaTeX in UTF8? [closed])
  • Rspec:“array.should == another_array”,但不用担心订单(Rspec: “array.should == another_array” but without concern for order)
  • Logcat错误:无法在android片段中加载视图(Logcat error: unable to load view in android fragments)
  • JavaFX的。(JavaFX. Adding items to the list in different threads is not working)
  • 从GDATA日历资源迁移到Google Calendar Resource api(Migrate from GDATA calendar resource to Google Calendar Resource api)
  • SSRS 2008 - 以零情景处理分割(SSRS 2008 - Dealing with division by zero scenarios)
  • 我如何以编程方式添加一个listView列标题的点击事件(How can I add a listView column header a click event programmatically)
  • Wxpython:无法检索有关列表控件项XXX的信息(Wxpython: Couldn't retrieve information about list control item XXX)
  • 使用Tortoise SVN在SVN存储库中移动目录(Move Directory across SVN repository using Tortoise SVN)
  • 天蓝色服务结构集群中的web api无状态服务是否在一段时间不活动后进入休眠状态?(Do web api stateless services in azure service fabric cluster go to sleep after a period of inactivity?)
  • 我可以设置intelliJ来突出显示PHP编码风格吗?(Can I set intelliJ to highlight php coding style?)
  • 用javafx创建一个Truetype字体文件(Creating a Truetype Font file with javafx)
  • Spring ftp配置错误(Spring ftp configuration is wrong)
  • 使用gsub去除多个字符(Using gsub to strip multiple characters)
  • 续订推送证书并保持当前的App Store App正常工作(Renew Push certificate and keep current App Store App working)
  • js:ES5和ES6之间关于'this'关键字用法的一个令人困惑的观点(js: one confusing point about 'this' keyword usage between ES5 and ES6)
  • window.onload vs $(document).ready()(window.onload vs $(document).ready())
  • 在Swift中,如何声明一个符合一个或多个协议的特定类型的变量?(In Swift, how can I declare a variable of a specific type that conforms to one or more protocols?)