i'm writing scrapy spider crawl today's nyt articles homepage, reason doesn't follow links. when instantiate link extractor in scrapy shell, extracts list of article urls le.extract_links(response), can't crawl command (scrapy crawl nyt -o out.json) scrape homepage. i'm sort of @ wit's end. because homepage not yield article parse function? appreciated.

from datetime import date                                                         import scrapy                                                                    scrapy.contrib.spiders import rule                                          scrapy.contrib.linkextractors import linkextractor                            ..items import newsarticle                                                   open('urls/debug/nyt.txt') debug_urls:                                       debug_urls = debug_urls.readlines()                                           open('urls/release/nyt.txt') release_urls:                                   release_urls = release_urls.readlines() # [""]                                   today ='%y/%m/%d')                                        print today                                                                        class nytspider(scrapy.spider):                                                      name = "nyt"                                                                     allowed_domains = [""]                                                start_urls = release_urls                                                           rules = (                                                                                   rule(linkextractor(allow=(r'/%s/[a-z]+/.*\.html' % today, )),                            callback='parse', follow=true),                                        )                                                                                    def parse(self, response):                                                              article = newsarticle()                                                                                  story in response.xpath('//article[@id="story"]'):                                  article['url'] = response.url                                                       article['title'] = story.xpath(                                                             '//h1[@id="story-heading"]/text()').extract()                               article['author'] = story.xpath(                                                            '//span[@class="byline-author"]/@data-byline-name'                          ).extract()                                                                      article['published'] = story.xpath(                                                      '//time[@class="dateline"]/@datetime').extract()                         article['content'] = story.xpath(                                                        '//div[@id="story-body"]/p//text()').extract()                           yield article   

i have found solution problem. doing 2 things wrong:

  1. i needed subclass crawlspider rather spider if wanted automatically crawl sublinks.
  2. when using crawlspider, needed use callback function rather overriding parse. per docs, overriding parse breaks crawlspider functionality.


