python - Scrapy Spider not Following Links -

- January 15, 2012

i'm writing scrapy spider crawl today's nyt articles homepage, reason doesn't follow links. when instantiate link extractor in scrapy shell http://www.nytimes.com, extracts list of article urls le.extract_links(response), can't crawl command (scrapy crawl nyt -o out.json) scrape homepage. i'm sort of @ wit's end. because homepage not yield article parse function? appreciated.

from datetime import date                                                         import scrapy                                                                    scrapy.contrib.spiders import rule                                          scrapy.contrib.linkextractors import linkextractor                            ..items import newsarticle                                                   open('urls/debug/nyt.txt') debug_urls:                                       debug_urls = debug_urls.readlines()                                           open('urls/release/nyt.txt') release_urls:                                   release_urls = release_urls.readlines() # ["http://www.nytimes.com"]                                   today = date.today().strftime('%y/%m/%d')                                        print today                                                                        class nytspider(scrapy.spider):                                                      name = "nyt"                                                                     allowed_domains = ["nytimes.com"]                                                start_urls = release_urls                                                           rules = (                                                                                   rule(linkextractor(allow=(r'/%s/[a-z]+/.*\.html' % today, )),                            callback='parse', follow=true),                                        )                                                                                    def parse(self, response):                                                              article = newsarticle()                                                                                  story in response.xpath('//article[@id="story"]'):                                  article['url'] = response.url                                                       article['title'] = story.xpath(                                                             '//h1[@id="story-heading"]/text()').extract()                               article['author'] = story.xpath(                                                            '//span[@class="byline-author"]/@data-byline-name'                          ).extract()                                                                      article['published'] = story.xpath(                                                      '//time[@class="dateline"]/@datetime').extract()                         article['content'] = story.xpath(                                                        '//div[@id="story-body"]/p//text()').extract()                           yield article

i have found solution problem. doing 2 things wrong:

i needed subclass crawlspider rather spider if wanted automatically crawl sublinks.
when using crawlspider, needed use callback function rather overriding parse. per docs, overriding parse breaks crawlspider functionality.

Search This Blog

Yet

python - Scrapy Spider not Following Links -

Comments

Post a Comment

Popular posts from this blog

PHP DOM loadHTML() method unusual warning -

python - How to create jsonb index using GIN on SQLAlchemy? -

c# - TransactionScope not rolling back although no complete() is called -