python - Scrapy Spider not Following Links -
i'm writing scrapy spider crawl today's nyt articles homepage, reason doesn't follow links. when instantiate link extractor in scrapy shell http://www.nytimes.com
, extracts list of article urls le.extract_links(response)
, can't crawl command (scrapy crawl nyt -o out.json
) scrape homepage. i'm sort of @ wit's end. because homepage not yield article parse function? appreciated.
from datetime import date import scrapy scrapy.contrib.spiders import rule scrapy.contrib.linkextractors import linkextractor ..items import newsarticle open('urls/debug/nyt.txt') debug_urls: debug_urls = debug_urls.readlines() open('urls/release/nyt.txt') release_urls: release_urls = release_urls.readlines() # ["http://www.nytimes.com"] today = date.today().strftime('%y/%m/%d') print today class nytspider(scrapy.spider): name = "nyt" allowed_domains = ["nytimes.com"] start_urls = release_urls rules = ( rule(linkextractor(allow=(r'/%s/[a-z]+/.*\.html' % today, )), callback='parse', follow=true), ) def parse(self, response): article = newsarticle() story in response.xpath('//article[@id="story"]'): article['url'] = response.url article['title'] = story.xpath( '//h1[@id="story-heading"]/text()').extract() article['author'] = story.xpath( '//span[@class="byline-author"]/@data-byline-name' ).extract() article['published'] = story.xpath( '//time[@class="dateline"]/@datetime').extract() article['content'] = story.xpath( '//div[@id="story-body"]/p//text()').extract() yield article
i have found solution problem. doing 2 things wrong:
- i needed subclass
crawlspider
ratherspider
if wanted automatically crawl sublinks. - when using
crawlspider
, needed use callback function rather overridingparse
. per docs, overridingparse
breakscrawlspider
functionality.
Comments
Post a Comment