python - Scrapy only outputting an open bracket -
i'm trying scrape title , url of khan academy pages under math/science/economics pages. however, outputting open bracket, , before happened scrape start url.
from openbar_index.items import openbarindexitem scrapy.contrib.spiders import crawlspider, rule scrapy.contrib.linkextractors.sgml import sgmllinkextractor class openbarspider(crawlspider): """ scrapes website urls educational websites , commits urls/webpage names/text document """ name = 'openbar' allowed_domains = 'khanacademy.org' start_urls = [ "https://www.khanacademy.org" ] rules = [ rule(sgmllinkextractor(allow = ['/math/']), callback='parse_item', follow = true), rule(sgmllinkextractor(allow = ['/science/']), callback='parse_item', follow=true), rule(sgmllinkextractor(allow = ['/economics-finance-domain/']), callback='parse_item', follow=true) ] def parse_item(self, response): item = openbarindexitem() url = response.url item['url'] = url item['title'] = response.xpath('/html/head/title/text()').extract() yield item
does have idea why happening or tips on how fix it?
the problem assignment allowed_domains
. must not string
list
according documentation. string potentially results filtered scrapy offsite requests because there no valid domain.
so adding square brackets in next line should fix it
allowed_domains = ['khanacademy.org']
Comments
Post a Comment