Scrapy Crawl Spider Does Not Download Files?
So I am made a crawl spider which crawls this website (https://minerals.usgs.gov/science/mineral-deposit-database/#products, follows every link on that web page, from which it scra
Solution 1:
Take a close look at the Files Pipeline documentation:
In a Spider, you scrape an item and put the URLs of the desired into a file_urls field.
You need to store the URLs of the files to download in a field name file_urls
, not file
.
This minimal spider works for me:
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
classMySpider(CrawlSpider):
name = 'usgs.gov'
allowed_domains = ['doi.org']
start_urls = ['https://minerals.usgs.gov/science/mineral-deposit-database/#products']
custom_settings = {
'ITEM_PIPELINES': {'scrapy.pipelines.files.FilesPipeline': 1},
'FILES_STORE': '/my/valid/path/',
}
rules = (
Rule(LinkExtractor(restrict_xpaths='//div[@id="products"]/p/a'), callback='parse_x'),
)
defparse_x(self, response):
yield {
'file_urls': [response.urljoin(u) for u in response.xpath('//span[starts-with(@data-url, "/catalog/file/get/")]/@data-url').extract()],
}
Post a Comment for "Scrapy Crawl Spider Does Not Download Files?"