Skip to content Skip to sidebar Skip to footer

Scrapy Crawl Spider Does Not Download Files?

So I am made a crawl spider which crawls this website (https://minerals.usgs.gov/science/mineral-deposit-database/#products, follows every link on that web page, from which it scra

Solution 1:

Take a close look at the Files Pipeline documentation:

In a Spider, you scrape an item and put the URLs of the desired into a file_urls field.

You need to store the URLs of the files to download in a field name file_urls, not file.

This minimal spider works for me:

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


classMySpider(CrawlSpider):

    name = 'usgs.gov'
    allowed_domains = ['doi.org']
    start_urls = ['https://minerals.usgs.gov/science/mineral-deposit-database/#products']

    custom_settings = {
        'ITEM_PIPELINES': {'scrapy.pipelines.files.FilesPipeline': 1},
        'FILES_STORE': '/my/valid/path/',
    }

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//div[@id="products"]/p/a'), callback='parse_x'),
    )

    defparse_x(self, response):

        yield {
            'file_urls': [response.urljoin(u) for u in response.xpath('//span[starts-with(@data-url, "/catalog/file/get/")]/@data-url').extract()],
        }

Post a Comment for "Scrapy Crawl Spider Does Not Download Files?"