Screen Scraping In Python
Solution 1:
I'd recommend read up on xpath & try this scrapy tutorial. http://doc.scrapy.org/intro/tutorial.html . It is fairly easy to write a spider like this
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class DmozSpider(BaseSpider):
name = "dmoz.org"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
for site in sites:
title = site.select('a/text()').extract()
link = site.select('a/@href').extract()
desc = site.select('text()').extract()
print title, link, desc
Solution 2:
To ease the common tasks associated with screen scraping, a python framework "Scrapy" exists. It will make html, xml parsing painless.
Solution 3:
What you might be experiencing is that you are having trouble parsing content that is dynamically generated with javascript. I wrote a small tutorial on this subject, this might help:
http://koaning.github.io/html/scapingdynamicwebsites.html
Basically what you do is you have the selenium library pretend that it is a firefox browser, the browser will wait until all javascript has loaded before it continues passing you the html string. Once you have this string, you can then parse it with beautifulsoup.
Post a Comment for "Screen Scraping In Python"