Skip to content Skip to sidebar Skip to footer

Why Does Text Retrieved From Pages Sometimes Look Like Gibberish?

I'm using urllib and urllib2 in Python to open and read webpages but sometimes, the text I get is unreadable. For example, if I run this: import urllib text = urllib.urlopen('http

Solution 1:

This gibberish is a real server response for the request to 'http://tagger.steve.museum/steve/object/141913'. Actually, it looks like obfuscated JavaScript, which, if executed by a browser, loads page content.

To get this content, you need to execute this JavaScript, and this can be a really difficult task within Python. If you still want to do this, take a look at pywebkitgtk.

Solution 2:

You can use Selenium to get the content. Download the server and client drivers, run server and run this:

from selenium import selenium
s = selenium("localhost", 4444, "*chrome", "http://tagger.steve.museum")
s.start()

s.open("/steve/object/141913")

text = s.get_html_source()
print text

Post a Comment for "Why Does Text Retrieved From Pages Sometimes Look Like Gibberish?"