Why Does Text Retrieved From Pages Sometimes Look Like Gibberish?
I'm using urllib and urllib2 in Python to open and read webpages but sometimes, the text I get is unreadable. For example, if I run this: import urllib text = urllib.urlopen('http
Solution 1:
This gibberish is a real server response for the request to 'http://tagger.steve.museum/steve/object/141913'
. Actually, it looks like obfuscated JavaScript, which, if executed by a browser, loads page content.
To get this content, you need to execute this JavaScript, and this can be a really difficult task within Python. If you still want to do this, take a look at pywebkitgtk
.
Solution 2:
You can use Selenium to get the content. Download the server and client drivers, run server and run this:
from selenium import selenium
s = selenium("localhost", 4444, "*chrome", "http://tagger.steve.museum")
s.start()
s.open("/steve/object/141913")
text = s.get_html_source()
print text
Post a Comment for "Why Does Text Retrieved From Pages Sometimes Look Like Gibberish?"