Why Does Text Retrieved From Pages Sometimes Look Like Gibberish?

November 30, 2023 Post a Comment

I'm using urllib and urllib2 in Python to open and read webpages but sometimes, the text I get is unreadable. For example, if I run this: import urllib text = urllib.urlopen('http

Solution 1:

This gibberish is a real server response for the request to 'http://tagger.steve.museum/steve/object/141913'. Actually, it looks like obfuscated JavaScript, which, if executed by a browser, loads page content.

To get this content, you need to execute this JavaScript, and this can be a really difficult task within Python. If you still want to do this, take a look at pywebkitgtk.

Solution 2:

You can use Selenium to get the content. Download the server and client drivers, run server and run this:

from selenium import selenium
s = selenium("localhost", 4444, "*chrome", "http://tagger.steve.museum")
s.start()

s.open("/steve/object/141913")

text = s.get_html_source()
print text

Python Manual

Why Does Text Retrieved From Pages Sometimes Look Like Gibberish?

Solution 1:

Solution 2:

Post a Comment for "Why Does Text Retrieved From Pages Sometimes Look Like Gibberish?"