Skip to content Skip to sidebar Skip to footer

Beautiful Soup Not Pulling All The Html Of A Webpage

I'm trying to practice using BeautifulSoup. I am trying to pull the image address of football player images from this website: https://www.transfermarkt.com/jordon-ibe/profil/spiel

Solution 1:

The site seems to inspect whether the User-Agent header of the request is valid.

So you need to add the header like this:

import urllib3
import certifi

url = 'https://www.transfermarkt.com/jordon-ibe/profil/spieler/195652'
http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())
response = http.request('GET', url, headers={'User-Agent': 'Mozilla/5.0'})
print(response.status)

This prints 200. If you remove the headers, you get 404.

Any non-empty User-Agent value (after trimming whitespace) seems to work.

Post a Comment for "Beautiful Soup Not Pulling All The Html Of A Webpage"