My Script Parses All The Links Again And Again From A Infinite Scrolling Webpage

March 31, 2024 Post a Comment

I've written a script using python in combination with selenium to get all the company links from a webpage which doesn't display all the links until scrolled downmost. However, wh

Solution 1:

I don't know python but I do know what you are doing wrong. Hopefully you'll be able to figure out the code for yourself ;)

Every time you scroll down 50 links are added to the page until there are 1000 links. Well almost... it starts with 20 links and then adds 30 and then 50 each time until there are 1000.

The way your code is now you are printing of:

The 1st 20 links.

The 1st 20 again + the next 30.

The 1st 50 + the next 50.

And so on...

What you actually want to do is just scroll down the page until you have all the links on the page and then print them. Hope that helps.

Here's the updated Python code (I've checked it and it works)

from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get('http://fortune.com/fortune500/list/')


whileTrue:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)
    listElements = driver.find_elements_by_xpath("//li[contains(concat(' ', @class, ' '), ' small-12 ')]//a")
    print(len(listElements))
    if (len(listElements) == 1000):
        breakfor item in listElements:
    print(item.get_attribute("href"))

driver.close()

If you want it to work a bit faster you could swap out the "time.sleep(5)" for Anderson's wait statement

Solution 2:

You can try below code:

from selenium.webdriver.support.ui import WebDriverWait as wait 
from selenium.common.exceptions import TimeoutException

my_links = []
whileTrue:
    try:
        current_length = len(my_links)
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        wait(driver, 10).until(lambda: len(driver.find_elements_by_xpath("//li[contains(concat(' ', @class, ' '), ' small-12 ')]//a")) > current_length)
        my_links.extend([a.get_attribute("href") for a in driver.find_elements_by_xpath("//li[contains(concat(' ', @class, ' '), ' small-12 ')]//a")])
    except TimeoutException:
        break

my_links = set(my_links)

This should allow you to scroll down and collect new links while it's possible. Finally with set() you can leave only unique values

Python Manual

My Script Parses All The Links Again And Again From A Infinite Scrolling Webpage

Solution 1:

Solution 2:

Post a Comment for "My Script Parses All The Links Again And Again From A Infinite Scrolling Webpage"