How To Scrape A Website Which Redirects For Some Time

February 28, 2024 Post a Comment

I am trying to scrape a website which has a delay of 5 sec while displaying a ddos prevention page, the website is Koinex I am using Python3 and BeuwtifulSoup, I think I would ne

Solution 1:

It uses JavaScript to generate some value which is send to page https://koinex.in/cdn-cgi/l/chk_jschl and get cookie cf_clearance which is checked by page to skip doss page.

Code can generate value using different parameters and different methods in every requests so it can be easier to use Selenium to get data

from selenium import webdriver
import time

driver = webdriver.Firefox()
driver.get('https://koinex.in/')

time.sleep(8)

tables = driver.find_elements_by_tag_name('table')

for item in tables:
    print(item.text)
    #print(item.get_attribute("value"))

Result

VOLUME PRICE/ETH
5.231064,300.000.093064,100.0010.767064,025.010.084064,000.000.330063,800.000.280063,701.000.488063,700.000.706063,511.000.502063,501.000.101063,500.011.485063,500.001.000063,254.000.030063,253.00
VOLUME PRICE/ETH
1.000064,379.000.094064,380.000.971064,398.000.035064,399.000.717064,400.000.300064,479.005.165064,480.350.002064,495.000.200064,496.009.563064,500.000.400064,501.010.040064,550.000.522064,600.00DATE VOLUME PRICE/ETH
31/12/2017, 12:19:290.277064,300.0031/12/2017, 12:19:110.500064,300.0031/12/2017, 12:18:280.344064,025.0131/12/2017, 12:18:280.075064,026.0031/12/2017, 12:17:500.001064,300.0031/12/2017, 12:17:470.015064,300.0031/12/2017, 12:15:450.672064,385.0031/12/2017, 12:15:450.200064,300.0031/12/2017, 12:15:450.062064,300.0031/12/2017, 12:15:450.065064,199.9731/12/2017, 12:15:450.001064,190.0031/12/2017, 12:15:450.003064,190.0031/12/2017, 12:15:250.001064,190.00

You can also get HTML from Selenium and use with BeautifulSoup

soup = BeautifulSoup(driver.page_source)

but Selenium can get data using xpath, css selector and other methods so mostly there is no need to use BeautifulSoup

See documentation: 4. Locating Elements

EDIT: this code uses cookies from Selenium to load page with requests and it has no problem with DDoS page.

Problem is that page uses JavaScript to display tables so you can't get them using requests+BeautifulSoup. But maybe you will find urls used by JavaScript to get data for tables and then requests can be useful.

from selenium import webdriver
import time

# --- Selenium ---

url = 'https://koinex.in/'

driver = webdriver.Firefox()
driver.get(url)

time.sleep(8)

#tables = driver.find_elements_by_tag_name('table')#for item in tables:#    print(item.text)# --- convert cookies/headers from Selenium to Requests ---

cookies = driver.get_cookies()

for item in cookies:
    print('name:', item['name'])
    print('value:', item['value'])
    print('path:', item['path'])
    print('domain:', item['domain'])
    print('expiry:', item['expiry'])
    print('secure:', item['secure'])
    print('httpOnly:', item['httpOnly'])
    print('----')

# convert list of dictionaries into dictionary
cookies = {c['name']: c['value'] for c in cookies}

# it has to be full `User-Agent` used in Browser/Selenium (it can't be short 'Mozilla/5.0')
headers = {'User-Agent': driver.execute_script('return navigator.userAgent')}

# --- requests + BeautifulSoup ---import requests
from bs4 import BeautifulSoup

s = requests.Session()
s.headers.update(headers)
s.cookies.update(cookies)

r = s.get(url)

print(r.text)

soup = BeautifulSoup(r.text, 'html.parser')
tables = soup.find_all('table')

print('tables:', len(tables))

for item in tables:
    print(item.get_text())

Python Manual

How To Scrape A Website Which Redirects For Some Time

Solution 1:

Post a Comment for "How To Scrape A Website Which Redirects For Some Time"