Skip to content Skip to sidebar Skip to footer

Fetching Lawyer Details From Multiple Links Using Bs4 In Python

I am an absolute beginner to Web Scraping using Python and just knowing very little about programming i Python. I am just trying to extract the information of the lawyers in the Te

Solution 1:

You are already at the level of a tag when using the class selector which is your first issue.

I use a different selector below and test for urls which disguise the fact they are for the same lawyer. I separate into end urls so I can use set to remove duplicates.

I use Session for efficiency of re-using connection. I add the lawyers profiles to a list and flatten list via set comprehension to remove any duplicates.

import requests
from bs4 import BeautifulSoup as bs

final = []
with requests.Session() as s:
    res = s.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
    soup = bs(res.content, 'lxml')
    cities = [item['href'] for item in soup.select('#browse_view a')]
    for c in cities:
        r = s.get(c)
        s1 = bs(r.content,'lxml')
        categories = [item['href'] for item in s1.select('.three_browse_columns:nth-of-type(2) a')]
        for c1 in categories:
            r1 = s.get(c1)
            s2 = bs(r1.content,'lxml')
            lawyers = [item['href'].split('*')[1] if'*'in item['href'] else item['href'] for item in s2.select('.indigo_text .directory_profile')]
            final.append(lawyers)
final_list = {item for sublist in final for item in sublist}

Solution 2:

From another post:

This is occurring because you can't use nth-of-type() with a classed tag, it can only be used on a tag like this: table:nth-of-type(4).

Your categories variable is returning an empty list because of that.

The workaround is given in the same post:

categories = [item['href'] for item in s1.select('.three_browse_columns a')][1]

Solution 3:

I have tried the following:

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

res = requests.get('https://attorneys.superlawyers.com/tennessee/', headers = {'User-agent': 'Super Bot 9000'})
soup = bs(res.content, 'lxml')

cities = [item['href'] for item in soup.select('#browse_view a')]
for c in cities:
    r=requests.get(c)
    s1=bs(r.content,'lxml')
    categories = [item['href'] for item in s1.select('.three_browse_columns:nth-of-type(2) a')]
    #print(categories)for c1 in categories:
        r1=requests.get(c1)
        s2=bs(r1.content,'lxml')
        lawyers = [item['href'] for item in s2.select('#lawyer_0_main a')]
        print(lawyers)

"It is printing not only the profile links but also the about and other associated links which is not required. I just want the profile links of the lawyers."

"The out put is displayed as"

"`['https://profiles.superlawyers.com/tennessee/alamo/lawyer/jim-emison/c99a7c4f-3a42-4953-9260-3750f46ed4bd.html', 'https://www.superlawyers.com/about/selection_process.html']['https://profiles.superlawyers.com/tennessee/alamo/lawyer/jim-emison/c99a7c4f-3a42-4953-9260-3750f46ed4bd.html', 'https://www.superlawyers.com/about/selection_process.html']['https://profiles.superlawyers.com/tennessee/alamo/lawyer/jim-emison/c99a7c4f-3a42-4953-9260-3750f46ed4bd.html', 'https://www.superlawyers.com/about/selection_process.html']`"

Post a Comment for "Fetching Lawyer Details From Multiple Links Using Bs4 In Python"