Now Reading
Guide To Web Scraping With Python Libraries Selenium & Beautiful Soup


Guide To Web Scraping With Python Libraries Selenium & Beautiful Soup


Web scraping is a method for transforming unstructured data on the web into machine-readable, structured data for analysis. In general web, scraping is a complex process, but Python programming language has made it an easy and effective means. Python libraries such as Selenium, Beautiful soup and Pandas are used for web scraping.



Mining Data

It is essential that for practising any of the new data related technologies we need well-designed data sets. Many users believe they have to collect their own data but it’s simply not true.

There are hundreds of open data sets accessible, ready to be used and analyzed by anyone willing to look for them. Below is a list of most globally interesting open data websites.

1.US Census Bureau http://www.census.gov/data.html

2.Socrata

3.European Union Open Data Portal http://open-data.europa.eu/en/data/

4.Data.gov.uk http://data.gov.uk/ Data from the UK Government.

5.UNICEF volunteers statistics on the sphere of women and children worldwide.

Many of these open source datasets are from the government and public organisations, where they bury the data in drill-down links and tables. This often requires the users to use best guess navigation which helps in finding the specific data users are looking for. Scraping the data with the help of Python and saving it as JSON is what users need to do to get started.

JavaScript Links Rise Complexity

Most of the open source datasets websites use JavaScript links which makes it tough to analyse them. Methods using Python libraries will not work without some extensions.

Here’s Where Selenium Comes In

Selenium package is used to automate web browser intercommunication from Python. With Selenium, programming a Python script to automate a web browser is conceivable. Afterwards, those complex JavaScript links are no longer a problem.

Code:

From selenium import webdrivere

From selenium.webdriver.common.keys import Keys

From bs4 import BeautifulSoup

Import re

Import pandas as pd

Import os

Selenium will now begin a browser session. For Selenium to work, it must access the browser driver. By default, it will look in the corresponding directory as the Python script. Connections to Chrome, Firefox, Edge, and Safari drivers available here.

The sample code below uses Chrome

#launch url

url= "http:// website name/division/sub_division.format"

#create a new chrome session

driver=webdriver.chrome()

driver.implicity_wait(30)

driver.get(url)

python_button=driver.find_element_by_id('@@@@@@@@@@@@@@@@') #$$$$$

python_button.click()#click $$$$$ link

The python_button.click() mentioned in the code is telling Selenium to click the JavaScript link on the page. After appearing at the specified page, Selenium hands over the page source to Beautiful Soup.

Handing It Over To Beautiful Soup

Beautiful Soup is the best way to cross the DOM(Document Object Model) and scrape the data. After representing an empty list and a counter variable, it is time to examine Beautiful Soup to seize all the links on the page that coordinate a regular expression.

Code:

#Selenium hands the page source to Beautiful Soup

soup_level1=BeautifulSoup(driver.page_source,’####’)

datalist=[] #empty list

x=() #counter

For link in soup_level1.find_all(‘a’,id=re.compile(“^##file_location##”));

## code to execute in for loop ##

 

#Beautiful soup grabs all the specified links

All specified links

For link in soup_level.find_all(‘variable’, id=re.compile(“^##data_set path”));

# selenium visits each specified page

python_button=driver.find_element_by_##variable(‘##path’ + str(x))

python_button.click() #click link

# Selenium hands of the source of the specific page to Beautiful Soup

soup_level2=BeautifulSoup(driver.page_source,’###’)

#beautiful Soup grabe the HTML table on the page

table=soup_level2.find_all(‘table’)[0]

#giving the HTML table to pandas to put in a dataframe object

df=pd.read_html(str(table),header=0)

#Store the dataframe in a list

See Also

datalist.append(df[0])

#Ask Selenium to click the back button

driver.execute_script(“window.history.go(-1)”)

#increment the counter variable before starting the loop over

X +=1

Passing On To Pandas

Beautiful Soup transfers the conclusions to Pandas. Pandas use its read_htmlfunction to read the HTML table data into a data frame. The data frame is added to the previously defined empty list. Before the code block of the loop is terminated, Selenium needs to click the back button in the browser. This is so the next link in the loop will be available to click on the specified listing page.

When the for/in a loop has completed, Selenium will visit every specified title link. Beautiful Soup will recover the table from each page. Pandas will store the data from each table in a data frame. Each data frame is an item in the data list. The individual table data frames will merge into one extended data frame. The data will then be converted to JSON format.

 

Code:

#loop has completed

#end the Selenium browser session

driver.quit()

#combine all pandas dataframes in the list into one gaint dataframe

result=pd.concat([pd.Dataframe(datalist(i]) for i in range(len(datalist))], ignore_index=true)

#convert the pandas dataframe to JSON

json_records=result.to_json(orient=’records’)

 

#get current working directory

path=os.getcwd()

#open, write, and close the file

f=open(path+”\##specified_path##”,”w”)

#specified file

f.write(specified_records)

f.close()

A Quick Way

The automated web scraping method described above completes quickly. Selenium begins a browser window which users can see running. This enables developers to show users a screen grab of how fast the process is. A user sees how fast the script follows a link, fetches the data, goes back, and clicks the resulting link. It shortens the process of retrieving the data from hundreds of links exponentially.



Register for our upcoming events:


Enjoyed this story? Join our Telegram group. And be part of an engaging community.


Our annual ranking of Artificial Intelligence Programs in India for 2019 is out. Check here.

Provide your comments below

comments

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
Scroll To Top