Scrape Beautifully With Beautiful Soup In Python

Web Scraping is the process of collecting data from the internet by using various tools and frameworks. Sometimes, It is used for online price change monitoring, price comparison, and seeing how well the competitors are doing by extracting data from their websites.

Web Scraping is as old as the internet is, In 1989 World wide web was launched and after four years World Wide Web Wanderer: The first web robot was created at MIT by Matthew Gray, the purpose of this crawler is to measure the size of the worldwide web.

Beautiful Soup is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It creates a parse tree from page source code that can be used to extract data in a hierarchical and more readable manner.

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

It was first introduced by Leonard Richardson, who is still contributing to this project and this project is additionally supported by Tidelift (a paid subscription tool for open-source maintenance)

Beautiful soup3 was officially released in May 2006, Latest version released by Beautiful Soup is 4.9.2, and it supports Python 3 and Python 2.4 as well.


Download our Mobile App



Advantage

  • Very fast
  • Extremely lenient
  • Parses pages the same way a Browser does
  • Prettify the Source Code

Installation

How to install BeautifulSoup

For installing Beautiful Soup we need Python made framework for the same, and also some other supported or additional frameworks can be installed by given PIP command below:
pip install beautifulsoup4

Other frameworks we need in the future to work with different parser and frameworks:

pip install selenium
pip install requests
pip install lxml
pip install html5lib

Quickstart

A small code to see how BeautifulSoup is faster than any other tools, we are extracting the source code from demoblaze 

from bs4 import BeautifulSoupimport requests  URL = "https://www.demoblaze.com/"r = requests.get(URL)  

soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())

Now “.prettify()” is a built-in function provided by the Beautiful Soup module, it gives the visual representation of the parsed URL Source code. i.e. it arranges all the tags in a parse-tree manner with better readability

prettify function

How to locate the data from the source code?

For Excluding unwanted data and scrap reliable information only, we have to inspect the webpage.

 We can open the Inspect tab by doing any of the following in your Web browser:

  • Right Click on Webpage and Select Inspect
  • Or in Chrome, Go to the upper right side of your chrome browser screen and Click on the Menu bar -> More tools -> Developer tools.
  • Ctrl + Shift + i

Now after opening the inspect tab, you can search the element you wish to extract from the webpage.

By just hovering through the webpage, we can select the elements; and corresponding code will be available like shown in the above image.

The title for all the articles is inside Class=”post-article”, and inside that, we have our article title in-between “span” tags.

With this method, we can look into web pages’ backend and explore all the data with just hover and watch functionality provided by Chrome browser Inspect tools.

Let’s Extract Some data !

In this example, we are going to use Selenium for browser automation & source code extraction purposes.

A full tutorial about selenium is available here.

Our purpose is to scrape all the Titles of articles from the Analytics India Magazine homepage.

#importing modules
from selenium import webdriver
from bs4 import BeautifulSoup

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')

driver = webdriver.Chrome(chrome_options=options)
source =driver.get('https://analyticsindiamag.com/')
source_code=driver.page_source

soup = BeautifulSoup(source_code,'lxml')
article_block =soup.find_all('div',class_='post-title')

for titles in article_block:
	title =titles.find('span').get_text()
	print(title)

Let’s break down the above code line by line to understand how it can detect those article titles:

  •  First, two lines were to import BeautifulSoup and Selenium.
from selenium import webdriver
from bs4 import BeautifulSoup
  • Then we started the chrome Browser in Incognito, and headless mode means no chrome popup and surfing web URLs; instead, it will boot up the URL in the background.
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
  • Then with the help of Selenium driver, we loaded the given URL source code into “source_code”  variable.
source_code=driver.page_source

Note: We can extract given URL source code in many ways, but as we already know about selenium, So it’s easy to move forward with the same tool, and it has other functionalities too like scrolling through the hyperlinks and clicking elements.

  • Passing “source_code” variable into ‘BeautifulSoup’ with specifying the ”lxml” parser we are going to use  for data processing,
  • Now we are using the Beautiful soup function “Find” to find the ‘div’ tag having class ‘post-title’ as discussed above because article titles are inside this div container. 
soup = BeautifulSoup(source_code,'lxml')
article_block =soup.find_all('div',class_='post-title')
  • Now with a simple for loop, we are going to iterate through each article element and again with the help of “Find” we extract all the “span” tags containing title text.
  • “get_text()” is used to trim the pre/post span tags we are getting with each iteration of finding titles. 
for titles in article_block:
	title =titles.find('span').get_text()
	print(title)

After this, you can feed the data for data science work you can use this data to create a world, or maybe you can do text-analysis.

Conclusion

Beautiful Soup is a great tool for extracting very specific information from large unstructured raw Data, and also it is very fast and handy to use.

Its documentation is also very helpful if you want to continue your research.

You learned how to:

  • Install and setup the scraping environment
  • Inspect the website to get elements name
  • Parse the source code in Beautiful Soup to get trimmed results
  • Live example of getting all the published article names from a website.

More Great AIM Stories

Mohit Maithani
Mohit is a Data & Technology Enthusiast with good exposure to solving real-world problems in various avenues of IT and Deep learning domain. He believes in solving human's daily problems with the help of technology.

AIM Upcoming Events

Early Bird Passes expire on 3rd Feb

Conference, in-person (Bangalore)
Rising 2023 | Women in Tech Conference
16-17th Mar, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox
AIM TOP STORIES

Do machines feel pain?

Scientists worldwide have been finding ways to bring a sense of awareness to robots, including feeling pain, reacting to it, and withstanding harsh operating conditions.

IT professionals and DevOps say no to low-code

The obsession with low-code is led by its drag-and-drop interface, which saves a lot of time. In low-code, every single process is shown visually with the help of a graphical interface that makes everything easier to understand.

Neuralink elon musk

What could go wrong with Neuralink?

While the broad aim of developing such a BCI is to allow humans to be competitive with AI, Musk wants Neuralink to solve immediate problems like the treatment of Parkinson’s disease and brain ailments.

Understanding cybersecurity from machine learning POV 

Today, companies depend more on digitalisation and Internet-of-Things (IoT) after various security issues like unauthorised access, malware attack, zero-day attack, data breach, denial of service (DoS), social engineering or phishing surfaced at a significant rate.