Active Hackathon

How to build a web scraping package to extract hyperlinks in 10 minutes using Python?

This article briefs about the process of building a custom python package that can be used to scrape data from the web by using various inbuilt functions of BeautifulSoup.
Listen to this story

In the present situation of web page design, we find pages associated with various hyperlinks. Hyperlinks in short mean the webpage linked to other webpages where the link to the web pages will be given in form of words in an underlined text where the viewers of the webpage can redirect to the link needed. So this article briefs about a custom web scraping module created to extract various hyperlinks present in a webpage

Table of Contents

  1. Introduction to Web Scraping
  2. Creating a custom python (py) file
  3. Executing the custom python (py) file
  4. Summary

Introduction to webscraping

Web scraping is a process of legally collecting data or information in the required format from the web and python offers extensive support for data collection over the web by offering powerful and effective modules and libraries. 

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

There are various web scraping packages in python. Selenium, UrlLib, and BeautifulSoup (bs4) are some of the modules to name a few. Out of these popular modules available a custom python package is implemented in these article by using various inbuilt functions of BeautifulSoup to extract hyperlinks present in a single webpage.

Any python package implemented for data collection over the web has to adhere for legal data collection by requesting data collection from the particular web pages.

Creating a custom python (py) file

A custom python file can easily be created in google colab or in jupyter. With respect to colab as it is one of the cloud-based working environment, we can first start off with a ipynb file.

The first few cells of the ipynb file should include the import statements of required libraries for carrying out the tasks. In this article the custom web scrapper is built using Beautiful Soup and the libraries imported for the same is shown below.

from bs4 import BeautifulSoup
import requests,re

Once the required libraries are imported a user-defined function is created to send a request for the webpage to collect data and it is stored in the variable. Later from the variable, only text from the request granted from the website will be accessed. The user-defined function created for the same is shown below.

def original_htmldoc(url):
 response = requests.get(url) ## the get inbuilt function is used to send access request to the url
 return response.text ## text function is used to retrieve the text from the response

If required certain custom print statements and input statements can be given as needed. The custom print statement used in the webscrapping python package is shown below.

print('Enter a url to scrape for links present in it')

A custom input was also declared which facilitates the user to enter his own required webpage link using the input() function as shown below.

url_to_scrape=input('Enter a website link to extract links')

The user mentioned webpage is now passed on to the user-defined function shown above to obtain data collection request and the request granted is stored in a particular variable as shown below.

html_doc= original_htmldoc(url_to_scrape)

Now the html parser is used on top of Beautiful Soup web scrapping python package to identify the hyperlinks present in the webpage as shown below.

soup = BeautifulSoup(html_doc, 'html.parser')  ## html parser is used to identify the hyperlinks within the same web page

Now the parsed contents of the webpage is iterated through the find_all() method of BeautifulSoup for searching the hyperlinks associated within the user-mentioned webpage and the hyperlinks are collected using the get() method of BeautifulSoup for the reference links present in the same webpage. The code for the same is shown below.

for link in soup.find_all('a',attrs={'href': re.compile("https://")}):  ## findall is used to obtain a list of various hyperlinks in the mentioned web page in form of a list

 print(link.get('href'))

The link entered while running the python file in the custom input function is given below.

The output generated for the above-mentioned link is shown below.

The output generated basically describes the various hyperlinks present in the above-mentioned link entered by the user. So this python (py) file can be used as a module or an executable statement to run at different instances. Using the python (py) file in a different working instance is briefed below.

Executing the custom python (py) file

As mentioned earlier the custom python (py) file created can be now executed in a different working instance. In this article, the custom python file created was downloaded in the form of py file and uploaded to a working directory using the google cloud platform. The appearance of the python file in the working directory will be as shown below.

So once the custom python file is available a ipynb file was taken up in the same working directory. Initially, the drive was mounted to the working environment by traversing until specifying the path to the directory containing the python (py) file as shown below.

from google.colab import drive
drive.mount('/content/drive')

If the mounting of the drive is successful we will yield an output as shown below.

Now the command line utilities are specified as shown below to traverse to the directory of the python (py) file.

!ln -s /content/gdrive/My\ Drive/ /mydrive
%cd /content/drive/MyDrive/Colab notebooks/Web_Scrapping

If the command line utilities is used appropriately as mentioned above we would yield an output from the command line statements for correct traversal to the python (py) file directory as shown below.

So once the working directory is correctly traversed we can run a python executable statement as shown below to obtain hyperlinks in any of the user required webpages.

!python link_extractor_py.py

When the above-mentioned executable statement is run in a particular cell of a python notebook the command will ask for the webpage user wants to check hyperlinks for as shown below.

Now the user has to enter a webpage link in the blank space and the executable command will now be responsible to yield hyperlinks present in that particular webpage according to the logic present in the python (py) file. Some of the hyperlinks identified by the executable statement is shown below.

Summary

So this is the way the article emphasizes on how to create a custom python (py) file using standard web scrapping python packages and later run it in different working instances or environments and provide the user the flexibility to view various hyperlinks present in a single webpage and suitably access it by just a click for needed information.

References

More Great AIM Stories

Darshan M
Darshan is a Master's degree holder in Data Science and Machine Learning and an everyday learner of the latest trends in Data Science and Machine Learning. He is always interested to learn new things with keen interest and implementing the same and curating rich content for Data Science, Machine Learning,NLP and AI

Our Upcoming Events

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan, 2023

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
27-28th Apr, 2023

Conference, in-person (Bangalore)
MachineCon 2023
23rd Jun, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM