Web Scraping is a technique used for scraping from the internet and storing it locally on your system. It is used to scrape data from different websites using Hypertext transfer protocol. Web Scraping is used by a large number of companies that work on Data Harvesting. It is used to create Search Engine bots.
Autoscraper is a smart, automatic. Fast and lightweight web scraper for python. It makes web scraping an easy task. It gets a URL or the HTML content of a web page and a list of sample data that we want to scrape from that page. It is easy as we only need to write a few lines of code, it’s blazingly fast because it is lightweight and It learns the scraping rules and returns the similar elements.
In this article, we will explore Autoscraper and see how we can use it to scrape data from the web.
Implementation:
Autoscraper can be installed using the git repository where it is hosted. Before Installing autoscraper you need to download and install the git version according to your operating system. After git is installed we can install autoscraper by running the below-given command in the command prompt.
pip install git+https://github.com/alirezamika/autoscraper.git
- Importing Required Libraries
We will only import autoscraper as it is sufficient for web scraping alone.
from autoscraper import AutoScraper
- Defining Web Scraping function
Let us start by defining a URL from which will be used to fetch the data and the required data sample which is to be fetched. Here I will fetch titles for different articles on NLP published in Analytics India Magazine.
url = 'https://analyticsindiamag.com/?s=nlp'
category = ["8 Open-Source Tools To Start Your NLP Journey"]
- Initiate AutoScraper
The next step is calling the AutoScraper function so that we can use it to build the scraper model and perform a web scraping operation.
scraper = AutoScraper()
- Building The object
This is the final step where we create the object and display the result of the web scraping.
scrape = AutoScraper()
final = scrape.build(url, category)
print(final)
Here we saw that it returns the title of the topic based on NLP, similarly, we can also retrieve URLs of the Article by just passing the sample URL in the category we defined above.
category = ["https://analyticsindiamag.com/8-open-source-tools-to-start-your-nlp-journey/"]
scrape = AutoScraper()
final = scraper.build(url, category)
print(final)
- Function for Similar Result
Autoscraper allows you to use the model you build for fetching similar data from a different URL. We need to use the ‘get_result_similar’ function to fetch similar data. In this step, we will retrieve the URLs of different articles on Image Processing.
scrape.get_result_similar(‘https://analyticsindiamag.com/?s=image%20processing‘)
- Function for Exact Result
Instead of getting the similar results sometimes, we want the exact result of the query, autoscraper has the functionality of getting the exact result which means that if we are using the sample URL/Data on the first link then the exact result will also fetch the exact first link of the mentioned URL.
scrape.get_result_exact('https://analyticsindiamag.com/?s=widgets')
- Saving the Model
Autoscraper allows us to save the model created and load it whenever required.
scrape.save(‘AIM’) #saving the model
scrape.load(‘AIM’) #loading the model
Other than all these functionalities autoscraper also allows you to define proxy IP Addresses so that you can use it to fetch data. We just need to define the proxies and pass it as an argument to the build function like the example given below.
proxy = {
"http": 'http://127.0.0.1:8003',
"https": 'https://127.0.0.1:8071',
}
final = scrape.build(url, category, request_args=dict(proxies=proxy))
Conclusion:
In this article, we saw how we can use Autoscraper for web scraping by creating a simple and easy to use model. We saw different formats in which data can be retrieved using Autoscraper. We can also save and load the model for using it later which saves time and effort. Autoplotter is powerful, easy to use and time-saving.