The Internet is the ‘Large hub’ of Data. Whether you need the textual or image data for your company or personal research use, you can scrape all kinds of worthy data by using Selenium. There are plenty of tools and frameworks you can use to do web Scraping, today we are going to discuss selenium, which basically automates browsers. That’s it!
This means you can use your choice of browser to do automated scraping tasks for you.
Selenium was originally developed by Jason Huggins in 2004 as an internal tool at ThoughtWorks. It was mostly used for testing at that time, but now it’s widely used for browser automation platforms and, of course, web scraping!
It is available as Selenium WebDriver, Selenium IDE, and Selenium Grid.
Selenium WebDriver is used to automate browsers to test, scale, and distribute scripts with a language-specific binding to a browser.
Browser supported by Selenium (Chrome, Opera, Firefox, Safari, Internet Explorer)
Operating System Supported (Linux, Mac, Windows)
Selenium IDE (Integrated Development Environment) is a test tool used by testers and also can be used by someone who is not familiar with developing test cases for their websites. It is very easy to use, you just need to add the Selenium IDE extension to your browser, and you are good to go with a pre-built GUI function to easily record your sessions.
Selenium Grid is used to run parallel test sessions across different web browsers; it is based on the hub- node architecture, where one server acts as a hub and other devices act as nodes consisting of their operating system and remote Web drivers. It also reduces the time that a test suite takes to complete because of the Hub-Node relation they are relying on.
We are going to use Python for coding with an additional Chrome driver(to make your script work in chrome browser) and a selenium framework for python.
- Chrome Driver
- Selenium package (install using pip)
pip install selenium
To check if your “ChromeDriver” and everything is setup use the command :
- Put ChromeDriver downloads path into your environment variable path if it’s not running.
- Never Name your python file “selenium.py” framework get disturbed and throw an error if you name your file selenium.
This code will open analytics india magazine homepage into your chrome.
from selenium import webdriver DRIVER_PATH = '/path/to/chromedriver' driver = webdriver.Chrome(executable_path=DRIVER_PATH) driver.get('https://analyticsindiamag.com/')
If you don’t want to give your ChromeDriver location every time you run a programme, just put your driver location into the environment variable path.
And the same result will be achieved by this programme too!
from selenium import webdriver driver = webdriver.Chrome() driver.get(‘https://analyticsindiamag.com/')
Other driver function you can use:
print(driver.title) print(driver.window_handles) print(driver.page_source) print(driver.current_url) driver.refresh()
To scrape the specific amount of data, we have plenty of handful functions you can try.
- Tag name
- Class name
- CSS selectors
Usually, to scrape a specific type of data we need to find the element bound to that data, let’s say locating all the heading(title) we need to use
- Inspect tools by right click on the website page in the browser
- Or you can see the source code of a website into your terminal and then decide what element to extract.
from selenium import webdriver import time driver = webdriver.Chrome() driver.get('https://analyticsindiamag.com/') print(driver.page_source)
Let’s See some example what selenium can do :
We can search for bikes images and download them if we want to with making a google search query like this:
So this is one of many ways we can use Selenium to do our task from scraping to automating web surfing tasks and extract images and Report generation.
Another thing we can achieve is to automate the whole task of downloading reports from a website by filling in all the details of different users.
You can find more information about this in the Selenium documentation.