How To Create An Image Dataset and Labelling By Web Scraping?

In this article, I’ll be discussing how to create an image dataset as well as label it using python. For creating an image dataset, we need to acquire images by web scraping or better to say image scraping and then label using Labeling software to generate annotations.

While working on a data science project, the first step is acquiring the data. For this purpose, we traverse through several websites where certain datasets are available in a structured manner and we can download and have it ready to use. Even if we have a dataset, it might not have enough data and we know our ML models want a good amount of data to be trained well. In case of classification problems, we need this data along with labels. But this is not always the case, often for a specific problem statement dataset might not be readily available. Suppose we want to build a face mask classifier and maybe after several web searches we don’t get the desired dataset. In such situations, we need to make our dataset. 

In computer vision problems, very less is said about acquiring images and more about working with images. Thus I’ll be going through this crucial step of making a custom dataset and also labelling it.

In this article, I’ll be discussing how to create an image dataset as well as label it using python. For creating an image dataset, we need to acquire images by web scraping or better to say image scraping and then label using Labeling software to generate annotations.

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy
Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Web Scraping

Web scraping means extracting data from websites, wherein a large amount of data after extraction is stored in a local system. Web scraping may access the world wide web through https and a web browser.

The most well-known image scraping python library is beautifulsoup that parses HTML and XML documents. The requests library makes the necessary requests to the webpage. Both the packages are pip installable(and maybe already preinstalled). 

import BeautifulSoup
import requests as rq
import os
r2 = rq.get("")
soup = BeautifulSoup(r2.text,"html.parser")
links = []

If we click onto any picture on the webpage and go to developer tools we’ll see the specified format starts with ‘’, up to photos the format is the same and then a unique number is present, thus we specify that so similar images can be acquired. This is a form of regex(regular expressions).

images ='img[src^=""]')
for img in images:

After this step if we wish we can print the ‘links’ list to see the image links that have been scrapped.

Making our directory to save images in it


Now we download images and only 10 images to show the working. The entire page can also be done. This is done with the usual file handling technique.

for index,img_link in enumerate(links):
    if i<=10:
        img_data = rq.get(img_link).content
        with open("jayita_photos//"+str(index+1)+'.jpg', 'wb+') as f:

After successfully running the program go to the specified file path and you can see that the images are stored.


Now that we have our images we need to label them for classification. For this, we’ll be using the labelling software. Labelling is a GUI based annotation tool. Works with Python 3 and above. It’s a pip installable. Provides two types of annotations Pascal VOC(this is used by ImageNet) and YOLO.

Labelling software opens up with the above command.

On the left side there are specified options and on the right side image file information will be shown. For a single image select open for a directory of images select ‘open dir’ this will load all the images. To go to the previous image press ‘a’, for next image press ‘d’.

Drawing the rectangular box to get the annotations. Press ‘w’ to directly get it.

After drawing this window will pop up which means to store the class name for that particular image.

Providing class labels (koala in this case) on the right side of the window it shows.

After drawing the bounding box and labelling the precise class name its important save along with format(Pascal VOC or YOLO) that will generate the annotations. This file is stored in an XML format.

Format for Pascal VOC form of annotations

Format for YOLO 

The first one 0 represents object id, then rest 4 are bounding box coordinates

The class file containing the class names generated along with YOLO format


Creating own image datasets with these steps can be helpful in situations where the dataset is not readily available or less amount of data is available then to increase size this can be used. I’ve only shown it for a single class but this can be applied to multiple classes also, provided all the classes are placed in the same folder.

Jayita Bhattacharyya
Machine learning and data science enthusiast. Eager to learn new technology advances. A self-taught techie who loves to do cool stuff using technology for fun and worthwhile.

Download our Mobile App


AI Hackathons, Coding & Learning

Host Hackathons & Recruit Great Data Talent!

AIM Research

Pioneering advanced AI market research

Request Customised Insights & Surveys for the AI Industry


Strengthen Critical AI Skills with Trusted Corporate AI Training

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

AIM Leaders Council

World’s Biggest Community Exclusively For Senior Executives In Data Science And Analytics.

3 Ways to Join our Community

Telegram group

Discover special offers, top stories, upcoming events, and more.

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.