Earlier, we have seen many web scrapers that can extract data from websites, but many times in the case when sites are changing dynamically over time, it’s hard to scrape and locate elements. Businesses do everything to make their websites free from web crawlers so for solving these problems and making a multi-functional, and more diverse tool Diffbot introduced machine learning and computer vision algorithms and public APIs for extracting information from web pages.
Diffbot was the first company to introduce Computer Vision technology to scrape information from web pages, no more conditional programming for each element instead Diffbot visually parses the website’s pages and returns the important elements.
In 2012 they introduced Page Classifier API, which can automatically categorize the web pages into specific categories. This adoption of AI systems into their tools was a good move as they were able to analyze 750,000 web pages from Twitter.
In 2019 they introduced Knowledge Graph which automatically extracts data from webpages and they build a knowledge base of 2 billion attributes(products, articles, people, companies, and more) and 10 trillion “facts”.
This was a huge shot because now their web crawler was able to scrape tiny details from websites which are impossible for other web scraping service providers.
Now according to Financial Express report as OpenAI showcased there GPT-3, an advanced version of AI bot, and now according to MIT Technology Review report as well, Diffbot is working on the same, but with a different approach, they are trying to vacuuming up a large amount of human-written text and extracting facts from it instead of training a model directly out of it.
You can read more here.
This product is more for business purposes. So You need your work email to sign up.Note!
Products and services
Diffbot provides basic four services:
- Extract: Automatically extract any article, blog, product, or image from any website without code.
- Crawl: Extract structured data from entire websites once, or on a schedule, as it is a cloud-based service.
- Search: Use Diffbot Knowledge Graph to search for information on companies, articles, products, and people.
- Enhance: Enrich and manage your existing organization or client & employee data using the Diffbot Knowledge Graph.
After Signup, you’ll get a 14-days free trial which includes 10,000 free credits, access to the knowledge graph, Diffbot cloud dashboard, Excel and google sheets integration and Developer APIs.
If Login was successful, then you can see your dashboard right here at:
- On the right side, we have our products Extract, Crawl, Search and Enhance that is discussed above already. On the left tab, we can see custom APIs(create custom web crawlers), Diffbot provides users with freedom of making their own web crawlers with no code notion.
Diffbot’s Automatic Extraction APIs
Diffbot offers many APIs for extracting data from webpages using computer vision and NLP(Natural language processing), and they are able to categorize the whole page into different attributes and return as JSON.
- Analyze API is used to start with when you have no idea with the type of URL; it uses machine learning to figure out the route for the appropriate type of extraction.
Page Type APIs
If you know what type of content your URL contains, use one of the page-type specific APIs as follows:
- Article API allows you to extract information about articles, blog posts, and other written content. Diffbot is able to recognize authors and their profile images and links, sentiment, tags based on content, and more.
- Product API allows you to extract data about products, including specs, colors, availability, price, discount offers, reviews, and more.
- Image API allows you to extract information about images, from dimensions and download URLs
Custom API The Custom API can be used to create an entirely new custom web scrapers by defining rules. You can also use the Custom API programmatically.
- It will take you to a new subdomain, then click on Create New.
- Select API type and URL for which you want to create a custom web scraping API and click on create
And specify your own rules for extracting data, custom APIs are not focused on this Demonstration as we are going to deep dive into the knowledge graph.
Diffbot Python API
From last 6 Years the APIs is not being maintainedNote!
This API is mainly designed for developers. Thus, you can take control of the full API from your IDE but it is not maintained and used widely as it has only 13 stars and ten forks on Github.
pip install diffbot
How to use it:
Copy your unique token from the Diffbot dashboard!
import diffbot json_result = diffbot.article('https://github.com', token='your token here')
For extracting a specific part from source code, you can achieve by doing the following:
To POST data (text or HTML) to the API, use the text or HTML arguments:
import diffbot client = diffbot.Client(token='#') json_result = client.api('article', 'https://github.com', html=''' ... <h1>Introducing GitHub Traffic Analytics</h1> ... <p>We want to kick off 2014 with a bang, so today we're happy to launch ... traffic analytics!</p> ... ''')
Knowledge is one heck of a powerful tool able to scrape the whole internet in a minute and will give you corresponding results with customizable entities.
1. Let’s extract all Samsung smartphones from the internet using a knowledge graph for our data analysis research!
- Click on knowledge graph on left side of the dashboard and click search.
- Select the Entity type in our case Entity is Product and them from the dropdown menu we can
- Now we can select different attributes we needed in our dataset like sentiment, brand, review, URL, price, date and more.
- Click Number of rows like in this case 1000 and click Export
1000 Samsung smartphones details with price, selling site, category and more within just 10 seconds 😄
Download dataset from here
2. Same way if you want to do sentiment analysis on your company product:
Use Case: You are a Data Scientist. You are given a task to do sentiment analysis on your product like what people have been writing about it, how positive or negative impact it is making on the internet and what are the drawbacks we need to focus?
So one way to answer all these questions we can create a large dataset of all articles published on the internet of our product and then do sentiment analysis and research on the same to find out.
Our product name is Diffbot, Here we are scraping all the ‘Diffbot’ named articles i.e all the articles written on Diffbot over the years we are going to extract with attributes like publishers, sentiment, tags, URLs, text (important ) and more.
- Using the same process as above, just change the entity to Article and Use Diffbot as text in Filters.
- Click Search
- There you have it ! Full dataset of all the article written on topic “Diffbot” with sentiments, publisher, title, name, author and more you can select from left Tab, and when you are satisfied with dataset just click on Export.
Structured dataset ready for sentiment analysis for our data science project 😄
Download dataset from here
We saw different types of services, tools, and a full demonstration on Diffbot Knowledge Graph with two Use – case also we have used python API too.
Diffbot is a great tool as we have already seen and it has maintained its reputation over the years with its AI power services and further they are trying to improve their knowledge graph. 1000 of developers from fortune 500 companies rely on Diffbot on a daily basis because of its simplicity and accessibility.
Want to hear more about web scraping frameworks click on these links:
Their Research Areas are not just limited to web scraping, they are working on Named entity recognition, reaction extraction, sentiment analysis, computer vision, machine learning, Distributed systems, and more!