MITB Banner

How To Use OpenBlender – The Leading Data Blending Tool

Share

OpenBlender Logo

OpenBlender is an easy-to-handle means of enhancing the performance of Machine Learning models by augmenting them with external data from various open sources. It enables data-blending across multiple datasets which overlap in time or location. Relevant variables from other datasets can be efficiently embedded in our data in a negligible amount of time using OpenBlender. 

OpenBlender was introduced by a private organization named Open Blender Inc. The company was founded on Oct 22, 2018, and is headquartered at San Diego, California (USA). Its founders are – Antonio Rodriguez Lorenzo, Federico Riveroll and Javier Echevarria.

Watch this video to get an overview of OpenBlender’s capabilities.

Installation

To install OpenBlender using Python, run the following command:

pip install OpenBlender

The Time Blend function of OpenBlender enables enriching our dataset (called ‘anchor dataset’) with feature variables from some external datasets (called ‘blend datasets’) using time as the common axis for combining the data.

Practical implementation

Suppose we have a sample Python DataFrame named ‘df’ having data of Walmart sales as follows:

sample DataFrame

Import the built-in OpenBlender library

import OpenBlender

To time-blend some external dataset with df, we first need to convert the ‘Date’ column’s format to UNIX Timestamp which is a timezone-free format and hence uniform throughout the world.

 df ['timestamp'] = OpenBlender.dateToUnix(df['date'],
                                           date_format = '%d.%m.%Y', 
                                           timezone = 'GMT') 

 Sort df in descending order of ‘timestamp’ 

  df = df.sort_values('timestamp').reset_index(drop = True)

 Search for the dataset which time-overlaps with df using searchTimeBlends() function

   token = 'YOUR_TOKEN’
   search_term = 'gold'
   OpenBlender.searchTimeBlends( token, df.timestamp, search_term ) 

The output of the above lines of code shows names of the datasets which time-intersect with df along with URL to the anchor datasets on OpenBlender’s dashboard, features and description of those datasets. For instance,

 [{‘name’: ‘Daily Gold Price’,
    ‘url’:  ‘https://www.openblender.io/#/dataset/explore/5d13a3029516295728d6c7e5’,
    ‘id_dataset’ :  ‘5d13a3029516295728d6c7e5’,
    ‘feature’: [ ‘call_opinion’,
    ‘change’,
    ‘price’,
    ‘price_avg’,
    ‘unnamed_0’,
    ‘volume’ ],
    ‘num_observations’: 1543,
    ‘Intersection’: 100%,
    ‘Description’: ‘Daily gold prices’ }] 

Following is the ‘blend_source’ code to specify the anchor dataset using its ID shown by ‘id_dataset’ in the above output as well as name of its feature which needs to be added to the anchor dataset.

   blend_source = {
                                 'id_dataset':'DATASET_ID',
                                'feature' : 'price’'
               } 

Adding a numerical feature to the anchor dataset :

 There are two types of time blends:

  1. Aggregated in Intervals Blend 
  • Specify blend_type = ‘agg_in_intervals’
  • Specify ‘count’, ‘avg’ or ‘sum’ as interval_output to aggregate the values over a particular interval (i.e. ‘interval_size’) from the blend set feature.
  1. Closest Observation Blend (closest_observation)
  • Specify blend_type = ‘closest_observation’
  • It will join those observations from the blend datasets which are closest in time to the anchor dataset.
   df_blend = OpenBlender.timeBlend(token = token,
                                   anchor_ts = df.timestamp,
                                   blend_type = 'agg_in_intervals', 
                                   direction = 'time_prior',
                                   interval_output = 'avg',
                                   interval_size = 60 * 60 * 24 * 1,
                                   blend_source = blend_source,
                                   missing_values = 'impute') 

In the above piece of code, 60*60*24*1 interval_size means the data from the blend dataset will be aggregated over the period of 1 day. (It is in ‘seconds*minutes*hours*days’ format) 

The ‘direction’ parameter can have one of the 3 values – ‘prior’, ‘posterior’ or ‘both’ which specifies the direction in time w.r.t each timestamp value of the anchor.

Now concat all the columns of df_blend except ‘timestamp’ column to the blend DataFrame df

 

 df_anchor = pd.concat([df, df_blend.loc[:, df_blend.columns != 'timestamp']], axis = 1)

Adding a text feature to the anchor dataset

For joining a textual feature column from a blend dataset, the only modification required is – set the ‘interval_output’ parameter of timeBlend() function to ‘list’ if we want the list of texts or ‘text’ if we want the raw text from the anchor corresponding to the ‘interval_size’ specified.

Adding some filtered text to the anchor dataset

Suppose, we have a column named ‘reviews’ in the blend dataset and we wish to add a separate column in the anchor dataset which shows only the positive reviews from the blend set.

filter_txt = {‘name’ : ‘positive_reviews’, ‘match_ngrams’ = [‘good’, ‘excellent’, ‘great’]} 

Now, just add the above dictionary ‘filter_txt’ the the blend_source code as follows:

 blend_source = {
                   'id_dataset':'DATASET_ID',
                   'feature' : 'price’'
                   ‘filter_text’ : filter_txt
               } 

Rest of the procedure remains the same as that for adding a numerical feature described above.

Adding a text vectorizer to the anchor dataset

Again, the process remains the same except that the blend_source code should now include the ID of the text vectorizer found on the dashboard of OpenBlender. The blend_source code now looks like:

 blend_source = {  
                   'id_textVectorizer':'TEXT_VECTORIZER_ID'
                  } 

The above demo code depicts the process to blend the data overlapping in time.

The blending process can also be carried out based on location as the attribute for data-joining.

Click here to have a quick look at how location-based blending can be performed using locationBlen() function. 

Apart from carrying out the time-wise and location-wise blends, following are the functionalities provided by the OpenBlender community:

  • Search time blends related to our anchor dataset
  • Search location blends related to our anchor dataset
  • Get details of a dataset available on the dashboard of OpenBlender
  • Create a new dataset
  • Add samples to a dataset
  • Get samples from a dataset
  • Pull samples from external text sources into a dataframe
  • Pull samples from a text vectorizer into a dataframe
  • date/datetime to UNIX timestamp conversion
  • UNIX timestamp to date/datetime conversion

Go through the API documentation to understand the implementation of the above functionalities.

Endnotes

In this article, we got a brief overview of OpenBlender – the simplest way of blending germane data from a wide-range of open sources. It holds the capability of fueling our ML model with minimal efforts and time-consumption.

Refer to the following sources to gain A-Z knowledge of the OpenBlender:

Share
Picture of Nikita Shiledarbaxi

Nikita Shiledarbaxi

A zealous learner aspiring to advance in the domain of AI/ML. Eager to grasp emerging techniques to get insights from data and hence explore realistic Data Science applications as well.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.