MITB Banner

What Forms The Foundation Of  Data Science? MSR India Researcher Explains

Share

 

Ravi Kannan at the MSR India lab

The last century has seen tremendous innovation in the field of mathematics. New theories have been postulated and traditional theorems have been made robust by persistent mathematicians. And we are still reaping the benefits of their exhaustive endeavours to build intelligent machines. The field of data science is built on some ingenious mathematical and logical hypotheses and tools.

Here we list a few concepts from the Principal Researcher at Microsoft Research India, Ravi Kannan’s book, which forms the foundation of Data Science:

Singular Value Decomposition

Modern data often consists of feature vectors with a large number of features. The conversion of data into vectors is domain specific.

High-dimensional geometry and Linear Algebra are two of the crucial areas which form the mathematical foundations of Data Science.

Length squared sampling in matrices, Singular value decomposition, Low rank approximation are few techniques which are widely used in data processing.

For example, the singular value decomposition finds the best-fitting k-dimensional subspace for k= 1,2,3,…,For the set of N data points.  Here, “best” means minimizing the sum of the squares of the perpendicular distances of the points to the subspace, or equivalently,  maximizing the sum of squares of the lengths of the projections of the points onto this subspace.

SVD is traditionally used in principal component analysis. PCA is popularly used for feature extraction and knowing how significant the relationship among the features/properties is to an outcome.

Lloyd’s Algorithm

Very often than not, data is unstructured, vast and vague. Making sense of it is the job of a data scientist. The simplest most intuitive way of reducing the complexities in data is to divide it into groups and then deal with them on an individual level. Grouping or gathering data points is done traditionally using clustering methods like k-means. Lloyd’s algorithm is one such, which goes as follows:

  1. Start with k centers.
  2. Cluster each point with the center nearest to it.
  3. Find the centroid of each cluster and replace the set of old centers with the centroid.
  4. Repeat the above two steps until the centers converge according to some criterion, such as the k-means score no longer improving.

Lloyd’s algorithm does not necessarily find a globally optimal solution but will find a locally-optimal one.  An important but unspecified step in the algorithm is its initialization: how the starting k centers are chosen.

Be it sentiment analysis for recommendation systems or identifying protein sequences in cancer cells, clustering is very applicable.

Occam’s Razor

A good machine learning model makes predictions from a database of random examples. The basic goal is to perform as well, or nearly as well, as the best predictor in a family of functions, such as neural networks or decision trees. For a given model and function family, if this goal can be achieved under some reasonable constraints, the family is said to be learnable in the model.

Machine-learning theorists are typically able to transform questions about the learnability of a particular function family into problems that involve analysing various notions of dimension that measure some aspect of the family’s complexity. For example, the appropriate notion for analysing PAC learning is known as the Vapnik–Chervonenkis (VC) dimension, and, in general, results relating learnability to complexity are sometimes referred to as Occam’s-razor theorems.

Occam’s razor is the notion, stated by William of Occam around AD 1320, that in general one should prefer simpler explanations over more complicated ones.

Why should one do this, and can we make a formal claim about why this is a good idea?  What if each of us disagrees about precisely which explanations are simpler than others?

What it does say is that Occam’s razor is a good policy in that simple rules are unlikely to fool us since there are just not that many simple rules.

As a machine learning model heads to production, all these statistical methods and techniques will come down to one thing– a YES or NO decision.

The book Foundations of Data Science authored by Avrim Blum, John Hopcroft and Ravindran Kannan, consists of other interesting rudimentary topics like:

  • Law of large numbers
  • Geometry of high dimensions
  • Matrix operations
  • Random walks in Euclidean space
  • Gradient Descent methods
  • Graph partitioning
  • Bayesian or belief networks and many other concepts supplemented with intuition behind the math.

Download the free book here

Also watch:

 

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.