# Microsoft Introduces Mathematical Framework To Tune Up Attention Architectures

Recently, Microsoft Research and the University of Montreal introduced a new mathematical framework that uses measure theory and integral operators to model attention architectures in neural networks. According to the researchers, the framework is proposed to quantify the regularity; in other words, the amount of smoothness of the attention operation.

The attention mechanism is the fundamental building block of neural networks like multi-layer perceptron, convolution neural network and recurrent neural network cell. The mechanism, a part of the network’s architecture, is in charge of managing and quantifying the interdependence between the input and output elements and also within the input elements.

### Why This Research

Attention has proved to be a powerful component of modern neural networks across a variety of domains. Researchers have been working to improve this architecture. However, there is still a lack of clarity in the explanations about the mathematical properties of attention and regularities of the attention architectures. The new mathematical framework comes in the wake of this situation.

The researchers stated, “In particular, we seek to understand how “close” the outputs of the attention operation are in terms of the closeness of the inputs and the parameters of the attention block.”

### Importance Of Regularity

According to the researchers, the regularity of attention is essential for various reasons. More specifically, regularity impacts self-attention networks’ properties, such as their invertibility and the existence of infinite-depth limits.

The key reasons why regularity is important:

• Regularity is a basic property of a function with crucial implications for tasks such as feature learning. Also, Lipschitz regularity, in particular, plays an important role in such cases.
• Secondly, the repeated composition of a function magnifies its regularity. Since attention is liberally used in very deep architectures, understanding the regularity of this essential building block can shine a light on the training and stability of these models.
• Lastly, having a precise theory allows us to make testable predictions about experiments to generate improvements and post hoc analysis of experimental results to understand better why a given behaviour was observed.

### Behind The Framework

The mathematical framework relies heavily on linear transformations of measures modelled by Markov kernels. The idea behind this research is to show that attention is Lipschitz continuous under various assumptions, and to do so, the researchers introduced the modelling paradigm for attention based on measure theory and integral operators.

They demonstrated that the attention operation is Lipschitz continuous and provided an estimate of its Lipschitz constant. Lipschitz continuity is mainly used to improve the state-of-the-art in several deep learning topics such as generative models and robust learning.

The researchers also assessed the impact of these regularity results on practical applications of attention such as cross-attention, robustness and token-level perturbations in NLP, and sophisticated extensions to the transformer architecture.

### Wrapping Up

The researchers studied how regularity can help certain applications by providing robustness to the learned representations. They showed several benefits of using this mathematical framework:

• The mathematical framework is consistent with the usual definition, and it captures the essential properties of attention.
• The framework showed the resulting representation is Lipschitz continuous concerning the output semantic space.
• The framework provides a potential mathematical basis for the robustness of transformers.
• The modelling could be used to derive predictions of the distance between self-attention networks’ contextual embeddings as a function of the context to test this hypothesis.
• It could also be used to design better model components, such as input embedding spaces that reduce the regularity mismatch for specific perturbations that are highly irregular.

## More Great AIM Stories

### Top Free Resources To Learn Scikit-Learn

A Technical Journalist who loves writing about Machine Learning and Artificial Intelligence. A lover of music, writing and learning something out of the box.

## Our Upcoming Events

Conference, in-person (Bangalore)
MachineCon 2022
24th Jun

Conference, Virtual
Deep Learning DevCon 2022
30th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

### Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

### Telegram Channel

Discover special offers, top stories, upcoming events, and more.

##### Twitter former CEO Maheshwari allegedly threatened his Invact Metaversity co-founder

“I never invested to be used as an instrument of a co-founder bullying the other one,” Orosz said in the email.

##### Allen Institute for AI introduces new benchmark for computer vision models

GRIT is an evaluation only benchmark for evaluating the performance of vision systems across several image prediction tasks, concepts, and data sources.

##### Why is Broadcom acquiring VMWare?

Following the closing of the transaction, the Broadcom Software Group will rebrand and operate as VMware.

##### PayPal to have over 800 job openings in India: Chandramouliswaran V

We have close to 1000 positions that are open, and we look to hire laterally across all levels.

##### WhatsApp Business on a mission to lure Indian enterprises

WhatsApp Business is among the 30 most downloaded apps in India, beating the likes of Jio Saavn and Wynk.

##### NIT Calicut launches AI for cancer initiative

Listen to this story The National Institute of Technology Calicut (NITC) and MVR Cancer Centre

##### Startup’s loss is IT’s gain

Around 40-50 per cent of employees are leaving startups and are getting absorbed by IT companies.

##### AWS launches all-new GPU-based instances for ML training and HPC

The all-new P4de instances are 2x higher than current GPUs.

##### The never-ending debate on AGI

DeepMind’s AlphaGo is one of the biggest success stories in AI.

##### How to improve time series forecasting accuracy with cross-validation?

Listen to this story Time series analysis, is one of the major parts of data