“Matrix multiplication is among the most fundamental and compute-intensive operations in machine learning.”
The fundamental operations of any typical neural network can be reduced to a bunch of addition and multiplication operations. Neural networks can be expressed in terms of matrices. Matrix multiplication is one of the most important mathematical operations when it comes to deep neural networks. Be it a convolution operation of a CNN to recognise images or a language model to perform sentiment analysis; these basic arithmetic operations play a huge role. Modern large data sets, too, are often expressed as large matrices. For instance, in the bag-of-words model, each row of the matrix corresponds to one document. Whereas, in the case of image analysis, feature values are represented by a matrix with rows corresponding to images. Even Google TPUs are defined in terms of matrices.TPUs have Tensors, which are multi-dimensional arrays or matrices. Basic math operations performed on tensors, including addition, element-wise multiplication, and matrix multiplication, are inseparable from the world of deep learning.
As matrix multiplication is one of the fundamental processes of a deep neural network, any chance of speeding up this process can cut down long training times, which the DNNs are usually blamed for. Previous works targeted to speed up deep learning have developed high-speed matrix multiplication libraries, designed custom hardware to accelerate matrix multiplication and designed efficient Approximate Matrix Multiplication (AMM) algorithms under various assumptions. AMM, according to Ye et al., makes matrix computation suitable for large-scale datasets.
Recently, researchers Davis Blalock and John Guttag introduced a learning-based algorithm for this task that greatly outperforms existing methods. This method departs from the existing AMM approach as it relies on hashing and table lookups rather than multiply-add operations.
Existing methods try to minimise computation time by approximating linear operations. This challenge usually arises in machine learning when one has a data matrix, say A, whose rows are samples, and a linear operator, say B, which could also be a linear classifier, regressor, or embedding matrix. This new method by Blalock and Guttag, known as MADDNESS, uses a nonlinear preprocessing function and reduces the problem to table lookups.
If applying a trained linear model to new data leads to knowledge of the linear operator, the MADDNESS algorithm does not require any multiply-add operations. “Our method is most closely related to vector quantisation methods used for similarity search. However, instead of using an expensive quantisation function that requires many multiply-adds, we introduce a family of quantisation functions that require no multiply-adds,” said the researchers in their recent work, which was presented at the ICML 2021 conference. The quantisation method, which is key to this algorithm, refers to a classic vector quantisation algorithm used for approximating inner products and Euclidean.
To assess MADDNESS’s effectiveness, the researchers implemented their method and existing algorithms in C++ and Python using a single thread on a Macbook Pro with a 2.6GHz Intel Core i7-4960HQ processor. Regarding the use of CPUs instead of GPUs, the authors stated that trained ML models are commonly deployed on billions of smartphones with just CPUs. This simulated a real-world scenario while also making it easy for researchers in the future to benchmark their work without needing an expensive GPU setup.
On validating the MADDNESS method on hundreds of matrices from diverse domains, Blalock and Guttag observed that their method “often” runs 100× faster than exact matrix products and 10× faster than current approximate methods.
That said, the authors also admit that their method will fall short when applied on state-of-the-art algorithms as convolution operations for SOTA models exploit structure that is not available to general matrix multiply algorithms. To match the performance, modifications have to be done to the MADDNESS approach. However, Blalock and Guttag believe that their method can inspire other research that can eventually lead to the acceleration of convolution operations and other workloads bottlenecked by linear transforms.