The Transformer is one of the de-facto models for text understanding, besides other artificial intelligence fields, including NLP, computer vision, video, audio processing, etc. But, it also comes with a shortfall. It can be inefficient and compute-heavy — as it suffers from quadratic complexity to handle long input sequences.
Since the introduction of Transformer in 2017, there have been many methods to accelerate the model. Some of them include Longformer, Linformer, BigBird, Reformer, Long Range Arena, Poolingformer, etc. For instance, BigBird computes sparse attention instead of a dense one. Instead, it uses a mix of local attention, global attention at certain positions, and random attention between a certain number of tokens.
However, sparse attention usually cannot fully model the global context. Linformer exploits the low-rank characteristics of the self-attention matrix by computing approximated ones. It projects attention key and value into low-dimensional matrices that are independent of the sequence length. But, the approximation is context-agnostic, which may weaken the context modelling ability of the Transformer. Moreover, most of these methods are not efficient enough when the input sequence length is very long.
To solve this problem, a team from Microsoft Research Asia and Tsinghua University have proposed Fastformer, an efficient transformer variant based on additive attention. The new method achieves effective context modelling with linear complexity.
“In Fastformer, instead of modelling the pairwise interactions between tokens, we first use additive attention mechanisms to model global contexts, and then further transform each token representation based on its interaction with global context representations,” explained the team. That way, Fastformer can achieve effective context modelling with linear complexity.
Fastformer architecture (Source: arXiv)
The researchers experimented on five benchmark datasets in various tasks, including sentiment classification, topic prediction, news recommendation, and text summarisation.
Dataset used includes Amazon (reviewing rating prediction), IMDB (review movie rating prediction), MIND (news recommendation and intelligence), CNN/DailyMail (text summarisation), and PubMed (text summarisation dataset with much longer document lengths).
In their experiments, the researchers used GloVe embeddings to initialise the token embedding matrix. Further, to obtain the embeddings in the classification and news recommendation tasks, the team applied an additive attention network to convert the matrix output by Fastformer into an embedding. Finally, in the news recommendation tasks, they hierarchically used Fastformer first to learn news embeddings from news titles and later learn user embeddings from the embeddings of historical clicked news.
The team used Adam for model optimisation and ran their experiments on the NVIDIA Tesla V100 GPU with 32 GB memory. The researchers repeated each experiment five times and reported the average performance and the standard deviations, as shown below.
For the classification tasks, the researchers have used accuracy and macro-F scores as performance metrics. For news recommendation tasks, the team has used AUC, MRR, bDCG@5 and nDCG@10 as the metrics. Finally, they have used the ROUGE-1, ROUGE-2, and ROUGE-L metrics to evaluate the generated summaries for text summarisation tasks.
The experiment conducted by Microsoft Research Asia and Tsinghua University researchers achieved quite competitive results in long text modelling. Furthermore, the results demonstrated that the Fastformer was much more efficient than many Transformer models.
Here are some of the key highlights of Fastformer:
- Fastformer is the most efficient Transformer architecture.
- It models the interaction between global contexts and token representations via element-wise product, which can help to model context information in a more efficient way fully
- The experiments run on five datasets show that it is much more efficient than many Transformer models. It can achieve competitive performance.
In the future, the researchers said that they plan to pre-train Fastformer-based language models to empower NLP tasks with long document modelling better. Besides this, they plan to explore applying Fastformer to other scenarios like e-commerce recommendation and Ads CTR prediction to improve user modelling based on long user behaviour sequences.
Addressing News Recommendation Bias
News recommendation is pivotal for personalised news access. Existing news recommendation methods infer users’ personal interest based on their historical clicked news/articles and train the news recommendation models by predicting future news clicks. In other words, news click behaviours indicate ‘user interest.’
In reality, it can also be affected by other factors like the bias of news presentation in the online platform. For instance, news with higher positions and large sizes are usually more likely to be clicked. The bias of clicked news may bring noises to user interest modelling and model training, hurting the personalised news recommendation model.
To solve this issue, the team has also proposed a bias-aware personalised news recommendation method called ‘DebiasRec.’ This new technique can handle the bias information for more accurate user interest inference and model training. It includes a bias representation module, a bias-aware user modelling module, and a bias aware click prediction module. “Experiments on two ‘real-world’ datasets show that our method can effectively improve the performance of news recommendations,” said the researchers.