Google Adds Fast Wordpiece Tokenization To Tensorflow

Google’s LinMaxMatch approach improves performance, makes computation faster and reduces complexity

Google presented the ‘Fast WordPiece Tokenization’ at EMNLP 2021, where they developed an improved end-to-end WordPiece tokenisation system. It has the capability to speed up the tokenisation process, saving computing resources and reducing the overall model latency. In comparison to traditional algorithms, this approach improved performance up to 8x faster as it reduces the complexity of the computation by order of magnitude. Google has applied this in a number of systems and has released it in TensorFlow Text.

Using the greedy longest-match-first strategy to tokenise a single word, WordPiece iteratively picks the longest prefix of the remaining text that matches a word in the model’s vocabulary. This approach is called MaxMatch and has been used in Chinese word segmentation since the 1980s. Despite its wide use in NLP, it is still computation intensive. Google has proposed an alternative to this – the LinMaxMatch, which has a tokenisation time that is strictly linear with respect to n.

In other terms, if tire-matching cannot match an input character for a given node, the standard algorithm backtracks to the last character where a token was matched and restarts the trie matching procedure from there. This results in repetitive and wasteful iterations. Instead of backtracking, Google’s method triggers a ‘failure transition’ in two steps. It first collects the precomputed tokens stored at that node, known as ‘failure pops’ and then follows the precomputed ‘failure link’ to a new node from where the trie matching process continues. As n operations are required to read the entire input, LinMaxMatch is asymptotically optimal for the MaxMatch problem.

As the existing systems pre-tokenise the input text by splitting it into words by whitespace, punctuation and characters, and then call WordPiece tokenisation on each resulting word, Google has proposed an end-to-end WordPiece tokeniser. It combines pre-tokenisation and WordPiece into a single, linear-time pass. It uses the LinMaxMatch trie-matching and failure transitions and only checks for whitespace and punctuation characters among the few input characters that are not handled by the loop. This makes it more efficient as it traverses the input only once, performs fewer whitespace/punctuation checks, and skips the creation of intermediate words.

More Great AIM Stories

Meeta Ramnani
Meeta’s interest lies in finding out real practical applications of technology. At AIM, she writes stories that question the new inventions and the need to develop them. She believes that technology has and will continue to change the world very fast and that it is no more ‘cool’ to be ‘old-school’. If people don’t update themselves with the technology, they will surely be left behind.
MORE FROM AIM
Yugesh Verma
How is Boolean algebra used in Machine learning?

Machine learning model with Boolean algebra starts with the data with a target variable and input or learner variables and using the set of rules it generates output value by considering a given configuration of input samples.

Yugesh Verma
How is Taylor series used in deep learning?

We can apply the Taylor series expansion to any function of the neural network since the Taylor series expansion of any continuous function reveals many of the characteristics of the function.

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM