Active Hackathon

Google Adds Fast Wordpiece Tokenization To Tensorflow

Google’s LinMaxMatch approach improves performance, makes computation faster and reduces complexity

Google presented the ‘Fast WordPiece Tokenization’ at EMNLP 2021, where they developed an improved end-to-end WordPiece tokenisation system. It has the capability to speed up the tokenisation process, saving computing resources and reducing the overall model latency. In comparison to traditional algorithms, this approach improved performance up to 8x faster as it reduces the complexity of the computation by order of magnitude. Google has applied this in a number of systems and has released it in TensorFlow Text.

Using the greedy longest-match-first strategy to tokenise a single word, WordPiece iteratively picks the longest prefix of the remaining text that matches a word in the model’s vocabulary. This approach is called MaxMatch and has been used in Chinese word segmentation since the 1980s. Despite its wide use in NLP, it is still computation intensive. Google has proposed an alternative to this – the LinMaxMatch, which has a tokenisation time that is strictly linear with respect to n.


Sign up for your weekly dose of what's up in emerging technology.

In other terms, if tire-matching cannot match an input character for a given node, the standard algorithm backtracks to the last character where a token was matched and restarts the trie matching procedure from there. This results in repetitive and wasteful iterations. Instead of backtracking, Google’s method triggers a ‘failure transition’ in two steps. It first collects the precomputed tokens stored at that node, known as ‘failure pops’ and then follows the precomputed ‘failure link’ to a new node from where the trie matching process continues. As n operations are required to read the entire input, LinMaxMatch is asymptotically optimal for the MaxMatch problem.

As the existing systems pre-tokenise the input text by splitting it into words by whitespace, punctuation and characters, and then call WordPiece tokenisation on each resulting word, Google has proposed an end-to-end WordPiece tokeniser. It combines pre-tokenisation and WordPiece into a single, linear-time pass. It uses the LinMaxMatch trie-matching and failure transitions and only checks for whitespace and punctuation characters among the few input characters that are not handled by the loop. This makes it more efficient as it traverses the input only once, performs fewer whitespace/punctuation checks, and skips the creation of intermediate words.

More Great AIM Stories

Meeta Ramnani
Meeta’s interest lies in finding out real practical applications of technology. At AIM, she writes stories that question the new inventions and the need to develop them. She believes that technology has and will continue to change the world very fast and that it is no more ‘cool’ to be ‘old-school’. If people don’t update themselves with the technology, they will surely be left behind.

Our Upcoming Events

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, in-person (Bangalore)
Machine Learning Developers Summit (MLDS) 2023
19-20th Jan

Conference, in-person (Bangalore)
Data Engineering Summit (DES) 2023
21st Apr, 2023

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM

Council Post: Enabling a Data-Driven culture within BFSI GCCs in India

Data is the key element across all the three tenets of engineering brilliance, customer-centricity and talent strategy and engagement and will continue to help us deliver on our transformation agenda. Our data-driven culture fosters continuous performance improvement to create differentiated experiences and enable growth.

Ouch, Cognizant

The company has reduced its full-year 2022 revenue growth guidance to 8.5% – 9.5% in constant currency from the 9-11% in the previous quarter