Facebook recently open-sourced the graph transformer networks (GTN) framework for effectively training graph-based learning models. In this case, GTN will be used in automatic differentiation of weighted finite-state transducers (WFSTs), which is an expressive and powerful graph. With GTN, researchers can easily construct WFSTs, visualize them, and perform operations on them
This framework enables the separation of graphs from operations on them that helps in exploring new structured loss functions and which in turn makes the encoding of prior knowledge on learning algorithms easier. Further, in a paper published by Awni Hannun, Vineel Pratap, Jacob Kahn & Wei-Ning Hsu of the Facebook AI Research, in this regard, proposed a convolutional WFST layer to be used in the interior of a deep neural network for mapping lower-level to higher-level representations.
GTN is written in C++ and has bindings to Python. GTN can be used to express and design sequence-level loss functions. With this framework, Facebook aims to make experimentation with structures in learning algorithms much simpler.
How Does GTN Function
The use of WFST data structure is prevalent among speech recognition, natural language processing, and handwriting recognition applications. WFST, especially in the speech recognition systems, provides a common and natural representation for the hidden Markov models (HMM), context-dependency, grammar, pronunciation dictionaries, and weighted determinization algorithms to optimise time and space requirement. One of the most popular WFST-based products is the Kaldi toolkit for speech recognition which is trained to decode speeches. Kaldi heavily relies on OpenFST, which is an open-source WFST toolkit.
To understand the importance of GTN framework for a WFST graph, we consider a general speech recogniser. A speech recogniser consists of an acoustic model that predicts the letters in the speech, its language model, and also identifies the word that may follow. These models are represented as WFSTs and are trained separately before combining to output the most likely transcription. It is, at this juncture, that the GTN library steps in to train the different models, which in turn provides better results.
Before GTN, the use of the individual graphs at the training time was implicit, and the graph structure needed to be hard-coded in the software. Using GTN, however, researchers can use WFSTs dynamically at the training time, which in turn makes the whole system more efficient. It also helps researchers in constructing WFSTs and visualising them to perform functions on them. The gradients can be computed with respect to any participating graph by a simple call gtn.backward.
GTN’s programming style is similar to other popular frameworks such as PyTorch, in terms of the autograd API, imperative style, and autograd implementation. The only difference lies in replacing the tensors with WFSTs.
GTN allows separating the graphs from the operations on the graph, which gives greater freedom to experiment with a larger base of structures learning algorithms.
GTN is similar to PyTorch, which is an automatic differential framework for tensors. In the case of PyTorch, which is also called a ‘define-by-run framework’, the autograd package provides automatic differentiation for operations on tensors. Every iteration is different which allows dynamic modifications between epochs.
Despite being similar on many counts, GTN provides an edge over PyTorch or any other tensor-based framework. GTN tools can easily experiment to develop a better algorithm. In the case of GTN, the graph structure is suited for encoding suggestive prior knowledge, which can teach the whole system and help in improving the data.
Further, it is expected that in future, the structure of WFSTs, along with learning from the data, can make machine learning models lighter and more accurate. The use of WFST as a substitute for the tensor-based layer in a deep architecture is an interesting proposition. The paper (referenced above) concludes that WFSTs may be more effective than traditional layers for discovering significant representations from data.