Last year, Facebook open-sourced graph transformer networks (GTN), a framework for automatic differentiation with a weighted finite-state transducer graph (WFSTs). To put things in perspective, GTN is to WFSTs what PyTorch is for automatic differentiation with tensors. GTN can be used to train graph-based machine learning models effectively.
What are WFSTs?
WFSTs are a widely used tool in speech recognition and language processing. A WFST data structure combines different sources of information such as speech recognition, natural language processing, and handwriting recognition. For example, a standard speech recogniser with an acoustic model, predicts the letters in a speech snippet and the likelihood of a given word following another. The same model can also be represented as a WFST. Here the WFST can be trained separately and combined to output most likely transcription by combining constraints from an acoustic-phoneme model.
However, this approach has a drawback. Combining learned models using WFSTs only at the inference level faces issues such as exposure bias and label bias; this occurs primarily due to practical considerations. There has not been any previous work on making training WFSTs traceable. Further, no implementation exists which supports automatic differentiation in a high-level and efficient manner.
GTN for WFST
To remediate this, the researchers developed a framework for automatic differentiation through operations on WFSTs. This framework can be leveraged to design and experiment with existing and novel learning methods. A framework for differentiable WFSTs allows the model to learn from training data and incorporate such knowledge in the best possible way.
Previously, developers had to hardcode the graph structure in the software. However, with GTN, researchers can dynamically use WFSTs at training time and the whole system can learn and improve from the data more efficiently.
Credit: Facebook blog
In a recent talk, Awni Hannan, an AI researcher at Facebook and also one of the authors of this study said, “GTNs are a way to perform operations on the graphs coupled with automatic differentiation. In the case of GTNs, instead of tensors, you have graphs, and instead of matrix multiplication, convolution, and pointwise operations, you have different interesting graph operations. Just like the operations you can do on tensors, these graphs are differentiable; it means that you can differentiate the output of the operations with respect to the input.”
With GTNs, researchers can easily construct WFSTs, visualise them and perform operations on them. For example, by using the function gtn.backward, gradients can be computed with respect to any graph participating in the computation.
GTN’s programming style is comparable to popular frameworks such as PyTorch. For example, the style, autograd API, and autograd implementation are based on similar design principles. The main difference is that tensors are replaced with WFSTs.
Speaking of replacing tensors with graphs in GTN, the latter allows researchers to encode more useful prior information about tasks into a learning algorithm. GTN allows researchers to encode the pronunciation for a word into a graph and incorporate it into the learning algorithm.
GTNs give the freedom to experiment with a larger design space of structured learning algorithms by separating graphs from operations on graphs. Such creative freedom goes a long way in developing newer and better algorithms.
By using these graphs during the training, the whole system learns and improves from data. In the future, the structure of WFSTs combined with learning from the data can help machine learning models be more accurate, modular, and lightweight.
Read the full paper here.