Researchers from Salesforce have released a new powerful generative language model that reaches a new milestone in generative language models’ history! CTRL is the abbreviation for Conditional Transformer Language Model incorporating a new feature called control codes to generalise the model for a very wide range of text-generation applications. Control codes govern the nature of generated text by various attributes such as style, content, task-specific behaviour, topics, domains, dates, entities and relationships between entities.
Present text-generation language models are usually trained conditionally in a task-oriented fashion ceasing their generalisation ability. With initial text prompts, these models are generating texts in the field of their training. Those models capture the pattern as word vectors or contextualized word vectors. A model trained on a specific task can be used in a new task by performing transfer learning and subsequently fine-tuning. However, the need for a generalized model that can be employed in any task has grown in recent time. Salesforce’s CTRL fulfils that need by introducing a new concept of control codes that are fed during text prompt before text generation. Control codes are learnt from the structure of the raw training texts. They yield a control measure on deciding the generation field or area of interest. With CTRL, human users can generate text in their field of interest, just with a few control codes representing their field of interest.
The CTRL is a large-scale model with 1.63 billion parameters being the largest language model to date. It has been trained on 140 GB of text data from various resources such as Wikipedia, Project Gutenberg, Amazon reviews and Reddit. Sources also include a large collection of news data, Europarl and UN data from WMT, question-answer pairs from ELI5 and MRQA shared task, NewsQA, TriviaQA, SearchQA and HotpotQA. The CTRL was originally implemented in TensorFlow on top of the original Transformer architecture with a vocabulary size of 250,000 tokens. The training took place in the cloud distributed across 256 cores of a Cloud TPU v3 Pod for 800,000 iterations continuously over a period of 2 weeks!
How do control codes work in CTRL?
According to the CTRL’s developers, the idea of control codes evolved from the generative models in computer vision tasks that have fine control over generation.
In the text generation process of present state-of-the-arts, the following token/sequence is generated based on the highest values among the probability distribution of all possible tokens/sequences. This probability distribution is calculated based on the chain rule of probability by calculating the contribution made by previous tokens/sequences.
p(x) denotes the probability of generating sequence x; i denotes the current token under prediction; and n represents the sequence’s length.
It can be understood that the probability calculation purely depends on the previously prompted or predicted words or tokens or sentences.
In the CTRL conditional model, control codes have a firm control over the prediction of next tokens/sequences by conditioning them.
Here, c denotes the control codes, other notations being the same as above.
Whether human-written prompts or model-generated tokens, the CTRL necessitates control code in the prescribed format to make inference. Even for identical prompts, different control codes let the model generate different texts. Even for no prompts, the model produces output based on the control codes provided to it. More control codes can be combined to attain a fine-grained control over generation.
Most control codes in the CTRL model specify the overall style of generated text by denoting a specific domain of training data. Additional control is enabled by providing additional codes to the domain code. A URL can also be used as an additional control code by specifying the term ‘Links’ as domain code. URLs can be used to specify various features, including domain, subdomain, entities, entity relations, and dates.
Complex tasks such as translation and question-answering can be initiated by simply providing combinations of complex control codes such as source-target languages or questions along with domain and other task-specific attributes.
Control codes enable the model to generate texts in a specific language in a specific domain, though there is no related training data available in that language. Thus control codes give versatility and robustness to the CTRL model in handling tough text tasks by performing zero-shot code-mixing.
Python Implementation of CTRL
Step-1: Enable GPU
Inference on the pre-trained CTRL model needs GPU. Make sure about the availability of GPU using the following command.
!nvidia-smi
Output:
Step-2: Clone pre-trained CTRL Model
Download the pre-trained model, dependencies and necessary files to the local machine/ cloud environment.
!git clone https://github.com/salesforce/ctrl
Output:
Change the directory to proceed with the downloaded contents.
%cd ctrl/
Step-3: Enable low-memory inference
Generation of text may consume a lot of memory. Official source code repository has a separate branch named lower_memory
that enables low memory consumption during inference. Checkout the branch lower_memory
to perform inference.
!git checkout lower_memory
Output:
Step-4: Install TensorFlow-based dependencies
The CTRL needs tensorflow [GPU]
, fastBPE
for text embedding and gsutil
for data collection. The following commands install those packages.
%%bash pip2 install tensorflow-gpu==1.14 patch -b /usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/keras.py estimator.patch pip2 install fastBPE pip2 install gsutil
Step-5: Restore specific checkpoint of CTRL for inference
A few number of trained checkpoints of model are available for restoration and thus inference. The following command restores one of the officially declared checkpoints from the cloud to our local model.
!gsutil -m cp -r gs://sf-ctrl/seqlen256_v1.ckpt .
Output:
Step-6: Sample generation of text
By running generation.py file on the CTRL model, we can sample a text generation process. It should be noted that the minimum requirement for inference is CUDA enabled GPU or above. The following command generates data for one of the control code prompts based on a link. The model extracts domain, subdomain and other necessary attributes from the keywords available in the link itself. As the model starts generating text, it prints progressively until the end.
!python2 generation.py --model seqlen256_v1.ckpt/model.ckpt-413000.data-00000-of-00001
Output:
Step-7: Sample generation of text with print_once flag
The CTRL enables inference with print_once
flag. It prints the whole text only after the generation ends. In this example, the control code is ‘Books’ and the prompt is ‘Books Weary with toil, I haste me to my bed’.
!python2 generation.py --model seqlen256_v1.ckpt/model.ckpt-413000.data-00000-of-00001 --print_once
Output:
Wrapping up
The CTRL, the Conditional Transformer Language Model is trained with control codes so that human users can easily perform text generation, machine translation and other related natural language tasks. The CTRL is the largest publicly available language generative model to date with 1.63 billion parameters. The control codes allow users to specify domains, subdomains and any applicable attributes to arrive at fine-grained text generations of greater quality. Future improvements can be implemented in the CTRL model by improving the control codes.