Now Reading
Hands-On Guide To Markov Chain For Text Generation

Hands-On Guide To Markov Chain For Text Generation

Bhoomika Madhukar
W3Schools

Markov chains are a very simple and easy way to create statistical models on a random process. They have been used for quite some time now and mostly find applications in the financial industry and for predictive text generation. Markov chains became popular due to the fact that it does not require complex mathematical concepts or advanced statistics to build it. They are a great way to start learning about probabilistic modelling and data science implementations. 

In this article, we will learn about

  • What are Markov chains?
  • The model
  • Implementation of a predictive text generator using Markov chains. 

What are Markov chains?

Markov chains are random determined processes with a finite set of states that move from one state to another. These sets of transitions from state to state are determined by some probability distribution. 



markov chain

(Source)

Consider the scenario of performing three activities: sleeping, running and eating ice cream. Each node contains the labels and the arrows determine the probability of that event occurring. In the above example, the probability of running after sleeping is 60% whereas sleeping after running is just 10%.

The important feature to keep in mind here is that the next state is entirely dependent on the previous state. The next state is determined on a probabilistic basis. Hence Markov chains are called memoryless. Since they are memoryless these chains are unable to generate sequences that contain some underlying trend. They simply lack the ability to produce content that depends on the context since they cannot take into account the full chain of prior states.

The model 

A Markov chain typically consists of two entities: A transition matrix and an initial state vector. 

As we saw above, the next state in the chain depends on the probability distribution of the previous state. These probabilities are represented in the form of a transition matrix. The transition matrix for the earlier example would look like this.

markov chain

Where S is for sleep, R is for run and I stands for ice cream.

 If the Markov chain has M possible states, the transition matrix would be M x M, such that entry (I, J) is the probability of transitioning from the state I to state J.The rows of the transition matrix should add up to 1 because they are probability distribution and each state will have its own probability. 

The second entity is an initial state vector which is an Mx1 matrix. This matrix describes the probability distribution of M possible values. The entry I mean the probability beginning at the state I. 

We know how to obtain the transitions from one state to another, but we need to be able to find the chances of that transition occurring over multiple steps. To do this, we need to determine the probability of moving from the state I to J over N iterations. Since the transition matrix is given, this can be calculated by raising N to the power of M. For small values of N, this can easily be done with repeated multiplication. 

Implementation of a text generator with Markov chain

Upon understanding the working of the Markov chain, we know that this is a random distribution model. We will use this concept to generate text. I will implement it both using Python code and built-in functions. Let’s get started. 

Coding from scratch

This model is a very simple single-function model. The dataset used for this can be download from this link. Once we have downloaded the data be sure to read the content of the entire dataset once. 

Now we will write a function that performs the text generations. 

import random
def markov_text():
    data_sample = "text.txt"
    text_data = open(data_sample, 'r').read()
    text_data = ''.join([i for i in text_data if not i.isdigit()]).replace("\n", " ").split(' ')
    index = 1
    markov_gen = {}
    word_count = int(input('select the number of words you want to generate'))

Here we have opened our file and written all the sentences into new lines. We will create a dictionary of words in the markov_gen variable based on the number of words you want to generate. 

for character in text_data[index:]:
        key = text_data[index-1]
        if key in markov_gen:
            markov_gen[key].append(character)
        else:
            markov_gen[key] = [character]
        index += 1

Next, we analyse each word in the data file and generate key-value pairs. 

    character1 = random.choice(list(markov_gen.keys()))
    message = character1.capitalize()
    while len(message.split(' ')) < word_count:
        character2 = random.choice(markov_gen[character1])
        character1 = character2
        message += ' ' + character2
    return message

Finally, we will create a range of random choice of words from our dictionary and display the output on the screen. I will give the word count to be 20. 

text generation, Markov Chain

At first glance, this may look like something an actual human being says or types. But looking closely you will notice that it is just a random set of words together. Also, note that this sentence does not appear in the original text file and is generated by our model. 

Using the built-in method

To make the implementation of Markov chains easy, you can make use of the built-in package known as markovify. 

See Also

To install this use the following command 

pip install markovify

We will implement this for the same dataset used above. 

import markovify
with open("/content/gdrive/My Drive/text.txt") as f:
    data = f.read()

Next, you can choose how many sentences you want to generate by assigning the sentence count in the for-loop. I have generated 3 sentences here. 

data_model = markovify.Text(data)
for i in range(3):
    print(data_model.make_sentence())

The Text method is for the generation of random sentences from our data. 

Markov Chain

Again, these sentences are only random. Another option with this package is to choose how many characters should be in the sentences. 

for i in range(3):
    print(data_model.make_short_sentence(280))

Here, it prints 3 sentences with a maximum of 280 characters.

text generation, Markov Chain

Every time the program is run a new output is generated because Markov models are memoryless. We have successfully built a Markov chain text generator using custom and built-in codes. 

Conclusion 

Markov chains are a very simple and easy way to generate text that mimics humans to some extent. But, for effectively generate text, the text corpus needs to be filled with documents that are similar. Simple Markov chains are the building blocks of other, more sophisticated, modelling techniques. These models can be powerful tools for NLP and deep learning as well. 

What Do You Think?

If you loved this story, do join our Telegram Community.


Also, you can write for us and be one of the 500+ experts who have contributed stories at AIM. Share your nominations here.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top