MITB Banner

How to build an AI model that writes like Chetan Bhagat

I used TextBlob for sentiment analysis.

Share

Illustration by Analytics India Magazine

Listen to this story

I am a huge Chetan Bhagat fan and have read all his novels. I couldn’t get enough of Chetan Bhagat and I wanted more of his crazy, fun and emotional writing. So, I have decided to build an AI model that writes like him.

How does the AI model work

  1. I took parts from his novels so it falls under fair use.
  2. Then, I cleaned the text to make sure each line has only one sentence.
  3. This was added to a dataframe and I built a dataset for training.
  4. I took data from the dataset and pre-processed it for training.
  5. Then, I trained a custom GPT-2 model.
  6. The model was loaded and can now generate text like Chetan Bhagat.

Let’s get started.

Dataset

Let’s get the imports first:

from io import BytesIO
 
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

First let’s get Chetan Bhagat’s Half Girlfriend https://archive.org.

Now, let’s make a function to make sure there is only one sentence.

#a function to make sure there is only one sentence in each line of a txt file.
def one_sentence_per_line(text):
   lines = text.splitlines()
   new_lines = []
   for line in lines:
       if line.strip() == '':
           continue
       else:
           new_lines.append(line)
   return '\n'.join(new_lines)

After that, make a new function to change pdf to txt file.

#a function get the text from a pdf file and make a txt file
def pdf_to_txt(pdf_file):
   #pdf_file = "C:/Users/Chetan/Desktop/test.pdf"
   rsrcmgr = PDFResourceManager()
   retstr = StringIO()
   codec = 'utf-8'
   laparams = LAParams()
   device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
   fp = open(pdf_file, 'rb')
   interpreter = PDFPageInterpreter(rsrcmgr, device)
   password = ""
   maxpages = 0
   caching = True
   pagenos=set()
 
   for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
       interpreter.process_page(page)
 
   text = retstr.getvalue()
 
   fp.close()
   device.close()
   retstr.close()
   return text

Now, let’s make a function to clean the text and make a new clean.txt file.

#make a text file named "clean.txt"
def clean_txt(pdf_file):
   text = pdf_to_txt(pdf_file)
   text = one_sentence_per_line(text)
   f = open("clean.txt","w")
   f.write(text)
   f.close()
   return text

At last, let’s call the function.

clean_txt("C:/Users/Chetan/Desktop/test.pdf")

And now we have a clean.txt file with all of the texted cleaned and ready to be made into a dataset.

Here is the file.

Now, we can finally make the dataset by adding this data into a dataframe:

import pandas as pd
 
file=open('/content/clean.txt','r')
lines=file.readlines()
"add all the lines to a list"
all_lines=[]
for line in lines:
   all_lines.append(line)
"remove all the empty lines"
all_lines=list(filter(None,all_lines))
"remove all the lines with only one letter"
all_lines=list(filter(lambda x: len(x)>1,all_lines))
'remove everything exept alphabetical characters'
all_lines=list(map(lambda x: x.strip('\r').strip('\t').strip(' ').strip('\f').strip('\v'),all_lines))
 
"create a dataframe"
df=pd.DataFrame()
df['Dialogue']=pd.DataFrame(all_lines)
"save the dataframe"
df.to_csv('/content/TrainChetan.csv')
file.close()

And, now we have TrainChetan.csv file:

And now it’s a Kaggle dataset. Check it out here.

The dataset has 8003 unique values and it is ready for training models.

Now, let’s run sentiment analysis on the text for getting more data for training. I used TextBlob for sentiment analysis. 

! pip install textblob
from textblob import TextBlob
df['score']=df['Dialogue'].apply(lambda x: TextBlob(x).sentiment.polarity)
df['sentiment']=df['Dialogue'].apply(lambda x: TextBlob(x).sentiment.subjectivity)
df.to_csv('/content/TrainChetan_after_sentiment_analysis.csv')

Here is the new csv file after sentiment analysis:

I wasn’t satisfied with the output for the sentiment analysis.. So, I rewrote the sentiment analysis by using NLTK libs.

Let’s get the imports and downloads first:

import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
!gdown --id 1BGTVDjg8EwzUJmSwsmn2sEz9lJD2_3-w
!gdown --id 1xsYC2UF1JAR7BIiNSU4iGbTZytYNzYof

This will bring all the downloads needed in your system.

Now, let’s get the data and preprocess and run sentiment analysis on it.

import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
df = pd.read_csv('/content/TrainChetan.csv', usecols=['Dialogue'])
lemma = WordNetLemmatizer()
stop_words = stopwords.words('english')
def text_prep(x: str) -> list:
    corp = str(x).lower()
    corp = re.sub('[^a-zA-Z]+',' ', corp).strip()
    tokens = word_tokenize(corp)
    words = [t for t in tokens if t not in stop_words]
    lemmatize = [lemma.lemmatize(w) for w in words]
    return lemmatize
preprocess_tag = [text_prep(i) for i in df['Dialogue']]
df["preprocess_txt"] = preprocess_tag
df['total_len'] = df['preprocess_txt'].map(lambda x: len(x))
file = open('negative-words.txt', 'r',encoding = "ISO-8859-1")
neg_words = file.read().split()
file = open('positive-words.txt', 'r',encoding = "ISO-8859-1")
pos_words = file.read().split()
num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words]))
df['pos_count'] = num_pos
num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words]))
df['neg_count'] = num_neg
df['sentiment'] = round((df['pos_count'] - df['neg_count']) / df['total_len'], 2)
df.to_csv('/content/TrainChetan_after_sentiment_analysis_nltk.csv')
df.head()import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
df = pd.read_csv('/content/TrainChetan.csv', usecols=['Dialogue'])
lemma = WordNetLemmatizer()
stop_words = stopwords.words('english')
def text_prep(x: str) -> list:
    corp = str(x).lower()
    corp = re.sub('[^a-zA-Z]+',' ', corp).strip()
    tokens = word_tokenize(corp)
    words = [t for t in tokens if t not in stop_words]
    lemmatize = [lemma.lemmatize(w) for w in words]
    return lemmatize
preprocess_tag = [text_prep(i) for i in df['Dialogue']]
df["preprocess_txt"] = preprocess_tag
df['total_len'] = df['preprocess_txt'].map(lambda x: len(x))
file = open('negative-words.txt', 'r',encoding = "ISO-8859-1")
neg_words = file.read().split()
file = open('positive-words.txt', 'r',encoding = "ISO-8859-1")
pos_words = file.read().split()
num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words]))
df['pos_count'] = num_pos
num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words]))
df['neg_count'] = num_neg
df['sentiment'] = round((df['pos_count'] - df['neg_count']) / df['total_len'], 2)
df.to_csv('/content/TrainChetan_after_sentiment_analysis_nltk.csv')
df.head()

Let’s see the output:

As you can see this sentiment analysis is very detailed and can be used for training our model.

I will train a NLP model with the data from the dataset. I went through a lot of NLP models and decided on a model based on GPT-2.

So, we will be using Aitextgen Lib for this project. Let’s install it.

!pip install -q aitextgen #install the main package

Let’s import Aitextgen:

from aitextgen import aitextgen

Check your GPU (be sure to use GPU runtime if you are doing this in Google Colab or Kaggle Notebook):

! nvidia-smi

Download the 124M GPT-2 Model:

ai = aitextgen(tf_gpt2=”124M”, to_gpu=True)

Now, let’s read the dataset that I made above:

pd.read_csv("/content/TrainChetan_after_sentiment_analysis_nltk.csv")
pd.set_option('display.max_colwidth', None)

Now, we need to clean the dataset and remove unwanted columns and spaces.

input_file["Dialogue"]  = input_file["Dialogue"].str.replace('(','').str.replace(')','')

Let’s see the shape of the new dataframe.

df = pd.DataFrame(input_file["Dialogue"])
df.shape

We have 8248 individual values.

Now let’s divide the words and make it into two rows for better training.

df = df.assign(var1=df['Dialogue'].str.split('-')).explode('var1')
df.var1 = df.var1.str.lstrip()
df.shape

Now, we have data ready for training. Let’s save the cleaned text in a.txt file:

df.to_csv("input_text_cleaned.txt", columns=["var1"], header=False, index=False)

Now we can use the above text file to fine tune the model and set the correct parameters:

Let’s mount the Gdrive to save the model there:

!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
gpt2.mount_gdrive()

Time to fine tune the model to our needs:

run_name = 'ChetanAI'
ai.train('input_text_cleaned.txt',
        run_name = run_name,
        line_by_line=False,
        from_cache=False,
        num_steps=5000,
        generate_every=100,
        save_every=500,
        save_gdrive=True,
        learning_rate=1e-3,
        fp16=False,
        batch_size=1)

It is finally done!

Our custom GPT-2 Trained model is ready. Now, we can directly load the trained model.

ai.generate(n=3,
           prompt="This is something Chetan would write:",
           batch_size=1,
           max_length=50,
           temperature=1.0,
           top_p=0.9)

The outputs:


For the Colab notebook, go here.


		
Share
Picture of Eeman Majumder

Eeman Majumder

I am a coding enthusiast, mostly focused on AI development. I am studying B.tech CSE with specialisation in Artificial Intelligence and Machine Learning. I have worked on Artificial intelligence projects like Yolo_V5, OCR for SBI and 48 others.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.