Listen to this story
|
I am a huge Chetan Bhagat fan and have read all his novels. I couldn’t get enough of Chetan Bhagat and I wanted more of his crazy, fun and emotional writing. So, I have decided to build an AI model that writes like him.
How does the AI model work
- I took parts from his novels so it falls under fair use.
- Then, I cleaned the text to make sure each line has only one sentence.
- This was added to a dataframe and I built a dataset for training.
- I took data from the dataset and pre-processed it for training.
- Then, I trained a custom GPT-2 model.
- The model was loaded and can now generate text like Chetan Bhagat.
Let’s get started.
Dataset
Let’s get the imports first:
from io import BytesIO from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage
First let’s get Chetan Bhagat’s Half Girlfriend https://archive.org.
Now, let’s make a function to make sure there is only one sentence.
#a function to make sure there is only one sentence in each line of a txt file. def one_sentence_per_line(text): lines = text.splitlines() new_lines = [] for line in lines: if line.strip() == '': continue else: new_lines.append(line) return '\n'.join(new_lines)
After that, make a new function to change pdf to txt file.
#a function get the text from a pdf file and make a txt file def pdf_to_txt(pdf_file): #pdf_file = "C:/Users/Chetan/Desktop/test.pdf" rsrcmgr = PDFResourceManager() retstr = StringIO() codec = 'utf-8' laparams = LAParams() device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) fp = open(pdf_file, 'rb') interpreter = PDFPageInterpreter(rsrcmgr, device) password = "" maxpages = 0 caching = True pagenos=set() for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True): interpreter.process_page(page) text = retstr.getvalue() fp.close() device.close() retstr.close() return text
Now, let’s make a function to clean the text and make a new clean.txt file.
#make a text file named "clean.txt" def clean_txt(pdf_file): text = pdf_to_txt(pdf_file) text = one_sentence_per_line(text) f = open("clean.txt","w") f.write(text) f.close() return text
At last, let’s call the function.
clean_txt("C:/Users/Chetan/Desktop/test.pdf")
And now we have a clean.txt file with all of the texted cleaned and ready to be made into a dataset.
Here is the file.
Now, we can finally make the dataset by adding this data into a dataframe:
import pandas as pd file=open('/content/clean.txt','r') lines=file.readlines() "add all the lines to a list" all_lines=[] for line in lines: all_lines.append(line) "remove all the empty lines" all_lines=list(filter(None,all_lines)) "remove all the lines with only one letter" all_lines=list(filter(lambda x: len(x)>1,all_lines)) 'remove everything exept alphabetical characters' all_lines=list(map(lambda x: x.strip('\r').strip('\t').strip(' ').strip('\f').strip('\v'),all_lines)) "create a dataframe" df=pd.DataFrame() df['Dialogue']=pd.DataFrame(all_lines) "save the dataframe" df.to_csv('/content/TrainChetan.csv') file.close()
And, now we have TrainChetan.csv file:
And now it’s a Kaggle dataset. Check it out here.
The dataset has 8003 unique values and it is ready for training models.
Now, let’s run sentiment analysis on the text for getting more data for training. I used TextBlob for sentiment analysis.
! pip install textblob from textblob import TextBlob df['score']=df['Dialogue'].apply(lambda x: TextBlob(x).sentiment.polarity) df['sentiment']=df['Dialogue'].apply(lambda x: TextBlob(x).sentiment.subjectivity) df.to_csv('/content/TrainChetan_after_sentiment_analysis.csv')
Here is the new csv file after sentiment analysis:
I wasn’t satisfied with the output for the sentiment analysis.. So, I rewrote the sentiment analysis by using NLTK libs.
Let’s get the imports and downloads first:
import nltk nltk.download('stopwords') nltk.download('punkt') nltk.download('wordnet') !gdown --id 1BGTVDjg8EwzUJmSwsmn2sEz9lJD2_3-w !gdown --id 1xsYC2UF1JAR7BIiNSU4iGbTZytYNzYof
This will bring all the downloads needed in your system.
Now, let’s get the data and preprocess and run sentiment analysis on it.
import pandas as pd from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer import re df = pd.read_csv('/content/TrainChetan.csv', usecols=['Dialogue']) lemma = WordNetLemmatizer() stop_words = stopwords.words('english') def text_prep(x: str) -> list: corp = str(x).lower() corp = re.sub('[^a-zA-Z]+',' ', corp).strip() tokens = word_tokenize(corp) words = [t for t in tokens if t not in stop_words] lemmatize = [lemma.lemmatize(w) for w in words] return lemmatize preprocess_tag = [text_prep(i) for i in df['Dialogue']] df["preprocess_txt"] = preprocess_tag df['total_len'] = df['preprocess_txt'].map(lambda x: len(x)) file = open('negative-words.txt', 'r',encoding = "ISO-8859-1") neg_words = file.read().split() file = open('positive-words.txt', 'r',encoding = "ISO-8859-1") pos_words = file.read().split() num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words])) df['pos_count'] = num_pos num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words])) df['neg_count'] = num_neg df['sentiment'] = round((df['pos_count'] - df['neg_count']) / df['total_len'], 2) df.to_csv('/content/TrainChetan_after_sentiment_analysis_nltk.csv') df.head()import pandas as pd from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from nltk.stem import WordNetLemmatizer import re df = pd.read_csv('/content/TrainChetan.csv', usecols=['Dialogue']) lemma = WordNetLemmatizer() stop_words = stopwords.words('english') def text_prep(x: str) -> list: corp = str(x).lower() corp = re.sub('[^a-zA-Z]+',' ', corp).strip() tokens = word_tokenize(corp) words = [t for t in tokens if t not in stop_words] lemmatize = [lemma.lemmatize(w) for w in words] return lemmatize preprocess_tag = [text_prep(i) for i in df['Dialogue']] df["preprocess_txt"] = preprocess_tag df['total_len'] = df['preprocess_txt'].map(lambda x: len(x)) file = open('negative-words.txt', 'r',encoding = "ISO-8859-1") neg_words = file.read().split() file = open('positive-words.txt', 'r',encoding = "ISO-8859-1") pos_words = file.read().split() num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words])) df['pos_count'] = num_pos num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words])) df['neg_count'] = num_neg df['sentiment'] = round((df['pos_count'] - df['neg_count']) / df['total_len'], 2) df.to_csv('/content/TrainChetan_after_sentiment_analysis_nltk.csv') df.head()
Let’s see the output:
As you can see this sentiment analysis is very detailed and can be used for training our model.
I will train a NLP model with the data from the dataset. I went through a lot of NLP models and decided on a model based on GPT-2.
So, we will be using Aitextgen Lib for this project. Let’s install it.
!pip install -q aitextgen #install the main package
Let’s import Aitextgen:
from aitextgen import aitextgen
Check your GPU (be sure to use GPU runtime if you are doing this in Google Colab or Kaggle Notebook):
! nvidia-smi
Download the 124M GPT-2 Model:
ai = aitextgen(tf_gpt2=”124M”, to_gpu=True)
Now, let’s read the dataset that I made above:
pd.read_csv("/content/TrainChetan_after_sentiment_analysis_nltk.csv") pd.set_option('display.max_colwidth', None)
Now, we need to clean the dataset and remove unwanted columns and spaces.
input_file["Dialogue"] = input_file["Dialogue"].str.replace('(','').str.replace(')','')
Let’s see the shape of the new dataframe.
df = pd.DataFrame(input_file["Dialogue"]) df.shape
We have 8248 individual values.
Now let’s divide the words and make it into two rows for better training.
df = df.assign(var1=df['Dialogue'].str.split('-')).explode('var1') df.var1 = df.var1.str.lstrip() df.shape
Now, we have data ready for training. Let’s save the cleaned text in a.txt file:
df.to_csv("input_text_cleaned.txt", columns=["var1"], header=False, index=False)
Now we can use the above text file to fine tune the model and set the correct parameters:
Let’s mount the Gdrive to save the model there:
!pip install -q gpt-2-simple import gpt_2_simple as gpt2 gpt2.mount_gdrive()
Time to fine tune the model to our needs:
run_name = 'ChetanAI' ai.train('input_text_cleaned.txt', run_name = run_name, line_by_line=False, from_cache=False, num_steps=5000, generate_every=100, save_every=500, save_gdrive=True, learning_rate=1e-3, fp16=False, batch_size=1)
It is finally done!
Our custom GPT-2 Trained model is ready. Now, we can directly load the trained model.
ai.generate(n=3, prompt="This is something Chetan would write:", batch_size=1, max_length=50, temperature=1.0, top_p=0.9)
The outputs:
For the Colab notebook, go here.