One of the most widely done tasks in machine learning is the classification where a predictive model is built to classify things among different classes. But do you think it is possible to classify different features of an author from writings like blogs and articles? Several texts are written on the internet in the form of articles, blogs, etc. This is the reason it has now become difficult to predict anything about the writer/ author without knowing him. Through this article, we will try solving this problem by building a classifier that would be able to predict multiple features such as Age, Gender, Astrological sign and Industry about the author from his texts. This problem is also listed as “Blog Authorship Corpus” on Kaggle.
What you will gain from this article?
- How to solve the blog authorship corpus challenge?
- How to download and load the huge corpus for the task?
- How to do pre-processing of the textual corpus?
- How to build a model that will predict the features of the author?
The data set can be directly downloaded from Kaggle. It consists of posts of 19,320 bloggers that were collected in August 2004 from blogger.com. It has a total of 6,81,288 posts and 140 million words. All the bloggers fall into 3 age groups that are (13-17), (23-27), and (33-47). We will be using Google Colab for the task whereas if you want you can work with other IDE’s as well.
Sign up for your weekly dose of what's up in emerging technology.
Implementing Author Feature Prediction
Let us quickly import all the required libraries that are required. Use the below code to do the same.
import pandas as pd from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.preprocessing import MultiLabelBinarizer from sklearn.multiclass import OneVsRestClassifier from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score from sklearn.metrics import f1_score from sklearn.metrics import average_precision_score from sklearn.metrics import recall_score
After building the model we now load the data set and print the first 10 rows of the data. When we did a bit of EDA over the data we found there were no missing values and overall there are a total of 681284 rows and 7 columns.
data = pd.read_csv(‘blogtext.csv) data.head(10)
We will be working only with 3000 rows to build the model to reduce the errors and once we are done with the model building we can build the model using the whole data. After selecting only the 3000 rows we will do the pre-processing of the data – removing unwanted characters, converting text to lowercase, strip, and splitting the text followed by stopwords removals.
data_new = data_new[:3000] data_new['text'] = data_new['text'].str.replace('[^A-Za-z]',' ') data_new['text'] = data_new['text'].str.lower() data_new["text"] = data_new["text"].str.strip() data_new["text"] = data_new["text"].str.split() from nltk.corpus import stopwords stopwords = set(stopwords.words('english')) data_new.text = data_new.text.apply(lambda x: ' '.join([word for word in x.split() if word not in stopwords]))
The output of clean text after stopword removal is given below.
We will now merge the labels so that we have all the labels for a particular sentence in one column. After merging all the labels you will see the transformation as shown in the image. Use the code below to merge the labels.
data_new['age'] = data_new['age'].astype(str) data_new['labels']=data_new[['gender','age','topic','sign']].apply(lambda x:','.join(x), axis = 1) merged_data=data_new.drop(labels =['date','gender', 'age','topic','sign','id'], axis = 1) merged_data.head()
We will then define the dependent and independent features X and y respectively. After defining X and y we will divide the data into testing and training. We will fit the training data on the model and do testing on the test data. Use the code given below for the same.
X = merged_data['text'] merged_data['labels'] = merged_data['labels'].str.lower()= merged_data['labels'] X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.33, random_state = 43)
After splitting the data we will now vectorize X and y by creating a bag of words and using a count vectorizer. After doing so we will transform X and y. Once we have transformed the X_train and X_test we will then convert the training and testing label using a multi-label binarizer. Use the below code to do the same.
vectorizer = CountVectorizer(min_df = 2,ngram_range = (1,2),stop_words = "english") X_train = vectorizer.fit_transform(X_train) X_test = vectorizer.transform(X_test) vectorizer_labels = CountVectorizer(min_df = 1,ngram_range = (1,1),stop_words = "english") labels_vector = vectorizer_labels.fit_transform(labels) label_classes= for key in vectorizer_labels.vocabulary_.keys(): label_classes.append(key) MLB = MultiLabelBinarizer(classes = label_classes)
Before applying a multi-label binarizer we need to convert the labels in a format that is accepted by multi-label binarizer. We will do the same using the below code and will transform both the labels.
y = [["".join(re.findall("\w",f)) for f in lst] for lst in [s.split(",") for s in y]] labels_trans = mlb.fit(labels) y_train = [["".join(re.findall("\w",f)) for f in lst] for lst in [s.split(",") for s in y_train]] y_train = mlb.transform(y_train) Y_test = [["".join(re.findall("\w",f)) for f in lst] for lst in [s.split(",") for s in y_test]] y_test_trans = mlb.transform(y_test)
After the labels have been transformed using a multi-label binarizer we will then define a classifier we will use OneVsRestClassifier that is based upon the One-vs-Rest approach. As a basic classifier, we will be using LogisticRegression. The training may take time because of the large volume of data. After initiating the classifier we will then fit the training data and check the training accuracy. Use the below code to do the same.
Classification Model for Author Feature Prediction
clf = LogisticRegression(solver = 'lbfgs',max_iter = 1000) clf = OneVsRestClassifier(clf) clf.fit(X_train,Y_train) print("Training Accuracy:",clf.score(X_train,y_train))
After training, we will now make predictions over the testing data and then evaluate the model performance using different metrics. Use the below code to do the same.
y_pred = clf.predict(X_test) print("Test Accuracy:" + str(accuracy_score(y_test,y_pred))) print("F1: " + str(f1_score(y_test,y_pred))) print("F1_macro: " + str(f1_score(y_test,y_pred))) print("Precision: " + str(precision_score(y_test,y_pred)))
Author Feature Predictions by the Model
We will now check a few of the predictions and compare them with the original labels. We compared predictions for 2 sentences and the model had correctly predicted the labels.
print(" Predicted :",y_pred) print(" Actual :",y_test) print(" Predicted :",y_pred) print(" Actual :",y_test)
We can conclude that the build model using OneVsRestClassifier did not have much good accuracy but for the 2 predictions, we made those were predicted correctly by the model. You can also try building the model by using different classifiers to improve the accuracy of the model. In the end, we also evaluated our model performance using metrics like precision, recall, and F1 score where we got an F1 score of 75 and a precision of 82 that are much satisfactory.