It is a breakthrough to be able to use artificial intelligence and machine learning to find, identify and confirm the author of a text, and analyse the style of writing from a text collection. In the literary sector, the question of authorship has always been the main concern. A perfect case in point is the analysis of William Shakespeare and John Fletcher’s work Henry VIII.
Czech researcher Petr Plechac recently released a paper titled “Relative contributions of Shakespeare and Fletcher in Henry VIII”, confirming scholar James Spedding’s longstanding theory of having more than one formal author for Shakespeare’s Henry VIII.
The Tech Behind
Plechac developed a machine learning system that determined which portions of the historical play were written by which author. He trained an algorithm on the works of Shakespeare and Fletcher to recognise the style, rhythmic patterns, and word choices and performed a rolling regression technique that breaks down the play into parts and conducts regressions over and over again with subsamples, to study the styles of writing. The process provided granular evidence of the involvement of Fletcher in writing the last four scenes of the play, and it is definitely satisfying to give credence to a lingering debate.
The key concept behind author identification is the process of feature engineering, where the machine selects features from the collected text that suitably describes the style of an individual author which helps in distinguishing the author from other writers. The most common features used by the best of the machines are:
- Frequency of n-gram
- Usage of function words
- Distribution of word lengths
- Frequency of a digit
- Syntax (which helps in analysing the specific token and the style within a text)
- Punctuation marks
- Usage of passive and active voice
- Parts of speech
The major development in feature engineering has deep roots in stylometry, which is the study of the linguistic style and analysing the variations in the literation of different texts. Researchers believe typical individual human activities carry invariant similarities and slightly vary from one person to another. Similarly, the style in writing is usually distinguished by the repeated choices of words or text patterns that the writer tends to make subconsciously. These repeated choices are very individualistic and are supposed to reflect a writer’s style.
As it paved its way into the digital world, the technique goes beyond analysing features and focus more on network modelling and advanced mining of data and texts. It also uses information like graphics, emotions, colours, and layouts to provide the required information.
The method of network modelling utilises information related to the attribution of the document, such as the venue and date of the published document, and also the type of publication, which is then paired up with some document text-based keywords. However, this method does not focus on studying the important rhythmic patterns and therefore applied with machine learning techniques to find more author characteristics.
It also considers a complex relational structure in the usage of function words by constructing word adjacency networks (WANs) with function word nodes and edges containing information regarding the use of two function words within a certain distance from one another. Each WAN is them being interpreted as a Markov chain that assigns transition probabilities to the appearance of two function words in succession. Thus, these probabilities stand uniquely to the authors’ expression.
As a science, it falls under the general category of recognition systems which are usually applied to identify suspects or criminals.
The need to identify the author of a particular text and verify the authenticity of the same has been around for several years. Researchers and linguists have been using the manual technique of stylometry to identify distinct language patterns to help identify the authors in proposed cases. However, the process got more accurate with the involvement of ML and AI.
In 2017, an AI application was released for public use, known as Emma, which can read a text and identify the style of the author. CEO Aleksandr Marchenko said in an interview, “To run the check, one needs to upload a text of at least 5,000 words by one author, which is then analysed by Emma to learn the author’s writing style. It can then determine whether all the subsequent texts uploaded.”
He further explained that the application combines natural language processing (NLP) and ML with the techniques of stylometry to extract patterns from the author’s text, majorly of which can not be easily detected by the human eye. “More than 50 mathematical parameters stand behind every author’s writing identity. So the issue is of a striking complexity: it’s extremely difficult to define and assess style features of a vexing number of authors, and to implement the extracted knowledge into an NLP technology.”
Apart from verifying authorship, and providing an insight into the mental state of the author, stylometry has many potential applications in areas of education and literature, digital content forensics, program code author, crime prevention, law enforcement. Plagiarism detection, ghostwriter detection, has also been benefitted by the technique of machine learning in stylometry.