Who wrote this text? Authorship attribution with SVM

Authorship attribution is the task of detecting who has written a certain text. As a famous example, researchers unmasked crime writer Robert Galbraith in fact to be J.K. Rowling, the writer of the Harry Potter books. In this blog post, I will explore and explain the use of the support vector machine (SVM) classification algorithm for authorship attribution.

To do so, I will use a publicly available dataset called the ‘Extended-Brennan-Greenstadt Corpus’. The dataset consists of texts written by 45 different authors, who were recruited through Amazon Mechanical Turk. Each author wrote at least 6500 words, in multiple texts.

 

Drawing a line

In machine learning terminology, authorship attribution is a classification task. To classify basically means to label: who out of 45 authors wrote a given text? Each author would be called a class, and each text would be called an instance of that class. We will let the computer perform the classification task automatically by using the Support Vector Machine (SVM) algorithm.

What SVM generally does, is trying to find a line that separates the instances of two classes by the widest margin. Find a simple illustration below. Here, the instances are dots, and the classes are black and white. The green line, H1, does not correctly separate the classes: on its right side, there are both black and white dots. The blue line, H2, does separate the classes correctly, but it is easy to imagine that a new black or white dot could appear on the wrong side of the line. The red line, H3, is the best one: it separates the black and white dots by the largest margin.

SVM

Picture taken from Wikipedia Commons. User:ZackWeinberg, based on PNG version by User:CycSvm separating hyperplanes (SVG)CC BY-SA 3.0

The SVM tries to find the optimal line by using features of the instances that need to be classified. In the above picture, these features were X1 and X2. While we do not know what these features represent (height? income? love for animals?), we can see that the white dots tend to score high on X1 but not X2. For the black dots it is the other way around. The SVM learns what the best line is by looking at lots of examples. In other words, we can say that we are ‘training’ our SVM. When the training is done, we can use the SVM to classify new instances, like a dot whose colour is yet unknown.

 

Features of texts

When it comes to authorship attribution, each text is an instance, and features are properties of the texts, for example the length of a text. When choosing which features to use, it is important to understand the difference between lexical words and function words. Lexical words are words whose meaning can easily be pictured, for example ‘dog’, ‘red’, or ‘running’. Function words are needed to create grammatical sentences, but their meaning is not as easy to picture, for example: ‘is’, ‘of’, ‘theirs’. While function words might perhaps at a first glance not seem very informative, in fact we know that they are very useful for authorship attribution.

In this authorship attribution task, I will be using 236 different features. This may seem a lot, but such numbers are not uncommon when it comes to authorship attribution. I grouped my 236 features in five different categories:

  1. Text and sentence length (5 features). Examples: total number of words, total number of sentences, average sentence length in characters.
  2. Lexical richness (7 features). Examples: ratio of lexical words to total number of words, ratio of unique words to total number of words (‘type-token ratio’).
  3. Frequency of function words (153 features): how often each of 153 selected function words occurred in a text.
  4. Frequency of parts-of-speech (45 features): how often certain word categories, such as proper nouns (‘John’, ‘London’) or personal pronouns (‘I’, ‘him’) occurred in a text.
  5. Letter frequency (26 features): how often each of the 26 letters of the alphabet occurred in a text.

 

How confused is my SVM?

As explained above, an SVM learns to draw the best line for separating the instances of two classes by looking at lots of training data. I gave the SVM 544 texts to learn from. Once it had figured out how to draw its lines, I tested how well the SVM worked by letting it predict the author of 155 texts it had not seen before. For each prediction that the SVM makes, four different outcomes are possible:

  1. True positive: a text written by author 29 was attributed to author 29
  2. True negative: a text not written by author 29 was not attributed to author 29
  3. False positive: a text not written by author 29 was attributed to author 29
  4. False negative: a text written by author 29 was not attributed to author 29

These outcomes can be visualised in a confusion matrix (see picture below). The 0-5 legend indicates how often a certain text was attributed to a certain author. For example, the yellow square towards the bottom right of the picture indicates that five texts written by author 30 were actually predicted to have been written by author 30 (which is good!). We can also see in the top right corner that it happened once that author 40 was predicted to have written a text which in reality had been written by author 0.

If the SVM had made no mistakes at all, then we would only see a bright diagonal line, and no squares elsewhere. While this is not the case, the confusion matrix shows that the SVM managed to predict the right author for a given text quite often, especially considering the fact that it had to choose only one author out of 45 different options.

Confusion matrix

 

From pictures to numbers

The confusion matrix gives us an impression of how well our SVM performed. However, sometimes we want to describe this picture with a few numbers, which makes it easier to compare the outcomes of different SVMs. To this end, three measures are available: recall, precision and F1 score.

‘Recall’ tells us how often a text written by a certain author was indeed attributed to that author. Thus, for each author, recall is calculated as the number of true positives divided by the sum of the true positives and false negatives. ‘Precision’ tells us how many of the texts that the system attributes to a certain author, were indeed written by that author. Thus, for each author, precision is calculated as the number of true positives divided by the sum of the true positives and false positives. Finally, the ‘F1 score’ is the average of recall and precision.

Recall, precision and F1 are always between 0 (nothing is predicted correctly) and 1 (everything is predicted correctly). For my data, recall was at 0.64, precision was at 0.69, and F1 was at 0.66. While there is obviously room for improvement, do remember that the chance level of making a correct prediction was only 1/45 ≈ 0.02.

 

Ablation analysis

We may wonder how much each of the five feature groups contributed to our results. For example, if the first feature type (Text and sentence length) made a useful contribution, we should see our recall, precision and F1 scores drop when we train another SVM that does not make use of this information. To investigate this further, we trained five additional SVMs, in each of which a complete feature group gets ignored. The resulting F1 scores are shown below.

The ‘None’ bar shows the F1 score if none of the feature groups is excluded, this is the 0.66 we already saw above. As can be seen, nothing much changes when feature group B (Lexical richness) is excluded. This means that when we want to do authorship attribution again in the future, we can probably just leave out the lexical richness features and still get equally good results. On the other hand, the F1 score drops the most for feature group D (Frequency of parts of speech). Thus, the frequency with which individual authors use nouns, verbs, etc., can help our system decide which text was written by which author.

Failure analysis.png

 

Where to go next?

We have seen that the SVM algorithm for classification works quite well for predicting who wrote a certain text. Our results are highly above chance level. Of course, there is still room for improvement. A first step could be to explore the individual contribution of the 236 features, similar as we did for the five feature groups. If we are to find that certain features do not improve our results, our SVM might be better of when we don’t use these features altogether.

 

Want to know more?

Brennan, M., Afroz, S., & Greenstadt, R. (2012). Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information and System Security, 15(3), 12:1–12:22. https://doi.org/10.1145/2382448.2382450

Grieve, J. (2007). Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing, 22(3), 251–270. http://doi.org/10.1093/llc/fqm020

 

Want to try it yourself?

Download the Python code here. Get the data here.

 

 

Leave a comment