Can computers replace teachers? Automatically grading exams with vector space models

Below you can see two pictures of me. In the left one, I’m grading a big pile of student papers, feeling stressed and tired. Imagine how nice it would be if I could let my computer do all of this work, so that I can spend more time relaxing with my cat like in the right picture! To this end, I tried to build a system that automatically graded students’ answers to an open-ended exam question.

 

What do students know about Whorf’s language theory?

To build the system and evaluate how well it works, I used real data of 402 first year Psychology students at Radboud University Nijmegen, who all answered the following exam question:

“Discuss Whorf’s language theory. Include the following terms in your answer: Strong and weak variations on the theory.”

The course lecturer (not myself this time) had graded all of the answers, with a minimum grade of 0 and a maximum grade of 10. The below histogram shows the frequency distribution of the grades he assigned to 321 of the student answers (80% of the dataset), which I will use as my training data. For the other 81 answers (20% of the dataset) I will let my system predict the grades, and then compare them to the grades that the lecturer assigned to these answers.

Histogram of grades

Permission to work with the above-described data was granted to me by the ethical committee of the Social Science faculty of Radboud University.

 

The ‘perfect’ answer

The approach I’m taking in this project is to compare each student’s answer to a ‘perfect’ reference answer in terms of semantic similarity. Semantic similarity expresses to what extent the meaning of two texts (not necessarily the literal words) is the same.

The reference answer can be considered as the ideal answer to the above question. It was partly copied from the textbook that the students used (Gleitman, Gross & Reisberg, 2011), and partly based on the lecturer’s course notes.

I’ll admit in advance that there can of course be multiple, different ways to answer an exam question perfectly, and I’ll come back to this point in the conclusion. Nevertheless, the ‘perfect answer approach’ seems worth exploring because of its simplicity.

 

Comparing answers with vector space models

To understand the rest of this blog post, you will need a basic understanding of what a vector space is. For a quick introduction, watch the below video until 2:09.

In this project, I will compare three ways of transforming the answers from a running text into vectors of numbers. These transformations will be explained below. Thus, there will be three different vector spaces in which students’ answers can be compared to the reference answer. I will investigate which of these vector spaces predicts the students’ grades the best.

In order to predict a grade, we should measure how similar a vector that represents a student’s answer is to the vector that represents the reference answer. This similarity can be measured as cosine similarity, which lies between 0 (when the texts are completely different) and 1 (when the texts share the exact same meaning). Then, the similarity score is simply multiplied by 10 to arrive at the predicted grade (many other mapping algorithms are possible too). Finally, to evaluate the accuracy of the predictions, I will correlate the grades predicted by each vector space model with the grades assigned by the lecturer, using Spearman’s correlation coefficient.

 

Model 1: Vectors from word counts

In the first and most simple vector representation, each dimension of the vector reflects how often a particular word occurs in a text. For example, we could count how often the following four words appear in an answer: [“Whorf”, “language”, “determine”, “thought”]. Say a student gives the following answer: “Whorf’s language theory states that the language you speak determines your thoughts.” For this answer, the vector of word counts would be [1, 2, 1, 1]. For the reference answer, the vector would be [3, 4, 1, 2]. Please note that lemmatisation needs to be applied first, so that “Whorfian” becomes “Whorf”, “determines” becomes “determine” and “thoughts” becomes “thought”.

The above example was overly simplified. In reality, the number of dimensions I used was equal to the total number of unique words in all of the answers combined: 1676. That’s pretty difficult to visualise! Nevertheless, because this vector space model is based only on vocabulary counts and is therefore relatively simple, I will call it the baseline model.

The below figure shows the grades that the baseline model predicted for the 81 student answers in the test set. The x-axis shows the grade that the model predicted, and the y-axis shows the grade that the lecturer actually assigned (thicker dots indicate multiple answers with the same grades). As can be seen, the results aren’t great. To begin with, the model did not grade any of the exams as higher than 7, whereas the lecturer awarded many answers with a 10. Secondly, many answers got a predicted grade that was very different from the one assigned by the lecturer. Still, there is positively correlation of ρ = .38 (Spearman’s correlation coefficient), and a linear trend is visible.

Best baseline model (r = .38)

 

Models 2 and 3: Vectors from topics

The two other models I’ll try are different from the first model with regard to what the dimensions of the vector represent. This time, the dimensions no longer are based on how often a word appears in an answer, but on topics: what an answer is about.

For example, say we have a list of five topics: [“Language”, “Thinking”, “Animals”, “Travelling”, “Education”]. The vector representing each answer consists of the percentage of the answer that concerns each of these topics. For example, an answer that concerns the topics of language and thinking, but not the other topics, could look like this: [65, 35, 0, 0, 0], indicating that 65% of the text is about language and 35% about thinking. If we also represent the reference answer as a similar list of percentages, we can again calculate the cosine similarity between a student’s answer and the reference answer, and this time compare them in terms of what they are about.

Of course, the above method requires that a list of topics is established. I did not do this manually, but used two different topic models, called Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA). These models were trained both on the student answers, and also on the chapter of the students’ Psychology textbook in which Whorf’s language theory was discussed. For reasons of space I won’t discuss how these models work exactly, but you can find tutorials about that on YouTube or read my technical report which is uploaded below.

These are the results:

Best LSA model (r = .42)

Best LDA model (r = .52)

As you can see, the correlations that these models achieved are higher than that of the baseline model. This could be because the topic-based vector space models compare two answers in terms of their meaning, rather than looking for literal word matches. So, for example, the terms words and vocabulary would not be recognised as having the same meaning by the baseline model, but they probably would by the topic models.

The LDA model yields the best results. Still, this model never assigns the grade of 10, which was also a shortcoming of the baseline model (and the LSA model). This could potentially be solved by changing the algorithm that transforms similarity scores into grades, for example by not only multiplying the similarity score by 10 but also adding two extra points. However, the correlation of ρ = .52 would not change.

 

So, can computers replace teachers yet?

In conclusion, it is possible to grade students’ answers to exam questions by means of vector space models. The best model resulted in a moderate correlation of .52. This means that the accuracy of grade prediction is much higher than chance level. However, from looking at the above scatterplots it is also obvious that this technique currently is not good enough to actually be implemented in higher education.

Several avenues could still be explored to try to increase prediction accuracy. Currently, it is still the case that students’ creativity is punished by the models, which is undesirable. For example, if a student includes an argument in his/her answer about the Inuit having many more words for snow than speakers of European languages, this would make the answer more dissimilar from the reference answer, and therefore a lower grade would be predicted. Another potential improvement could be to use several different reference answers, so that we no longer need to assume that there is only one perfect answer to a question .

Until then, I think I’ll keep grading the students’ exams myself…

 

Want to know more?

Read the detailed report that I submitted to the Text and Multimedia Mining course at Radboud University. It also contains references to other research on automatic grading, and to LDA and LSA topic models.

 

Want to try it yourself?

Download the source code here. Unfortunately, I don’t have permission to share the students’ data.

Leave a comment