Last updated November 11, 2019
In AI Origins & Evolution

Can AI Replace Teachers To Grade Student Essays? A Lesson From US Schools

Share

Published on November 8, 2019

by Vishal Chawla

In countries like the US, artificial intelligence is already being used at a large scale to evaluate student essays, saving educational institutes money and time. According to reports, at least 21 states in America have deployed some type of automated scoring, from middle school to college level. Students are being graded on their essays using such AI systems designed by different vendors for highly important tests like the Graduate Record Examinations (GRE). While educators in the US say they are not going back to using human teachers for essay grading, it has received major backlash from parents particularly those from state school systems.

But, it’s not all great when it comes to automated grading. Recent reports from students and parents have brought attention to the issue of how artificial intelligence systems are not evaluating essays rightly, leaving them anguished. The parents have said that automated grading is doing nothing to help kids learn better from their mistakes. Moreover, many parents have reported that the AI system used for grading can be easily fooled with gibberish sentences that don’t make sense but only contain the advanced words which tricks the algorithm into thinking the essay is well written. Meaning an essay looks great from afar but doesn’t have any substance.

Another big finding that has been flagged by experts is that the AI is biased against certain students from different nationalities and cultures. Based on specific syntax structures and vocabulary, the automated grading system was found to have deemed essays written by foreign students as poor. This is because the data sets used are not diverse enough to accommodate and learn from writing samples of students’ from foreign and different language backgrounds. This was also confirmed by Aoife Cahill- a researcher from Educational Testing Service- a firm which creates such grading systems in a podcast episode of Reset.

What About Creativity And Writing Style?

The elephant in the room is not just evaluation grammar, syntax and irrevocably but other creative elements of language which are difficult to measure, especially at scale. Of course, that would need collaboration from thousands and thousands of teachers, so subtle grading metrics can also be recorded and added to the system. Of course, it’s also possible that no two separate teachers might grade the same piece of essay writing the same manner due to their individualistic judgement and biases.

Other countries like China have also implemented automated scoring with substantial success, reporting almost similar accuracy to human assessors, even in how those systems evaluate writing style. The answer to that the lack of standards in such software and also the volume of data sets used to train AI models. With advancements like Open AI release of GPT-2 for language models, AI can mimic human language to very high accuracy as it is trained on more than a billion parameters. GPT-2 has shown impressive results to creative people like authors and novelists who have praised it, stating the system works. But then, why is it that anomalous cases of errors are occurring in the US related to errors in grading. It all boils down to the volume and quality of data, believe experts who make these software work.

What Are The Challenges Student Grading Systems That Rely On Artificial Intelligence?

The incident is complicated on the grounds that there’s not only one program that is being utilized. There are a lot of various algorithms, made by different vendors across the US. Yet, they’re altogether made in fundamentally a similar way: First, an automated scoring organization evaluates how human graders evaluate students’ writing. At that point, the AI company prepares an algorithm to make predictions with respect to how a human grader may score a paper using the data. Contingent upon the data, those predictions can be filled with errors as well as contain human bias.

The biggest challenge identified by experts is knowing what data is actually biased so it’s not fed into the system and amplified by training the algorithm on it. Researchers say that it must be ensured that the data taken from human graders have to follow a fixed standard format to ensure all checkpoints are ticked. Also, to prevent bias, diverse sets of data is also required to feed the system, as per AI testing vendors.

Conclusion

Educators say the incidents reported by parents may be a few rare occurrences the algorithms failed to do what they are supposed to, but overtime will keep improving. Cyndee Carter, evaluation organizer for the Utah Education board reportedly said says the state started carefully, from the start ensuring each machine-reviewed writing was additionally perused by a human grader.

With everything considered, she says the automated scoring framework has been great for the state, for the amount of money and effort it has saved. Yet in addition since it empowers educators to get test results back in minutes as opposed to months, she told a publication.

Automated scoring has demonstrated “spot-on” and Utah in the currently gives machines a chance to be the sole judge of most essays. In around 20% of cases, when the system identifies something anomalous, it hails an assessment by a human teacher.

Access all our open Survey & Awards Nomination forms in one place

Vishal Chawla

Vishal Chawla is a senior tech journalist at Analytics India Magazine and writes about AI, data analytics, cybersecurity, cloud computing, and blockchain. Vishal also hosts AIM's video podcast called Simulated Reality- featuring tech leaders, AI experts, and innovative startups of India.