AI In Education — Automatic Essay Scoring

As computers intelligence is rapidly developing, there are lots of powerful tools that could help teachers become more efficient coming out almost every week, it seems. One of the more sci-fi sounding tools under examination is automatic computer grading of written essays.

Researchers apparently are well on their way towards getting bots to instantly grade written essays. For stakeholders dealing with humongous amounts of essays such as MOOC providers or states that include essays as part in their standardized tests, the thought of having the grading work done, even partly, by a computer is mesmerizing to say the least.

The big question is just how much of a poet a computer is capable of becoming in order to recognize small but significant nuances the can mean the difference between a good essay and a great essay. Can it capture essentials of written communication: reasoning, ethical stance, argumentation, clarity?

Source: Skip Sterling

In the year 1966 when computers still filled whole rooms, researcher Ellis Page at the University of Connecticut took the first steps towards automatic grading. Page was a true visionary of his generation. Computers was a relatively new thing and the thought of using them with text input rather than numbers must have seemed extremely novel to Page’s peers. Besides, computers were mainly reserved for the most advanced tasks possible, and access to them was still highly restricted. Using computers to grade essays wasn’t very realistic. From either a practical or economical standpoint.

Today however, the need for automated computer grading is soaring. Due to high costs from every essay having to be graded by two teachers, standardized state tests with a written part of the examination have become increasingly expensive. This cost has led to many states ditching this important part of assessment tests.

To counteract this discouraging development, in 2012 the William and Flora Hewlett Foundation sponsored a competition for automatic grading to get things going in the area. A prize of $60.000 was awarded the solution that best could replicate grading from real teachers on several thousand of essay samples.

“We had heard the claim that the machine algorithms are as good as human graders, but we wanted to create a neutral and fair platform to assess the various claims of the vendors. It turns out the claims are not hype.”, says Barbara Chow, education program director at the Hewlett Foundation.

Today many standardized tests in lower grades use automatic grading systems with good results. Children’s fate is not entirely in computer hands however. In most cases, robo-graders only replace one of two necessary graders in standardized tests. If the automatic grader has strongly divergent opinions, the essays are flagged and forwarded to another human grader for further assessment. This routine is there to guarantee quality is assessment and is at the same time helpful in developing auto-grader skills.

Development in automatic grading is also of great interest for MOOC-providers. One of the largest problems in the prevalence of online education is individual assessment of essays. One teacher could potentially provide material for 5.000 students, but it’s impossible for a single teacher to evaluate each students work individually. Solving this problem is a big step towards disrupting the education systems that some say is broken.

Grading software has dramatically improved over the last few years, and is now advancing and being tested at a college level. One of the big leaders in advancement is EdX, a MOOC provider and a combined initiative of Harvard and MIT towards improving online education.

EdX president Anant Agarwal claims AI-grading has more advantages than just freeing up valuable time. The instant feedback made possible with the new technology has a positive impact on learning as well. Today, essay assessments can take days or even weeks to complete, but through instant feedback, students have their work fresh in memory and can improve weaker parts instantly and more effective.

To start off the machine learning in the software, teachers have to input graded essays into the system to give a few examples of what is good and what is bad. The software gets increasingly better at its job as more and more essays are being entered and can eventually provide specific feedback almost instantly. According to Agarwal, there is still a long way to go, but the quality in grading is fast approaching that of a human teacher. Development of the EdX-system is rapidly growing as more schools join in on the action. As of today, 11 major Universities are contributing to the ongoing advancement of the grading software.

Professor Mark Shermis, Dean of College Education at the University of Houston is considered one of the world’s leading experts in automatic grading. He supervised the Hewlett competition back in 2012 and was very impressed by the performance of the participants. 154 different teams took part in the competition and were compared on more than 16.000 essays. The Output from the winning team was in 81% agreement to human raters. Shermis verdict was predominantly positive, and he says that this technology has a sure place in future educational settings.

Since the competition, research in automatic grading has had good progress. In 2016 two researchers at Stanford presented a report where they claim to have achieved a coincident of 94.5% based on the same dataset as in the Hewlett competition.

Besides, assessment variation between human graders is not something that has been deeply scientifically explored and is more than likely to differ greatly between individuals.

Skepticism

Evidently, technology of automatic grading is on the rise and has come a long way from the first simple tools that mainly relied on counting words, measuring sentences, word complexity and structure.

How vendors of automatic essays scoring systems actually come up with their algorithms is hidden deep behind intellectual property regulations. However, long time skeptic Les Perelman and former director of undergraduate writing at MIT has some of the answers. He spent the last 10 years inventing ways to trick and ridicule different automated grading software and, has more or less started a full fledged war to fight the use of these systems.

Over the years he has become a master of understanding the inner workings and the weak points. Perelman has on several occasions managed to crack the algorithms behind grading just to prove how easy they can be tricked. His latest contraption is a software he developed with help from MIT undergraduate students called the Babel Generator (try it, it hilarious). The program can generate a complete essay in under a second, based on one to three keywords. Of course, the essay makes absolutely no sense to read since it is full to the brim with just well-articulated nonsense.

The essential problem in data assessment is called overfitting, i.e. using a small dataset to predict something. The grading software must compare essays, understand what parts are great and not so great and then condense this down to a number which constitutes the grade, which in its turn must be comparable with a different essay on a totally different topic. Sounds hard, doesn’t it? That’s because it is. Very hard. But still, not impossible. Google uses similar tactics when comparing what resulting texts and images are more preferable to different search terms. The issue is just that Google uses millions of data samples for their approximations. A single school could, at best, input a few thousand essays. This is like trying to solve a 1000-piece puzzle with just 50 pieces. Sure, some pieces can end up in the right place but it’s mostly guess work. Until there is a humongous database of millions and millions of essays, this problem will most likely be hard to work around.

The only plausible solution to overfitting is specifying a specific set of rules for the computer to act upon to determine if a text makes sense or not, since computers can’t read. This solution has worked in many other applications. Right now, auto-grading vendors are throwing everything they got at coming up with these rules, it’s just that it is so hard coming up with a rule to decide the quality of creative work such as essays. Computers have a tendency of solving problems in the way they usually do: by counting.

In auto-grading, the grade predictors could, for example, be; sentence length, the number of words, number of verbs, number of complex words and so on. Do these rules make for a sensible assessment? Not according to Perelman at least. He says that the prediction rules are often set in a very rigid and limited way which restrains the quality of these assessments. For example, he has found out that:

- A longer essay is considered better than short one (a coincidence according to auto grading advocate and professor Mark D. Shermis)

- Specific word associated with complex thinking such as ’moreover’ and ’however’ leads to better grades

- Towering words such as ’avarice’ gives more points than using simple ones such as ’greed’

On other instances he found examples of rules poorly applied or just not applied at all, the software could for example not determine whether facts were true or false. In a published and automatically graded essay, the task was to discuss the main reasons why a college education is so expensive. Perelman argued that the explanation lies within the greedy teacher’s assistants who has a salary of six times that of a college president and regularly uses their complementary private jets for a south sea vacation.

To avoid the examining eye of Perelman and his peers most vendors have restricted use of their software while development is still ongoing. So far, Perelman hasn’t gotten his hand on the most prominent systems and admits that so far he has only been able to fool a couple of systems.

If we are to believe Perelman’s claims, automatic grading of college level essays still has a long way to go. But remember that already today, lower grade essays is actually being graded by computers already. Granted, under meticulous supervision by humans but still, technological progress can move fast. Considering how much effort being asserted towards perfecting automatic grading scoring it is likely we will see a fast expansion in a not too distant future.

About the author: Hubert.ai is a young edtech company based in Stockholm, Sweden. We are working to disrupt teacher feedback by using AI conversational dialog with every student separately. Feedback is then analyzed and compiled down to a few recommendations on how you as a teacher can improve your skills and methods. Are you a teacher and would like to help us in development? Please sign up as a beta tester at our website :]