Automated Engines Score Essays Like Humans
A study from the PARCC found that essays graded by computers matched those of humans based on various performance metrics.
A report from a national testing organization found that the performance of automated scoring engines matches that of human scorers.
The Partnership for Assessment of Readiness for College and Careers (PARCC), a consortium of states working to create a standard of K-12 assessments in mathematics and English language arts in alignment with the Common Core State Standards Initiative, recently released a report on the viability of computer-scored essays. The study was conducted in 2014 and published in 2015, but the report was not widely available to the public until the Parent Coalition for Student Privacy and other parties wrote a letter to state commissioners urging the PARCC to be more widely available.
Pearson Education and Educational Testing Service (ETS) together participated in the research study to test and compare the PARCC’s automated scoring against human scoring. The joint study included 75 prompts, spanning multiple grade levels and task types. Both Pearson and ETS first trained their scoring engines on the prompts using correlating human-scored responses. Then, Pearson and ETS fed an unseen set of student essays to their scoring engines and compared the results to human-scored responses on the same unseen set. Performance was based on grade level, trait and type of prompt. The study revealed that, on average, the performance of the automated scoring engines matched that of the human scorers, and only essays from grade three performed slightly below human-scored tests.
Parents and advocates addressed their concerns about automated scoring, citing the “inability of computers to assess the creativity and critical thought that the Common Core standards were supposed to demand” in the letter. They wanted more information from the PARCC, such as the percentage of computer-scored tests that were re-checked by humans and what happens when machine scores vary significantly from scores from humans.
Despite the call for more information, several of the PARCC states will start using scoring engines to judge essays this year, according to an online report. This year, about two-thirds of all student essays will be scored automatically, while one-third will be scored by humans. In addition, about 10 percent of all responses will be randomly selected to receive a second score as a precaution. States can still opt to have all essays hand-scored.