Automated Engines Score Essays Like Humans

A study from the PARCC found that essays graded by computers matched those of humans based on various performance metrics.

A report from a national testing organization found that the performance of automated scoring engines matches that of human scorers.

The Partnership for Assessment of Readiness for College and Careers (PARCC), a consortium of states working to create a standard of K-12 assessments in mathematics and English language arts in alignment with the Common Core State Standards Initiative, recently released a report on the viability of computer-scored essays. The study was conducted in 2014 and published in 2015, but the report was not widely available to the public until the Parent Coalition for Student Privacy and other parties wrote a letter to state commissioners urging the PARCC to be more widely available.

Pearson Education and Educational Testing Service (ETS) together participated in the research study to test and compare the PARCC’s automated scoring against human scoring. The joint study included 75 prompts, spanning multiple grade levels and task types. Both Pearson and ETS first trained their scoring engines on the prompts using correlating human-scored responses. Then, Pearson and ETS fed an unseen set of student essays to their scoring engines and compared the results to human-scored responses on the same unseen set. Performance was based on grade level, trait and type of prompt. The study revealed that, on average, the performance of the automated scoring engines matched that of the human scorers, and only essays from grade three performed slightly below human-scored tests.

Parents and advocates addressed their concerns about automated scoring, citing the “inability of computers to assess the creativity and critical thought that the Common Core standards were supposed to demand” in the letter. They wanted more information from the PARCC, such as the percentage of computer-scored tests that were re-checked by humans and what happens when machine scores vary significantly from scores from humans.

Despite the call for more information, several of the PARCC states will start using scoring engines to judge essays this year, according to an online report. This year, about two-thirds of all student essays will be scored automatically, while one-third will be scored by humans. In addition, about 10 percent of all responses will be randomly selected to receive a second score as a precaution. States can still opt to have all essays hand-scored.

About the Author

Sri Ravipati is Web producer for THE Journal and Campus Technology. She can be reached at [email protected].

Featured

  • Double exposure image of coin stacks on technology financial graph background

    The Budget Cut that Changes Everything in K-12

    ESSER funding, the post-COVID lifeline that enabled many districts to invest in data collection and research, is coming to an end. For districts that relied on those dollars to conduct surveys and gather community feedback, the impact is significant.

  • AI logo near computer equipment

    White House Issues National Policy Framework for AI

    The White House has released a four-page AI policy framework aimed at setting a national approach to AI, with priorities including child safety, intellectual property protections, truth and accuracy guardrails, and worker training for an AI-driven economy.

  • tool icons with variety of business icons

    SETDA Releases Free EdTech Quality Action Toolkit

    The State Educational Technology Directors Association (SETDA) has put together a free K-12 EdTech Quality Action Toolkit that provides a framework for evaluating education technology products as well as guidance on regulatory compliance, templates for communicating with vendors, training resources, and more.

  • abstract representation of artificial intelligence with data streams and circuits

    Anthropic to Study Risks and Economic Effects of Advanced AI

    Anthropic has launched a new research effort focused on the biggest societal challenges posed by more powerful AI systems.