Is It Fair and Accurate for AI to Grade Standardized Tests?

The state of Texas is transferring part of its high-stakes standardized test scoring process to robots.

The news organization is Detailed deployment details A natural language processing program, a type of artificial intelligence, used by the Texas Education Agency to score the written portion of standardized tests administered to students in third grade and above.

Like many AI-related projects, the idea started as a way to reduce the cost of hiring humans.

Texas realized it needed a way to score exponentially more written responses on the Texas State Assessment of Academic Readiness (STAAR). new law Starting in the 2022-23 school year, at least 25 percent of questions will be required to be open-ended rather than multiple-choice.

Officials say the automated scoring system will save the state millions of dollars that would have been spent on contractors hired to read and score written responses. Only 2,000 graders were needed this spring, compared to 6,000 at the same time last year.

Using technology to grade essays is nothing new. For example, the written answer for the GRE is: have long been scored by computers. A 2019 study by Vice found that at least 21 states Use natural language processing to score students’ written responses on standardized tests.

Still, some educators and parents alike felt blindsided by the news about automatically grading essays for K-12 students. Clay Robison, a spokesman for the Texas Teachers Association, said many teachers learned of the change through media coverage.

“I know that the Texas Education Agency did not engage any of our members to ask what they thought about this,” he says. “And apparently they didn’t ask many parents either.”

The shift to using technology to score standardized test responses is raising concerns about fairness and accuracy, as low test scores can impact students, schools, and school districts. .

Officials have been keen to stress that the system does not use generative artificial intelligence like the widely known ChatGPT. Rather, the natural language processing program was trained using his 3,000 written responses submitted during past tests, including the parameters used to assign scores. His quarter of the scores awarded are reviewed by a human scorer.

“The very notion that only boilerplate statements can be scored by this engine is not true,” said Chris Roznick, TEA’s Director of Assessment Development. houston chronicle.

The Texas Education Agency did not respond to EdSurge’s request for comment.

Fairness and accuracy

One question is whether the new system will fairly grade the writing of bilingual children and children learning English. About 20 percent of Texas public school students are learning English, according to federal data, but not all of them are old enough to take standardized tests.

Rocio Laña is CEO and co-founder of LangInnov, a company that uses automated scoring to assess language and literacy for bilingual students, and is working on alternative assessments for writing. She has spent much of her career thinking about how we can improve educational technology and assessment for bilingual children.

Ranya isn’t opposed to the idea of using natural language processing to assess students. She recalls that when she came to the United States as a student 20 years ago, one of her own graduate school entrance exams was graded by a computer.

A red flag for Ranya is that, based on publicly available information, Texas doesn’t appear to have developed the program beyond what she considers a reasonable timeline of two to five years. . Ranya says this is plenty of time to test and test. Fine-tune the accuracy of your program.

She also notes that natural language processing and other AI programs tend to be trained on the writings of monolingual, white, middle-class people, which is certainly true for many students in Texas. This is not a profile. More than half of the students are Latino. status data62% are considered economically disadvantaged.

“It’s a good initiative, but maybe they were doing it wrong,” she says. “You should never make a high-stakes evaluation based on “I want to save money.”

Ranya said this process requires not only taking the time to develop an automated grading system, but also rolling it out slowly to ensure it works for a diverse student population.

“[That] It’s difficult for automated systems,” she says. “What always happens is that it’s very discriminatory for people who don’t follow the norms, and in Texas, those people are probably the majority.”

Kevin Brown, executive director of the Texas Association of School Administrators, said the concerns he’s heard from administrators are about the rubrics the automated system uses to grade.

“When you have a human grader, the rubrics used to assess writing used to say that originality of voice benefits students,” he says. “Machine-scorable writing may encourage machine-like writing.”

TEA’s Rosnick told the Texas Tribune that the system “doesn’t penalize students who give different answers, but students who give truly unique answers.”

In theory, bilingual Spanish-speaking or English learner students could have their written responses flagged for human review, allaying fears that the system would lower their scores.

Ranya argues that this is a form of discrimination because bilingual children’s writing is evaluated differently than children who write only in English.

Ranya also found it strange that after Texas added open-ended questions to the test to give students more room for creativity, they ended up having most of the answers read by computers rather than by humans.

The automated scoring program was first used to grade essays for a small number of students who took the STAAR standardized test in December. Brown said he has heard from school administrators that the number of students receiving zero points on written responses has increased sharply.

“Some school districts are alarmed by the high number of zeros their students receive,” Brown said. “I think it’s too early to tell whether it’s due to machine grading or not. The bigger question is, if a child may have written an essay and received a zero, how do they deal with it?” It’s difficult to explain to people exactly how to explain it to your family.

This was announced by a TEA spokesperson. dallas morning news Previous versions of the STAAR test only awarded zeros for blank or nonsensical answers, but the new rubric allows zeros based on content.

high stakes

Concerns about the potential impact of using AI to score standardized tests in Texas can only be understood by understanding the state’s school accountability system, Brown said.

The Texas Education Agency extracts extensive data, including STAAR test results, to create single-letter grades from A to F for each district and school. It’s a system that many people are unfamiliar with and the risks are high, Brown said. Examinations and the annual preparation for them were described by one writer as follows: “A circus full of children’s anxiety”

TEA can take over the following school districts: F 5 times in a row, as did the large Houston Independent School District in the fall. The takeover was triggered by failing grades at just one of the 274 schools, and both the superintendent and the elected board of directors were replaced with state appointees. Since the acquisition, there has been a steady stream of news such as: protests that’s all controversial changes In “low-performing” schools.

“Accountability systems are a source of confusion for school districts and parents because they sometimes feel disconnected from what’s actually happening in the classroom,” Brown said. “So any time you think you’re going to change your rating, it’s because there’s accountability.” [system] It’s a blunt force and people worry too much about change. Especially when there is no clear communication about what it is. ”

Robison said his organization, which represents teachers and school staff, is advocating for a complete repeal of the STAAR test. Adding an opaque automated grading system will not help build trust in state education officials.

“There’s already a lot of mistrust surrounding STAAR and what it stands for and what it’s trying to accomplish,” Robison said. “This is not an accurate measure of student achievement, and given that most of us were surprised by this, there is great suspicion that it will lead to further mistrust.”

Source

Subscribe for Updates

What's Hot

Is It Fair and Accurate for AI to Grade Standardized Tests?

Fairness and accuracy

high stakes

Related Posts