Is there something weird about California’s standardized test scores?

Last year, 49 percent of California students who took the test scored as meeting the state’s reading and writing standards. This year, that number flatlined at 49 percent. So despite most teachers and students having an additional year to get familiar with the exam, and an additional year of instruction conceivably tailored to improve on student weaknesses identified in the test, California public schools were no better at getting students to master state English standards.

That’s certainly plausible. State English standards are tougher now than they used to be since California adopted Common Core, a set of academic standards intended to measure a deeper understanding of math and English. And California’s relatively large population of low-income children and non-English speakers have long suffered from a persistent achievement gap that could make it difficult to significantly bump up test scores from one year to the next.

But it’s not only California experiencing lackluster 2017 scores.

The other 12 states that published scores and administered the same standardized test with the same questions—known as the Smarter Balanced Assessment—all saw their scores dip or remain stagnant. Combined, those states contain a striking ethnic and socioeconomic diversity of students—from West Virginia to Washington state to Hawaii. Vermont’s English scores for students in grades 3 through 8 fell about 4 percentage points.

No state among the 12, despite different education policies at the state and local level, saw progress on the English portion of the test.

“When you see all states losing ground or holding flat, it’s like, ‘Really?’” says Paul Warren, an education researcher at the Public Policy Institute of California. “Everybody had that same problem? That bears some looking into.”

Warren stops well short of declaring that there was something wrong with the test; he just says it demands further scrutiny. And other education experts, while scratching their heads to interpret California and other state results, don’t outright declare the test flawed in some way.

California’s Department of Education has not expressed any dissatisfaction with the test itself in any of its official communication. In a statement last week, State Superintendent of Public Instruction Tom Torlakson said of the results that he was “pleased we retained our gains, but we have much more work to do.” Multiple requests for comment from the department were not answered.

Chris Barron, a spokesman for the Smarter Balanced Assessment Consortium, the organization that designed the test, said he has confidence in this year’s version of the exam.

“At this point, Smarter Balanced has every reason to believe that the spring 2017 scores accurately describe what students knew and were able to do related to the (English language arts) and mathematics content standards,” Barron wrote via email. He added that the organization will conduct its annual review of testing data, which it is still receiving from states. 

Why did states using a different test see test scores improve?

While English scores on the Smarter Balanced test plateaued or declined across all states, a different test intended to measure the same thing—how well students are mastering Common Core—yielded a markedly different trend.

In the six states and one city that administer the Partnership for Assessment of Readiness for College and Careers (PARCC) tests, five saw increases in their English test scores this year over last. That includes what look like fairly significant gains in Colorado and New Jersey and a major leap in Washington, D.C.

“The most surprising thing to me is how different the growth patterns have been for Smarter Balanced versus PARCC,” said David Pearson, an education researcher at University of California, Berkeley who helped advised Smarter Balanced on the English component of the test. “The growth patterns on PARCC looks more like what you would expect” over the three years of the test.

Whenever a new standardized test is dropped on students, researchers typically see a trend: Proficiency scores decline markedly from the last year of the old test to the first year of the new test. Then in subsequent years, as students and teachers get familiar with the exam, scores improve.

The first two years of Smarter Balanced and PARCC—both deployed starting in 2015—roughly fit that trend. Many Smarter Balanced states actually saw larger increases in proficiency than PARCC states. California saw a 5 percentage point gain on the English portion of the test between 2015 and 2016.

But the leveling off of Smarter Balanced scores—particularly in contrast to the upward trajectory of scores in PARCC states—raises a red flag.

“To me the only thing that makes any sense is that…. there was something going on with the assessment or administration [of Smarter Balanced] that year that would account for that,” said Pearson.

Barron, spokesman for Smarter Balanced, calls it misleading to compare the two tests.

“PARCC is a different test administered in different states with different achievement standards and should not be directly compared with Smarter Balanced,” he wrote via email.

While differences in English test scores were most striking between the two sets of tests, changes in math scores were more comparable. About half the states in both groups saw very modest increases in their math scores, while some saw no change or small negative changes.

What could explain the results?

The evidence that the 2017 Smarter Balanced tests were off in some way is as of now purely circumstantial. Until a more thorough autopsy of test results is conducted, researchers can only speculate on possible reasons for the trend.

Explanation 1: The test was fine, and most of the improvement we saw between the first and second year of the test was simply a result of students and teachers becoming more comfortable with the exam the second time around.

As much as possible, Smarter Balanced tests are designed to probe a student’s deeper knowledge of core concepts and avoid instructors “teaching to the test”. Now that the shock of a new test—taken for the first time in California on a computer—has worn off, we’re getting a more reliable measure of student achievement and growth.

That explanation makes sense on many levels. But you would expect the same trend in the PARCC states as well then—2017 was also the third year of those tests. Instead most PARCC states saw growth in scores.

Explanation 2: This year’s test was harder in one way or another, artificially depressing scores. Or last year’s test was too easy, artificially inflating scores.

Another intuitive explanation, and again another less-than-satisfying one. Smarter Balanced test scores take into account the difficulty of the questions students receive. Scores are calibrated based not just on the percentage of questions a student answers correctly, but on the performance of other students and the relative difficulty of the question. It’s of course possible something got mucked up there. But the tests should be carefully designed to avoid such mistakes.

Perhaps more importantly, this year’s Smarter Balanced test looked an awful lot like last year’s. About 70 percent of the pool of possible questions appeared in both years, according to Smarter Balanced.

Explanation 3: A shortage of easy questions, combined with a computer adaptive test, made for funky results.

Unlike the PARCC tests, Smarter Balanced tests are computer adaptive—the difficulty of questions a student receives is in part determined by whether a student answered a preceding question correctly. This design aims to generate a more discerning look at how students at the lower and higher ends of the achievement spectrum are faring.

Doug McRae, a retired educational measurement specialist who used to advise California on standardized tests, speculated that a shortage of easier questions could have suppressed scores. In other words, students missing lots of questions and thus sent down a pathway of simpler questions may have exhausted the supply.

Smarter Balanced says it has no evidence of that, and points to increases in the sheer number of possible questions students could encounter between 2016 and 2017.

Where’s the data?      

Interestingly, most of the discussion around Smarter Balanced scores does not invoke comparing one state’s performance to another. In the press releases and in media coverage of test results in participating states, it is rare to see cross-state comparisons, even though states are administering the same test.

That’s partly because there’s no single repository for all the Smarter Balanced data that can be easily broken down by state.

To examine how states stack up against one another, McRae has to go to each state testing website and compile the numbers himself. No one else is really doing that, and the numbers in this story and others rely on McRae’s Good Samaritan grunt work. (You can read his analysis of the scores here.)

Part of the impetus for Smarter Balanced came from the Obama administration’s desire to compare performance across states. But strict data-sharing policies by some states have made those comparisons—and especially detailed comparisons that move beyond top-line estimates—more difficult.

Many other states participating in Common Core administer their own tests, as opposed to Smarter Balanced or PARCC, and thus can’t be easily compared.