I recently did a peer-reviewed deep dive into the test design documentation to figure out why the reported results weren’t showing improvement. I found the flat scores were at least in part by design.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results