They found three main issues: - Python overfitting - Language-specific contamination - Large gaps in multilingual performance Top Python models struggle with Rust, JavaScript, and Go. This gap is ...