Mass General Brigham's BRIDGE benchmark found top AI models scored 92 on medical exams but just 44.8% on real-world clinical tasks.