Abstract: The tensor cores in modern GPUs lead to significant performance improvement in matrix multiplication, which is the primary operation in deep learning. However, existing hardware ...