Abstract: Understanding videos, especially aligning them with textual data, presents a significant challenge in computer vision. The advent of vision-language models (VLMs) like CLIP has sparked ...