Abstract: Understanding videos, especially aligning them with textual data, presents a significant challenge in computer vision. The advent of vision-language models (VLMs) like CLIP has sparked ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results