Abstract: Pre-trained vision-language models (VLMs) have showcased remarkable performance in image and natural language understanding, such as image captioning and response generation. As the ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results