In this research, they revealed that the VLM can pay more attention to the image simply by chaining attention weights.
In this research, they revealed that the VLM can pay more attention to the image simply by chaining attention weights.