ÉϺ£´óѧÍŽáÄÏ¿ª´óѧչÏÖ¶àģ̬ģ×ÓÖÐÒ»¸ö±»ºöÊÓµÄÖ÷ҪƫÖÃÎÊÌâ
2026-02-26 12:23:09

½üÄêÀ´£¬£¬£¬£¬£¬£¬ £¬Vision-Language Models£¨ÊÓ¾õ ¡ª ÓïÑÔÄ£×Ó£©ÔÚ¶àģ̬Ã÷ȷʹÃüÖÐÈ¡µÃÁËÏÔÖøÏ£Íû£¬£¬£¬£¬£¬£¬ £¬²¢Öð½¥³ÉΪͨÓÃÈ˹¤ÖÇÄܵÄÖ÷ÒªÊÖÒÕõè¾¶¡£¡£¡£¡£¡£ ¡£¡£¡£È»¶ø£¬£¬£¬£¬£¬£¬ £¬ÕâÀàÄ£×ÓÔÚÏÖʵӦÓÃÖÐÍùÍùÃæÁÙÍÆÀí¿ªÏú´ó¡¢Ð§ÂÊÊÜÏÞµÄÎÊÌ⣬£¬£¬£¬£¬£¬ £¬Ñо¿Õßͨ³£ÒÀÀµ visual token pruning µÈÕ½ÂÔ½µµÍÅÌË㱾Ǯ£¬£¬£¬£¬£¬£¬ £¬ÆäÖÐ attention »úÖÆ±»ÆÕ±éÊÓΪȨºâÊÓ¾õÐÅÏ¢Ö÷ÒªÐÔµÄÒªº¦ÒÀ¾Ý¡£¡£¡£¡£¡£ ¡£¡£¡£

¿ËÈÕ£¬£¬£¬£¬£¬£¬ £¬ÉϺ£´óÑ§Ôøµ¤ÍŶÓÍŽáÄÏ¿ª´óѧÑо¿Ö°Ô±£¬£¬£¬£¬£¬£¬ £¬´Ó attention ¿É¿¿ÐԵĽǶȳö·¢£¬£¬£¬£¬£¬£¬ £¬ÏµÍ³Õ¹ÏÖÁË Vision-Language Models ÖÐÆÕ±é±£´æµÄ attention Æ«ÖÃÎÊÌ⣬£¬£¬£¬£¬£¬ £¬²¢Ìá³öÁËÒ»ÖÖÎÞÐèÖØÐÂѵÁ·µÄ attention ȥƫҪÁ죬£¬£¬£¬£¬£¬ £¬ÔÚ¶à¸öÖ÷Á÷Ä£×Ó¡¢¼ôÖ¦Õ½ÂÔ¼°Í¼ÏñÓëÊÓÆµ»ù×¼ÉÏÑéÖ¤ÁËÆäÓÐÓÃÐÔ£¬£¬£¬£¬£¬£¬ £¬Îª¶àģ̬ģ×ӵĸßЧ¡¢¿É¿¿°²ÅÅÌṩÁËеÄ˼Ð÷¡£¡£¡£¡£¡£ ¡£¡£¡£

ÂÛÎÄÎÊÌ⣺Attention Debiasing for Token Pruning in Vision Language ModelsÂÛÎÄÁ´½Ó£ºhttps://arxiv.org/abs/2508.17807´úÂëÁ´½Ó£ºhttps://github.com/intcomp/attention-bias

Ò»¡¢Ñо¿ÒâÒå

½üÄêÀ´£¬£¬£¬£¬£¬£¬ £¬ÊÓ¾õ ¡ª ÓïÑÔÄ£×Ó£¨Vision-Language Models£¬£¬£¬£¬£¬£¬ £¬VLMs£©ÔÚͼÏñÃ÷È·¡¢ÊÓ¾õÎÊ´ð¡¢¶àģ̬¶Ô»°µÈʹÃüÖÐÌåÏÖÍ»³ö£¬£¬£¬£¬£¬£¬ £¬²¢Öð½¥³ÉΪͨÓÃÈ˹¤ÖÇÄܵÄÖ÷ÒªÊÖÒÕ»ù´¡¡£¡£¡£¡£¡£ ¡£¡£¡£È»¶ø£¬£¬£¬£¬£¬£¬ £¬ÕâÀàÄ£×ÓÔÚÏÖʵ°²ÅÅʱÍùÍùÃæÁÙÒ»¸öÏÖʵÌôÕ½£ºÄ£×ÓÍÆÀí±¾Ç®¸ß£¬£¬£¬£¬£¬£¬ £¬ËÙÂÊÂý¡£¡£¡£¡£¡£ ¡£¡£¡£

ΪÌáÉýЧÂÊ£¬£¬£¬£¬£¬£¬ £¬Ñо¿Õßͨ³£»£»£»£»£»£»á½ÓÄÉvisual token pruning£¨ÊÓ¾õ token ¼ôÖ¦£©ÊÖÒÕ£¬£¬£¬£¬£¬£¬ £¬¼´ÔÚ²»ÏÔÖøÓ°ÏìÐÔÄܵÄÌõ¼þÏ£¬£¬£¬£¬£¬£¬ £¬ÑïÆú²»Ö÷ÒªµÄÊÓ¾õÐÅÏ¢¡£¡£¡£¡£¡£ ¡£¡£¡£ÆäÖУ¬£¬£¬£¬£¬£¬ £¬attention »úÖÆ ±»ÆÕ±éÓÃ×÷ÅÐ¶Ï ¡°ÄÄЩÊÓ¾õ token ¸üÖ÷Òª¡± µÄ½¹µãÒÀ¾Ý¡£¡£¡£¡£¡£ ¡£¡£¡£

µ«ÉϺ£´óÑ§Ôøµ¤ÍŶÓÔÚÑо¿Öз¢Ã÷£ºattention ²¢²»×ÜÊǿɿ¿µÄ ¡°Ö÷ÒªÐÔÖ¸±ê¡±¡£¡£¡£¡£¡£ ¡£¡£¡£ÔÚ¶àģ̬ģ×ÓÖУ¬£¬£¬£¬£¬£¬ £¬attention ÍùÍùÊܵ½¶àÖֽṹÐÔÆ«ÖõÄÓ°Ï죬£¬£¬£¬£¬£¬ £¬ÕâЩƫÖÃÓëÕæÊµÓïÒåÎ޹أ¬£¬£¬£¬£¬£¬ £¬È´»áÖ±½Ó×óÓÒ¼ô֦Ч¹û£¬£¬£¬£¬£¬£¬ £¬´Ó¶øÓ°ÏìÄ£×ÓÐÔÄÜ¡£¡£¡£¡£¡£ ¡£¡£¡£

Õë¶ÔÕâÒ»ÎÊÌ⣬£¬£¬£¬£¬£¬ £¬¸ÃÍŶÓϵͳÆÊÎöÁË VLM ÖÐ attention µÄÐÐÎªÌØÕ÷£¬£¬£¬£¬£¬£¬ £¬Ìá³öÁËÒ»ÖÖAttention Debiasing£¨×¢ÖØÁ¦È¥Æ«£©ÒªÁ죬£¬£¬£¬£¬£¬ £¬ÔÚÎÞÐèÖØÐÂѵÁ·Ä£×ÓµÄÌõ¼þÏ£¬£¬£¬£¬£¬£¬ £¬ÓÐÓÃÌáÉýÁ˶àÖÖÖ÷Á÷¼ôÖ¦ÒªÁìµÄÎȹÌÐÔÓë¿É¿¿ÐÔ¡£¡£¡£¡£¡£ ¡£¡£¡£ÈçÏÂͼËùʾ£¬£¬£¬£¬£¬£¬ £¬Ìá³öµÄÒªÁìÓ¦ÓÃÓÚÏÖÔÚ»ùÓÚ attention µÄ¼ôÖ¦ÒªÁìÉÏÖ®ºó£¬£¬£¬£¬£¬£¬ £¬¶¼ÓÐÌáÉý¡£¡£¡£¡£¡£ ¡£¡£¡£

¶þ¡¢Ñо¿Åä¾°

ÔÚÖ±¾õÉÏ£¬£¬£¬£¬£¬£¬ £¬attention »úÖÆÍùÍù±»Ã÷ȷΪ ¡°Ä£×Ó¸ü¹Ø×¢ÄÇÀ£¬£¬£¬£¬£¬£¬ £¬Òò´Ë±»×ÔÈ»µØÊÓΪÓïÒåÖ÷ÒªÐÔµÄÌåÏÖ¡£¡£¡£¡£¡£ ¡£¡£¡£È»¶ø£¬£¬£¬£¬£¬£¬ £¬Ôøµ¤ÍŶӵÄÑо¿Åú×¢£¬£¬£¬£¬£¬£¬ £¬ÔÚ Vision-Language Models ÖУ¬£¬£¬£¬£¬£¬ £¬attention ÍùÍù²¢·ÇÖ»ÓÉÄÚÈݾöÒ飬£¬£¬£¬£¬£¬ £¬¶øÊÇÒþº¬×ŶàÖÖϵͳÐÔÆ«Öᣡ£¡£¡£¡£ ¡£¡£¡£

ÆäÖÐ×îµä·¶µÄÓÐÁ½Àࣺ

µÚÒ»ÀàÊÇλÖÃÆ«Öã¨recency bias£©¡£¡£¡£¡£¡£ ¡£¡£¡£Ñо¿·¢Ã÷£¬£¬£¬£¬£¬£¬ £¬language-to-vision attention »áËæ×ÅÊÓ¾õ token ÔÚÐòÁÐÖеÄλÖÃÒ»Ö±Ôö´ó£¬£¬£¬£¬£¬£¬ £¬Ò²¾ÍÊÇ˵£¬£¬£¬£¬£¬£¬ £¬Ä£×Ó¸üÇãÏòÓÚ¹Ø×¢ ¡°ºóÃæµÄ token¡±¡£¡£¡£¡£¡£ ¡£¡£¡£ÈçͼËùʾ£¬£¬£¬£¬£¬£¬ £¬Õâͨ³£ÌåÏÖΪģ×Ó¶ÔͼÏñÏ·½ÇøÓò¸øÓè¸ü¸ß attention£¬£¬£¬£¬£¬£¬ £¬¼´±ãÕâÐ©ÇøÓò²¢²»°üÀ¨Òªº¦ÐÅÏ¢¡£¡£¡£¡£¡£ ¡£¡£¡£

µÚ¶þÀàÊÇpadding Òý·¢µÄ attention sink Õ÷Ï󡣡£¡£¡£¡£ ¡£¡£¡£ÔÚÏÖʵÊäÈëÖУ¬£¬£¬£¬£¬£¬ £¬ÎªÁËͳһ³ß´ç£¬£¬£¬£¬£¬£¬ £¬Í¼ÏñÍùÍùÐèÒª padding£¬£¬£¬£¬£¬£¬ £¬µ«ÕâÐ©ÇøÓòÔÚÓïÒåÉÏÊÇ ¡°¿Õȱ¡± µÄ¡£¡£¡£¡£¡£ ¡£¡£¡£È»¶ø£¬£¬£¬£¬£¬£¬ £¬ÓÉÓÚ hidden state ÖзºÆðÒì³£¼¤»î£¬£¬£¬£¬£¬£¬ £¬padding ¶ÔÓ¦µÄ token ·´¶ø¿ÉÄÜ»ñµÃ½Ï¸ß attention£¬£¬£¬£¬£¬£¬ £¬´Ó¶ø±»¹ýʧµØ±£´æÏÂÀ´¡£¡£¡£¡£¡£ ¡£¡£¡£ÏÂͼÊÇ pad ÇøÓòÌî³ä²î±ðµÄÊýֵʱ£¬£¬£¬£¬£¬£¬ £¬pad ÇøÓò¶ÔÓ¦µÄ attention score ÊýÖµÒÔ¼° hidden states µÄ¼¤»îÖµ¡£¡£¡£¡£¡£ ¡£¡£¡£

¸üÖµµÃ×¢ÖØµÄÊÇ£¬£¬£¬£¬£¬£¬ £¬µ± attention ±»ÓÃÓÚ¼ôÖ¦ÅÅÐòʱ£¬£¬£¬£¬£¬£¬ £¬ÕâЩƫÖò¢²»»á±»Ï÷Èõ£¬£¬£¬£¬£¬£¬ £¬·´¶ø»á±»½øÒ»²½·Å´ó£¬£¬£¬£¬£¬£¬ £¬×îÖÕµ¼Ö¼ô֦Ч¹ûÆ«ÀëÕæÊµÓïÒåÐèÇ󡣡£¡£¡£¡£ ¡£¡£¡£

Èý¡¢Ñо¿ÒªÁì

Õë¶ÔÉÏÊöÎÊÌ⣬£¬£¬£¬£¬£¬ £¬ÉϺ£´óÑ§Ôøµ¤ÍŶӲ¢Ã»ÓÐÌá³öеļôÖ¦Ëã·¨£¬£¬£¬£¬£¬£¬ £¬Ò²Ã»ÓжÔÄ£×ӽṹ¾ÙÐÐÐ޸쬣¬£¬£¬£¬£¬ £¬¶øÊÇ´ÓÒ»¸ö¸ü»ù´¡µÄ½Ç¶È³ö·¢£º¼ÈÈ» attention ×Ô¼ºÊÇÓÐÆ«µÄ£¬£¬£¬£¬£¬£¬ £¬ÊÇ·ñ¿ÉÒÔÏÈ¶Ô attention ¾ÙÐÐÐÞÕý£¿£¿£¿£¿ £¿£¿£¿£¿

¸ÃÍŶÓÊӲ쵽£¬£¬£¬£¬£¬£¬ £¬attention ÖÐµÄÆ«Öò¢·ÇËæ»úÔëÉù£¬£¬£¬£¬£¬£¬ £¬¶øÊÇ·ºÆð³öÎȹ̵ÄÕûÌåÇ÷ÊÆ¡£¡£¡£¡£¡£ ¡£¡£¡£Òò´Ë£¬£¬£¬£¬£¬£¬ £¬ËûÃÇͨ¹ý¶Ô attention Ëæ token λÖÃת±äµÄÇ÷ÊÆ¾ÙÐÐÄâºÏ£¬£¬£¬£¬£¬£¬ £¬¹¹½¨ÁËÒ»Ìõ·´Ó¦ ¡°Î»ÖÃÆ«Öá± µÄÇúÏߣ¬£¬£¬£¬£¬£¬ £¬²¢ÔÚ´Ë»ù´¡É϶Ôԭʼ attention ¾ÙÐÐȥƫÐÞÕý£¬£¬£¬£¬£¬£¬ £¬ÏÔʽÏ÷ÈõÓëÄÚÈÝÎ޹صÄλÖÃÒòËØ£¬£¬£¬£¬£¬£¬ £¬Ê¹ attention ¸ü¿¿½üÕæÊµµÄÓïÒåÖ÷ÒªÐÔ¡£¡£¡£¡£¡£ ¡£¡£¡£ÈçÏÂͼËùʾ¡£¡£¡£¡£¡£ ¡£¡£¡£

Óë´Ëͬʱ£¬£¬£¬£¬£¬£¬ £¬ÔÚ¼ôÖ¦½×¶ÎÏÔʽÒÖÖÆ padding token µÄÓ°Ï죬£¬£¬£¬£¬£¬ £¬×èÖ¹ÓïÒåΪ¿ÕµÄÇøÓò×ÌÈżôÖ¦ÅÅÐò¡£¡£¡£¡£¡£ ¡£¡£¡£Õû¸öÀú³ÌÎÞÐèÖØÐÂѵÁ·Ä£×Ó£¬£¬£¬£¬£¬£¬ £¬Ò²²»ÒÀÀµÌض¨µÄ¼ôÖ¦Õ½ÂÔ£¬£¬£¬£¬£¬£¬ £¬¿É×÷Ϊplug-and-play Ä£¿£¿£¿£¿ £¿£¿£¿£¿éÖ±½Ó¼¯³Éµ½ÏÖÓÐÒªÁìÖС£¡£¡£¡£¡£ ¡£¡£¡£

ËÄ¡¢ÊµÑéЧ¹û

ÔÚʵÑéÑéÖ¤ÖУ¬£¬£¬£¬£¬£¬ £¬¸ÃÍŶӽ« Attention Debiasing ÒªÁ켯³Éµ½ FastV¡¢PyramidDrop¡¢SparseVLM¡¢HiMAP¡¢TokenCarve¡¢iLLaVA µÈ 6 ÖÖÖ÷Á÷ attention-based ¼ôÖ¦ÒªÁìÖУ¬£¬£¬£¬£¬£¬ £¬ÔÚ 10 ¸öͼÏñÃ÷È·»ù×¼Óë 3 ¸öÊÓÆµÃ÷È·»ù×¼ ÉϾÙÐÐÁËϵͳÆÀ¹À£¬£¬£¬£¬£¬£¬ £¬²¢ÁýÕÖ LLaVA-7B / 13B µÈ¶àÖÖÖ÷Á÷ Vision-Language Models¡£¡£¡£¡£¡£ ¡£¡£¡£

ʵÑéЧ¹ûÅú×¢£¬£¬£¬£¬£¬£¬ £¬ÔÚÏÕЩËùÓÐÉèÖÃÏ£¬£¬£¬£¬£¬£¬ £¬¾­ÓÉ attention ȥƫÐÞÕýºó£¬£¬£¬£¬£¬£¬ £¬¼ô֦ģ×Ó¶¼ÄÜ»ñµÃÒ»ÖÂÇÒÎȹ̵ÄÐÔÄÜÌáÉý£¬£¬£¬£¬£¬£¬ £¬ÇÒÔÚ¼ôÖ¦¸ü¼¤½ø¡¢token Ô¤Ëã¸üÖ÷ÒªµÄÇéÐÎÏÂЧ¹ûÓÈΪÏÔ×Å¡£¡£¡£¡£¡£ ¡£¡£¡£Õâ˵Ã÷£¬£¬£¬£¬£¬£¬ £¬¶Ô attention ¾ÙÐÐȥƫ´¦Öóͷ££¬£¬£¬£¬£¬£¬ £¬ÓÐÖúÓÚÄ£×ÓÔÚ ¡°¸üÉÙÐÅÏ¢¡± µÄÌõ¼þÏÂ×ö³ö¸ü¿É¿¿µÄÅжÏ¡£¡£¡£¡£¡£ ¡£¡£¡£

±ðµÄ£¬£¬£¬£¬£¬£¬ £¬Í¨¹ý¶ÔʵÑéЧ¹ûµÄ¿ÉÊÓ»¯ÆÊÎö£¬£¬£¬£¬£¬£¬ £¬Ô­Ê¼ attention-based ¼ôÖ¦ÒªÁìÍùÍù±£´æÁË´ó×ÚλÓÚͼÏñÏ·½»ò padding ÇøÓòµÄÊÓ¾õ token£¬£¬£¬£¬£¬£¬ £¬¶øÓëÎÊÌâÓïÒåÇ×½üÏà¹ØµÄÒªº¦ÇøÓòÈ´ÈÝÒ×±»ºöÂÔ¡£¡£¡£¡£¡£ ¡£¡£¡£ÒýÈë attention ȥƫÐÞÕýºó£¬£¬£¬£¬£¬£¬ £¬Ä£×Ó±£´æµÄÊÓ¾õÇøÓòÔ½·¢¼¯ÖÐÓÚÄ¿µÄÎïÌå¼°Òªº¦Ï¸½ÚλÖ㬣¬£¬£¬£¬£¬ £¬ÓÐÓÃïÔÌ­ÁËÎÞ¹ØÅä¾°µÄ×ÌÈÅ¡£¡£¡£¡£¡£ ¡£¡£¡£¸ÃЧ¹ûÖ±¹ÛÑéÖ¤ÁË attention ȥƫÔÚÌáÉý¼ôÖ¦ºÏÀíÐԺͿÉÚ¹ÊÍÐÔ·½ÃæµÄ×÷Óᣡ£¡£¡£¡£ ¡£¡£¡£

Îå¡¢×ܽá

¸ÃÑо¿Åú×¢£¬£¬£¬£¬£¬£¬ £¬attentio¹ã¶«ÖéÁª¹©Ó¦Á´ÓÐÏÞ¹«Ë¾n ²¢·Ç×ÔÈ»µÈ¼ÛÓÚÓïÒåÖ÷ÒªÐÔ£¬£¬£¬£¬£¬£¬ £¬ÓÈÆäÔÚ Vision-Language Models ÖУ¬£¬£¬£¬£¬£¬ £¬ÈôÊǺöÊÓ attention ÖÐDZÔڵĽṹÐÔÆ«Ö㬣¬£¬£¬£¬£¬ £¬»ùÓÚ attention µÄ¼ôÖ¦Õ½ÂÔ¿ÉÄܻᱻÎ󵼡£¡£¡£¡£¡£ ¡£¡£¡£ÉϺ£´óÑ§Ôøµ¤ÍŶÓͨ¹ý¼òÆÓ¶øÓÐÓÃµÄ attention ȥƫҪÁ죬£¬£¬£¬£¬£¬ £¬ÏÔÖøÌáÉýÁ˶àģ̬ģ×ÓÔÚЧÂÊÓë¿É¿¿ÐÔÖ®¼äµÄƽºâÄÜÁ¦¡£¡£¡£¡£¡£ ¡£¡£¡£