
ÔÚµ±½ñµÄ´óÄ£×ÓºóѵÁ·£¨Post-training£©½×¶Î£¬£¬£¬£¬£¬£¬£¬DPO£¨Ö±½ÓÆ«ºÃÓÅ»¯£© ÒÀ¸½ÆäÎÞÐèѵÁ·×ÔÁ¦ Reward Model µÄÓÅÑÅÉè¼ÆºÍ¸ßЧÐÔ£¬£¬£¬£¬£¬£¬£¬ÀÖ³ÉÈ¡´ú PPO ³ÉΪҵ½çµÄ ¡¸°æ±¾Ö®×Ó¡¹£¬£¬£¬£¬£¬£¬£¬±»ÆÕ±éÓ¦ÓÃÓÚ Llama-3¡¢Mistral µÈ¶¥Á÷¿ªÔ´Ä£×ÓµÄ¶ÔÆëÖС£¡£¡£¡£¡£¡£¡£¡£
È»¶ø£¬£¬£¬£¬£¬£¬£¬Ëæ×ŶÔÄ£×ÓÄÜÁ¦ÒªÇóµÄÈÕÒæÑÏ¿Á£¬£¬£¬£¬£¬£¬£¬DPO µÄȱÏÝÖ𽥸¡³öË®Ãæ¡£¡£¡£¡£¡£¡£¡£¡£
ÊÂʵ¸ÃÔõÑùÈà DPO ѧ»á¡¸È¥Î±´æÕ桹£¬£¬£¬£¬£¬£¬£¬¾«×¼Ê¶±ð³öÄÇÐ©ÕæÕý¾öÒéÊäÓ®µÄ Critical Tokens£¿£¿£¿£¿£¿£¿£¿
Õë¶ÔÕâÒ»ÎÊÌ⣬£¬£¬£¬£¬£¬£¬À´×ÔÖйú¿ÆÑ§Ôº×Ô¶¯»¯Ñо¿Ëù¡¢×Ö½ÚÌø¶¯¡¢Î¢ÈíÑÇÖÞÑо¿ÔººÍ±±¾©¿Æ¼¼´óѧµÄÑо¿ÕßÃÇÔÚ±»Ñ¡Îª ICLR 2026 Oral µÄÐÂÊÂÇéÖÐÍŽáÌá³öÁËÒ»ÖÖÈ«Ð嵀 TI-DPO ¿ò¼Ü¡£¡£¡£¡£¡£¡£¡£¡£

ÂÛÎÄ£º¡¶Token-Importance Guided Direct Preference Optimization¡·ÂÛÎĵص㣺https://arxiv.org/abs/2505.19653¿ªÔ´µØµã£ºhttps://github.com/gracefulning/TIDPO
Ñо¿Åä¾°ÓëÒâÒå
Ö÷Á÷ÒªÁìÕýÃæÁÙÁ½¸ö½¹µãÄÑÌ⣬£¬£¬£¬£¬£¬£¬ÕâʹµÃÄ£×ÓÄÑÒÔʵÏÖÕæÕýϸÄ廯µÄÓïÒå¿ØÖÆ:
Í´µãÒ»£ºÐòÁм¶µÄ¡¸¶þÔª¶ÔÁ¢¡¹ÏÝÚå¡£¡£¡£¡£¡£¡£¡£¡£¹Å°åÒªÁìÒÀȻͣÁôÔÚÐòÁм¶±ð£¨Sequence-level£©µÄ´ÖÁ£¶ÈÓÅ»¯ÉÏ£¬£¬£¬£¬£¬£¬£¬¼òÆÓ´Ö±©µØ½«Êý¾Ý»®·ÖΪºÃÓ뻵¡£¡£¡£¡£¡£¡£¡£¡£ÕâÖÖ¶þÔª¼àÊÓÐźż«¶ËØÑ·¦£¬£¬£¬£¬£¬£¬£¬ÓÉÓÚËüÑÚÊθßÖÊÁ¿»Ø¸´ÖпÉÄÜ»ìÔÓ×Å覴à Token µÄÊÂʵ£¬£¬£¬£¬£¬£¬£¬µ¼ÖÂÁËÄ£×ÓÔÚÒ»Á¬ÓïÒå¿Õ¼äÖÐ΢µ÷Ч¹û²î£¬£¬£¬£¬£¬£¬£¬ÉõÖÁÒý·¢²ÉÑùÂþÑÜÆ«ÒÆ£¨Distribution Shift£©¡£¡£¡£¡£¡£¡£¡£¡£Í´µã¶þ£º±»Îó²î°ó¼ÜµÄ¡¸Î±¡¹Ö÷ÒªÐÔ¡£¡£¡£¡£¡£¡£¡£¡£×ÝÈ»ÊÔͼϳÁµ½ Token ¼¶±ð£¬£¬£¬£¬£¬£¬£¬ÏÖÓеÄÖ÷ÒªÐÔÆÀ¹ÀÊÖ¶ÎÒ²±£´æÎÊÌâ¡£¡£¡£¡£¡£¡£¡£¡£Ðí¶àÒªÁìÒÀÀµ¸ÅÂÊÕ¹Íû»ò¼òÆÓ¼ÓȨ£¬£¬£¬£¬£¬£¬£¬Õâµ¼ÖÂËüÃÇÖ±½Ó¼ÌÐøÁËÄ£×Ӽܹ¹µÄ¹ÌÓÐȱÏÝ ¡ª¡ª¡¸U ÐÍ×¢ÖØÁ¦Îó²î¡¹£¨Lost in the Middle£©£¬£¬£¬£¬£¬£¬£¬Ä£×ÓÌìÉúÇãÏòÓÚÌ«¹ý¹Ø×¢Ê×β Token ¶øºöÂÔÖÐÐĵĽ¹µãÓïÒå¡£¡£¡£¡£¡£¡£¡£¡£
TI-DPO µÄ½¹µã»úÖÆ
TI-DPO µÄ½¹µãÍ·ÄÔÊÇ£º¼ÈÈ» Token Éú¶ø²î±ð£¬£¬£¬£¬£¬£¬£¬ÄǾ͸øËüÃÇ¡¸¼ÓȨ¡¹¡£¡£¡£¡£¡£¡£¡£¡£ ͨ¹ýÒýÈë»ìÏý¼ÓȨ»úÖÆºÍÈýÔª×éËðʧ£¬£¬£¬£¬£¬£¬£¬TI-DPO Äܹ»¾«×¼Ê¶±ð²¢·Å´ó¡¸Òªº¦ Token¡¹µÄÐźţ¬£¬£¬£¬£¬£¬£¬Í¬Ê±ÒÖÖÆÔëÉù£¬£¬£¬£¬£¬£¬£¬´Ó¶øÊµÏֱȹŰå DPO ¸ü×¼¡¢¸üÎÈµÄ¶ÔÆëЧ¹û¡£¡£¡£¡£¡£¡£¡£¡£ËüÖ÷Òª°üÀ¨Á½´ó½¹µã»úÖÆ£º
1. »ìÏý¼ÓȨ»úÖÆ (Hybrid Weighting)
ΪÁËÕÒ³ö˲ÅÊǾöÒ黨¸´ÖÊÁ¿µÄ¡¸ÊäÓ®ÊÖ¡¹£¬£¬£¬£¬£¬£¬£¬TI-DPO Éè¼ÆÁËÒ»Ì×Êý¾ÝÇý¶¯ÓëÏÈÑé½á¹¹ÏàÍŽáµÄÈ¨ÖØÅÌËã·¨£º
ÌݶȹéÒò£ºÅÌËã Loss ¶Ôÿ¸ö Token Embedding µÄÌݶȷ¶Êý¡£¡£¡£¡£¡£¡£¡£¡£¼òÆÓÀ´Ëµ£¬£¬£¬£¬£¬£¬£¬Ë¶Ô×îÖÕÊä³öТ˳´ó£¬£¬£¬£¬£¬£¬£¬ËµÄÈ¨ÖØ¾Í¸ß¡£¡£¡£¡£¡£¡£¡£¡£¸ß˹ÏÈÑ飺Õë¶Ô LLM ³£¼ûµÄ¡¸U ÐÍ×¢ÖØÁ¦Îó²î¡¹£¨Ì«¹ý¹Ø×¢¿ªÍ·×îºó£©£¬£¬£¬£¬£¬£¬£¬ÒýÈë¸ß˹ÂþÑÜÇ¿ÖÆÄ£×Ó¹Ø×¢ÖÐÐĵÄÓïÒå½¹µã¡£¡£¡£¡£¡£¡£¡£¡£
×îÖÕµÄ Token È¨ÖØ £¬£¬£¬£¬£¬£¬£¬ÊÇÕâÁ½ÕßµÄ͹×éºÏ£º

Ð嵀 Token ¼¶ DPO ¼ÓȨËðʧº¯ÊýÈçÏ£º

2. ÈýÔª×éËðʧ (Triplet Loss)
TI-DPO ²»ÔÙÖª×ãÓڷǺڼ´°×µÄ¶þÔª±ÈÕÕ£¬£¬£¬£¬£¬£¬£¬¶øÊÇÒýÈëÁË»³±§Ñ§Ï°ÖеÄÉñÆ÷Triplet Loss¡£¡£¡£¡£¡£¡£¡£¡£ËüÔÚѵÁ·Àú³ÌÖй¹½¨ÁËÈý¸ö½ÇÉ«£º



TI-DPO Ëðʧº¯Êý£ºTI-DPO µÄ×îÖÕÓÅ»¯Ä¿µÄ¼´ÊÇÁ½ÕߵļÓȨºÍ£º

ʵÑéЧ¹û
ΪÁËÑéÖ¤ TI-DPO µÄÏÖʵսÁ¦£¬£¬£¬£¬£¬£¬£¬Ñо¿ÍŶÓÔÚ Llama-3 (8B/3B) ºÍ Mistral-7B µÈ¶à¸öÖ÷Á÷»ù×ùÄ£×ÓÉϾÙÐÐÁ˲âÊÔ£¬£¬£¬£¬£¬£¬£¬±ÈÕÕÁ˰üÀ¨ DPO¡¢SimPO ÒÔ¼°×î½ü´ó»ðµÄ GRPO µÈ 10+ ÖÖ¶ÔÆëËã·¨¡£¡£¡£¡£¡£¡£¡£¡£
1. ×ÛºÏÄÜÁ¦ÆÀ¹À
Èçͼ 1£¬£¬£¬£¬£¬£¬£¬ÔÚ Llama-3.1-8B-Instruct »ù×ùÉÏ£¬£¬£¬£¬£¬£¬£¬TI-DPO µÄ×ÛºÏÆ½¾ù·ÖµÖ´ï 62.3£¬£¬£¬£¬£¬£¬£¬Áè¼Ý GRPO (62.1) ºÍ DPO (60.8) ¡£¡£¡£¡£¡£¡£¡£¡£

2. ϸ·ÖÁìÓòÌåÏÖÓÅÒì
ÔÚ IFEval£¨Ö¸Áî×ñÕÕ£©¡¢TruthfulQA£¨ÕæÊµÐÔ£©ºÍ HumanEval£¨´úÂëÌìÉú£© ÕâÈý´ó×îÄ¥Á·Ï¸½ÚÕÆÎÕµÄʹÃüÉÏ£¬£¬£¬£¬£¬£¬£¬TI-DPO µÄÌåÏÖ´ó·ùÓâÔ½ÁË DPO¡¢SimPO ÒÔ¼° GRPO¡£¡£¡£¡£¡£¡£¡£¡£


3. ÏûÈÚʵÑ飺½¹µã×é¼þȱһ²»¿É
Table 2 µÄÏûÈÚʵÑéЧ¹ûÅú×¢£¬£¬£¬£¬£¬£¬£¬TI-DPO µÄËùÓн¹µã×é¼þ£¨°üÀ¨»ìÏý¼ÓȨ»úÖÆ¡¢¸ß˹ÏÈÑéºÍÈýÔª×éËðʧ£©¹ØÓÚÄ£×ÓÐÔÄܶ¼ÖÁ¹ØÖ÷Òª£¬£¬£¬£¬£¬£¬£¬ÒƳýí§ÒâÄ£¿£¿£¿£¿£¿£¿£¿é¾ù»áµ¼ÖÂÔÚͨÓÃÄÜÁ¦¡¢ÊýÑ§ÍÆÀí¼°´úÂëÌìÉúµÈ¸÷ÏîÖ¸±êÉϵÄÏÔÖøÏ½µ¡£¡£¡£¡£¡£¡£¡£¡£

4. °¸Àýչʾ£ºÒ»ÑÛ¿´¶®¡¸Òªº¦ Token¡¹
ΪÁËÑéÖ¤ TI-DPO ÊÇ·ñÕæµÄѧ»áÁË¡¸×¥Öص㡹£¬£¬£¬£¬£¬£¬£¬×÷ÕßչʾÁËÒ»¸öÒ½ÁÆ×Éѯ°¸Àý£¨¡¸Í·Í´¸ÃÔõô°ì£¿£¿£¿£¿£¿£¿£¿¡¹£©µÄÈ¨ÖØ¿ÉÊÓ»¯ÈÈÁ¦Õù¡£¡£¡£¡£¡£¡£¡£¡£
ÔÚ Preferred »Ø¸´ÖУ¨×󣩣ºÄ£×Ó¸ø¡¸seek medical attention¡¹ºÍ¡¸promptly¡¹·ÖÅÉÁ˼«¸ßµÄÈ¨ÖØ£¨ºìÉ«ÉîÉ«ÇøÓò£©£¬£¬£¬£¬£¬£¬£¬×½×¡ÁË¡¸Çå¾²µÚÒ»¡¹µÄ½¹µã¡£¡£¡£¡£¡£¡£¡£¡£ÔÚ Non-Preferred »Ø¸´ÖУ¨ÓÒ£©£ºÄ£×Ó¾«×¼¡¸×¥°ü¡¹ÁË¡¸painkillers casually¡¹ÕâÖÖDZÔڵĸßΣº¦½¨Ò飬£¬£¬£¬£¬£¬£¬²¢¸¶Óë¸ßÈ¨ÖØ¼ÓÒÔ´¦·Ö¡£¡£¡£¡£¡£¡£¡£¡£Intermediate ResponseÊÇÄ£×ÓÄ¿½ñµÄ×ÔÎÒˮƽ£º¡¸½¨Òé¶àÐÝÏ¢£¬£¬£¬£¬£¬£¬£¬ÈôÊǶñ»¯ÔÙ¿´Ò½Éú¡¹¡£¡£¡£¡£¡£¡£¡£¡£TI-DPO Ö¸µ¼Ä£×ÓÔÚÌìÉúÀú³ÌÖУ¬£¬£¬£¬£¬£¬£¬Ò»Ö±Ïò Preferred µÄ¼ÛÖµ¹Û¿¿Â££¬£¬£¬£¬£¬£¬£¬Í¬Ê±¹æ±Ü Non-preferred µÄÏÝÚ壬£¬£¬£¬£¬£¬£¬´Ó¶øÍê³É´Ó´Ö·ÅÏòϸÄåµÄ½ø»¯¡£¡£¡£¡£¡£¡£¡£¡£

ÕâÖÖÓÐÁ¦µØÖ¤Êµ TI-DPO ²»ÊÇÔÚËÀ¼ÇÓ²±³£¬£¬£¬£¬£¬£¬£¬¶øÊÇÕæµÄ¶Á¶®ÁËÈËÀà¼ÛÖµ¹Û¡£¡£¡£¡£¡£¡£¡£¡£
×ܽáÓëТ˳
TI-DPO µÄÌá³ö£¬£¬£¬£¬£¬£¬£¬Îª´óÄ£×Ó¶ÔÆë´Ó´Ö·ÅµÄÐòÁм¶ÓÅ»¯Ïò¸üϸÄåµÄ Token ¼¶¿ØÖÆ×ª±äÌṩÁËÒ»¸öÓÐÁ¦µÄʵÑé¡£¡£¡£¡£¡£¡£¡£¡£Ëü²»ÔÙÖª×ãÓÚÁýͳµØÅжϻظ²µÄ¡¸ÓÅÁÓ¡¹£¬£¬£¬£¬£¬£¬£¬¶øÊÇÊÔͼÀåÇåÿһ¸ö Token ÔÚ¼ÛÖµ¶ÔÆëÖеÄÕæÊµÐ¢Ë³¡£¡£¡£¡£¡£¡£¡£¡£
ʵÑéЧ¹ûÅú×¢£¬£¬£¬£¬£¬£¬£¬TI-DPO ÔÚÖ¸Áî×ñÕÕ¡¢ÕæÊµÐÔÓë´úÂëÌìÉúµÈʹÃüÉÏ£¬£¬£¬£¬£¬£¬£¬Ïà±È GRPO µÈ»ùÏßÈ¡µÃÁËÎȹ̵ÄÐÔÄÜÌáÉý£¬£¬£¬£¬£¬£¬£¬ÑéÖ¤ÁËÌáÉýÊý¾ÝʹÓõġ¸¿ÅÁ£¶È¡¹ÊÇÔöǿģ×ÓÄÜÁ¦µÄÓÐÓ÷¾¶¡£¡£¡£¡£¡£¡£¡£¡£
TI-DPO ÒÔÆäÔÚÈ¥ÔëºÍϸʢԪµçÆø×°±¸ÓÐÏÞ¹«Ë¾½Ú¿ØÖÆÉϵÄÌØÕ÷£¬£¬£¬£¬£¬£¬£¬ÎªºóÐøµÄ RLHF Ñо¿ÌṩÁËÒ»¸öÖµµÃ¹Ø×¢µÄÐÂÆ«Ïò¡£¡£¡£¡£¡£¡£¡£¡£ÎÒÃÇÆÚ´ý¿´µ½¸ü¶àÎ§ÈÆ¡¸Ï¸Á£¶È¼ÛÖµ¶ÔÆë¡¹µÄ̽Ë÷£¬£¬£¬£¬£¬£¬£¬Íƶ¯´óÄ£×ÓÏòןü¾«×¼¡¢¸ü¿É¿ØµÄÆ«Ïò½ø»¯¡£¡£¡£¡£¡£¡£¡£¡£