È«ÐÂÊӽǿ´ÌìÏÂÄ£×Ó£º´ÓÊÓÆµÌìÉúÂõÏòͨÓÃÌìÏÂÄ£ÄâÆ÷
2026-03-04 01:48:37

½üÄêÀ´£¬£¬ £¬£¬£¬£¬£¬ÊÓÆµÌìÉú£¨Video Generation£©ÓëÌìÏÂÄ£×Ó£¨World Models£©ÒÑÔ¾ÉýΪÈ˹¤ÖÇÄÜÁìÓò×îÖËÊÖ¿ÉÈȵĽ¹µã¡£¡£¡£¡£ ¡£¡£¡£´Ó Sora µ½¿ÉÁ飨Kling£©£¬£¬ £¬£¬£¬£¬£¬ÊÓÆµÌìÉúÄ£×ÓÔÚÔ˶¯Ò»Á¬ÐÔ¡¢ÎïÌå½»»¥Ó벿·ÖÎïÀíÏÈÑéÉÏÖð½¥ÌåÏÖ³ö¸üÇ¿µÄ¡¸ÌìÏÂÒ»ÖÂÐÔ¡¹£¬£¬ £¬£¬£¬£¬£¬ÈÃÈËÃÇ×îÏÈÈÏÕæÌÖÂÛ£ºÄÜ·ñ°ÑÊÓÆµÌìÉú´Ó¡¸±ÆÕæ¶ÌƬ¡¹Íƽøµ½¿ÉÓÃÓÚÍÆÀí¡¢ÍýÏëÓë¿ØÖÆµÄ¡¸Í¨ÓÃÌìÏÂÄ£ÄâÆ÷¡¹¡£¡£¡£¡£ ¡£¡£¡£

Óë´Ëͬʱ£¬£¬ £¬£¬£¬£¬£¬ÕâÒ»Ñо¿Æ«ÏòÕý¿ìËÙÓë¾ßÉíÖÇÄÜ£¨Embodied AI£©¡¢×Ô¶¯¼ÝÊ»£¨Autonomous Driving£©µÈÇ°ÑØ³¡¾°Éî¶È½»Ö¯£¬£¬ £¬£¬£¬£¬£¬±»ÊÓΪͨÍùͨÓÃÈ˹¤ÖÇÄÜ£¨AGI£©µÄÖ÷Ҫ·¾¶¡£¡£¡£¡£ ¡£¡£¡£

È»¶ø£¬£¬ £¬£¬£¬£¬£¬ÔÚÑо¿Èȳ±Ö®Ï£¬£¬ £¬£¬£¬£¬£¬¡¸×÷ÉõÕæÕýµÄÌìÏÂÄ£×Ó¡¹ÒÔ¼°¡¸ÔõÑùÆÀÅÐÊÓÆµÄ£×ÓµÄÌìÏÂÄ£ÄâÄÜÁ¦¡¹µÈ½¹µãÒéÌâÈ´ÏÝÈëÁ˶àάÕùÂÛ¡£¡£¡£¡£ ¡£¡£¡£Ä¿½ñ£¬£¬ £¬£¬£¬£¬£¬ÌìÏÂÄ£×ӵĽç˵Óë·ÖÀà²ã³ö²»Ç£¬ £¬£¬£¬£¬£¬ÀíÂÛά¶ÈµÄ½»Ö¯ÖصþÍùÍùÁîÑо¿Õ߸ÐÓ¦ÒÉÐÄ£¬£¬ £¬£¬£¬£¬£¬Ò²ÏÞÖÆÁËÊÖÒյıê×¼»¯Éú³¤¡£¡£¡£¡£ ¡£¡£¡£

Ϊ½¨Éè¸üϵͳ¡¢ÇåÎúµÄÉóÔÄÊӽǣ¬£¬ £¬£¬£¬£¬£¬¿ìÊÖ¿ÉÁéÍŶÓÓëÏã¸Û¿Æ¼¼´óѧ£¨¹ãÖÝ£©³ÂÓ±´Ï½ÌÊÚÍŶӣ¨ÅäºÏÒ»×÷£º²©Ê¿ÉúÍõÂÞÖÝ¡¢²©Ê¿Éú³ÂÖª·Ç£©ÍŽá½ÒÏþÁË´ÓÈ«ÐÂÊÓ½ÇÉî¶ÈÆÊÎöÊÓÆµÌìÏÂÄ£×ÓµÄϵͳ×ÛÊö¡£¡£¡£¡£ ¡£¡£¡£

±¾ÎÄÖ¼ÔÚÃֺϽñÊÀ¡¸ÎÞ״̬¡¹ÊÓÆµ¼Ü¹¹Óë¾­µä¡¸ÒÔ״̬ΪÖÐÐÄ¡¹µÄÌìÏÂÄ£×ÓÀíÂÛÖ®¼äµÄºè¹µ£¬£¬ £¬£¬£¬£¬£¬Ê×´ÎÌá³öÒÔ¡¸×´Ì¬¹¹½¨£¨State Construction£©¡¹Ó롸¶¯Ì¬½¨Ä££¨Dynamics Modeling£©¡¹ÎªË«Ö§ÖùµÄȫзÖÀàϵͳ¡£¡£¡£¡£ ¡£¡£¡£

±ðµÄ£¬£¬ £¬£¬£¬£¬£¬±¾ÎÄÁ¦³«½«ÆÀ¹À±ê×¼´Ó´¿´âµÄ¡¸ÊÓ¾õ±£Õæ¶È¡¹×ªÏò¡¸¹¦Ð§ÐÔ»ù×¼¡¹£¬£¬ £¬£¬£¬£¬£¬²¢Ç°Õ°ÐÔµØÖ¸³öÁËÁ½¸öÒªº¦ÊÖÒÕÇ°ÑØ£¬£¬ £¬£¬£¬£¬£¬ÎªÊÓÆµÌìÉúÑݽøÖÁ³°ôµÄͨÓÃÌìÏÂÄ£ÄâÆ÷ÌṩÁËÇåÎúµÄõ辶ͼ¡£¡£¡£¡£ ¡£¡£¡£

ÂÛÎÄÎÊÌ⣺A Mechanistic View on Video Generation as World Models: State and DynamicsÂÛÎÄÁ´½Ó£ºhttps://arxiv.org/pdf/2601.17067github Á´½Ó£ºhttps://github.com/hit-perfect/Awesome-Video-World-Models

×ÛÊö½á¹¹ÌáÒª

½¹µãÁÁµã£ºÕâÆª×ÛÊöµÄÒªº¦Ð¢Ë³ÊÇʲô£¿£¿£¿£¿£¿

Ïà±ÈÓÚ¹ýÍù×ÅÖØÓÚÊÓ¾õЧ¹ûµÄÊÓÆµÌìÉúÑо¿£¬£¬ £¬£¬£¬£¬£¬±¾Æª×ÛÊöÔÚ¶à¸öά¶È¾ßÓдú¼ÊÓÅÊÆ£º

È«Á´Â·Êӽǣ¨Full-Stack Perspective£©£º³¹µ×Í»ÆÆ¼òµ¥µÄ¡¸äÖȾ¡¹Êӽǣ¬£¬ £¬£¬£¬£¬£¬º­¸ÇÁ˴ӵײãÀíÂÛ½ç˵¡¢Öвã¼Ü¹¹Éè¼Æ£¨×´Ì¬¹¹½¨Ó붯̬½¨Ä££©µ½Éϲ㹦ЧÐÔÆÀ¹ÀµÄÈ«ÉúÃüÖÜÆÚÆÊÎö£¬£¬ £¬£¬£¬£¬£¬È·±£¶ÔÊÓÆµÌìÏÂÄ£×ÓÈ«·½Î»µÄÃ÷È·¡£¡£¡£¡£ ¡£¡£¡£ÃÖºÏÀíÂۺ蹵£¨Bridging the Gap£©£ºÊ״ν«½ñÊÀ¡¸ÎÞ״̬¡¹£¨state-less£©µÄÊÓÆµÀ©É¢¼Ü¹¹Óë¾­µäµÄ»ùÓÚÄ£×ÓÇ¿»¯Ñ§Ï°£¨MBRL£©¡¢¿ØÖÆÀíÂÛ¾ÙÐÐÉî¶ÈÓ³É䣬£¬ £¬£¬£¬£¬£¬ÎªÌìÏÂÄ£×ÓÕÒµ½Á˼áʵµÄÀíÂÛ»ù±¾¡£¡£¡£¡£ ¡£¡£¡£Ç°Õ°ÐÔÖ¸ÄÏ£¨Forward-Looking Guide£©£ºÃ÷È·ÁË¡¸³¤ÆÚÐÔ¡¹Ó롸Òò¹ûÐÔ¡¹ ÊÇÂõÏòͨÓÃÌìÏÂÄ£ÄâÆ÷µÄÁ½´ó½¹µã¹Ø¿Ú¡£¡£¡£¡£ ¡£¡£¡£±¾Ñо¿ÎªÒµ½ç´Ó±»¶¯µÄ¡¸ÏñËØÕ¹Íû¡¹×ªÏò¾ß±¸±Õ»·½»»¥ÓëÒò¹û¸ÉÔ¤ÄÜÁ¦µÄÄ£ÄâÆ÷ÌṩÁËÇåÎúµÄ·¾¶²Î¿¼¡£¡£¡£¡£ ¡£¡£¡£×îÐÂÑо¿ÁýÕÖ£ºÉî¶ÈÊáÀíÁË 2024 ÖÁ 2025 Äê¼äÓ¿ÏÖµÄÊÓÆµÌìÉúµÄ×îÐÂÊÂÇ飬£¬ £¬£¬£¬£¬£¬·´Ó¦ÁËÄ¿½ñÊÖÒÕ´ÓÊÓ¾õ±£Õæ¶ÈÏòÎïÀíÒ»ÖÂÐÔת»¯µÄÇ°ÑØÇ÷ÊÆ¡£¡£¡£¡£ ¡£¡£¡£

½¹µãÀíÂÛ

ÌìÏÂÄ£×ÓµÄÈý´ó»ùʯ

±¾ÎÄÊ×ÏȻع龭µä£¬£¬ £¬£¬£¬£¬£¬½«ÌìÏÂÄ£×ÓµÄÔË×÷ÌáÁ¶ÎªÈý¸öñîºÏµÄ½¹µã×é¼þ£¬£¬ £¬£¬£¬£¬£¬¹¹½¨ÁË´Ó¸ÐÖªµ½ÍÆÀíµÄÍêÕûÁ´Â·£º

ÌìÏÂÄ£×ӵĽ¹µã²Ù×÷

»ùÓÚǰÎÄÌá³öµÄ¡¸Èý´ó»ùʯ¡¹£¬£¬ £¬£¬£¬£¬£¬±¾ÎĽ«ÌìÏÂÄ£×ÓµÄÔËÐлúÖÆ¹éÄÉΪÁ½Ïî½¹µã²Ù×÷£º

ÌìÏÂÄ£×ÓµÄѧϰ·½·¨

¼øÓÚÌìÏÂÄ£×ÓÖ÷ҪЧÀÍÓÚÏÂÓξöÒ飬£¬ £¬£¬£¬£¬£¬±¾ÎĽ«Æä»ñÈ¡£¡£¡£¡£ ¡£¡£¡£¨ÑµÁ·£©·¶Ê½°´ÓëÕ½ÂÔÄ£×Ó£¨Policy Model£©µÄñîºÏˮƽ¹éÄÉΪÁ½Àࣺ

±Õ»·Ñ§Ï°£¨Closed-loop Learning / Coupled Training£©£ºÌìÏÂÄ£×ÓÓëÕ½ÂÔÄ£×ÓÍŽáѵÁ·£¬£¬ £¬£¬£¬£¬£¬ÌìÏÂÄ£×ӵIJÎÊý¸üÐÂÖ±½ÓÊÜÕ½ÂÔÄ¿µÄÓ°Ï죨¹²ÏíÌÝ¶È / ¶Ëµ½¶ËÓÅ»¯£©£¬£¬ £¬£¬£¬£¬£¬¸Ã·¶Ê½¿É½øÒ»²½·ÖΪÁ½Öֽṹ£ºË³Ðò×éºÏ£¨Sequential Architecture£©£ºÌìÏÂÄ£×ÓºÍÕ½ÂÔÄ£×ÓÊÇÍÑÀëµÄÄ£¿£¿£¿£¿£¿é£¬£¬ £¬£¬£¬£¬£¬µ«ÑµÁ·Ê±»á¶Ëµ½¶ËÁª¶¯£ºÕ½ÂÔÄ¿µÄ±¬·¢µÄÎó²îÐźŻáͨ¹ýÌݶȷ´Ïò´«»ØÌìÏÂÄ£×Ó£¬£¬ £¬£¬£¬£¬£¬´Ó¶øÈÃÌìÉúЧ¹û¸üÇкϿÉÖ´ÐÐÐÔÓëÎïÀíÒ»ÖÂÐÔ¡£¡£¡£¡£ ¡£¡£¡£Í³Ò»¼Ü¹¹£¨Unified Architecture£©£º½«ÌìÏÂÄ£×ÓÓëÕ½ÂÔÕûºÏΪ¼òµ¥¶Ëµ½¶Ëϵͳ£¬£¬ £¬£¬£¬£¬£¬ÔÚͳһ¿ò¼ÜÄÚÅäºÏÓÅ»¯¸ÐÖª¡¢Õ¹ÍûÓëÐж¯ÌìÉú¡£¡£¡£¡£ ¡£¡£¡£¿£¿£¿£¿£¿ª»·Ñ§Ï°£¨Open-loop Learning / Decoupled Training£©£º½«ÌìÏÂÄ£×ÓÊÓΪͨ¹ý´ó¹æÄ£±»¶¯Êý¾ÝԤѵÁ·»ñµÃµÄ×ÔÁ¦Ä£ÄâÆ÷£» £» £»£»£»£»£»Õ½ÂÔÄ£×Ó¿ÉÔÚ×ÔÉíÓÅ»¯ÖÐŲÓÃÌìÏÂÄ£×Ó¾ÙÐС¸ÏëÏó / ÍýÏ롹£¬£¬ £¬£¬£¬£¬£¬µ«ÌìÏÂÄ£×Ó²»ÎüÊÕÀ´×ÔÕ½ÂÔ½±ÀøÐźŻòËðʧº¯ÊýµÄÌݶȸüУ¨Ä£×Ó¶³½á£©¡£¡£¡£¡£ ¡£¡£¡£

ÊÓÆµÄ£×ÓµÄÑݽø£ºÂõÏò³°ôÌìÏÂÄ£ÄâÆ÷

ÏÖ´úÊÓÆµÌìÉúÄ£×ÓËäÒѾ߱¸ºÜÇ¿µÄÊÓ¾õ±£Õæ¶È²¢±»ÊÓΪDZÔÚµÄÌìÏÂÄ£×ÓÔØÌ壬£¬ £¬£¬£¬£¬£¬µ«ÓëÉÏÃæÆÊÎöµÄ¾­µäÌìÏÂÄ£×ÓÏà±ÈÈÔ±£´æÁ½´ó¸Åº¦²î±ð£º

ÔÚ¶¯Ì¬£¨Dynamics£©²ãÃæ£¬£¬ £¬£¬£¬£¬£¬±ê׼ģ×Ó³£ÒÔË«Ïò×¢ÖØÁ¦¡¸Ò»´ÎÐÔäÖȾ¡¹Àο¿Ê±³¤Æ¬¶Ï£¬£¬ £¬£¬£¬£¬£¬È±ÉÙÏÔʽʱ¼äÒò¹ûÍÆ½ø£¬£¬ £¬£¬£¬£¬£¬½üÆÚÊÂÇéÔòͨ¹ýÒò¹û¼Ü¹¹Öع¹£¨×Իع顢Òò¹ûÑÚÂ롢ת¶¯Õ¹ÍûµÈ£©»òÒò¹û֪ʶ¼¯³É£¨½èÖú LMM ×öÍýÏëÔ¼Êø»òͳһñîºÏÓÅ»¯£©À´×¢ÈëÒò¹ûÐÔ£¨causality£©¡£¡£¡£¡£ ¡£¡£¡£

½¹µãÖ§Öù

ΪÁËÃè»æÊÓÆµÌìÉúÄ£×ÓÂõÏòÎȽ¡ÌìÏÂÄ£×ÓµÄÑݽøÂ·¾¶£¬£¬ £¬£¬£¬£¬£¬±¾ÎÄÊ×ÏÈ´ÓÆäÄÚ²¿ÌåÏÖÈëÊÖ£¬£¬ £¬£¬£¬£¬£¬ÖصãÉóÔÄ״̬£¨state£©µÄ¹¹½¨£º½«¡¸×´Ì¬¡¹ÊÓΪ¶ÔÇéÐÎÄ¿½ñÉèÖõijä·Öͳ¼ÆÁ¿£¬£¬ £¬£¬£¬£¬£¬²¢ÒÔ´ËΪ½¹µã°ÑÀúÊ·ÐÅÏ¢ÓлúÈÚÈëͳһÌåÏÖÖС£¡£¡£¡£ ¡£¡£¡£Í¨¹ý½«ºã¾ÃÅä¾°ÌáÁ¶²¢³Áµíµ½ÕâÖÖ״̬ÌåÏÖÀ£¬ £¬£¬£¬£¬£¬Ä£×ӲŻªÔÚ¸ü³¤Ê±³ÌÏÂά³ÖÒ»ÖµÄÓ°ÏóÓëÁ¬¹áµÄÄ£Äâ¡£¡£¡£¡£ ¡£¡£¡£

Ëæºó£¬£¬ £¬£¬£¬£¬£¬±¾ÎĽøÒ»²½ÆÊÎöÊÓÆµÌìÉúÄ£×ÓÖж¯Ì¬£¨dynamics£©ÐÐΪµÄȪԴ£¬£¬ £¬£¬£¬£¬£¬Ç¿µ÷Ä£×ÓÐèÒªÄÚ»¯Ç±ÔÚµÄÒò¹û¼ÍÂÉ£¬£¬ £¬£¬£¬£¬£¬Ê¹µÃËæÊ±¼äÍÆ½øµÄÑÝ»¯¼ÈÇкÏÎïÀí¿ÉÐÐÐÔ£¬£¬ £¬£¬£¬£¬£¬Ò²ÔÚÂß¼­²ãÃæ¼á³Ö×ÔÇ¢ÓëÒ»Ö¡£¡£¡£¡£ ¡£¡£¡£

Ö§ÖùÒ»£º×´Ì¬¹¹½¨£¨State Construction£©

ÊÓÆµÄ£×ÓÔõÑù¡¸¼Ç×Å¡¹ÒÑÍù£¿£¿£¿£¿£¿ÈçÄÇÀïÖÃÀúÊ·ÐÅÏ¢£¿£¿£¿£¿£¿±¾ÎĽ«ÏÖÓеÄ״̬´¦Öóͷ£»úÖÆ»®·ÖΪÒþʽ£¨Implicit State£©ÓëÏÔʽ£¨Explicit State£©Á½´ó·¶Ê½£¬£¬ £¬£¬£¬£¬£¬²¢¶ÔÆäÓÅÁÓ¾ÙÐÐÁËÉî¶È½â¹¹£º

Òþʽ״̬£¨Ó°Ïó»úÖÆÖÎÀí£©

ÏÔʽ״̬£¨ÄÚºËÌåÏÖ£©

ÕâÒ»·¶Ê½½«×´Ì¬¹¹½¨ÄÚ»¯ÎªÄ£×Ó×ÔÉíµÄѹËõÀú³Ì£ºËü²»ÔÙά»¤Ò»Ö±ÔöÌíµÄÀúÊ·Ö¡»º³åÇø£¬£¬ £¬£¬£¬£¬£¬¶øÊǰÑÀúÊ·ÉÏÏÂÎÄÒ»Á¬ÕôÁó½øÒ»¸öÈ«¾Ö¸üеÄDZÔÚ±äÁ¿£¨State£©ÖУ¬£¬ £¬£¬£¬£¬£¬Ê¹Æä³ÉΪ¶ÔÊÓÆµÑÝ»¯Àú³ÌµÄÀο¿Î¬¶È¡¢¿ÉµÝÍÆµÄÊýѧժҪ¡£¡£¡£¡£ ¡£¡£¡£

ñîºÏ״̬£¨Coupled States£©£º×´Ì¬×ªÒÆÓëÌìÉúÖ÷¸ÉÉî¶ÈÈںϣ¬£¬ £¬£¬£¬£¬£¬Ä£×ÓÔÚÍ³Ò»ÍøÂçÄÚʵÏÖ¡¸±ßÌìÉú¡¢±ß¸üС¹¡£¡£¡£¡£ ¡£¡£¡£×´Ì¬Í¨³£ÌåÏÖÎªÍøÂçÄÚ²¿µÄÒþ²ØÓ°Ïó£¨Èç SSM/RNN/LSTM Òþ״̬»ò×¢ÖØÁ¦»º³åÇø£©£¬£¬ £¬£¬£¬£¬£¬Ò²¿Éͨ¹ýÔÚÏßÓÅ»¯ / ¿ÉËÜÐÔ°ÑÀúÊ·ÐÅÏ¢±àÂë½ø²ÎÊý£¬£¬ £¬£¬£¬£¬£¬Ê¹×´Ì¬ÈÚÈëÌìÉúÆ÷µÄÄÚ²¿¶¯Á¦Ñ§£¬£¬ £¬£¬£¬£¬£¬´ú±íÊÂÇéÈç TTT [5] ¡¢SANA-Video [6] µÈ¡£¡£¡£¡£ ¡£¡£¡£½âñî״̬£¨Decoupled States£©£º×´Ì¬ÓëÌìÉúÆ÷ÄÚ²¿¼¤»îÊèÉ¢£¬£¬ £¬£¬£¬£¬£¬×÷Ϊ×ÔÁ¦ÏÔʽ±íÕ÷±»µ¥¶Àά»¤Óë¸üУ¬£¬ £¬£¬£¬£¬£¬ÌìÉúÆ÷ÿ²½¶ÁÈ¡¸Ã״̬¾ÙÐÐäÖȾ¡£¡£¡£¡£ ¡£¡£¡£³£¼û·¾¶°üÀ¨£ºÓïÒåµ¼Ïò£¨Óà LLM µÈά»¤ÌìÏÂÐÎò / ÐðÊÂÂß¼­£©Ó뼸ºÎµ¼Ïò£¨ÓõãÔÆ»ò 3D Gaussian splatting µÈ 3D Ó°Ï󣬣¬ £¬£¬£¬£¬£¬Í¨¹ýÈÚºÏ / ·´Í¶Ó°µü´ú¸üÐÂÒÔ¼á³Ö¿Õ¼äÒ»ÖÂÐÔ£©¡£¡£¡£¡£ ¡£¡£¡£

Òþʽ״̬ vs. ÏÔʽ״̬µÄϵͳÐÔ±ÈÕÕ

×ÜÌåÈ¡ÉáÊÇ£ºÒþʽ״̬ÏÖÔÚ¸üÎÈÍ×µØÖ§³Ö¸ß±£ÕæÊÓÆµÌìÉú£¬£¬ £¬£¬£¬£¬£¬¶øÏÔʽ״̬¸üÏñͨÍù¸ßЧ¡¢¿Éºã¾ÃÍÆÀíµÄ×ÔÖ÷ÖÇÄÜÌåÓëÌìÏÂÄ£ÄâµÄÇ°ÑØÆ«Ïò¡£¡£¡£¡£ ¡£¡£¡£

Ö§Öù¶þ£º¶¯Ì¬½¨Ä££¨Dynamics Modeling£©

ÔõÑùÈÃÌìÉúµÄÊÓÆµ²»µ«ÊÇ¡¸¿´ÆðÀ´Ïñ¡¹£¬£¬ £¬£¬£¬£¬£¬¶øÊÇÕæÕýÇкÏÎïÀí¼ÍÂÉÓëʱ¼äÂß¼­£¿£¿£¿£¿£¿±¾ÎĹéÄÉÁËÁ½ÌõÔöÇ¿Òò¹ûÍÆÀíÄÜÁ¦µÄÖ÷Ҫ·¾¶£º

Òò¹û¼Ü¹¹Öع¹£¨Causal Architecture Reformulation£©£º´ÓÄ£×ӽṹÓëѵÁ·Ä¿µÄÈëÊÖ£¬£¬ £¬£¬£¬£¬£¬°ÑÌìÉúÀú³Ì´Ó¡¸Ò»´ÎÐÔäÖȾ¡¹Ë¢Ð³ɡ¸×¼Ê±¼ä˳ÐòÕ¹Íû¡¹£¬£¬ £¬£¬£¬£¬£¬Í¨¹ýÒò¹ûÕÚÕֵȻúÖÆ×èֹδÀ´ÐÅÏ¢×ß©£¬£¬ £¬£¬£¬£¬£¬²¢ÍŽá²î±ðµÄѵÁ· / ÔëÉùµ÷ÀíÕ½ÂÔÇ¿»¯ÑÏ¿áµÄʱ¼äÒÀÀµ£» £» £»£»£»£»£»Í¬Ê±Í¨¹ý forcing µÈ·½·¨Ä£ÄâÍÆÀí½×¶ÎµÄÎó²îÀÛ»ýÓëÆØ¹âÎó²î£¬£¬ £¬£¬£¬£¬£¬ËõСѵÁ·ÓëÍÆÀíµÄ²î±ð£¬£¬ £¬£¬£¬£¬£¬Ê¹³¤Ê±³Ì rollout ¸üÎȹ̡¢¸üÇкÏÎïÀíÒ»ÖÂÐÔÓëÂß¼­Á¬¹áÐÔ£¬£¬ £¬£¬£¬£¬£¬´ú±íÊÂÇéÈç Self-Forcing [7] µÈ¡£¡£¡£¡£ ¡£¡£¡£Òò¹û֪ʶ¼¯³É£¨Causal Knowledge Integration£©£ºÒýÈë¾ß±¸¸üÇ¿ÍÆÀíÓë֪ʶÄÜÁ¦µÄ¶àģ̬´óÄ££¨LMM/VLM/LLM£©×÷Ϊ¡¸ÍýÏëÕß / µ¼ÑÝ¡¹£¬£¬ £¬£¬£¬£¬£¬ÏÈÔڸ߲ãÍê³ÉʱÐò¡¢Ðж¯Ó볡¾°Âß¼­µÄÍýÏ룬£¬ £¬£¬£¬£¬£¬ÔÙÓÉÊÓÆµÌìÉúÄ£×ÓÈÏÕæ¸ß±£Õ桸äÖȾ¡¹£» £» £»£»£»£»£»¸ü½øÒ»²½µÄͳһ¿ò¼Ü»á½«Ã÷È·ÓëÌìÉú¸üϸÃܵØñîºÏ£¬£¬ £¬£¬£¬£¬£¬ÈÃÍÆÀíÐźÅÖ±½ÓÔ¼ÊøÌìÉúÀú³Ì£¬£¬ £¬£¬£¬£¬£¬´Ó¶øÌáÉý¶¯Ì¬ÑÝ»¯µÄÒò¹û¿ÉÐŶÈ£¬£¬ £¬£¬£¬£¬£¬´ú±íÊÂÇéÈç Owl-1 [8] µÈ¡£¡£¡£¡£ ¡£¡£¡£

Ö§ÖùÈý£ºÆÀ¹Àϵͳ£¨Evaluation£©

ÈôÊÇ˵ÊÓÆµÌìÉú¸üÌåÌù¡¸ºÃÇ·ÔÃÄ¿¡¹£¬£¬ £¬£¬£¬£¬£¬ÄÇôÌìÏÂÄ£Ä⻹ÐèÒª¸üÌåÌù¡¸ºÃÇ·ºÃÓṡ£¡£¡£¡£ ¡£¡£¡£¹Å°åµÄ IS/FVD µÈÖ¸±êÖ÷ҪȨºâ¶ÌƬ¶ÏµÄÊÓ¾õÕæÊµ¸Ð£¬£¬ £¬£¬£¬£¬£¬ÒÑÄÑÒԻظ²Ä£×ÓÊÇ·ñ¾ß±¸¿ÉÒ»Á¬ÍÆÑÝ¡¢¿É½»»¥¡¢¿ÉÓÃÓÚ¾öÒéµÄ¡¸ÌìÏÂÄ£×Ó¡¹ÄÜÁ¦¡£¡£¡£¡£ ¡£¡£¡£Òò´Ë£¬£¬ £¬£¬£¬£¬£¬±¾ÎÄÖ÷ÕŽ«ÆÀ¹À´Ó ¡¸ÊÓ¾õÃÀ¸Ð¡¹½øÒ»²½Íƽøµ½¡¸¹¦Ð§»ù×¼¡¹£¬£¬ £¬£¬£¬£¬£¬²¢Ìá³öÈýÌõ½¹µãÆÀ¼ÛÖ᣺

ÖÊÁ¿£¨Quality£©£º¹Ø×¢»ù´¡ÊÓ¾õ±£Õæ¶È¡¢¶Ì³ÌʱÐòÏà¹ØÐÔÒÔ¼°Îı¾ / Ìõ¼þ¶ÔÆëÄÜÁ¦£¬£¬ £¬£¬£¬£¬£¬´ú±íÐÔ¹¤¾ßÈç VBench [9] / VBench++ [10] µÈ£¬£¬ £¬£¬£¬£¬£¬ÓøüϸÁ£¶ÈµÄά¶È²ð½â¡¸»­ÃæÊÇ·ñÎȹ̡¢Ö÷ÌåÊÇ·ñÒ»Ö¡¢ÓïÒåÊÇ·ñ¶ÔÆë¡¹¡£¡£¡£¡£ ¡£¡£¡£³¤ÆÚÐÔ£¨Persistence£©£º¹Ø×¢³¤Ê±³Ì rollout µÄÎȹÌÐÔÓëÒ»ÖÂÐÔ£¬£¬ £¬£¬£¬£¬£¬¼È¿´ÉúÉú³¤¶ÈÀ­³¤ºóÊÇ·ñ·ºÆðÆ¯ÒÆ / ±À»µ£¬£¬ £¬£¬£¬£¬£¬Ò²Í¨¹ý¡¸³¡¾°Öطã¨re-visitation£©¡¹µÈÓ°ÏóʹÃüÄ¥Á·Ä£×ÓÄÜ·ñÔڻص½¾ÉËùÔÚʱ»Ö¸´×¼È·×´Ì¬£¬£¬ £¬£¬£¬£¬£¬¶ø²»ÊÇÆ¾¿Õ²¹Ï¸½Ú£» £» £»£»£»£»£»Ïà¹ØÆÀ²â°üÀ¨ WCS [11] ÒÔ¼°»ùÓÚ rFID [12] µÄÖØÐÞÒ»ÖÂÐÔ²âÊԵȡ£¡£¡£¡£ ¡£¡£¡£Òò¹ûÐÔ£¨Causality£©£º×÷ΪÌìÏÂÄ£ÄâµÄ½¹µãÄÜÁ¦£¬£¬ £¬£¬£¬£¬£¬ÖصãÄ¥Á·Ä£×ÓÊÇ·ñÕæÕýÄÚ»¯ÎïÀíÓëÂß¼­¼ÍÂÉ£¬£¬ £¬£¬£¬£¬£¬¼È°üÀ¨Ê±¼ä˳ÐòÓëÎïÀíÓÐÓÃÐÔ£¨Èç ChronoMagic-Bench [13] ¡¢Physics-IQ [14] £©£¬£¬ £¬£¬£¬£¬£¬Ò²°üÀ¨·´ÊÂʵ¸ÉԤϵÄÏìÓ¦ÊÇ·ñºÏÀí£¨ÀýÈç¸Ä±ä»»×÷ / ³õʼÌõ¼þºó£¬£¬ £¬£¬£¬£¬£¬ÌìÏÂÊÇ·ñ°´Òò¹û±¬·¢²î±ðÇÒ×ÔÇ¢µÄЧ¹û£©£¬£¬ £¬£¬£¬£¬£¬²¢½øÒ»²½ÑÓÉìµ½ agent-in-the-loop µÄʹÃüÀÖ³ÉÂÊÓëÍýÏëÌåÏÖ£¨Èç World-in-World [15] µÈ£©¡£¡£¡£¡£ ¡£¡£¡£

δÀ´Ñо¿Æ«Ïò

ÊÓÆµÌìÉúÂõÏòÌìÏÂÄ£ÄâµÄÒªº¦£¬£¬ £¬£¬£¬£¬£¬ÔÚÓÚ²¹ÆëÁ½Ïî½¹µãÄÜÁ¦£º³¤ÆÚÐÔ£¨persistence£©ÓëÒò¹ûÐÔ£¨causality£©¡£¡£¡£¡£ ¡£¡£¡£

ǰÕßÒªÇóÄ£×ÓÔÚ³¤Ê±³ÌÌìÉúÖмá³ÖÎȹÌÒ»ÖµÄ״̬£ºÒþʽ״̬ÐèÒª´ÓÀο¿´°¿ÚµÈÆô·¢Ê½Ó°ÏóÉý¼¶Îª¿Éѧϰ¡¢¿É¶¯Ì¬É¸Ñ¡µÄÐÅÏ¢ÖÎÀí»úÖÆ£» £» £»£»£»£»£»ÏÔʽ״̬ÔòÒªÔÚѹËõЧÂÊÓëϸ½Ú±£ÕæÖ®¼äÕÒµ½¸üºÃµÄƽºâ¡£¡£¡£¡£ ¡£¡£¡£

ºóÕßÒªÇóÄ£×Ó´Óͳ¼ÆÏà¹Ø×ßÏòÒò¹û»úÖÆ£ºÒ»Ìõõè¾¶ÊÇͨ¹ý¼Ü¹¹ÓëÊý¾ÝÉè¼ÆÌáÉýÒò¹ûÍÆ¶ÏÄÜÁ¦£¨¸üºÃµØ½âñîDZÔÚÒò¹ûÒòËØ£©£¬£¬ £¬£¬£¬£¬£¬ÁíÒ»Ìõõè¾¶ÊÇÒýÈëÃ÷È·Ä£×ÓµÄÍÆÀíÏÈÑéÀ´Ô¼ÊøÌìÉú£¬£¬ £¬£¬£¬£¬£¬µ«ÔõÑùÓÐÓÃ¶ÔÆëÌìÉúÓëÃ÷È·ÈÔÊǽ¹µãÌôÕ½¡£¡£¡£¡£ ¡£¡£¡£

½áÓï

×ÛÉÏËùÊö£¬£¬ £¬£¬£¬£¬£¬Ëæ×ÅÊÓÆµÌìÍâÐÐÒÕÔÚ¸÷ÁìÓòµÄ±¬·¢Ê½ÔöÌí£¬£¬ £¬£¬£¬£¬£¬ÔõÑùʹÆä¾ß±¸ÕæÊµÌìϵÄÄ£ÄâÄÜÁ¦ÒѳÉΪ²»¿É»Ø±ÜµÄÌôÕ½¡£¡£¡£¡£ ¡£¡£¡£Í¨¹ýÈ«Á´Â·µÄÊÖÒÕÆÊÎö£¬£¬ £¬£¬£¬£¬£¬±¾×ÛÊö²»µ«ÃÖºÏÁËÊÓÆµ¼Ü¹¹Óë¾­µäÀíÂÛÖ®¼äµÄÁѺÛ£¬£¬ £¬£¬£¬£¬£¬»¹Õ¹ÏÖÁË´Ó¡¸Òþ / ÏÔʽ״̬¹¹½¨¡¹µ½¡¸Òò¹û¶¯Ì¬½¨Ä£¡¹µÄÒªº¦Â·¾¶¡£¡£¡£¡£ ¡£¡£¡£

ÕâÆª×ÛÊöΪѧÊõ½çºÍ¹¤Òµ½çÌṩÁËÒ»¸öÖ÷ÒªµÄ²Î¿¼¿ò¼Ü£¬£¬ £¬£¬£¬£¬£¬×ÊÖúÑо¿ÕßÔÚͨÍùͨÓÃÌìÏÂÄ£ÄâÆ÷µÄÕ÷;Öо«×¼¶¨Î»¡£¡£¡£¡£ ¡£¡£¡£

ÍŶÓÐÅÍУ¬£¬ £¬£¬£¬£¬£¬Í¨¹ýÓ¦¶Ô×ÛÊöÖÐÁгöµÄÌôÕ½£¬£¬ £¬£¬£¬£¬£¬¸ÃÁìÓò¿ÉÒÔ´ÓÌìÉúÊÓ¾õÉϱÆÕæµÄÊÓÆµÉú³¤µ½¹¹½¨ÎȽ¡µÄͨÓÃÌìÏÂÄ£ÄâÆ÷£¬£¬ £¬£¬£¬£¬£¬Îª×Ô¶¯¼ÝÊ»¡¢¾ßÉíÖÇÄܵÈÁìÓòµÄ³¤×ãÉú³¤µÓÚ¨¼áʵ»ùʯ¡£¡£¡£¡£ ¡£¡£¡£

²Î¿¼ÎÄÏ×

[1] L. Zhang and M. Agrawala. Packing input frame context in next-frame prediction models for video generation. arXiv preprint arXiv:2504.12626, 2025.

[2] Z. Xiao et al. Worldmem: Long-term consistent world simulation with memory. arXiv preprint arXiv:2504.12369, 2025.

[3] X. Wu et al. Corgi: Cached memory guided video generation. arXiv preprint arXiv:2508.16078, 2025.

[4] R. Henschel et al. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 2568¨C2577, 2025.

[5] K. Dalal et al. One-minute video generation with test-time training. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 17702¨C17711, 2025.

[6] J. Chen et al. Sana-video: Efficient video generation with block linear diffusion transformer. arXiv preprint arXiv:2509.24695, 2025.

[7] X. Huang et al. Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009, 2025.

[8] Y. Huang et al. Owl-1: Omni world model for consistent long video generation. arXiv preprint arXiv:2412.09600, 2024.

[9] Z. Huang et al. Vbench: Comprehensive benchmark suite for video generative models, 2023.

[10] Z. Huang et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models, 2024.

[11] A. Rakheja et al. World consistency score: A unified metric for video generation quality, 2025.

[12] M. Heusel et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018.

[13] S. Yuan et al. Chronomagic-bench: A benchmark for metamor-phic evaluation of text-to-time-lapse video generation, 2024.

[14] S. Motamed et al. Do generative video models understand physical principles?, 2025.

[15] J. Zhang Ìì·å½¨²ÄÓÐÏÞ¹«Ë¾et al. World-in-world: World models in a closed-loop world, 2025.