
½üÄêÀ´£¬£¬£¬£¬£¬ÊÓÆµÌìÉú£¨Video Generation£©ÓëÌìÏÂÄ£×Ó£¨World Models£©ÒÑÔ¾ÉýΪÈ˹¤ÖÇÄÜÁìÓò×îÖËÊÖ¿ÉÈȵĽ¹µã¡£¡£¡£¡£¡£´Ó Sora µ½¿ÉÁ飨Kling£©£¬£¬£¬£¬£¬ÊÓÆµÌìÉúÄ£×ÓÔÚÔ˶¯Ò»Á¬ÐÔ¡¢ÎïÌå½»»¥Ó벿·ÖÎïÀíÏÈÑéÉÏÖð½¥ÌåÏÖ³ö¸üÇ¿µÄ¡¸ÌìÏÂÒ»ÖÂÐÔ¡¹£¬£¬£¬£¬£¬ÈÃÈËÃÇ×îÏÈÈÏÕæÌÖÂÛ£ºÄÜ·ñ°ÑÊÓÆµÌìÉú´Ó¡¸±ÆÕæ¶ÌƬ¡¹Íƽøµ½¿ÉÓÃÓÚÍÆÀí¡¢ÍýÏëÓë¿ØÖÆµÄ¡¸Í¨ÓÃÌìÏÂÄ£ÄâÆ÷¡¹¡£¡£¡£¡£¡£
Óë´Ëͬʱ£¬£¬£¬£¬£¬ÕâÒ»Ñо¿Æ«ÏòÕý¿ìËÙÓë¾ßÉíÖÇÄÜ£¨Embodied AI£©¡¢×Ô¶¯¼ÝÊ»£¨Autonomous Driving£©µÈÇ°ÑØ³¡¾°Éî¶È½»Ö¯£¬£¬£¬£¬£¬±»ÊÓΪͨÍùͨÓÃÈ˹¤ÖÇÄÜ£¨AGI£©µÄÖ÷Ҫ·¾¶¡£¡£¡£¡£¡£
È»¶ø£¬£¬£¬£¬£¬ÔÚÑо¿Èȳ±Ö®Ï£¬£¬£¬£¬£¬¡¸×÷ÉõÕæÕýµÄÌìÏÂÄ£×Ó¡¹ÒÔ¼°¡¸ÔõÑùÆÀÅÐÊÓÆµÄ£×ÓµÄÌìÏÂÄ£ÄâÄÜÁ¦¡¹µÈ½¹µãÒéÌâÈ´ÏÝÈëÁ˶àάÕùÂÛ¡£¡£¡£¡£¡£Ä¿½ñ£¬£¬£¬£¬£¬ÌìÏÂÄ£×ӵĽç˵Óë·ÖÀà²ã³ö²»Ç£¬£¬£¬£¬ÀíÂÛά¶ÈµÄ½»Ö¯ÖصþÍùÍùÁîÑо¿Õ߸ÐÓ¦ÒÉÐÄ£¬£¬£¬£¬£¬Ò²ÏÞÖÆÁËÊÖÒյıê×¼»¯Éú³¤¡£¡£¡£¡£¡£
Ϊ½¨Éè¸üϵͳ¡¢ÇåÎúµÄÉóÔÄÊӽǣ¬£¬£¬£¬£¬¿ìÊÖ¿ÉÁéÍŶÓÓëÏã¸Û¿Æ¼¼´óѧ£¨¹ãÖÝ£©³ÂÓ±´Ï½ÌÊÚÍŶӣ¨ÅäºÏÒ»×÷£º²©Ê¿ÉúÍõÂÞÖÝ¡¢²©Ê¿Éú³ÂÖª·Ç£©ÍŽá½ÒÏþÁË´ÓÈ«ÐÂÊÓ½ÇÉî¶ÈÆÊÎöÊÓÆµÌìÏÂÄ£×ÓµÄϵͳ×ÛÊö¡£¡£¡£¡£¡£
±¾ÎÄÖ¼ÔÚÃֺϽñÊÀ¡¸ÎÞ״̬¡¹ÊÓÆµ¼Ü¹¹Ó뾵䡸ÒÔ״̬ΪÖÐÐÄ¡¹µÄÌìÏÂÄ£×ÓÀíÂÛÖ®¼äµÄºè¹µ£¬£¬£¬£¬£¬Ê×´ÎÌá³öÒÔ¡¸×´Ì¬¹¹½¨£¨State Construction£©¡¹Ó롸¶¯Ì¬½¨Ä££¨Dynamics Modeling£©¡¹ÎªË«Ö§ÖùµÄȫзÖÀàϵͳ¡£¡£¡£¡£¡£
±ðµÄ£¬£¬£¬£¬£¬±¾ÎÄÁ¦³«½«ÆÀ¹À±ê×¼´Ó´¿´âµÄ¡¸ÊÓ¾õ±£Õæ¶È¡¹×ªÏò¡¸¹¦Ð§ÐÔ»ù×¼¡¹£¬£¬£¬£¬£¬²¢Ç°Õ°ÐÔµØÖ¸³öÁËÁ½¸öÒªº¦ÊÖÒÕÇ°ÑØ£¬£¬£¬£¬£¬ÎªÊÓÆµÌìÉúÑݽøÖÁ³°ôµÄͨÓÃÌìÏÂÄ£ÄâÆ÷ÌṩÁËÇåÎúµÄõ辶ͼ¡£¡£¡£¡£¡£

ÂÛÎÄÎÊÌ⣺A Mechanistic View on Video Generation as World Models: State and DynamicsÂÛÎÄÁ´½Ó£ºhttps://arxiv.org/pdf/2601.17067github Á´½Ó£ºhttps://github.com/hit-perfect/Awesome-Video-World-Models
×ÛÊö½á¹¹ÌáÒª

½¹µãÁÁµã£ºÕâÆª×ÛÊöµÄÒªº¦Ð¢Ë³ÊÇʲô£¿£¿£¿£¿£¿£¿£¿£¿
Ïà±ÈÓÚ¹ýÍù×ÅÖØÓÚÊÓ¾õЧ¹ûµÄÊÓÆµÌìÉúÑо¿£¬£¬£¬£¬£¬±¾Æª×ÛÊöÔÚ¶à¸öά¶È¾ßÓдú¼ÊÓÅÊÆ£º
È«Á´Â·Êӽǣ¨Full-Stack Perspective£©£º³¹µ×Í»ÆÆ¼òµ¥µÄ¡¸äÖȾ¡¹Êӽǣ¬£¬£¬£¬£¬º¸ÇÁ˴ӵײãÀíÂÛ½ç˵¡¢Öвã¼Ü¹¹Éè¼Æ£¨×´Ì¬¹¹½¨Ó붯̬½¨Ä££©µ½Éϲ㹦ЧÐÔÆÀ¹ÀµÄÈ«ÉúÃüÖÜÆÚÆÊÎö£¬£¬£¬£¬£¬È·±£¶ÔÊÓÆµÌìÏÂÄ£×ÓÈ«·½Î»µÄÃ÷È·¡£¡£¡£¡£¡£ÃÖºÏÀíÂۺ蹵£¨Bridging the Gap£©£ºÊ״ν«½ñÊÀ¡¸ÎÞ״̬¡¹£¨state-less£©µÄÊÓÆµÀ©É¢¼Ü¹¹Óë¾µäµÄ»ùÓÚÄ£×ÓÇ¿»¯Ñ§Ï°£¨MBRL£©¡¢¿ØÖÆÀíÂÛ¾ÙÐÐÉî¶ÈÓ³É䣬£¬£¬£¬£¬ÎªÌìÏÂÄ£×ÓÕÒµ½Á˼áʵµÄÀíÂÛ»ù±¾¡£¡£¡£¡£¡£Ç°Õ°ÐÔÖ¸ÄÏ£¨Forward-Looking Guide£©£ºÃ÷È·ÁË¡¸³¤ÆÚÐÔ¡¹Ó롸Òò¹ûÐÔ¡¹ ÊÇÂõÏòͨÓÃÌìÏÂÄ£ÄâÆ÷µÄÁ½´ó½¹µã¹Ø¿Ú¡£¡£¡£¡£¡£±¾Ñо¿ÎªÒµ½ç´Ó±»¶¯µÄ¡¸ÏñËØÕ¹Íû¡¹×ªÏò¾ß±¸±Õ»·½»»¥ÓëÒò¹û¸ÉÔ¤ÄÜÁ¦µÄÄ£ÄâÆ÷ÌṩÁËÇåÎúµÄ·¾¶²Î¿¼¡£¡£¡£¡£¡£×îÐÂÑо¿ÁýÕÖ£ºÉî¶ÈÊáÀíÁË 2024 ÖÁ 2025 Äê¼äÓ¿ÏÖµÄÊÓÆµÌìÉúµÄ×îÐÂÊÂÇ飬£¬£¬£¬£¬·´Ó¦ÁËÄ¿½ñÊÖÒÕ´ÓÊÓ¾õ±£Õæ¶ÈÏòÎïÀíÒ»ÖÂÐÔת»¯µÄÇ°ÑØÇ÷ÊÆ¡£¡£¡£¡£¡£
½¹µãÀíÂÛ
ÌìÏÂÄ£×ÓµÄÈý´ó»ùʯ
±¾ÎÄÊ×ÏȻع龵䣬£¬£¬£¬£¬½«ÌìÏÂÄ£×ÓµÄÔË×÷ÌáÁ¶ÎªÈý¸öñîºÏµÄ½¹µã×é¼þ£¬£¬£¬£¬£¬¹¹½¨ÁË´Ó¸ÐÖªµ½ÍÆÀíµÄÍêÕûÁ´Â·£º

ÌìÏÂÄ£×ӵĽ¹µã²Ù×÷
»ùÓÚǰÎÄÌá³öµÄ¡¸Èý´ó»ùʯ¡¹£¬£¬£¬£¬£¬±¾ÎĽ«ÌìÏÂÄ£×ÓµÄÔËÐлúÖÆ¹éÄÉΪÁ½Ïî½¹µã²Ù×÷£º


ÌìÏÂÄ£×ÓµÄѧϰ·½·¨
¼øÓÚÌìÏÂÄ£×ÓÖ÷ҪЧÀÍÓÚÏÂÓξöÒ飬£¬£¬£¬£¬±¾ÎĽ«Æä»ñÈ¡£¡£¡£¡£¡£¨ÑµÁ·£©·¶Ê½°´ÓëÕ½ÂÔÄ£×Ó£¨Policy Model£©µÄñîºÏˮƽ¹éÄÉΪÁ½Àࣺ
±Õ»·Ñ§Ï°£¨Closed-loop Learning / Coupled Training£©£ºÌìÏÂÄ£×ÓÓëÕ½ÂÔÄ£×ÓÍŽáѵÁ·£¬£¬£¬£¬£¬ÌìÏÂÄ£×ӵIJÎÊý¸üÐÂÖ±½ÓÊÜÕ½ÂÔÄ¿µÄÓ°Ï죨¹²ÏíÌÝ¶È / ¶Ëµ½¶ËÓÅ»¯£©£¬£¬£¬£¬£¬¸Ã·¶Ê½¿É½øÒ»²½·ÖΪÁ½Öֽṹ£ºË³Ðò×éºÏ£¨Sequential Architecture£©£ºÌìÏÂÄ£×ÓºÍÕ½ÂÔÄ£×ÓÊÇÍÑÀëµÄÄ£¿£¿£¿£¿£¿£¿£¿£¿é£¬£¬£¬£¬£¬µ«ÑµÁ·Ê±»á¶Ëµ½¶ËÁª¶¯£ºÕ½ÂÔÄ¿µÄ±¬·¢µÄÎó²îÐźŻáͨ¹ýÌݶȷ´Ïò´«»ØÌìÏÂÄ£×Ó£¬£¬£¬£¬£¬´Ó¶øÈÃÌìÉúЧ¹û¸üÇкϿÉÖ´ÐÐÐÔÓëÎïÀíÒ»ÖÂÐÔ¡£¡£¡£¡£¡£Í³Ò»¼Ü¹¹£¨Unified Architecture£©£º½«ÌìÏÂÄ£×ÓÓëÕ½ÂÔÕûºÏΪ¼òµ¥¶Ëµ½¶Ëϵͳ£¬£¬£¬£¬£¬ÔÚͳһ¿ò¼ÜÄÚÅäºÏÓÅ»¯¸ÐÖª¡¢Õ¹ÍûÓëÐж¯ÌìÉú¡£¡£¡£¡£¡£¿£¿£¿£¿£¿£¿£¿£¿ª»·Ñ§Ï°£¨Open-loop Learning / Decoupled Training£©£º½«ÌìÏÂÄ£×ÓÊÓΪͨ¹ý´ó¹æÄ£±»¶¯Êý¾ÝԤѵÁ·»ñµÃµÄ×ÔÁ¦Ä£ÄâÆ÷£»£»£»£»£»£»Õ½ÂÔÄ£×Ó¿ÉÔÚ×ÔÉíÓÅ»¯ÖÐŲÓÃÌìÏÂÄ£×Ó¾ÙÐС¸ÏëÏó / ÍýÏ롹£¬£¬£¬£¬£¬µ«ÌìÏÂÄ£×Ó²»ÎüÊÕÀ´×ÔÕ½ÂÔ½±ÀøÐźŻòËðʧº¯ÊýµÄÌݶȸüУ¨Ä£×Ó¶³½á£©¡£¡£¡£¡£¡£

ÊÓÆµÄ£×ÓµÄÑݽø£ºÂõÏò³°ôÌìÏÂÄ£ÄâÆ÷
ÏÖ´úÊÓÆµÌìÉúÄ£×ÓËäÒѾ߱¸ºÜÇ¿µÄÊÓ¾õ±£Õæ¶È²¢±»ÊÓΪDZÔÚµÄÌìÏÂÄ£×ÓÔØÌ壬£¬£¬£¬£¬µ«ÓëÉÏÃæÆÊÎöµÄ¾µäÌìÏÂÄ£×ÓÏà±ÈÈÔ±£´æÁ½´ó¸Åº¦²î±ð£º

ÔÚ¶¯Ì¬£¨Dynamics£©²ãÃæ£¬£¬£¬£¬£¬±ê׼ģ×Ó³£ÒÔË«Ïò×¢ÖØÁ¦¡¸Ò»´ÎÐÔäÖȾ¡¹Àο¿Ê±³¤Æ¬¶Ï£¬£¬£¬£¬£¬È±ÉÙÏÔʽʱ¼äÒò¹ûÍÆ½ø£¬£¬£¬£¬£¬½üÆÚÊÂÇéÔòͨ¹ýÒò¹û¼Ü¹¹Öع¹£¨×Իع顢Òò¹ûÑÚÂ롢ת¶¯Õ¹ÍûµÈ£©»òÒò¹û֪ʶ¼¯³É£¨½èÖú LMM ×öÍýÏëÔ¼Êø»òͳһñîºÏÓÅ»¯£©À´×¢ÈëÒò¹ûÐÔ£¨causality£©¡£¡£¡£¡£¡£
½¹µãÖ§Öù
ΪÁËÃè»æÊÓÆµÌìÉúÄ£×ÓÂõÏòÎȽ¡ÌìÏÂÄ£×ÓµÄÑݽøÂ·¾¶£¬£¬£¬£¬£¬±¾ÎÄÊ×ÏÈ´ÓÆäÄÚ²¿ÌåÏÖÈëÊÖ£¬£¬£¬£¬£¬ÖصãÉóÔÄ״̬£¨state£©µÄ¹¹½¨£º½«¡¸×´Ì¬¡¹ÊÓΪ¶ÔÇéÐÎÄ¿½ñÉèÖõijä·Öͳ¼ÆÁ¿£¬£¬£¬£¬£¬²¢ÒÔ´ËΪ½¹µã°ÑÀúÊ·ÐÅÏ¢ÓлúÈÚÈëͳһÌåÏÖÖС£¡£¡£¡£¡£Í¨¹ý½«ºã¾ÃÅä¾°ÌáÁ¶²¢³Áµíµ½ÕâÖÖ״̬ÌåÏÖÀ£¬£¬£¬£¬Ä£×ӲŻªÔÚ¸ü³¤Ê±³ÌÏÂά³ÖÒ»ÖµÄÓ°ÏóÓëÁ¬¹áµÄÄ£Äâ¡£¡£¡£¡£¡£
Ëæºó£¬£¬£¬£¬£¬±¾ÎĽøÒ»²½ÆÊÎöÊÓÆµÌìÉúÄ£×ÓÖж¯Ì¬£¨dynamics£©ÐÐΪµÄȪԴ£¬£¬£¬£¬£¬Ç¿µ÷Ä£×ÓÐèÒªÄÚ»¯Ç±ÔÚµÄÒò¹û¼ÍÂÉ£¬£¬£¬£¬£¬Ê¹µÃËæÊ±¼äÍÆ½øµÄÑÝ»¯¼ÈÇкÏÎïÀí¿ÉÐÐÐÔ£¬£¬£¬£¬£¬Ò²ÔÚÂß¼²ãÃæ¼á³Ö×ÔÇ¢ÓëÒ»Ö¡£¡£¡£¡£¡£
Ö§ÖùÒ»£º×´Ì¬¹¹½¨£¨State Construction£©
ÊÓÆµÄ£×ÓÔõÑù¡¸¼Ç×Å¡¹ÒÑÍù£¿£¿£¿£¿£¿£¿£¿£¿ÈçÄÇÀïÖÃÀúÊ·ÐÅÏ¢£¿£¿£¿£¿£¿£¿£¿£¿±¾ÎĽ«ÏÖÓеÄ״̬´¦Öóͷ£»úÖÆ»®·ÖΪÒþʽ£¨Implicit State£©ÓëÏÔʽ£¨Explicit State£©Á½´ó·¶Ê½£¬£¬£¬£¬£¬²¢¶ÔÆäÓÅÁÓ¾ÙÐÐÁËÉî¶È½â¹¹£º
Òþʽ״̬£¨Ó°Ïó»úÖÆÖÎÀí£©





ÏÔʽ״̬£¨ÄÚºËÌåÏÖ£©
ÕâÒ»·¶Ê½½«×´Ì¬¹¹½¨ÄÚ»¯ÎªÄ£×Ó×ÔÉíµÄѹËõÀú³Ì£ºËü²»ÔÙά»¤Ò»Ö±ÔöÌíµÄÀúÊ·Ö¡»º³åÇø£¬£¬£¬£¬£¬¶øÊǰÑÀúÊ·ÉÏÏÂÎÄÒ»Á¬ÕôÁó½øÒ»¸öÈ«¾Ö¸üеÄDZÔÚ±äÁ¿£¨State£©ÖУ¬£¬£¬£¬£¬Ê¹Æä³ÉΪ¶ÔÊÓÆµÑÝ»¯Àú³ÌµÄÀο¿Î¬¶È¡¢¿ÉµÝÍÆµÄÊýѧժҪ¡£¡£¡£¡£¡£
ñîºÏ״̬£¨Coupled States£©£º×´Ì¬×ªÒÆÓëÌìÉúÖ÷¸ÉÉî¶ÈÈںϣ¬£¬£¬£¬£¬Ä£×ÓÔÚÍ³Ò»ÍøÂçÄÚʵÏÖ¡¸±ßÌìÉú¡¢±ß¸üС¹¡£¡£¡£¡£¡£×´Ì¬Í¨³£ÌåÏÖÎªÍøÂçÄÚ²¿µÄÒþ²ØÓ°Ïó£¨Èç SSM/RNN/LSTM Òþ״̬»ò×¢ÖØÁ¦»º³åÇø£©£¬£¬£¬£¬£¬Ò²¿Éͨ¹ýÔÚÏßÓÅ»¯ / ¿ÉËÜÐÔ°ÑÀúÊ·ÐÅÏ¢±àÂë½ø²ÎÊý£¬£¬£¬£¬£¬Ê¹×´Ì¬ÈÚÈëÌìÉúÆ÷µÄÄÚ²¿¶¯Á¦Ñ§£¬£¬£¬£¬£¬´ú±íÊÂÇéÈç TTT [5] ¡¢SANA-Video [6] µÈ¡£¡£¡£¡£¡£½âñî״̬£¨Decoupled States£©£º×´Ì¬ÓëÌìÉúÆ÷ÄÚ²¿¼¤»îÊèÉ¢£¬£¬£¬£¬£¬×÷Ϊ×ÔÁ¦ÏÔʽ±íÕ÷±»µ¥¶Àά»¤Óë¸üУ¬£¬£¬£¬£¬ÌìÉúÆ÷ÿ²½¶ÁÈ¡¸Ã״̬¾ÙÐÐäÖȾ¡£¡£¡£¡£¡£³£¼û·¾¶°üÀ¨£ºÓïÒåµ¼Ïò£¨Óà LLM µÈά»¤ÌìÏÂÐÎò / ÐðÊÂÂß¼£©Ó뼸ºÎµ¼Ïò£¨ÓõãÔÆ»ò 3D Gaussian splatting µÈ 3D Ó°Ï󣬣¬£¬£¬£¬Í¨¹ýÈÚºÏ / ·´Í¶Ó°µü´ú¸üÐÂÒÔ¼á³Ö¿Õ¼äÒ»ÖÂÐÔ£©¡£¡£¡£¡£¡£

Òþʽ״̬ vs. ÏÔʽ״̬µÄϵͳÐÔ±ÈÕÕ



×ÜÌåÈ¡ÉáÊÇ£ºÒþʽ״̬ÏÖÔÚ¸üÎÈÍ×µØÖ§³Ö¸ß±£ÕæÊÓÆµÌìÉú£¬£¬£¬£¬£¬¶øÏÔʽ״̬¸üÏñͨÍù¸ßЧ¡¢¿Éºã¾ÃÍÆÀíµÄ×ÔÖ÷ÖÇÄÜÌåÓëÌìÏÂÄ£ÄâµÄÇ°ÑØÆ«Ïò¡£¡£¡£¡£¡£

Ö§Öù¶þ£º¶¯Ì¬½¨Ä££¨Dynamics Modeling£©
ÔõÑùÈÃÌìÉúµÄÊÓÆµ²»µ«ÊÇ¡¸¿´ÆðÀ´Ïñ¡¹£¬£¬£¬£¬£¬¶øÊÇÕæÕýÇкÏÎïÀí¼ÍÂÉÓëʱ¼äÂß¼£¿£¿£¿£¿£¿£¿£¿£¿±¾ÎĹéÄÉÁËÁ½ÌõÔöÇ¿Òò¹ûÍÆÀíÄÜÁ¦µÄÖ÷Ҫ·¾¶£º
Òò¹û¼Ü¹¹Öع¹£¨Causal Architecture Reformulation£©£º´ÓÄ£×ӽṹÓëѵÁ·Ä¿µÄÈëÊÖ£¬£¬£¬£¬£¬°ÑÌìÉúÀú³Ì´Ó¡¸Ò»´ÎÐÔäÖȾ¡¹Ë¢Ð³ɡ¸×¼Ê±¼ä˳ÐòÕ¹Íû¡¹£¬£¬£¬£¬£¬Í¨¹ýÒò¹ûÕÚÕֵȻúÖÆ×èֹδÀ´ÐÅÏ¢×ß©£¬£¬£¬£¬£¬²¢ÍŽá²î±ðµÄѵÁ· / ÔëÉùµ÷ÀíÕ½ÂÔÇ¿»¯ÑÏ¿áµÄʱ¼äÒÀÀµ£»£»£»£»£»£»Í¬Ê±Í¨¹ý forcing µÈ·½·¨Ä£ÄâÍÆÀí½×¶ÎµÄÎó²îÀÛ»ýÓëÆØ¹âÎó²î£¬£¬£¬£¬£¬ËõСѵÁ·ÓëÍÆÀíµÄ²î±ð£¬£¬£¬£¬£¬Ê¹³¤Ê±³Ì rollout ¸üÎȹ̡¢¸üÇкÏÎïÀíÒ»ÖÂÐÔÓëÂß¼Á¬¹áÐÔ£¬£¬£¬£¬£¬´ú±íÊÂÇéÈç Self-Forcing [7] µÈ¡£¡£¡£¡£¡£Òò¹û֪ʶ¼¯³É£¨Causal Knowledge Integration£©£ºÒýÈë¾ß±¸¸üÇ¿ÍÆÀíÓë֪ʶÄÜÁ¦µÄ¶àģ̬´óÄ££¨LMM/VLM/LLM£©×÷Ϊ¡¸ÍýÏëÕß / µ¼ÑÝ¡¹£¬£¬£¬£¬£¬ÏÈÔڸ߲ãÍê³ÉʱÐò¡¢Ðж¯Ó볡¾°Âß¼µÄÍýÏ룬£¬£¬£¬£¬ÔÙÓÉÊÓÆµÌìÉúÄ£×ÓÈÏÕæ¸ß±£Õ桸äÖȾ¡¹£»£»£»£»£»£»¸ü½øÒ»²½µÄͳһ¿ò¼Ü»á½«Ã÷È·ÓëÌìÉú¸üϸÃܵØñîºÏ£¬£¬£¬£¬£¬ÈÃÍÆÀíÐźÅÖ±½ÓÔ¼ÊøÌìÉúÀú³Ì£¬£¬£¬£¬£¬´Ó¶øÌáÉý¶¯Ì¬ÑÝ»¯µÄÒò¹û¿ÉÐŶȣ¬£¬£¬£¬£¬´ú±íÊÂÇéÈç Owl-1 [8] µÈ¡£¡£¡£¡£¡£
Ö§ÖùÈý£ºÆÀ¹Àϵͳ£¨Evaluation£©
ÈôÊÇ˵ÊÓÆµÌìÉú¸üÌåÌù¡¸ºÃÇ·ÔÃÄ¿¡¹£¬£¬£¬£¬£¬ÄÇôÌìÏÂÄ£Ä⻹ÐèÒª¸üÌåÌù¡¸ºÃÇ·ºÃÓṡ£¡£¡£¡£¡£¹Å°åµÄ IS/FVD µÈÖ¸±êÖ÷ҪȨºâ¶ÌƬ¶ÏµÄÊÓ¾õÕæÊµ¸Ð£¬£¬£¬£¬£¬ÒÑÄÑÒԻظ²Ä£×ÓÊÇ·ñ¾ß±¸¿ÉÒ»Á¬ÍÆÑÝ¡¢¿É½»»¥¡¢¿ÉÓÃÓÚ¾öÒéµÄ¡¸ÌìÏÂÄ£×Ó¡¹ÄÜÁ¦¡£¡£¡£¡£¡£Òò´Ë£¬£¬£¬£¬£¬±¾ÎÄÖ÷ÕŽ«ÆÀ¹À´Ó ¡¸ÊÓ¾õÃÀ¸Ð¡¹½øÒ»²½Íƽøµ½¡¸¹¦Ð§»ù×¼¡¹£¬£¬£¬£¬£¬²¢Ìá³öÈýÌõ½¹µãÆÀ¼ÛÖ᣺
ÖÊÁ¿£¨Quality£©£º¹Ø×¢»ù´¡ÊÓ¾õ±£Õæ¶È¡¢¶Ì³ÌʱÐòÏà¹ØÐÔÒÔ¼°Îı¾ / Ìõ¼þ¶ÔÆëÄÜÁ¦£¬£¬£¬£¬£¬´ú±íÐÔ¹¤¾ßÈç VBench [9] / VBench++ [10] µÈ£¬£¬£¬£¬£¬ÓøüϸÁ£¶ÈµÄά¶È²ð½â¡¸»ÃæÊÇ·ñÎȹ̡¢Ö÷ÌåÊÇ·ñÒ»Ö¡¢ÓïÒåÊÇ·ñ¶ÔÆë¡¹¡£¡£¡£¡£¡£³¤ÆÚÐÔ£¨Persistence£©£º¹Ø×¢³¤Ê±³Ì rollout µÄÎȹÌÐÔÓëÒ»ÖÂÐÔ£¬£¬£¬£¬£¬¼È¿´ÉúÉú³¤¶ÈÀ³¤ºóÊÇ·ñ·ºÆðÆ¯ÒÆ / ±À»µ£¬£¬£¬£¬£¬Ò²Í¨¹ý¡¸³¡¾°Öطã¨re-visitation£©¡¹µÈÓ°ÏóʹÃüÄ¥Á·Ä£×ÓÄÜ·ñÔڻص½¾ÉËùÔÚʱ»Ö¸´×¼È·×´Ì¬£¬£¬£¬£¬£¬¶ø²»ÊÇÆ¾¿Õ²¹Ï¸½Ú£»£»£»£»£»£»Ïà¹ØÆÀ²â°üÀ¨ WCS [11] ÒÔ¼°»ùÓÚ rFID [12] µÄÖØÐÞÒ»ÖÂÐÔ²âÊԵȡ£¡£¡£¡£¡£Òò¹ûÐÔ£¨Causality£©£º×÷ΪÌìÏÂÄ£ÄâµÄ½¹µãÄÜÁ¦£¬£¬£¬£¬£¬ÖصãÄ¥Á·Ä£×ÓÊÇ·ñÕæÕýÄÚ»¯ÎïÀíÓëÂß¼¼ÍÂÉ£¬£¬£¬£¬£¬¼È°üÀ¨Ê±¼ä˳ÐòÓëÎïÀíÓÐÓÃÐÔ£¨Èç ChronoMagic-Bench [13] ¡¢Physics-IQ [14] £©£¬£¬£¬£¬£¬Ò²°üÀ¨·´ÊÂʵ¸ÉԤϵÄÏìÓ¦ÊÇ·ñºÏÀí£¨ÀýÈç¸Ä±ä»»×÷ / ³õʼÌõ¼þºó£¬£¬£¬£¬£¬ÌìÏÂÊÇ·ñ°´Òò¹û±¬·¢²î±ðÇÒ×ÔÇ¢µÄЧ¹û£©£¬£¬£¬£¬£¬²¢½øÒ»²½ÑÓÉìµ½ agent-in-the-loop µÄʹÃüÀÖ³ÉÂÊÓëÍýÏëÌåÏÖ£¨Èç World-in-World [15] µÈ£©¡£¡£¡£¡£¡£
δÀ´Ñо¿Æ«Ïò
ÊÓÆµÌìÉúÂõÏòÌìÏÂÄ£ÄâµÄÒªº¦£¬£¬£¬£¬£¬ÔÚÓÚ²¹ÆëÁ½Ïî½¹µãÄÜÁ¦£º³¤ÆÚÐÔ£¨persistence£©ÓëÒò¹ûÐÔ£¨causality£©¡£¡£¡£¡£¡£
ǰÕßÒªÇóÄ£×ÓÔÚ³¤Ê±³ÌÌìÉúÖмá³ÖÎȹÌÒ»ÖµÄ״̬£ºÒþʽ״̬ÐèÒª´ÓÀο¿´°¿ÚµÈÆô·¢Ê½Ó°ÏóÉý¼¶Îª¿Éѧϰ¡¢¿É¶¯Ì¬É¸Ñ¡µÄÐÅÏ¢ÖÎÀí»úÖÆ£»£»£»£»£»£»ÏÔʽ״̬ÔòÒªÔÚѹËõЧÂÊÓëϸ½Ú±£ÕæÖ®¼äÕÒµ½¸üºÃµÄƽºâ¡£¡£¡£¡£¡£
ºóÕßÒªÇóÄ£×Ó´Óͳ¼ÆÏà¹Ø×ßÏòÒò¹û»úÖÆ£ºÒ»Ìõõè¾¶ÊÇͨ¹ý¼Ü¹¹ÓëÊý¾ÝÉè¼ÆÌáÉýÒò¹ûÍÆ¶ÏÄÜÁ¦£¨¸üºÃµØ½âñîDZÔÚÒò¹ûÒòËØ£©£¬£¬£¬£¬£¬ÁíÒ»Ìõõè¾¶ÊÇÒýÈëÃ÷È·Ä£×ÓµÄÍÆÀíÏÈÑéÀ´Ô¼ÊøÌìÉú£¬£¬£¬£¬£¬µ«ÔõÑùÓÐÓÃ¶ÔÆëÌìÉúÓëÃ÷È·ÈÔÊǽ¹µãÌôÕ½¡£¡£¡£¡£¡£
½áÓï
×ÛÉÏËùÊö£¬£¬£¬£¬£¬Ëæ×ÅÊÓÆµÌìÍâÐÐÒÕÔÚ¸÷ÁìÓòµÄ±¬·¢Ê½ÔöÌí£¬£¬£¬£¬£¬ÔõÑùʹÆä¾ß±¸ÕæÊµÌìϵÄÄ£ÄâÄÜÁ¦ÒѳÉΪ²»¿É»Ø±ÜµÄÌôÕ½¡£¡£¡£¡£¡£Í¨¹ýÈ«Á´Â·µÄÊÖÒÕÆÊÎö£¬£¬£¬£¬£¬±¾×ÛÊö²»µ«ÃÖºÏÁËÊÓÆµ¼Ü¹¹Óë¾µäÀíÂÛÖ®¼äµÄÁѺۣ¬£¬£¬£¬£¬»¹Õ¹ÏÖÁË´Ó¡¸Òþ / ÏÔʽ״̬¹¹½¨¡¹µ½¡¸Òò¹û¶¯Ì¬½¨Ä£¡¹µÄÒªº¦Â·¾¶¡£¡£¡£¡£¡£
ÕâÆª×ÛÊöΪѧÊõ½çºÍ¹¤Òµ½çÌṩÁËÒ»¸öÖ÷ÒªµÄ²Î¿¼¿ò¼Ü£¬£¬£¬£¬£¬×ÊÖúÑо¿ÕßÔÚͨÍùͨÓÃÌìÏÂÄ£ÄâÆ÷µÄÕ÷;Öо«×¼¶¨Î»¡£¡£¡£¡£¡£
ÍŶÓÐÅÍУ¬£¬£¬£¬£¬Í¨¹ýÓ¦¶Ô×ÛÊöÖÐÁгöµÄÌôÕ½£¬£¬£¬£¬£¬¸ÃÁìÓò¿ÉÒÔ´ÓÌìÉúÊÓ¾õÉϱÆÕæµÄÊÓÆµÉú³¤µ½¹¹½¨ÎȽ¡µÄͨÓÃÌìÏÂÄ£ÄâÆ÷£¬£¬£¬£¬£¬Îª×Ô¶¯¼ÝÊ»¡¢¾ßÉíÖÇÄܵÈÁìÓòµÄ³¤×ãÉú³¤µÓÚ¨¼áʵ»ùʯ¡£¡£¡£¡£¡£
²Î¿¼ÎÄÏ×
[1] L. Zhang and M. Agrawala. Packing input frame context in next-frame prediction models for video generation. arXiv preprint arXiv:2504.12626, 2025.
[2] Z. Xiao et al. Worldmem: Long-term consistent world simulation with memory. arXiv preprint arXiv:2504.12369, 2025.
[3] X. Wu et al. Corgi: Cached memory guided video generation. arXiv preprint arXiv:2508.16078, 2025.
[4] R. Henschel et al. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 2568¨C2577, 2025.
[5] K. Dalal et al. One-minute video generation with test-time training. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 17702¨C17711, 2025.
[6] J. Chen et al. Sana-video: Efficient video generation with block linear diffusion transformer. arXiv preprint arXiv:2509.24695, 2025.
[7] X. Huang et al. Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009, 2025.
[8] Y. Huang et al. Owl-1: Omni world model for consistent long video generation. arXiv preprint arXiv:2412.09600, 2024.
[9] Z. Huang et al. Vbench: Comprehensive benchmark suite for video generative models, 2023.
[10] Z. Huang et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models, 2024.
[11] A. Rakheja et al. World consistency score: A unified metric for video generation quality, 2025.
[12] M. Heusel et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018.
[13] S. Yuan et al. Chronomagic-bench: A benchmark for metamor-phic evaluation of text-to-time-lapse video generation, 2024.
[14] S. Motamed et al. Do generative video models understand physical principles?, 2025.
[15] J. Zhang ±±¾©ÑàÔÆÎÄ»¯´´ÒâÓÐÏÞ¹«Ë¾et al. World-in-world: World models in a closed-loop world, 2025.