È«ÐÂÊӽǿ´ÌìÏÂÄ£×Ó£º´ÓÊÓÆµÌìÉúÂõÏòͨÓÃÌìÏÂÄ£ÄâÆ÷
2026-02-27 18:03:52

½üÄêÀ´ £¬£¬£¬£¬£¬ÊÓÆµÌìÉú£¨Video Generation£©ÓëÌìÏÂÄ£×Ó£¨World Models£©ÒÑÔ¾ÉýΪÈ˹¤ÖÇÄÜÁìÓò×îÖËÊÖ¿ÉÈȵĽ¹µã¡£¡£¡£¡£¡£´Ó Sora µ½¿ÉÁ飨Kling£© £¬£¬£¬£¬£¬ÊÓÆµÌìÉúÄ£×ÓÔÚÔ˶¯Ò»Á¬ÐÔ¡¢ÎïÌå½»»¥Ó벿·ÖÎïÀíÏÈÑéÉÏÖð½¥ÌåÏÖ³ö¸üÇ¿µÄ¡¸ÌìÏÂÒ»ÖÂÐÔ¡¹ £¬£¬£¬£¬£¬ÈÃÈËÃÇ×îÏÈÈÏÕæÌÖÂÛ£ºÄÜ·ñ°ÑÊÓÆµÌìÉú´Ó¡¸±ÆÕæ¶ÌƬ¡¹Íƽøµ½¿ÉÓÃÓÚÍÆÀí¡¢ÍýÏëÓë¿ØÖÆµÄ¡¸Í¨ÓÃÌìÏÂÄ£ÄâÆ÷¡¹¡£¡£¡£¡£¡£

Óë´Ëͬʱ £¬£¬£¬£¬£¬ÕâÒ»Ñо¿Æ«ÏòÕý¿ìËÙÓë¾ßÉíÖÇÄÜ£¨Embodied AI£©¡¢×Ô¶¯¼ÝÊ»£¨Autonomous Driving£©µÈÇ°ÑØ³¡¾°Éî¶È½»Ö¯ £¬£¬£¬£¬£¬±»ÊÓΪͨÍùͨÓÃÈ˹¤ÖÇÄÜ£¨AGI£©µÄÖ÷Ҫ·¾¶¡£¡£¡£¡£¡£

È»¶ø £¬£¬£¬£¬£¬ÔÚÑо¿Èȳ±Ö®Ï £¬£¬£¬£¬£¬¡¸×÷ÉõÕæÕýµÄÌìÏÂÄ£×Ó¡¹ÒÔ¼°¡¸ÔõÑùÆÀÅÐÊÓÆµÄ£×ÓµÄÌìÏÂÄ£ÄâÄÜÁ¦¡¹µÈ½¹µãÒéÌâÈ´ÏÝÈëÁ˶àάÕùÂÛ¡£¡£¡£¡£¡£Ä¿½ñ £¬£¬£¬£¬£¬ÌìÏÂÄ£×ӵĽç˵Óë·ÖÀà²ã³ö²»Çî £¬£¬£¬£¬£¬ÀíÂÛά¶ÈµÄ½»Ö¯ÖصþÍùÍùÁîÑо¿Õ߸ÐÓ¦ÒÉÐÄ £¬£¬£¬£¬£¬Ò²ÏÞÖÆÁËÊÖÒյıê×¼»¯Éú³¤¡£¡£¡£¡£¡£

Ϊ½¨Éè¸üϵͳ¡¢ÇåÎúµÄÉóÔÄÊÓ½Ç £¬£¬£¬£¬£¬¿ìÊÖ¿ÉÁéÍŶÓÓëÏã¸Û¿Æ¼¼´óѧ£¨¹ãÖÝ£©³ÂÓ±´Ï½ÌÊÚÍŶӣ¨ÅäºÏÒ»×÷£º²©Ê¿ÉúÍõÂÞÖÝ¡¢²©Ê¿Éú³ÂÖª·Ç£©ÍŽá½ÒÏþÁË´ÓÈ«ÐÂÊÓ½ÇÉî¶ÈÆÊÎöÊÓÆµÌìÏÂÄ£×ÓµÄϵͳ×ÛÊö¡£¡£¡£¡£¡£

±¾ÎÄÖ¼ÔÚÃֺϽñÊÀ¡¸ÎÞ״̬¡¹ÊÓÆµ¼Ü¹¹Óë¾­µä¡¸ÒÔ״̬ΪÖÐÐÄ¡¹µÄÌìÏÂÄ£×ÓÀíÂÛÖ®¼äµÄºè¹µ £¬£¬£¬£¬£¬Ê×´ÎÌá³öÒÔ¡¸×´Ì¬¹¹½¨£¨State Construction£©¡¹Ó롸¶¯Ì¬½¨Ä££¨Dynamics Modeling£©¡¹ÎªË«Ö§ÖùµÄȫзÖÀàϵͳ¡£¡£¡£¡£¡£

±ðµÄ £¬£¬£¬£¬£¬±¾ÎÄÁ¦³«½«ÆÀ¹À±ê×¼´Ó´¿´âµÄ¡¸ÊÓ¾õ±£Õæ¶È¡¹×ªÏò¡¸¹¦Ð§ÐÔ»ù×¼¡¹ £¬£¬£¬£¬£¬²¢Ç°Õ°ÐÔµØÖ¸³öÁËÁ½¸öÒªº¦ÊÖÒÕÇ°ÑØ £¬£¬£¬£¬£¬ÎªÊÓÆµÌìÉúÑݽøÖÁ³°ôµÄͨÓÃÌìÏÂÄ£ÄâÆ÷ÌṩÁËÇåÎúµÄõ辶ͼ¡£¡£¡£¡£¡£

ÂÛÎÄÎÊÌ⣺A Mechanistic View on Video Generation as World Models: State and DynamicsÂÛÎÄÁ´½Ó£ºhttps://arxiv.org/pdf/2601.17067github Á´½Ó£ºhttps://github.com/hit-perfect/Awesome-Video-World-Models

×ÛÊö½á¹¹ÌáÒª

½¹µãÁÁµã£ºÕâÆª×ÛÊöµÄÒªº¦Ð¢Ë³ÊÇʲô£¿£¿£¿£¿£¿£¿£¿£¿

Ïà±ÈÓÚ¹ýÍù×ÅÖØÓÚÊÓ¾õЧ¹ûµÄÊÓÆµÌìÉúÑо¿ £¬£¬£¬£¬£¬±¾Æª×ÛÊöÔÚ¶à¸öά¶È¾ßÓдú¼ÊÓÅÊÆ£º

È«Á´Â·Êӽǣ¨Full-Stack Perspective£©£º³¹µ×Í»ÆÆ¼òµ¥µÄ¡¸äÖȾ¡¹ÊÓ½Ç £¬£¬£¬£¬£¬º­¸ÇÁ˴ӵײãÀíÂÛ½ç˵¡¢Öвã¼Ü¹¹Éè¼Æ£¨×´Ì¬¹¹½¨Ó붯̬½¨Ä££©µ½Éϲ㹦ЧÐÔÆÀ¹ÀµÄÈ«ÉúÃüÖÜÆÚÆÊÎö £¬£¬£¬£¬£¬È·±£¶ÔÊÓÆµÌìÏÂÄ£×ÓÈ«·½Î»µÄÃ÷È·¡£¡£¡£¡£¡£ÃÖºÏÀíÂۺ蹵£¨Bridging the Gap£©£ºÊ״ν«½ñÊÀ¡¸ÎÞ״̬¡¹£¨state-less£©µÄÊÓÆµÀ©É¢¼Ü¹¹Óë¾­µäµÄ»ùÓÚÄ£×ÓÇ¿»¯Ñ§Ï°£¨MBRL£©¡¢¿ØÖÆÀíÂÛ¾ÙÐÐÉî¶ÈÓ³Éä £¬£¬£¬£¬£¬ÎªÌìÏÂÄ£×ÓÕÒµ½Á˼áʵµÄÀíÂÛ»ù±¾¡£¡£¡£¡£¡£Ç°Õ°ÐÔÖ¸ÄÏ£¨Forward-Looking Guide£©£ºÃ÷È·ÁË¡¸³¤ÆÚÐÔ¡¹Ó롸Òò¹ûÐÔ¡¹ ÊÇÂõÏòͨÓÃÌìÏÂÄ£ÄâÆ÷µÄÁ½´ó½¹µã¹Ø¿Ú¡£¡£¡£¡£¡£±¾Ñо¿ÎªÒµ½ç´Ó±»¶¯µÄ¡¸ÏñËØÕ¹Íû¡¹×ªÏò¾ß±¸±Õ»·½»»¥ÓëÒò¹û¸ÉÔ¤ÄÜÁ¦µÄÄ£ÄâÆ÷ÌṩÁËÇåÎúµÄ·¾¶²Î¿¼¡£¡£¡£¡£¡£×îÐÂÑо¿ÁýÕÖ£ºÉî¶ÈÊáÀíÁË 2024 ÖÁ 2025 Äê¼äÓ¿ÏÖµÄÊÓÆµÌìÉúµÄ×îÐÂÊÂÇé £¬£¬£¬£¬£¬·´Ó¦ÁËÄ¿½ñÊÖÒÕ´ÓÊÓ¾õ±£Õæ¶ÈÏòÎïÀíÒ»ÖÂÐÔת»¯µÄÇ°ÑØÇ÷ÊÆ¡£¡£¡£¡£¡£

½¹µãÀíÂÛ

ÌìÏÂÄ£×ÓµÄÈý´ó»ùʯ

±¾ÎÄÊ×ÏȻع龭µä £¬£¬£¬£¬£¬½«ÌìÏÂÄ£×ÓµÄÔË×÷ÌáÁ¶ÎªÈý¸öñîºÏµÄ½¹µã×é¼þ £¬£¬£¬£¬£¬¹¹½¨ÁË´Ó¸ÐÖªµ½ÍÆÀíµÄÍêÕûÁ´Â·£º

ÌìÏÂÄ£×ӵĽ¹µã²Ù×÷

»ùÓÚǰÎÄÌá³öµÄ¡¸Èý´ó»ùʯ¡¹ £¬£¬£¬£¬£¬±¾ÎĽ«ÌìÏÂÄ£×ÓµÄÔËÐлúÖÆ¹éÄÉΪÁ½Ïî½¹µã²Ù×÷£º

ÌìÏÂÄ£×ÓµÄѧϰ·½·¨

¼øÓÚÌìÏÂÄ£×ÓÖ÷ҪЧÀÍÓÚÏÂÓξöÒé £¬£¬£¬£¬£¬±¾ÎĽ«Æä»ñÈ¡£¡£¡£¡£¡£¨ÑµÁ·£©·¶Ê½°´ÓëÕ½ÂÔÄ£×Ó£¨Policy Model£©µÄñîºÏˮƽ¹éÄÉΪÁ½Àࣺ

±Õ»·Ñ§Ï°£¨Closed-loop Learning / Coupled Training£©£ºÌìÏÂÄ£×ÓÓëÕ½ÂÔÄ£×ÓÍŽáѵÁ· £¬£¬£¬£¬£¬ÌìÏÂÄ£×ӵIJÎÊý¸üÐÂÖ±½ÓÊÜÕ½ÂÔÄ¿µÄÓ°Ï죨¹²ÏíÌÝ¶È / ¶Ëµ½¶ËÓÅ»¯£© £¬£¬£¬£¬£¬¸Ã·¶Ê½¿É½øÒ»²½·ÖΪÁ½Öֽṹ£ºË³Ðò×éºÏ£¨Sequential Architecture£©£ºÌìÏÂÄ£×ÓºÍÕ½ÂÔÄ£×ÓÊÇÍÑÀëµÄÄ£¿£¿£¿£¿£¿£¿£¿£¿é £¬£¬£¬£¬£¬µ«ÑµÁ·Ê±»á¶Ëµ½¶ËÁª¶¯£ºÕ½ÂÔÄ¿µÄ±¬·¢µÄÎó²îÐźŻáͨ¹ýÌݶȷ´Ïò´«»ØÌìÏÂÄ£×Ó £¬£¬£¬£¬£¬´Ó¶øÈÃÌìÉúЧ¹û¸üÇкϿÉÖ´ÐÐÐÔÓëÎïÀíÒ»ÖÂÐÔ¡£¡£¡£¡£¡£Í³Ò»¼Ü¹¹£¨Unified Architecture£©£º½«ÌìÏÂÄ£×ÓÓëÕ½ÂÔÕûºÏΪ¼òµ¥¶Ëµ½¶Ëϵͳ £¬£¬£¬£¬£¬ÔÚͳһ¿ò¼ÜÄÚÅäºÏÓÅ»¯¸ÐÖª¡¢Õ¹ÍûÓëÐж¯ÌìÉú¡£¡£¡£¡£¡£¿£¿£¿£¿£¿£¿£¿£¿ª»·Ñ§Ï°£¨Open-loop Learning / Decoupled Training£©£º½«ÌìÏÂÄ£×ÓÊÓΪͨ¹ý´ó¹æÄ£±»¶¯Êý¾ÝԤѵÁ·»ñµÃµÄ×ÔÁ¦Ä£ÄâÆ÷ £»£» £»£» £»£»Õ½ÂÔÄ£×Ó¿ÉÔÚ×ÔÉíÓÅ»¯ÖÐŲÓÃÌìÏÂÄ£×Ó¾ÙÐС¸ÏëÏó / ÍýÏ롹 £¬£¬£¬£¬£¬µ«ÌìÏÂÄ£×Ó²»ÎüÊÕÀ´×ÔÕ½ÂÔ½±ÀøÐźŻòËðʧº¯ÊýµÄÌݶȸüУ¨Ä£×Ó¶³½á£©¡£¡£¡£¡£¡£

ÊÓÆµÄ£×ÓµÄÑݽø£ºÂõÏò³°ôÌìÏÂÄ£ÄâÆ÷

ÏÖ´úÊÓÆµÌìÉúÄ£×ÓËäÒѾ߱¸ºÜÇ¿µÄÊÓ¾õ±£Õæ¶È²¢±»ÊÓΪDZÔÚµÄÌìÏÂÄ£×ÓÔØÌå £¬£¬£¬£¬£¬µ«ÓëÉÏÃæÆÊÎöµÄ¾­µäÌìÏÂÄ£×ÓÏà±ÈÈÔ±£´æÁ½´ó¸Åº¦²î±ð£º

ÔÚ¶¯Ì¬£¨Dynamics£©²ãÃæ £¬£¬£¬£¬£¬±ê׼ģ×Ó³£ÒÔË«Ïò×¢ÖØÁ¦¡¸Ò»´ÎÐÔäÖȾ¡¹Àο¿Ê±³¤Æ¬¶Ï £¬£¬£¬£¬£¬È±ÉÙÏÔʽʱ¼äÒò¹ûÍÆ½ø £¬£¬£¬£¬£¬½üÆÚÊÂÇéÔòͨ¹ýÒò¹û¼Ü¹¹Öع¹£¨×Իع顢Òò¹ûÑÚÂ롢ת¶¯Õ¹ÍûµÈ£©»òÒò¹û֪ʶ¼¯³É£¨½èÖú LMM ×öÍýÏëÔ¼Êø»òͳһñîºÏÓÅ»¯£©À´×¢ÈëÒò¹ûÐÔ£¨causality£©¡£¡£¡£¡£¡£

½¹µãÖ§Öù

ΪÁËÃè»æÊÓÆµÌìÉúÄ£×ÓÂõÏòÎȽ¡ÌìÏÂÄ£×ÓµÄÑݽøÂ·¾¶ £¬£¬£¬£¬£¬±¾ÎÄÊ×ÏÈ´ÓÆäÄÚ²¿ÌåÏÖÈëÊÖ £¬£¬£¬£¬£¬ÖصãÉóÔÄ״̬£¨state£©µÄ¹¹½¨£º½«¡¸×´Ì¬¡¹ÊÓΪ¶ÔÇéÐÎÄ¿½ñÉèÖõijä·Öͳ¼ÆÁ¿ £¬£¬£¬£¬£¬²¢ÒÔ´ËΪ½¹µã°ÑÀúÊ·ÐÅÏ¢ÓлúÈÚÈëͳһÌåÏÖÖС£¡£¡£¡£¡£Í¨¹ý½«ºã¾ÃÅä¾°ÌáÁ¶²¢³Áµíµ½ÕâÖÖ״̬ÌåÏÖÀï £¬£¬£¬£¬£¬Ä£×ӲŻªÔÚ¸ü³¤Ê±³ÌÏÂά³ÖÒ»ÖµÄÓ°ÏóÓëÁ¬¹áµÄÄ£Äâ¡£¡£¡£¡£¡£

Ëæºó £¬£¬£¬£¬£¬±¾ÎĽøÒ»²½ÆÊÎöÊÓÆµÌìÉúÄ£×ÓÖж¯Ì¬£¨dynamics£©ÐÐΪµÄȪԴ £¬£¬£¬£¬£¬Ç¿µ÷Ä£×ÓÐèÒªÄÚ»¯Ç±ÔÚµÄÒò¹û¼ÍÂÉ £¬£¬£¬£¬£¬Ê¹µÃËæÊ±¼äÍÆ½øµÄÑÝ»¯¼ÈÇкÏÎïÀí¿ÉÐÐÐÔ £¬£¬£¬£¬£¬Ò²ÔÚÂß¼­²ãÃæ¼á³Ö×ÔÇ¢ÓëÒ»Ö¡£¡£¡£¡£¡£

Ö§ÖùÒ»£º×´Ì¬¹¹½¨£¨State Construction£©

ÊÓÆµÄ£×ÓÔõÑù¡¸¼Ç×Å¡¹ÒÑÍù£¿£¿£¿£¿£¿£¿£¿£¿ÈçÄÇÀïÖÃÀúÊ·ÐÅÏ¢£¿£¿£¿£¿£¿£¿£¿£¿±¾ÎĽ«ÏÖÓеÄ״̬´¦Öóͷ£»úÖÆ»®·ÖΪÒþʽ£¨Implicit State£©ÓëÏÔʽ£¨Explicit State£©Á½´ó·¶Ê½ £¬£¬£¬£¬£¬²¢¶ÔÆäÓÅÁÓ¾ÙÐÐÁËÉî¶È½â¹¹£º

Òþʽ״̬£¨Ó°Ïó»úÖÆÖÎÀí£©

ÏÔʽ״̬£¨ÄÚºËÌåÏÖ£©

ÕâÒ»·¶Ê½½«×´Ì¬¹¹½¨ÄÚ»¯ÎªÄ£×Ó×ÔÉíµÄѹËõÀú³Ì£ºËü²»ÔÙά»¤Ò»Ö±ÔöÌíµÄÀúÊ·Ö¡»º³åÇø £¬£¬£¬£¬£¬¶øÊǰÑÀúÊ·ÉÏÏÂÎÄÒ»Á¬ÕôÁó½øÒ»¸öÈ«¾Ö¸üеÄDZÔÚ±äÁ¿£¨State£©ÖÐ £¬£¬£¬£¬£¬Ê¹Æä³ÉΪ¶ÔÊÓÆµÑÝ»¯Àú³ÌµÄÀο¿Î¬¶È¡¢¿ÉµÝÍÆµÄÊýѧժҪ¡£¡£¡£¡£¡£

ñîºÏ״̬£¨Coupled States£©£º×´Ì¬×ªÒÆÓëÌìÉúÖ÷¸ÉÉî¶ÈÈÚºÏ £¬£¬£¬£¬£¬Ä£×ÓÔÚÍ³Ò»ÍøÂçÄÚʵÏÖ¡¸±ßÌìÉú¡¢±ß¸üС¹¡£¡£¡£¡£¡£×´Ì¬Í¨³£ÌåÏÖÎªÍøÂçÄÚ²¿µÄÒþ²ØÓ°Ïó£¨Èç SSM/RNN/LSTM Òþ״̬»ò×¢ÖØÁ¦»º³åÇø£© £¬£¬£¬£¬£¬Ò²¿Éͨ¹ýÔÚÏßÓÅ»¯ / ¿ÉËÜÐÔ°ÑÀúÊ·ÐÅÏ¢±àÂë½ø²ÎÊý £¬£¬£¬£¬£¬Ê¹×´Ì¬ÈÚÈëÌìÉúÆ÷µÄÄÚ²¿¶¯Á¦Ñ§ £¬£¬£¬£¬£¬´ú±íÊÂÇéÈç TTT [5] ¡¢SANA-Video [6] µÈ¡£¡£¡£¡£¡£½âñî״̬£¨Decoupled States£©£º×´Ì¬ÓëÌìÉúÆ÷ÄÚ²¿¼¤»îÊèÉ¢ £¬£¬£¬£¬£¬×÷Ϊ×ÔÁ¦ÏÔʽ±íÕ÷±»µ¥¶Àά»¤Óë¸üР£¬£¬£¬£¬£¬ÌìÉúÆ÷ÿ²½¶ÁÈ¡¸Ã״̬¾ÙÐÐäÖȾ¡£¡£¡£¡£¡£³£¼û·¾¶°üÀ¨£ºÓïÒåµ¼Ïò£¨Óà LLM µÈά»¤ÌìÏÂÐÎò / ÐðÊÂÂß¼­£©Ó뼸ºÎµ¼Ïò£¨ÓõãÔÆ»ò 3D Gaussian splatting µÈ 3D Ó°Ïó £¬£¬£¬£¬£¬Í¨¹ýÈÚºÏ / ·´Í¶Ó°µü´ú¸üÐÂÒÔ¼á³Ö¿Õ¼äÒ»ÖÂÐÔ£©¡£¡£¡£¡£¡£

Òþʽ״̬ vs. ÏÔʽ״̬µÄϵͳÐÔ±ÈÕÕ

×ÜÌåÈ¡ÉáÊÇ£ºÒþʽ״̬ÏÖÔÚ¸üÎÈÍ×µØÖ§³Ö¸ß±£ÕæÊÓÆµÌìÉú £¬£¬£¬£¬£¬¶øÏÔʽ״̬¸üÏñͨÍù¸ßЧ¡¢¿Éºã¾ÃÍÆÀíµÄ×ÔÖ÷ÖÇÄÜÌåÓëÌìÏÂÄ£ÄâµÄÇ°ÑØÆ«Ïò¡£¡£¡£¡£¡£

Ö§Öù¶þ£º¶¯Ì¬½¨Ä££¨Dynamics Modeling£©

ÔõÑùÈÃÌìÉúµÄÊÓÆµ²»µ«ÊÇ¡¸¿´ÆðÀ´Ïñ¡¹ £¬£¬£¬£¬£¬¶øÊÇÕæÕýÇкÏÎïÀí¼ÍÂÉÓëʱ¼äÂß¼­£¿£¿£¿£¿£¿£¿£¿£¿±¾ÎĹéÄÉÁËÁ½ÌõÔöÇ¿Òò¹ûÍÆÀíÄÜÁ¦µÄÖ÷Ҫ·¾¶£º

Òò¹û¼Ü¹¹Öع¹£¨Causal Architecture Reformulation£©£º´ÓÄ£×ӽṹÓëѵÁ·Ä¿µÄÈëÊÖ £¬£¬£¬£¬£¬°ÑÌìÉúÀú³Ì´Ó¡¸Ò»´ÎÐÔäÖȾ¡¹Ë¢Ð³ɡ¸×¼Ê±¼ä˳ÐòÕ¹Íû¡¹ £¬£¬£¬£¬£¬Í¨¹ýÒò¹ûÕÚÕֵȻúÖÆ×èֹδÀ´ÐÅÏ¢×ß© £¬£¬£¬£¬£¬²¢ÍŽá²î±ðµÄѵÁ· / ÔëÉùµ÷ÀíÕ½ÂÔÇ¿»¯ÑÏ¿áµÄʱ¼äÒÀÀµ £»£» £»£» £»£»Í¬Ê±Í¨¹ý forcing µÈ·½·¨Ä£ÄâÍÆÀí½×¶ÎµÄÎó²îÀÛ»ýÓëÆØ¹âÎó²î £¬£¬£¬£¬£¬ËõСѵÁ·ÓëÍÆÀíµÄ²î±ð £¬£¬£¬£¬£¬Ê¹³¤Ê±³Ì rollout ¸üÎȹ̡¢¸üÇкÏÎïÀíÒ»ÖÂÐÔÓëÂß¼­Á¬¹áÐÔ £¬£¬£¬£¬£¬´ú±íÊÂÇéÈç Self-Forcing [7] µÈ¡£¡£¡£¡£¡£Òò¹û֪ʶ¼¯³É£¨Causal Knowledge Integration£©£ºÒýÈë¾ß±¸¸üÇ¿ÍÆÀíÓë֪ʶÄÜÁ¦µÄ¶àģ̬´óÄ££¨LMM/VLM/LLM£©×÷Ϊ¡¸ÍýÏëÕß / µ¼ÑÝ¡¹ £¬£¬£¬£¬£¬ÏÈÔڸ߲ãÍê³ÉʱÐò¡¢Ðж¯Ó볡¾°Âß¼­µÄÍýÏë £¬£¬£¬£¬£¬ÔÙÓÉÊÓÆµÌìÉúÄ£×ÓÈÏÕæ¸ß±£Õ桸äÖȾ¡¹ £»£» £»£» £»£»¸ü½øÒ»²½µÄͳһ¿ò¼Ü»á½«Ã÷È·ÓëÌìÉú¸üϸÃܵØñîºÏ £¬£¬£¬£¬£¬ÈÃÍÆÀíÐźÅÖ±½ÓÔ¼ÊøÌìÉúÀú³Ì £¬£¬£¬£¬£¬´Ó¶øÌáÉý¶¯Ì¬ÑÝ»¯µÄÒò¹û¿ÉÐÅ¶È £¬£¬£¬£¬£¬´ú±íÊÂÇéÈç Owl-1 [8] µÈ¡£¡£¡£¡£¡£

Ö§ÖùÈý£ºÆÀ¹Àϵͳ£¨Evaluation£©

ÈôÊÇ˵ÊÓÆµÌìÉú¸üÌåÌù¡¸ºÃÇ·ÔÃÄ¿¡¹ £¬£¬£¬£¬£¬ÄÇôÌìÏÂÄ£Ä⻹ÐèÒª¸üÌåÌù¡¸ºÃÇ·ºÃÓṡ£¡£¡£¡£¡£¹Å°åµÄ IS/FVD µÈÖ¸±êÖ÷ҪȨºâ¶ÌƬ¶ÏµÄÊÓ¾õÕæÊµ¸Ð £¬£¬£¬£¬£¬ÒÑÄÑÒԻظ²Ä£×ÓÊÇ·ñ¾ß±¸¿ÉÒ»Á¬ÍÆÑÝ¡¢¿É½»»¥¡¢¿ÉÓÃÓÚ¾öÒéµÄ¡¸ÌìÏÂÄ£×Ó¡¹ÄÜÁ¦¡£¡£¡£¡£¡£Òò´Ë £¬£¬£¬£¬£¬±¾ÎÄÖ÷ÕŽ«ÆÀ¹À´Ó ¡¸ÊÓ¾õÃÀ¸Ð¡¹½øÒ»²½Íƽøµ½¡¸¹¦Ð§»ù×¼¡¹ £¬£¬£¬£¬£¬²¢Ìá³öÈýÌõ½¹µãÆÀ¼ÛÖ᣺

ÖÊÁ¿£¨Quality£©£º¹Ø×¢»ù´¡ÊÓ¾õ±£Õæ¶È¡¢¶Ì³ÌʱÐòÏà¹ØÐÔÒÔ¼°Îı¾ / Ìõ¼þ¶ÔÆëÄÜÁ¦ £¬£¬£¬£¬£¬´ú±íÐÔ¹¤¾ßÈç VBench [9] / VBench++ [10] µÈ £¬£¬£¬£¬£¬ÓøüϸÁ£¶ÈµÄά¶È²ð½â¡¸»­ÃæÊÇ·ñÎȹ̡¢Ö÷ÌåÊÇ·ñÒ»Ö¡¢ÓïÒåÊÇ·ñ¶ÔÆë¡¹¡£¡£¡£¡£¡£³¤ÆÚÐÔ£¨Persistence£©£º¹Ø×¢³¤Ê±³Ì rollout µÄÎȹÌÐÔÓëÒ»ÖÂÐÔ £¬£¬£¬£¬£¬¼È¿´ÉúÉú³¤¶ÈÀ­³¤ºóÊÇ·ñ·ºÆðÆ¯ÒÆ / ±À»µ £¬£¬£¬£¬£¬Ò²Í¨¹ý¡¸³¡¾°Öطã¨re-visitation£©¡¹µÈÓ°ÏóʹÃüÄ¥Á·Ä£×ÓÄÜ·ñÔڻص½¾ÉËùÔÚʱ»Ö¸´×¼È·×´Ì¬ £¬£¬£¬£¬£¬¶ø²»ÊÇÆ¾¿Õ²¹Ï¸½Ú £»£» £»£» £»£»Ïà¹ØÆÀ²â°üÀ¨ WCS [11] ÒÔ¼°»ùÓÚ rFID [12] µÄÖØÐÞÒ»ÖÂÐÔ²âÊԵȡ£¡£¡£¡£¡£Òò¹ûÐÔ£¨Causality£©£º×÷ΪÌìÏÂÄ£ÄâµÄ½¹µãÄÜÁ¦ £¬£¬£¬£¬£¬ÖصãÄ¥Á·Ä£×ÓÊÇ·ñÕæÕýÄÚ»¯ÎïÀíÓëÂß¼­¼ÍÂÉ £¬£¬£¬£¬£¬¼È°üÀ¨Ê±¼ä˳ÐòÓëÎïÀíÓÐÓÃÐÔ£¨Èç ChronoMagic-Bench [13] ¡¢Physics-IQ [14] £© £¬£¬£¬£¬£¬Ò²°üÀ¨·´ÊÂʵ¸ÉԤϵÄÏìÓ¦ÊÇ·ñºÏÀí£¨ÀýÈç¸Ä±ä»»×÷ / ³õʼÌõ¼þºó £¬£¬£¬£¬£¬ÌìÏÂÊÇ·ñ°´Òò¹û±¬·¢²î±ðÇÒ×ÔÇ¢µÄЧ¹û£© £¬£¬£¬£¬£¬²¢½øÒ»²½ÑÓÉìµ½ agent-in-the-loop µÄʹÃüÀÖ³ÉÂÊÓëÍýÏëÌåÏÖ£¨Èç World-in-World [15] µÈ£©¡£¡£¡£¡£¡£

δÀ´Ñо¿Æ«Ïò

ÊÓÆµÌìÉúÂõÏòÌìÏÂÄ£ÄâµÄÒªº¦ £¬£¬£¬£¬£¬ÔÚÓÚ²¹ÆëÁ½Ïî½¹µãÄÜÁ¦£º³¤ÆÚÐÔ£¨persistence£©ÓëÒò¹ûÐÔ£¨causality£©¡£¡£¡£¡£¡£

ǰÕßÒªÇóÄ£×ÓÔÚ³¤Ê±³ÌÌìÉúÖмá³ÖÎȹÌÒ»ÖµÄ״̬£ºÒþʽ״̬ÐèÒª´ÓÀο¿´°¿ÚµÈÆô·¢Ê½Ó°ÏóÉý¼¶Îª¿Éѧϰ¡¢¿É¶¯Ì¬É¸Ñ¡µÄÐÅÏ¢ÖÎÀí»úÖÆ £»£» £»£» £»£»ÏÔʽ״̬ÔòÒªÔÚѹËõЧÂÊÓëϸ½Ú±£ÕæÖ®¼äÕÒµ½¸üºÃµÄƽºâ¡£¡£¡£¡£¡£

ºóÕßÒªÇóÄ£×Ó´Óͳ¼ÆÏà¹Ø×ßÏòÒò¹û»úÖÆ£ºÒ»Ìõõè¾¶ÊÇͨ¹ý¼Ü¹¹ÓëÊý¾ÝÉè¼ÆÌáÉýÒò¹ûÍÆ¶ÏÄÜÁ¦£¨¸üºÃµØ½âñîDZÔÚÒò¹ûÒòËØ£© £¬£¬£¬£¬£¬ÁíÒ»Ìõõè¾¶ÊÇÒýÈëÃ÷È·Ä£×ÓµÄÍÆÀíÏÈÑéÀ´Ô¼ÊøÌìÉú £¬£¬£¬£¬£¬µ«ÔõÑùÓÐÓÃ¶ÔÆëÌìÉúÓëÃ÷È·ÈÔÊǽ¹µãÌôÕ½¡£¡£¡£¡£¡£

½áÓï

×ÛÉÏËùÊö £¬£¬£¬£¬£¬Ëæ×ÅÊÓÆµÌìÍâÐÐÒÕÔÚ¸÷ÁìÓòµÄ±¬·¢Ê½ÔöÌí £¬£¬£¬£¬£¬ÔõÑùʹÆä¾ß±¸ÕæÊµÌìϵÄÄ£ÄâÄÜÁ¦ÒѳÉΪ²»¿É»Ø±ÜµÄÌôÕ½¡£¡£¡£¡£¡£Í¨¹ýÈ«Á´Â·µÄÊÖÒÕÆÊÎö £¬£¬£¬£¬£¬±¾×ÛÊö²»µ«ÃÖºÏÁËÊÓÆµ¼Ü¹¹Óë¾­µäÀíÂÛÖ®¼äµÄÁÑºÛ £¬£¬£¬£¬£¬»¹Õ¹ÏÖÁË´Ó¡¸Òþ / ÏÔʽ״̬¹¹½¨¡¹µ½¡¸Òò¹û¶¯Ì¬½¨Ä£¡¹µÄÒªº¦Â·¾¶¡£¡£¡£¡£¡£

ÕâÆª×ÛÊöΪѧÊõ½çºÍ¹¤Òµ½çÌṩÁËÒ»¸öÖ÷ÒªµÄ²Î¿¼¿ò¼Ü £¬£¬£¬£¬£¬×ÊÖúÑо¿ÕßÔÚͨÍùͨÓÃÌìÏÂÄ£ÄâÆ÷µÄÕ÷;Öо«×¼¶¨Î»¡£¡£¡£¡£¡£

ÍŶÓÐÅÍÐ £¬£¬£¬£¬£¬Í¨¹ýÓ¦¶Ô×ÛÊöÖÐÁгöµÄÌôÕ½ £¬£¬£¬£¬£¬¸ÃÁìÓò¿ÉÒÔ´ÓÌìÉúÊÓ¾õÉϱÆÕæµÄÊÓÆµÉú³¤µ½¹¹½¨ÎȽ¡µÄͨÓÃÌìÏÂÄ£ÄâÆ÷ £¬£¬£¬£¬£¬Îª×Ô¶¯¼ÝÊ»¡¢¾ßÉíÖÇÄܵÈÁìÓòµÄ³¤×ãÉú³¤µÓÚ¨¼áʵ»ùʯ¡£¡£¡£¡£¡£

²Î¿¼ÎÄÏ×

[1] L. Zhang and M. Agrawala. Packing input frame context in next-frame prediction models for video generation. arXiv preprint arXiv:2504.12626, 2025.

[2] Z. Xiao et al. Worldmem: Long-term consistent world simulation with memory. arXiv preprint arXiv:2504.12369, 2025.

[3] X. Wu et al. Corgi: Cached memory guided video generation. arXiv preprint arXiv:2508.16078, 2025.

[4] R. Henschel et al. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 2568¨C2577, 2025.

[5] K. Dalal et al. One-minute video generation with test-time training. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 17702¨C17711, 2025.

[6] J. Chen et al. Sana-video: Efficient video generation with block linear diffusion transformer. arXiv preprint arXiv:2509.24695, 2025.

[7] X. Huang et al. Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009, 2025.

[8] Y. Huang et al. Owl-1: Omni world model for consistent long video generation. arXiv preprint arXiv:2412.09600, 2024.

[9] Z. Huang et al. Vbench: Comprehensive benchmark suite for video generative models, 2023.

[10] Z. Huang et al. Vbench++: Comprehensive and versatile benchmark suite for video generative models, 2024.

[11] A. Rakheja et al. World consistency score: A unified metric for video generation quality, 2025.

[12] M. Heusel et al. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018.

[13] S. Yuan et al. Chronomagic-bench: A benchmark for metamor-phic evaluation of text-to-time-lapse video generation, 2024.

[14] S. Motamed et al. Do generative video models understand physical principles?, 2025.

[15] J. Zhang ±±¾©ÑàÔÆÎÄ»¯´´ÒâÓÐÏÞ¹«Ë¾et al. World-in-world: World models in a closed-loop world, 2025.