Why Are Large Language Models So Terrible at Video Games?
LLMs remain poor at video games despite rapid progress elsewhere, as a new paper by NYU's Julian Togelius examines what this gap reveals about the broader limits of AI reasoning in 2026.
Large language models (LLMs) have improved so quickly that the benchmarks themselves have evolved, adding more complex problems in an effort to challenge the latest models. Yet LLMs haven’t improved across all domains, and one task remains far outside their grasp: They have no idea how to play video games. While a few have managed to beat a few games (for example, Gemini 2.5 Pro beat Pokemon Blue in May of 2025), these exceptions prove the rule. The eventually victorious AI completed games far more slowly than a typical human player, made bizarre and often repetitive mistakes, and required custom software to guide their interactions with the game.
Julian Togelius, the director of New York University’s Game Innovation Lab and co-founder of AI game-testing company Modl.ai, explored the implications of LLMs’ limitations in video games in a recent paper. He spoke with IEEE Spectrum about what this lack of video-game skills can tell us about the broader state of AI in 2026. LLMs have improved rapidly in coding, and your paper frames coding as a kind of well-behaved game. What do you mean by that? Julian Togelius: Coding is extremely well-behaved in the sense that you have tasks. These…
- spectrum.ieee.orgWhy Are Large Language Models So Terrible at Video Games?primary