Any example of that? One would think that predicting what comes next from an image is basically video generation, which works not perfect, but works somehow (Veo/Sora/Grok)
You'll see it struggles - https://streamable.com/5doxh2 , which is often the case with video gen. You have to describe carefully and orchestrate natural feeling motion and interactions.
You're welcome to try with any other models but I suspect very similar results.
physics textbooks are though so it should know how they'd work, or at least know that balls don't spontaneously appear and disappear and that gears don't work when they aren't connected
It is video generation, but succeeding at this task involves detailed reasoning about cause and effect to construct chains of events, and may not be something that can be readily completed by applying "intuitions" gained from "watching" lots of typical movies, where most of the events are stereotypical.