2026-05-11
Source: Interaction Models: A Scalable Approach to Human-AI Collaboration.
AI Secret:
The First Real-Time Model
What’s happening: Thinking Machines Lab showed its first Interaction Model after 18 months, led by former OpenAI CTO Mira Murati. The point is not a smarter chatbot. It is a break from turn-based AI, where users speak, models wait, then respond. This model listens, watches, interrupts, and reacts while the interaction is still happening.
How this hits reality: Most AI products still sit inside a prompt box. Even strong agents behave like ticket systems: give an instruction, wait, inspect the output, correct it, repeat. Thinking Machines is attacking that loop directly with 200 millisecond micro-turns, live audio, video, search, UI generation, and a second background model for deeper work.
Key takeaway: If this works, the prompt box becomes a temporary interface, not the center. Agents move toward live collaboration layers where users steer, interrupt, and reshape execution in real time.
---
AINews:
[AINews] Thinking Machines' Native Interaction Models - TML-Interaction-Small 276B-A12B - advances SOTA Realtime Voice and kills standard VAD
well done, Team Thinky.
By complete coincidence, the day we released Neil Zeghidour (CEO of Gradium, the for profit spinoff of the vaunted Kyutai Moshi)’s talk on what remains to be built for realtime voice, Thinking Machines emerged for only the third time in a ~year (despite much drama) to drop Interaction Models: A Scalable Approach to Human-AI Collaboration, TML-Interaction-Small is a 276B parameter MoE with 12B active., which immediately advances the state of the art of realtime voice models as Neil had laid out, updating the famously dead GPT 4o “her” demo with far more detailed demos that are presumably far closer to real use:
The full blogpost has lots of demos of the level of continuous interactivity, focusing on streams of “time-aligned microturns” of 200ms each:
Using encoder-free early fusion, with images and audio all processed <200ms, similar to Meta’s Chameleon:
There are a number of official benchmarks that the team shows beating both GPT-Realtime-2 and Gemini 3.1-Flash on basic things like BigBench Audio and IFEval and FD-bench, but the level of interactivity aimed for required making 2 new internal benchmarks for time awareness, simultaneous translation, and visual proactivity:
TimeSpeak: Can the model initiate speech at user-specified times?
Example: “I want to practice my breathing, remind me to breathe in and out every 4 seconds until I ask you to stop.”
CueSpeak: Can the model speak at the appropriate moment?
Example: “Everytime I codeswitch and use another language, give me the correct word in the original language.”
RepCount-A contains videos of repeated actions and is adapted into an online counting task - measures continuous visual tracking and timely counting.
ProactiveVideoQA consists of videos with questions, whose answers become available at specific moments. Higher scores require correct answers at the correct times, silence gets partial credit, and incorrect answers are penalized.
Charades is a standard temporal action-localization benchmark.
Stream a user audio instruction: “Say ‘start’ when the person starts doing {action} then say ‘Stop’ when they stop.”
But look past the numbers: the single most visceral demo is this one buried at the bottom. Play the samples and feel the AGI:
The closing notes leave tantalizing hints to Thinky’s roadmap, including an intriguing pairing of background agents with interactive models, which we like a whole lot.





Không có nhận xét nào:
Đăng nhận xét