Microsoft AI Build Week + NVIDIA Physical AI Wave. Microsoft AI ships its full in-house stack at Build 2026: MAI-Thinking-1 (1T MoE, 97% AIME 2025, matches Opus 4.6 on SWE-Bench Pro), MAI-Voice-2 (15-lang expressive TTS), MAI-Image-2.5 (#2 on image edit leaderboards), MAI-Transcribe-1.5 (SOTA in 18 of 43 languages), and Microsoft Discovery GA — the agentic R&D platform that drove the Majorana 2 quantum chip (~1000× reliability, 20s qubit lifetime). NVIDIA launches Cosmos 3 (open Physical AI omni-model with Reasoner + Generator MoT, tops PAI-Bench), Nemotron 3 Ultra (550B MoE, 55B active, 5.9× throughput vs GLM-5.1, NVFP4), OmniDreams (real-time AV world model), RTX Spark superchip (Blackwell + Grace, slim laptops), open Unitree H2 Plus humanoid reference design (75 DoF total). MiniMax M3 hits 59% SWE-Bench Pro with 1M context + native multimodality + Sparse Attention (15.6× decode). Alibaba Qwen3.7-Plus (1M context multimodal agent, 79 ScreenSpot Pro). Google Gemma 4 12B (encoder-free native audio-in), Magenta RealTime 2 (2.4B live music, ~200ms latency, MLX support). OpenAI ships ChatGPT Dreaming (background memory synthesis) + GPT-5.5 Instant June refresh. Ideogram 4.0 (9.3B open-weight, structured prompts, 2K native). Reve 2.0 (Large Layout Model, #2 T2I Arena). Boson AI ships Higgs Audio v3 TTS (4B, 102 languages, zero-shot) + Higgs Avatar v1 (real-time talking head). Baidu NAVA (native audio-visual alignment), Bernini video gen+edit, StreamChar real-time character AV. Research wave: Déjà View looping Transformers for 3D, PaGeR panoramic geometry, MAMMA multi-person mocap, Stable-Layers Flow-GRPO, WavTTS waveform diffusion. Tooling: EveryInc compound-engineering-plugin (37 skills/51 agents), Baidu LoongForge (5.04× over Megatron), Kasetto (declarative Rust agent manager), learn-claude-code (20-lesson harness curriculum), Odysseus self-hosted, earendil-works/pi toolkit, MisoTTS 8B, synthteam Slack persona plugin, QuantDinger AI quant trading. Industry: GitHub Copilot moves to usage-based AI Credits ($0.01/credit), Perplexity Personal Computer for Windows (19-model orchestrator), Vals AI Finance Agent v2 benchmark (GPT-5.5 leads at 51.76%), FlashDreams inference library, NVIDIA Cosmos Coalition (Black Forest Labs, Runway, LTX), Suno iOS Notes/Voice Memos integration.
44 launches and research drops that matter for enterprise AI builders—curated, tagged, and ready for your next roadmap sync.
New drops
44
Unique sources
31
Key themes
Immersive · Frontier · Agents
frontier
Frontier Models & Research
New reasoning systems, world models, and alignment papers.
Memory SystemOpenAI
ChatGPT Dreaming (V3 Memory)
New ChatGPT memory architecture that replaces the manually curated saved-memories list with a background synthesis process revising memories over time (e.g. trip "will go" → "went"); enabled by ~5× compute-cost reduction, doubles memory capacity for Plus/Pro.
12B unified, encoder-free multimodal model — vision and audio flow directly into the LLM backbone (first mid-sized Gemma with native audio-in), runs on 16GB VRAM, nears a 26B MoE on benchmarks at <half the memory. Apache-2.0 on HF/Kaggle.
Multimodal agent model with 1M-token context, vision + deep reasoning + tool invocation + autonomous iteration; 79.0 on ScreenSpot Pro and 70.3 on Terminal-Bench (top of open-API GUI agents), $0.40/$1.60 per M tokens on Alibaba Cloud Model Studio.
Open-weight model combining frontier coding (59.0% on SWE-Bench Pro, ahead of GPT-5.5 and Gemini 3.1 Pro), 1M-token context and native text+image+video input via MiniMax Sparse Attention — ~15.6× faster decode and ~9.7× faster prefill at 1M context vs M2.
Next-gen topological quantum chip co-developed using the Microsoft Discovery agentic-AI research platform — new materials stack delivers ~1000× reliability improvement and a mean qubit lifetime of 20s (peaks near 1 min). Microsoft pulls its scalable-quantum target in to 2029.
Open 550B-parameter MoE (55B active) for long-running agents — hybrid Mamba-Transformer layers, LatentMoE expert routing, multi-token prediction, NVFP4 quantization. ~5.9× throughput of GLM-5.1 on 8K/64K NVFP4 on GB200, lowers task-completion cost ~30%. Weights+data+recipes under Linux Foundation license.
Microsoft's first in-house reasoning model — sparse MoE with ~1T total / 35B active params and 256K context, trained from scratch on commercially licensed enterprise data with no third-party distillation; 97.0% AIME 2025, 94.5% AIME 2026, matches Claude Opus 4.6 on SWE-Bench Pro.
Behavioral update to ChatGPT's default model: tighter cross-subject answers, more natural conversational tone, personalization pulling from past chats, files, and connected Gmail (Plus/Pro web first). 52.5% fewer hallucinated claims vs GPT-5.3 Instant on high-stakes prompts.
Video, audio, and physics-native generation techniques.
TTS ModelMiso Labs
MisoTTS 8B
8B-parameter RVQ-Transformer TTS with a Llama-3.2-style backbone + 300M audio decoder, using the Mimi codec with 32 codebooks and 2,048-token max length; English-only, ~24GB VRAM at bf16, ships with SilentCipher watermarking.
Unified video generation + editing framework pairing an MLLM-based semantic planner (predicts target semantic embeddings with CoT reasoning) with a DiT renderer in VAE latent space; introduces Segment-Aware 3D RoPE for multi-input handling, SOTA on video gen/edit benchmarks.
Multi-view 3D reconstruction method that recurrently applies a single looped Transformer block to per-view features for K refinement steps, matching or beating much larger feed-forward baselines across five reconstruction benchmarks at a fraction of parameters and compute.
Single-pass framework that lifts perspective 3D foundation models to panoramas — predicts scale-invariant depth, metric depth, surface normals and sky masks from one perspective or 360° image for unified panoramic geometry estimation.
2.4B-parameter open-weights live music model with ~200ms control latency and 40ms frame size, controllable via MIDI/audio/text. Ships with SpectroStream (48kHz stereo discrete codec), a JAX/MLX Python lib (`magenta-rt`), and a C++ SequenceLayers engine for on-device inference on Apple Silicon.
9.3B-parameter open-weight text-to-image model with structured JSON prompting, explicit bounding-box layout + color-palette controls, multilingual text rendering and native 2K resolution; #1 open model on DesignArena, #2 overall. Apache-2.0 inference code, non-commercial weights.
Open omni-model for Physical AI built on a two-tower Mixture-of-Transformers: Reasoner VLM tower for multimodal observation reasoning + Generator tower that diffuses physics-aware video, sound and action sequences; tops PAI-Bench, Physics-IQ, RoboLab.
Fine-tunes image-layer-decomposition models with Flow-GRPO + LoRA — samples multiple candidate decompositions per image, scores with a VLM on five edit-centric criteria, adds a grid-based side-by-side calibration step; improves layer separation and reconstruction error on Crello.
End-to-end zero-shot TTS that directly models raw 16kHz waveforms with a Diffusion Transformer + flow matching, using non-overlapping patchification and multi-scale mel-spectrogram supervision; lowest WER and highest UTMOS on LJSpeech/LibriSpeech-PC without any pretrained autoencoder or vocoder.
Real-time streaming character audio-video generator that decouples an LLM orchestrator (producing frame-aligned audio conditions from transcript + history) from a joint audio-video DiT doing bidirectional local denoising; runs real-time on a single H100 via two-stage distillation.
Real-time generative world model for closed-loop autonomous-vehicle simulation, mid/post-trained from Cosmos diffusion to autoregressively generate action-conditioned camera frames; a world-action model post-trained from it beats VLA-based Alpamayo 1.5 on Physical AI NuRec at 1/5 the params.
4B-parameter autoregressive TTS on a Qwen3-4B backbone with interleaved text+audio tokens (8-codebook Higgs Tokenizer at 25fps → 24kHz waveform); 102 languages (85 production-grade <5% WER/CER), zero-shot voice cloning, inline control tokens for emotion (21 types), style, prosody and SFX.
6.3B-parameter Native Audio-Visual Alignment framework with an Align-then-Fuse MMDiT that first builds audio-video correspondence in an interaction space then conditions joint denoising on external context; produces 720p stereo synchronized audio-video in ~1 min on 8 GPUs with Ulysses sequence parallel.
Expressive TTS supporting 15 languages with zero-shot voice prompting from 5–60s reference audio, granular emotion tags (sad, whispered, excited), and code-switching for pairs like Hindi-English and Spanish-English. Available via Microsoft Foundry, VSCode, Dynamics 365 Contact Center.
Microsoft's first text-to-image + image-to-image model with identity/brand preservation across stylization, pose and layout edits; #2 on image-editing leaderboards above Nano Banana 2. Priced $5/$8/$47 per M text/image-in/image-out tokens (Flash variant: $1.75/$33).
Real-time avatar foundation model that turns a single still image into a talking head — per-frame lip-sync, expressive head motion and facial reactions synced to audio in English, Mandarin, Spanish, French and dozens of other languages. Full pipeline runs on a single H100.
First open humanoid reference design under Isaac GR00T: ~6ft/150lb 31-DoF Unitree H2 chassis paired with dual Sharpa Wave 22-DoF tactile five-finger hands (75 DoF total) on Jetson Thor Blackwell compute; adopted by Ai2, ETH Zurich, Stanford, UCSD.
Microsoft AI's new speech-to-text model with state-of-the-art average WER across 43 languages and #1 in 18 of them. Launched at Build 2026 alongside MAI-Thinking-1, MAI-Image-2.5 and MAI-Voice-2.
Open ecosystem launched alongside Cosmos 3 with Agile Robots, Black Forest Labs, Generalist, LTX, Runway and Skild AI committed to advancing open world-foundation models for Physical AI synthetic-data and policy training.
iOS app gains direct share-sheet integration with Apple Notes (auto-transcribed into lyrics) and Voice Memos (auto-attached to the Create form), lowering friction from idea capture to song generation.
Embodied agents learning to act in complex worlds.
Agent FrameworkEveryInc
Compound Engineering Plugin
Cross-platform plugin (Claude Code, Codex, Cursor, Copilot) bundling 37 skills and 51 agents for planning, code review, debugging and documentation around a compound engineering workflow where each unit of work makes the next easier.
Educational repo teaching agent-harness engineering in 20 progressive Python lessons (s01–s20), building from a single bash-tool loop up to multi-agent teams with MCP, planning, subagents, memory and worktree isolation.
Claude Code / Codex plugin that distills public Slack history of a colleague into a structured persona doc, then lets you query a single persona (ask-colleague) or convene a deliberating panel (ask-team); data stays in ~/.synthteam/.
Self-hosted Docker-Compose stack for AI quant trading across crypto (CCXT to 10+ venues), equities (IBKR, Alpaca), and FX (MT5) with vectorized IndicatorStrategy + event-driven ScriptStrategy runtimes, plus a `quantdinger-mcp` MCP server exposing markets, backtests and paper trades to Claude Code/Cursor.
Agentic R&D platform reaches GA at Build 2026: compose specialized agents over a graph-based knowledge engine for materials science, life sciences, semiconductors, energy workflows. Powered the Majorana 2 quantum chip development.
Local agent orchestrator that connects to Word, Excel, PowerPoint, Outlook and on-device files, routing subtasks to whichever of 19 frontier models (Claude, Gemini, GPT, Grok…) fits best; demoed alongside a hybrid local-cloud inference orchestrator at Computex 2026.
Open training framework built on Megatron-LM with heterogeneous tensor/data/recompute parallelism per model component, supporting LLMs, VLMs, diffusion (WAN 2.2) and embodied models (Pi0.5, GR00T); up to 5.04× speedup over Megatron baselines on NVIDIA + Kunlun XPUs.
Rust-based declarative agent environment manager using a YAML config + lock-file (à la Cargo.lock) to install and sync skills, MCPs and commands across 20+ agents (Claude Code, Cursor, Copilot, Gemini CLI) from any Git host.
Self-hosted ChatGPT/Claude alternative (Python/FastAPI + JS PWA) with multi-backend chat (vLLM, llama.cpp, Ollama, OpenRouter), local ChromaDB + fastembed ONNX vector store, deep research, blind model comparison, Docker/Apple Silicon Metal deployment.
New benchmark for multi-step financial-analyst agents covering equity research, credit analysis and corporate finance (DCF inputs, 10-K extraction, segment reconciliation); GPT-5.5 leads at 51.76%, Opus 4.7 at 51.51% — no model clears 52%.
Superchip combining a Blackwell RTX GPU (6,144 CUDA cores, 5th-gen Tensor cores with FP4) and a 20-core Grace CPU over NVLink-C2C; targets 14mm slim ~3lb laptops and small desktops from ASUS/Dell/HP/Lenovo/MSI/Surface this fall.
From June 1, 2026 every Copilot plan moves from PRUs to usage-based "AI Credits" (1 credit = $0.01). Monthly allotments: Pro 1,500 / Pro+ 7,000 / Max 20,000 / Business 1,900 per user / Enterprise 3,900 per user. Completions and Next Edit suggestions still uncounted.
Open-source high-performance inference and serving library for interactive autoregressive video and world models, released alongside OmniDreams to support real-time closed-loop simulation workloads.