AI Daily — 2026-05-05

English 中文

GPT-5.5 Instant Rolls Out in ChatGPT with Smarter, Warmer Replies · Grok 4.3 Debuts on xAI API as...

Covering 24 AI news items

🔥 Top Stories

1. GPT-5.5 Instant Rolls Out in ChatGPT with Smarter, Warmer Replies

OpenAI begins rolling out GPT-5.5 Instant in ChatGPT, delivering smarter, clearer, and more personalized answers in a warmer, more natural tone. The update also emphasizes conciseness in response length based on user feedback. Source-twitter

2. Grok 4.3 Debuts on xAI API as Fastest Model

Grok 4.3 is now live on the xAI API, promoted as the fastest and most capable model to date. It tops the ArtificialAnlys leaderboards for agentic tool calling and instruction following and ranks #1 in ValsAI enterprise domains like case law and corporate finance. The model supports a 1 million token context window and is priced at $1.25 per input and $2.50 per output. Source-twitter

3. Heretic 1.3 Released with Reproducible Models and Benchmarks

Heretic 1.3 is now available, introducing reproducible models, an integrated benchmarking system, reduced peak VRAM usage, and broader model support. The release emphasizes transparency amid a crowded ecosystem of forks, noting around 20k GitHub stars and 13 million total downloads, and referencing a competitor allegedly using a plagiarized fork of Heretic. Source-reddit

📰 Featured

LLMs

DeepSeek V4 Pro Matches GPT-5.2 on FoodTruck Benchmark — DeepSeek V4 Pro competed on the 30-day FoodTruck Bench, a 34-tool, memory-enabled agentic task, tying with Grok 4.3 and within 3% of GPT-5.2’s median. Ten weeks after GPT-5.2, this marks a narrowing China–US frontier, with DeepSeek offering a roughly 17× cost advantage (0.435/0.87 per unit) versus GPT-5.2. It marks the first Chinese model in the frontier tier and highlights ongoing rapid progress and cost efficiency in agentic AI. Source-reddit
MTP support in llama.cpp on Strix Halo (PR 22673) — Reddit user Edenar reports testing MTP support in llama.cpp on an AMD Strix Halo setup. They rebuilt the amd-strix-halo-toolboxes and used PR 22673 with the GGUF Qwen3.6-35BA3B-MTP-GGUF, enabling —spec-type mtp —spec-draft-n-max 3, achieving about 60-80 tokens/s with MTP versus roughly 40 tokens/s without. They note similar GGUF sizes (~36 GB each) and plan to experiment with Qwen 3.5 122B, calling the results impressive. Source-reddit

LLM

Claude launches ready-to-run agent templates for financial services — Claude has introduced ready-to-run agent templates for finance, enabling tasks like pitching, valuation reviews, and month-end close. These templates can be installed as plugins in Cowork and Claude Code or run in production via cookbooks as Managed Agents. The update highlights Claude’s expanding toolset for enterprise finance workflows. Source-twitter
Gemma 4 Gains 3x Speed with Multi-Token Drafters — Google’s Gemma 4 model is updated to run up to 3x faster thanks to Multi-Token Prediction Drafters (MTP drafters). The approach lets Gemma 4 predict multiple tokens at once, increasing output speed without sacrificing quality. The announcement was shared via Google’s developer channel on X (Twitter). Source-twitter
OpenAI Codex rate limits increased tenfold, fans celebrate — A tweet by Arav Jain praises OpenAI developers for boosting Codex rate limits by 10x, signaling strong enthusiasm for the update. The post treats the change as a major improvement for coding tasks, reflecting positive reception to AI tooling enhancements. Source-twitter
Use Qwen3.6 with Pi coding agent, forget the rest — A Reddit user reports that pairing Qwen3.6 35B with a Pi coding agent, exa web search, and an agent-browser extension dramatically improves coding, maintenance, and web research tasks. They claim this setup covers about 80% of their use cases and can even surpass Perplexity for web research, with more complex planning handled by Kimi2.6. Source-reddit
3x LLM Inference Speedup on Google TPUs via Diffusion-style Speculative Decoding — Google Developers Blog describes speeding up LLM inference on Google TPUs using diffusion-style speculative decoding, achieving about 3x faster performance. The technique predicts tokens in a diffusion-inspired way to reduce compute and latency, enabling quicker generative workloads on TPU-backed deployments. Source-reddit

Multimodal

Hermes Agent builds full videos with HeyGen HyperFrames — Hermes Agent can now create complete videos using HeyGen’s official HyperFrames skill. HyperFrames videos are HTML-native, giving the agent full control over the final output and enabling HLS playback. Source-twitter
Map2World Enables Segment-Conditioned 3D World Generation — Map2World presents a framework for 3D world generation conditioned on user-defined segment maps of arbitrary shapes and scales. This approach aims to overcome grid-layout constraints and scale inconsistencies seen in prior methods, delivering more globally coherent scenes. The work advances AI-driven 3D content generation with potential benefits for immersive content creation and autonomous driving simulation. Source-huggingface

Open Source

MolmoAct2: Open Action Reasoning for Real-World Deployment — MolmoAct2 is introduced as a fully open action reasoning model designed for practical deployment in Vision-Language-Action (VLA) robotics systems. The authors argue that current frontier models are closed or expensive, and that reasoning-augmented policies incur prohibitive latency, presenting MolmoAct2 as an accessible, deployable alternative. This work aims to advance open-source, embodied AI for real-world use. Source-huggingface

AI Benchmarking

ProgramBench Tests Rebuilding Large Binaries from Scratch with AI Agents — ProgramBench formalizes a benchmark of 200 tasks where an AI agent must build a complete program from a target executable and usage files, with no internet access or decompilation allowed. It tests language choice, architecture design, and software engineering decisions, backed by 6 million lines of generated behavioral tests that were filtered for quality. The project has open-sourced its GitHub repository, Hugging Face resources, and Docker images, with all results published at programbench.com. Source-reddit

Voice Cloning

OmniVoice Enables One-Shot Voice Cloning, User Amazed — A Reddit post praises OmniVoice for delivering one-shot voice cloning that is incredibly easy to use. The author expresses astonishment, saying it’s everything they’ve dreamed of. The post notes that the capability isn’t an LLM, but highlights the impressive voice cloning performance. Source-reddit

AI Safety

US Tech Firms to Review AI Models for National Security Before Release — A deal has been struck between the United States and technology companies to review AI models for national security implications before they are released publicly. The arrangement aims to ensure safety and regulatory compliance prior to deployment of new AI systems. Source-reddit

⚡ Quick Bites

Efficient AI Models Spark Worries Over Codex-Claude Migration — The post notes that AI models are very efficient for their capability level and expresses concern about Codex limits. It claims the Claude-to-Codex migration is an intentional honeymoon period, warning of a potential rugpull. Source-twitter
Sama seeks 5.5 AI feats with massive token budgets — A post from Sam Altman (via the @sama account) asks for examples of impressive things built with a hypothetical 5.5 AI model that weren’t possible with earlier versions. The request emphasizes use cases that relied on extremely large token budgets. The goal is to hear stories about capabilities unlocked by extended token allowances. Source-twitter
Web2BigTable: Bi-Level Multi-Agent LLM for Internet-Scale Search — Web2BigTable introduces a bi-level, multi-agent LLM framework for scalable web-to-table search. It targets deep reasoning over long search trajectories and structured, schema-aligned aggregation across heterogeneous sources. The framework aims to address both breadth and depth in agentic web search for information extraction. Source-huggingface
CocoIndex: Incremental context engine for long-horizon AI agents — CocoIndex provides an incremental engine that turns diverse data sources (codebases, notes, inboxes, Slack, PDFs, and videos) into live, fresh context for AI agents and LLM apps, recomputing only deltas. It aims to deliver production-ready AI agents quickly with parallel processing and Python tooling. The project is open-source at cocoindex-io/cocoindex with weekly updated examples. Source-github
Running Local LLMs: 200M Tokens in 5 Days, ROI — A Reddit post describes running local LLMs (Qwen-397b) on a 2-Spark cluster with Hermes counting tokens, reaching 200 million tokens in five days. The author estimates ROI using a token price of $1.25 per million, suggesting about $1,250/month in value and a payback window of around six months, while noting caveats about costs and alternatives. Source-reddit
Local 26B LLM Runs Fast on CPU, No GPU Needed — Reddit user reports Gemma4 26B running locally on a CPU-only machine (i5-8500 with 32GB RAM) at impressive speeds without a GPU. They also note 12B models work well on the same setup, underscoring CPU-based inference feasibility for large LLMs. The post highlights surprising capabilities for local AI deployment without GPUs. Source-reddit
Can Language Models Learn Skills from Context? — Language models often need to reason over contexts beyond their learned parameters. The piece proposes context learning and inference-time skill augmentation to convert contextual rules into natural-language skills. It notes two major challenges, including the prohibitive cost of manual skill annotation for context learning. Source-huggingface
Excited for voice models changing how we interface with AI — The post expresses excitement about the maturation of voice models. It notes that people are already changing the way they interface with AI, signaling a shift toward voice-enabled interactions and broader adoption of voice as a primary AI modality. Source-twitter

Generated by AI News Agent | 2026-05-05