AI Daily — 2026-05-02
OpenAI and Anthropic Release New Models Amid Party-Like Previews · AI-Generated Proof Advances Er...
Covering 34 AI news items
🔥 Top Stories
1. OpenAI and Anthropic Release New Models Amid Party-Like Previews
OpenAI unveils a new model release with a festive, playful rollout. Concurrently, Anthropic debuts a research-preview model claiming broad internet access and warning of potential job displacement. The coverage adopts a provocative, meme-driven tone circulating on X. Source-twitter
2. AI-Generated Proof Advances Erdős Problem 1196 and a 60-Year Conjecture
Researchers refined and adapted the proof method from GPT-5.4 Pro to address Erdős Problem 1196 and prove several additional results, including a 60-year-old conjecture by Erdős, Sárközy, and Szemerédi. They argue AI-generated proofs can open new avenues, and they announced the result at the Future of Mathematics Symposium. Source-twitter
3. GLM-5V-Turbo: Native Foundation Model for Multimodal Agents
GLM-5V-Turbo is introduced as a step toward native foundation models for multimodal agents. The paper notes that as foundation models are deployed in real environments, agentic capability must span perception and action across heterogeneous contexts such as images, videos, webpages, documents, and GUIs. It emphasizes integrating multimodal perception as a core component of reasoning, planning, tool use, and execution. Source-huggingface
📰 Featured
Multimodal
- Visual AI Evolves to Agentic World Modeling, Beyond Appearance — Recent visual generation models have advanced photorealism and interactive editing but still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. The authors argue for shifting from mere appearance synthesis to intelligent visual generation grounded in structure, dynamics, domain knowledge, and causal relations, introducing a five-level taxonomy to frame this evolution. Source-huggingface
- RADIO-ViPE Enables Open-Vocabulary Semantic SLAM from Monocular RGB — RADIO-ViPE (Reduce All Domains Into One — Video Pose Engine) is an online semantic SLAM system that enables geometry-aware open-vocabulary grounding by linking natural-language queries to localized 3D regions and objects in dynamic environments. Unlike methods requiring calibrated RGB-D input, RADIO-ViPE operates directly on raw monocular RGB video streams and requires no camera intrinsics, depth sensors, or pose initialization. It relies on tight multi-modal fusion to support open-vocabulary understanding in real time. Source-huggingface
LLM
- Apr 2026 Best Local LLMs: Qwen3.5, Gemma4 Shine — The post highlights ongoing excitement around local, open-weight LLMs with fresh releases like Qwen3.5 and Gemma4, plus claims of state-of-the-art performance from GLM-5.1. It also spotlights accessible options such as Minimax-M2.7 and Bonsai 1-bit models, and invites detailed user setups to navigate benchmarking caveats. Source-reddit
- Qwen3.6-27B with agentic search achieves 95.7% SimpleQA locally — An LDR maintainer reports a fully local LLM setup using an RTX 3090 with an Ollama backend running qwen3.6:27b. The LangGraph_agent, built on LangChain tool-calling with parallel subtopic decomposition (up to 50 iterations), is used and self-graded by the LLM. Benchmark results show Qwen3.6-27B achieving 95.7% SimpleQA (287/300) and 77.0% on xbench-DeepSearch; Qwen3.5-9B at 91.2% and 59.0%, with gpt-oss-20B at 85.4%. Source-reddit
- Researchers debunk frontier-model size claim; GPT-5.5 ~1.5T with wide CI — A viral paper claimed GPT-5.5 has 9.7 trillion parameters and similar claims for frontier models. Researchers Ben Sturgeon and the author investigated and found serious issues in the paper; after correcting the methodology, GPT-5.5 is estimated at about 1.5 trillion parameters with a 90% confidence interval from 256 billion to 8.3 trillion. Source-twitter
- Moondream Inference on Apple Silicon: Local 3-Model Pipeline — An approach for running Moondream inference directly on Apple Silicon without MLX, using a triple-model stack (Whisper, Qwen, Moondream) to enable offline processing with around 1-second latency. The setup supports Mac, screen visibility, and offline HLS playback, showcasing on-device AI capabilities. Source-twitter
- Eywa Introduces Heterogeneous Agentic Framework for Scientific Tasks — The paper presents Eywa, a heterogeneous agentic framework intended to extend language-centric systems to scientific domains. By integrating domain-specific foundation models, Eywa aims to tackle tasks beyond natural language and broaden the applicability of agentic LLMs to scientific problems. Source-huggingface
- Qwen3.6-27B on Windows: 72 tok/s with native vLLM — Windows-native setup for Qwen3.6-27B using a patched vLLM fork delivers 72 tok/s for short prompts on an RTX 3090, 64.5 tok/s for long prompts (~25k tokens), and 53.4 tok/s at 127k context on a single GPU. With two RTX 3090 GPUs, 160k context is achievable. The package includes a portable launcher/installer, requires no admin or Python, and exposes an OpenAI-compatible endpoint at http://127.0.0.1:5001/v1, with a GitHub release for the project. Source-reddit
- Qwen 3.6 Tops Benchmarks, Gemma 4 Shines in Reality — A tester compares Qwen 3.6 and Gemma 4 on 27B/31B vision models locally with vLLM and FP8, finding that benchmarks favor Qwen while real-world performance favors Gemma. The write-up highlights observed behavioral differences, token-burn behavior, and the gap between official benchmarks and practical tasks. Source-reddit
- Warpdrv: Open-source Llama.cpp launcher for local Qwen models — Warpdrv is an open-source launcher for running Llama.cpp-based LLMs locally, enabling two Qwen models to run in parallel with different backends on a high-end setup. The project showcases Qwen 3.6 27b and 35b on hardware including Strix Halo and RTX Pro 5000 Blackwell, using Ubuntu 25.10 and multiple acceleration backends. It includes a built-in model-router, support for opencode and Claude-Code workflows, MCP.json, tool calling, and experimental KV-cache checkpointing, though it does not ship with a llama.cpp build by default. Source-reddit
AI Agents
- Hermes Agent Emerges as Best Local AI Harness (2026) — A social post argues that when running local AI, the agent harness matters more than the model. It praises Hermes Agent for clean tool calls, persistent memory, and sub-agents, claiming it outperforms OpenClaw and other frameworks. The author ranks Hermes as the best general-purpose agent in 2026 and notes its broad out-of-the-box capabilities across multiple hardware setups. Source-twitter
AI
- xAI Voice Cloning Live with 80+ Voices in 28 Languages — xAI has launched voice cloning via its API, allowing developers to create custom voices in under two minutes or choose from a library of 80+ voices across 28 languages. The technology targets voice agents, audiobooks, video game characters, and other applications, with Hermes Agent support coming soon. Source-twitter
- Chinese AI models ~8 months behind leading U.S. models — A commentator claims Chinese AI models lag the leading U.S. models by about eight months. DeepSeek V4 is described as an example, with capabilities behind top U.S. models by roughly eight months. The claim cites nist.gov as the source. Source-twitter
- Anipartment Replicated with Open-Source AI Models — A Reddit post titled Anipartment featured detailed anime-style images of a person relaxing in a fictional apartment. The original post and comments were deleted, but a nested thread later revealed a rough description and an example prompt, which the author then attempted using open-source models ZIT and Klein9B to recreate the detailed visuals. Source-reddit
- Replicating TurboQuant: PROD results lag paper’s claims — Reddit user reports implementing TurboQuant (arXiv:2504.19874) from scratch. The MSE variant performs as expected, but the PROD version achieves only about 95.8% correlation at 4-bit, short of the paper’s claimed >99%. They also observe attention quality degrades despite the correlation, discuss potential causes like variance scaling and bit packing, and question whether results depend on dimensions or setup; code is provided. Source-reddit
World Models
- Stanford Seminar Deepens World Modeling: From Reconstruction to Latent Prediction — Stanford’s latest seminar dives into the evolution of world modeling in AI, examining the shift from traditional reconstruction methods to latent-space prediction. It covers JEPA and World Models, Causal JEPA, the LOWER Model, practical applications, planning, and the future outlook. Source-twitter
Open Source
- Flare-TTS 28M: Open-Source TTS Model Trained on LJSpeech — Reddit user LH-Tech_AI released Flare-TTS 28M, their first text-to-speech model trained from scratch on a single NVIDIA A6000 GPU for around 24 hours and ~300 epochs using the full LJSpeech dataset. The model is free and open-source on Hugging Face, with an audio sample provided. It speaks English and sounds somewhat robotic, but invites experimentation. Source-reddit
LLMs
- Kv Cache Quantization: Performance Gains vs Reliability Risk — A Reddit post discusses quantizing KV cache to FP8 for Qwen-3.6 27B using vLLM on two NVIDIA GeForce RTX 3090 GPUs for long-horizon workloads. The author notes FP8 KV cache introduces subtle errors and reliability issues, while 16-bit KV cache offers better reliability and speed. They question why the community treats KV-cache quantization as a serious optimization. Source-reddit
⚡ Quick Bites
- Prompting AI to act as manager, not coder, sparks multi-agent drama — A Twitter thread describes instructing GPT-5.5 to be a manager rather than a coder and to delegate to sub-agents. Over time, the model ends up coding anyway, highlighting tensions between role instructions and emergent multi-agent behavior, with a nod to Claude Opus 4.7. The piece offers a reflective look at AI governance and delegation dynamics. Source-twitter
- Sam Altman: Smarter AI models trump cheaper or faster — Sam Altman argues that while cheaper and faster AI models are desirable, being smarter remains the top priority. He notes in a tweet that intelligence should drive value more than reducing cost or latency. The remark underscores a continuing industry debate about balancing efficiency with capability. Source-twitter
- Exploratory Sampling Boosts Semantic Diversity in LLMs — Researchers propose Exploratory Sampling (ESamp), a decoding method that explicitly encourages semantic diversity in LLM generation beyond standard stochastic sampling. The approach is motivated by the idea that neural networks make lower-error predictions on inputs similar to those seen before, aiming to expand semantic exploration during test-time generation. Source-huggingface
- simstudioai/sim Launches Open-Source AI Agent Orchestration Platform — Simstudioai/sim is an open-source platform to build, deploy, and orchestrate AI agents. It offers visual workflows, Copilot-assisted node generation and debugging, and supports 1,000+ integrations with LLMs. It also enables vector databases for grounding, and provides cloud-hosted and self-hosted deployment options via sim.ai or Docker. Source-github
- Quadtrix.cpp: CPU-only GPT-style Transformer in C++17 — An independent project, Quadtrix.cpp, implements a GPT-style language model entirely in C++17 with no external dependencies, training on CPU. The model uses hand-written tensor operations and full analytical backpropagation, totaling 0.83M parameters across 4 layers with 4 heads and 200 dimension. It reports a best validation loss around 1.64 after about 76 minutes of training on CPU on a 128-token context window and a 31.4M-character corpus. Source-reddit
- Visualizer for Hugging Face Models Visualizes Model Architectures — hfviewer.com is a new tool that visually explores Hugging Face model architectures. Users paste a model URL to generate an interactive diagram, with examples like Qwen3.6-27B and the Gemma 4 family, and the author invites feedback. Source-reddit
- Tinygrad Driver Tests MoE Speeds on RDMA Cluster — A Reddit post announces plans to test Tinygrad drivers for MoE workloads on a Blackwell + M3 Ultra RDMA cluster with nearly 2 TB RAM. The author invites the community to suggest benchmarks and collaborate on experiments. Source-reddit
- Unsloth fixes Mistral 3.5 inference bug with YaRN parsing — Unsloth, in collaboration with Mistral, released updated GGUFs to fix a Mistral Medium 3.5 inference bug affecting multiple implementations. The issue stemmed from a YaRN parsing quirk impacting several frameworks including transformers and llama.cpp, resolved by changing mscale_all_dim from 1 to 0. Additionally, mmproj files were corrected to generate properly. Source-reddit
- Petdex launches public gallery for Codex pets — Petdex is a public gallery for discovering, sharing, and installing Codex pets using a single curl. Submissions are open, promoting ‘Pets. Now in Codex’ and a /pet command to wake a pet. Source-twitter
- Elon Musk to crash GPT-5.5 event, rumors of a curse — A Twitter post claims Elon Musk will attend a GPT-5.5 event uninvited and cast a disruptive ‘curse’ on the proceedings. The metaphor likens Musk’s alleged appearance to a Sleeping Beauty witch, signaling potential controversy rather than confirmed news. Treat this as speculation about AI industry chatter surrounding GPT-5.5. Source-twitter
- Codex vs Claude Code: 1.7B tokens vs 80M—who hit the limit? — The author claims spending 1.7 billion tokens on Codex Pro 5x and 80 million on Claude Code Max 20x yesterday, and asks which one triggered a usage-limit warning. The post highlights large-scale token usage differences between OpenAI’s Codex and Anthropic’s Claude Code, signaling potential limits for high-volume code-generation tasks. Source-twitter
- Claude rate-limited; user tests DeepSeek V4, costs soar after 10M+ tokens — An AI practitioner reports being rate-limited by Claude and tries DeepSeek V4 for the first time. After running through over 10 million tokens, they express shock at the cost, highlighting potential pricing challenges when scaling LLM usage. The post draws attention to the trade-offs between established models and newer tools in practical workloads. Source-twitter
- Ban phrases on llama.cpp with this script. — Reddit highlights a script to ban phrases in llama.cpp, with setup instructions in the linked README. The GitHub repo llama-cpp-phrase-ban, created by BigStationW, provides tooling to filter phrases used with llama.cpp; the post was submitted by Total-Resort-3120. Source-reddit
Generated by AI News Agent | 2026-05-02