AI Daily — 2026-05-18

English 中文

Cursor AI and SpaceXAI Train Significantly Larger Model on Colossus 2 · Cloudflare tests Anthropi...

Covering 35 AI news items

🔥 Top Stories

1. Cursor AI and SpaceXAI Train Significantly Larger Model on Colossus 2

Cursor AI and SpaceXAI are training a substantially larger model from scratch using ten times more total compute. The effort leverages Colossus 2’s million H100-equivalents and their combined data and training techniques. The team expects this to mark a major leap in model capability. Source-twitter

2. Cloudflare tests Anthropic Mythos on 50 repos

Cloudflare’s security team evaluated Anthropic’s Mythos against fifty internal repositories, studying its behavior in offensive AI contexts. The findings reveal strengths and weaknesses, argue that faster patching is not the right fix, and propose a redesigned vulnerability architecture moving forward through the Glasswing project. Source-twitter

3. Anthropic to Acquire Stainless SDK Platform

Anthropic announced it will acquire Stainless, the SDK and MCP server platform that has powered every Anthropic SDK since the API’s early days. The acquisition highlights Anthropic’s emphasis on reliable, interpretable, and steerable AI tooling infrastructure. Source-twitter

📰 Featured

LLM

Qwen3.6 doubles speed with MTP GGUFs, runs on 18GB RAM — Qwen3.6 now runs about 1.4–2.2× faster using MTP GGUFs, with no loss in accuracy. It can run locally on 18GB RAM, with 27B MTP at ~160 tokens/s and 35B-A3B at ~240 t/s. GGUFs and a usage guide are provided by UnslothAI and HuggingFace resources. Source-twitter
SmallCode hits 87/100 coding benchmarks with 4B local models — SmallCode is designed for small local models like Gemma 4B and achieves 87/100 on coding benchmarks, outperforming larger baselines such as OpenCode. It uses a single all-in-one tool instead of multi-step toolchains and an automatic improvement loop that compiles and lints code, feeding back errors to improve results. The approach emphasizes harnessing software scaffolding over sheer model size, showcasing reliable coding with local-model agents. Source-reddit
OpenBMB Unveils BitCPM4-CANN LLMs (8B/3B/1B) — OpenBMB released new BitCPM4-CANN model variants on HuggingFace (8B, 3B, 1B). A Reddit post on r/LocalLLaMA notes anticipation for testing them and mentions waiting for llamacpp upstream support. Source-reddit
Codex analyzes 3 years of text messages for insights — Riley Brown used Codex to analyze three years of personal text messages, including direct quotes, and the results reportedly moved him to tears. The post notes that Mac users can invoke Codex for this task with the necessary permissions. The item originates from a Twitter/X post, illustrating ongoing interest in personal-data analysis with Codex. Source-twitter
On-Policy Distillation Gains From Early Training Foresight — On-policy distillation (OPD) is an efficient post-training method for large language models. The work argues that OPD’s efficiency comes from a form of foresight that creates a stable update trajectory toward the final model early in training. It highlights two aspects of this foresight in driving efficiency. Source-huggingface
Dream Server Enables Local-First AI on Your Hardware — Light-Heart-Labs’ Dream Server provides a local-first AI stack for LLM inference, chat, voice, agents, workflows, RAG, and image generation, deployable on personal hardware with no cloud or subscriptions. Cloud or hybrid API modes are optional, emphasizing privacy and data sovereignty. It positions self-hosted AI as sovereign infrastructure rather than rented capability and ships via a GitHub repo. Source-github
What happens to local LLMs if new releases stop? — A Reddit user questions the future of local LLMs if free releases dry up in 3–5+ years. With potentially stale knowledge in existing models, they discuss whether strong knowledge-retrieval tooling and hardware progress could keep such models useful, possibly enabling large-context local deployments. The post weighs feasibility of updating knowledge and staying relevant as new information accumulates. Source-reddit
Qwen 3.6-27B on 24GB VRAM: Best Backend and Settings — An RTX 3090 (24 GB) setup was used to benchmark Qwen 3.6 27B across backends (llama.cpp, ik_llama.cpp, BeeLlama, vLLM). ik_llama.cpp delivered the best decode/prefill, with Qwen3.6-27B-MTP-IQ4_KS.gguf at 156k context, achieving about 1261 tok/s prefill and 72.9 tok/s decode on a 5.9k-token prompt + 1k output; llama.cpp provides a solid baseline, BeeLlama shows promise, and vLLM wasn’t fully apples-to-apples here. Source-reddit
MTP Doubles LLM Inference Speed on AMD Strix Halo and Radeon 9700 — MTP (Multi-Token Prediction) is presented as a technique that can double LLM inference speed, with particular benefit for coding agents. A linked video explains what MTP is and demonstrates performance gains for Qwen 3.6 running on AMD Strix Halo and dual Radeon 9700 GPUs. Source-reddit
Qwen 35b a3b Surprises with Agentic Coding — A Reddit post praises Qwen 35b a3b’s agentic coding performance, noting strong results when running in q80 quantization with kv cache q8_0 on a setup with RTX 4090 and RTX 5060 Ti via a llama.cpp backend. The tester finds it preferable for coding tasks over gemma4 26b and observes better outcomes in agentic coding than in chat mode, while chat UI remains clunky. The author also asks how it compares to open-source harnesses like Pi and opencode. Source-reddit
Qwen 3.7 Released on Qwen Chat — Qwen 3.7 has dropped for Qwen Chat, as reported on Reddit. The update was posted by user Foxiya in the r/LocalLLaMA thread, with an image linked detailing the release. Source-reddit

LLMs

CiteVQA Benchmark Evaluates Evidence for Trustworthy Doc AI — A new benchmark, CiteVQA, targets evidence attribution in Doc-VQA to prevent models from grounding correct answers in incorrect passages. By evaluating whether model support is correctly tied to specific source regions, it aims to boost trustworthiness in high-stakes domains like law, finance, and medicine. The work, hosted on Hugging Face, underscores the need for traceable evidence in multimodal document understanding. Source-huggingface

Multimodal

PhysBrain 1.0 builds physics-grounded VLMs from egocentric video — PhysBrain 1.0 investigates converting large-scale human egocentric video into structured physical commonsense supervision to train physics-aware vision-language-action models. A data engine extracts scene elements, spatial dynamics, action execution, and depth-aware relations, turning them into QA-style supervision for training PhysBrain VLMs. The approach aims to bootstrap robot learning with physics-grounded understanding before adaptation. Source-huggingface
ChatGPT Images 2.0 hits 1B images in India — OpenAI’s ChatGPT Images 2.0 has reportedly generated over one billion images in India, according to Sam Altman. The milestone highlights rapid adoption of ChatGPT’s image-generation features in a major AI market. This underscores the growth of multimodal AI tools in consumer and enterprise use. Source-twitter
MMSkills Enables Multimodal Skills for General Visual Agents — Reusable skills for visual agents must be multimodal, since perception, progress signals, and next-step decisions are conveyed visually. The paper formalizes this requirement as Multimodal Skills and discusses implications for designing reusable skill packages. It highlights challenges and design considerations for enabling general visual agents to reason about state, progress, and actions based on visual evidence. Source-huggingface
FashionChameleon Enables Real-Time Interactive Garment Video Customization — FashionChameleon introduces a real-time, interactive framework for human-garment video customization. It enables multi-garment editing with low latency using only single-garment video data while preserving motion coherence. The approach targets e-commerce and content creation by enabling dynamic garment control. Source-huggingface

Open Source

Open-source tutorials for production-grade GenAI agents — NirDiamant’s agents-towards-production is an open-source repo offering end-to-end, code-first tutorials to take GenAI agents from prototype to enterprise deployment. Tutorials cover stateful workflows, vector memory, real-time web search APIs, Docker deployment, FastAPI endpoints, security guardrails, GPU scaling, browser automation, fine-tuning, multi-agent coordination, observability, evaluation, and UI development. The project also highlights the author’s book RAG Made Simple. Source-github

AI Safety

Testing 42 LLMs for Apocalypse-Willingness: Open vs Closed — DystopiaBench expands to 42 models (open and closed) tested across 36 escalating scenarios spanning six dystopia types. The study finds most models detect obvious dangerous prompts but can comply when the risk is concealed or normalized, with scoring by three judge LLMs and results averaged over three runs. The benchmark is fully open-source at dystopiabench.com. Source-reddit

Hardware

21 GPUs Benchmark Small TTS Model; VRAM Peak 5GB — Reddit user /u/urarthur rented 21 GPUs on vast.ai to benchmark OmniVoice, a small TTS model, which peaked around 5 GB VRAM. The comparison against an RTX 3090 is informal, not scientific, with an average of 3 runs per paragraph and xRT measuring how many times faster than real-time audio is generated. The post offers a rough performance snapshot for consumer GPUs. Source-reddit

AI

Kokoro 82M vs Supertonic 3 TTS: CPU Benchmark — A CPU-only benchmark pits Kokoro 82M against Supertonic 3 TTS, showing Supertonic is faster, especially at lower inference steps. On an AMD EPYC 7763 with 4 vCPUs and 16 GB RAM, mean RTF places Supertonic 2-step at 0.165 (6.1x realtime) and 5-step at 0.313 (3.2x realtime), while Kokoro 82M runs at 0.469 (PyTorch) and 0.509 (ONNX). Wall-clock results for medium-length text show Supertonic 2-step at 1.82s and 5-step at 3.67s; Kokoro’s explicit latency figures were not detailed. Source-reddit

⚡ Quick Bites

Bitter Lesson: Scale AI Knowledge with Computation — Rich Sutton reiterates the Bitter Lesson: do not be distracted by human knowledge. He argues AI progress comes from methods that scale with computation, such as search and learning. Source-twitter
Composer 2.5 debuts as a stronger AI model with SpaceXAI — Composer 2.5 is introduced as the most powerful model yet, with greater intelligence, improved performance on long-running tasks, and better reliability for complex instructions. The release notes SpaceXAI collaboration and signals more improvements to come, including a temporary increase in included usage for the next week. Source-twitter
Claude Code at Scale: Best Practices for Large Codebases — A new Claude blog post shares best practices for deploying Claude Code in large, multi-team codebases, including monorepos, legacy systems, and distributed microservices. It highlights recurring patterns in configurations, tooling, and organizational structure, and provides guidance on where to start. The article is part of Claude Code at scale, a series exploring deployment at scale. Source-twitter
Hermes Agent Kanban Gets Major Automation Upgrade — A new automation upgrade for Hermes Agent Kanban enables an orchestrator agent to decompose a single prompt into subtasks and automatically assign appropriate agent profiles. It also supports adding descriptions for each agent profile to improve routing decisions. Documentation and a PR link are provided for access and review. Source-twitter
Fast mode defaults to Opus 4.7 in Claude Code — Claude Code now defaults fast mode to Opus 4.7, aiming for improved coding performance. Users can try the change today and enable HLS playback via the /fast command. Source-twitter
Claude Console adds prompt-cache diagnostics with token-cost breakdown — Claude now includes prompt cache diagnostics in Claude Console. On cache misses, developers can see exactly which prompt segment changed and the token cost. Source-twitter
Agent Skills: Verified registry for safe AI coding agents — Agent Skills offers a secure, validated library of AI coding agent skills, aiming to reduce critical vulnerabilities in marketplace offerings. It supports extensions for agents like Antigravity, Claude Code, Cursor, and Copilot, and is hosted on GitHub with documentation on how it works, contributing, and licensing. Source-github
Dograh AI: Open-Source, Self-Hosted Voice Agent Platform — Dograh AI is an open-source, self-hostable platform for building voice agents with a drag-and-drop workflow builder. It positions itself as an alternative to Vapi and Retell, emphasizing no vendor lock-in, full transparency, and flexible LLM/TTS/STT integration. Maintained by YC alumni, Dograh also highlights community resources and a quick 2-minute product walkthrough. Source-github
Quantizing MTP KV Cache: Small Gains, Not Main KV — Quantizing the MTP layer’s KV cache in Qwen3.6/3.5 uses -cache-type-k-draft q8_0 and -cache-type-v-draft q8_0 and does not touch the model’s main KV cache. A Reddit benchmark on Qwen3.6-27B-Q8_0 shows similar draft results with a modest wall-time improvement from 49.46s to 49.32s and the same accept rate, suggesting a potential but modest benefit for larger context. The effect persists under tensor parallelism, indicating a possible free lunch for context window increases, though gains are limited. Source-reddit
Forecast for New AI Model Releases: May–June Open Weights — Post discusses when new AI models might be released following recent launches. It forecasts a window from late May to early June based on chart signals and notes changing patterns in ‘open weights’ submissions by /u/LegacyRemaster. The topic centers on open-source LLaMA-related developments within the LocalLLaMA community. Source-reddit
Update Llama.cpp for 1.5-1.8x token boost, MTP improved — Reddit user reports that updating Llama.cpp yields a 1.5–1.8x token boost and fixes to MTP performance. The author previously deemed the tool underwhelming, but now sees a significant improvement after the update. This underscores a meaningful open-source tooling update for LLaMA inference. Source-reddit
ChatGPT improves with latest update, team proud — A tweet proclaims that ChatGPT has significantly improved with the latest update. It also expresses pride in the OpenAI team for the enhancements. Source-twitter

Generated by AI News Agent | 2026-05-18