AI Daily — 2026-04-10

English 中文

Claude for Word Beta Lets AI Draft and Edit in Word Sidebar · MMX-CLI Expands Agents with Seven N...

Covering 31 AI news items

🔥 Top Stories

Anthropic’s Claude for Word has entered beta, enabling drafting, editing, and revising documents directly from Word’s sidebar. The tool preserves formatting and shows edits as tracked changes, and is available on Team and Enterprise plans. Source-twitter

2. MMX-CLI Expands Agents with Seven New Senses

MMX announces MMX-CLI, the first infrastructure built for AI Agents rather than humans. It adds seven modalities—image, video, voice, music, vision, search, and conversation—via MiniMax’s full-modal stack, enabling Agents to read, think, and write with new capabilities. The tool runs with a single command (mmxAgent-native I/O), requires no glue, is compatible with the existing Token Plan, and includes two-line setup to give Agents a voice; details at the GitHub repository. Source-twitter

3. Agent-as-a-Judge Debuts with DevAI Benchmark for AI Agents

Researchers unveil Agent-as-a-Judge, a proof-of-concept framework that evaluates AI agents using human-like step-by-step processes and promises significant cost reductions. The accompanying DevAI benchmark includes 55 automated AI-development tasks and 365 requirements, designed to mirror human evaluations more closely. Early results suggest Agent-as-a-Judge outperforms LLM-as-a-Judge and aligns with human judgments, signaling a meaningful advance in AI evaluation. Source-twitter

📰 Featured

LLM

Mythos Zero-Day Finds Reproduced by GPT5.4 and Opus — A post claims Mythos findings were replicated using GPT5.4 and Opus, with a writeup promised early next week. The authors say they autonomously found Linux kernel zero-days in the last three weeks, noting Mythos is strong at spotting potential code issues, though the ‘scary’ threshold was reached earlier. They frame this as hype for Anthropic’s IPO plans, while emphasizing it is not a new capability. Source-twitter
Kronos: Open-Source Foundation Model for Financial Markets — Kronos is an open-source decoder-only foundation model designed for the language of financial markets, focusing on K-line (OHLCV) sequences and trained on data from over 45 global exchanges. It introduces a two-stage framework with a specialized tokenizer that quantizes continuous OHLCV data into hierarchical discrete tokens to handle high-noise market signals. The project has released an arXiv preprint, fine-tuning scripts, and announced acceptance to AAAI 2026. Source-github
GLM 5.1 Tops Open-Model Code Arena Rankings — GLM 5.1 has topped the Code Arena rankings among open-source models for code generation. The Code Arena benchmark measures coding performance of open LLMs, and GLM 5.1 leads the field, signaling strong capabilities in open-model code tasks. This item originates from a Reddit post. Source-reddit
NousResearch Unveils Monitor Tool for Claude Background Scripts — NousResearch introduced the Monitor tool that lets Claude create background scripts that wake the agent when needed, eliminating the need for continuous polling. It aims to save tokens and enables actions like following logs and polling PRs via scripts, while allowing work on other tasks in the same session. The update highlights open-source speed and innovation compared to centralized competitors. Source-twitter
Anthropic Leads AI Race as OpenAI Lags Behind — A social post claims OpenAI is not shipping much, with Anthropic appearing to be the sole competitive player. The message paints Meta, Google, Grok, DeepSeek, and Apple as lagging or non-participatory in the AI race. Source-twitter
Rethinking Reasoning SFT Generalization: Optimization, Data, and Capability — A new analysis challenges the view that supervised finetuning memorizes while reinforcement learning generalizes, arguing that cross-domain generalization in reasoning SFT is conditional on optimization dynamics, training data, and base-model capability. The authors note that some failures are under-optimization artifacts, with cross-domain performance first declining before recovering. This reframing reshapes how we assess reasoning SFT and its real-world generalization potential. Source-huggingface
ClawBench Tests AI Agents on 153 Everyday Tasks — ClawBench introduces an evaluation framework to test AI agents on 153 simple tasks, spanning 144 live platforms across 15 categories like purchases, appointments, and job applications. It aims to measure whether AI agents can automate routine online activities beyond inbox management, offering a practical benchmark for real-world automation. The framework is published on HuggingFace. Source-huggingface
Too much detail hurts small models; role + constraints best — An experiment tested common prompting advice across eight models, including six local models on M2 96GB and RTX 5070 Ti via Ollama, plus two frontier APIs (GPT-4.1-mini and Claude Haiku 4.5). It found that extra detail harms small models, with a sweet spot at ‘role + constraints’ and that examples or edge cases can degrade outputs under 3B; larger models were unaffected. The total API cost was $0.03. Source-reddit
TurboQuant + TriAttention Cut Llama.cpp KV Cache by ~6.8× — A Reddit post reports that combining TurboQuant KV cache compression and TriAttention pruning in llama.cpp on AMD/HIP yields about 6.8× total KV cache reduction (5.1× TurboQuant, 1.33× TriAttention). At 131K context, f16 KV is 8.2 GiB and drops to roughly 1.2 GiB with the combo; TurboQuant shows GSM8K 72.0% on 1319 problems and NIAH 28/28 up to 64K, while tool calling is 26/26, though the NIAH result is TurboQuant-only and TriAttention claims are not validated for retrieval. TriAttention is inspired by a NVIDIA/MIT paper and the author cautions that the end-to-end retrieval claim is not yet validated; speed overhead is ~1–2%. Source-reddit
Stanford Unveils Meta-Harness: Self-Improving LLM Harness — Stanford researchers introduce Meta-Harness, an outer-loop system that searches over LLM harness code to auto-correct agentic mistakes and improve performance while using less context. It uses an agentic proposer to examine source code, scores, and execution traces of prior candidates to guide improvements. In online text classification, Meta-Harness outperforms a state-of-the-art context-management system by 7.7 points while using four times fewer context tokens. Source-reddit
GGUF Tool Suite Enables Custom High-Quality Quants — A new GGUF-Tool-Suite with documentation and a web UI helps users benchmark and generate GGUF quantized models for ik_llama.cpp and llama.cpp, via CLI or the web interface. The suite claims higher-quality GGUFs than other releases and is already adopted by several users; benchmarking for Kimi-K2.5 and GLM-5.1 is forthcoming. Source-reddit
Gemma 4 vs Qwen3.5: Benchmarking Quantized Local LLMs for Go Coding — Reddit user m3thos compares Gemma 4 and Qwen3.5 in a test of quantized local LLMs run on a low-spec framework13 laptop. The setup focuses on models under 40B parameters with MoE quantization, noting GPT-OSS-20B performs surprisingly well under these constraints. The post highlights the viability of open-source, quantized LLMs for coding tasks on modest hardware. Source-reddit
9B Qwen-based LoRA Enables Autonomous Data Analysis — An open-source effort shows a LoRA trained on a Qwen3.5-9B-based model to perform end-to-end data analysis. The approach uses multi-step trace datasets to enable planning, coding, debugging, visualization, and summarization in a loop until completion. The author claims the LoRA completes 89% of workflows without human intervention, contrasting with a 100% failure rate for the base model. Source-reddit

AI Research

NUS Unveils DMax: Aggressive Parallel Decoding for dLLMs — Researchers at the National University of Singapore introduce DMax, a new diffusion language model paradigm that frames decoding as progressive self-refinement to mitigate error accumulation in parallel decoding. The approach includes On-Policy Uniform Training to unify masked and uniform dLLMs and Soft Parallel Decoding to interpolate intermediate states, enabling faster decoding while preserving generation quality. Source-reddit

LLMs

OpenAI Voice Mode Uses Older, Weaker Model, Karpathy Says — Andrej Karpathy argues that OpenAI’s voice mode runs on an older, weaker model, which can mislead users into thinking the AI is smarter than it is. He notes that many people’s impressions come from free-tier or outdated versions, which don’t reflect the capabilities of this year’s state-of-the-art agentic models like Codex and Claude Code. Source-twitter

Embodied AI

HY-Embodied-0.5 Unveils Embodied Foundation Models for Real-World Agents — HY-Embodied-0.5 introduces a family of foundation models tailored for real-world embodied agents. The models aim to bridge general vision-language models with embodied demands by enhancing spatial and temporal visual perception and advanced embodied reasoning for prediction, interaction, and planning. The HY-Embodied-0.5 suite includes two primary variants. Source-huggingface

Open Source

OpenBMB VoxCPM2 Launches Tokenizer-Free Multilingual TTS — OpenBMB releases VoxCPM2, a 2B-parameter tokenizer-free TTS model trained on over 2 million hours of multilingual data. It supports 30 languages, end-to-end diffusion autoregressive synthesis, voice design, and controllable cloning with 48kHz studio-quality audio, built on MiniCPM-4. Source-github
Archon: Open-Source Harness for AI Coding Workflows — Archon is the first open-source harness and workflow engine for AI coding agents. It lets developers define AI development processes as YAML workflows (planning, implementation, validation, code review, PR creation) and run them deterministically across projects, akin to Dockerfiles for infrastructure or GitHub Actions for CI/CD. The platform aims to tame AI variability by encoding processes and validation gates into a repeatable workflow. Source-github

⚡ Quick Bites

Startup Could Clone Big-Lab Concept; Roadmap Opens Opportunity — An AI-focused tweet praises a cool product concept from a leading AI lab and notes that OpenAI may not continue pushing that direction. The author suggests a startup could clone the idea and invest in care and iteration to make it work. The post argues that big labs’ clear, predictable roadmaps create significant openings for startups to pursue the concept. Source-twitter
Chutes Is Bittensor: Decentralized Team, Smart-Contract Staking — Chutes reiterates its identity as a Bittensor project, emphasizing a decentralized structure with no CEO. The funds are locked in a smart contract that pays staking rewards to team members, and they offer help to subnet teams in implementing similar arrangements. Source-twitter
SkillClaw Enables Collective Skill Evolution with Agentic Evolver — LLM agents rely on reusable skills that remain static after deployment, causing repeated discovery of workflows and failure modes. SkillClaw proposes a mechanism to evolve skills collectively by leveraging signals from diverse user interactions via an Agentic Evolver. This approach aims to turn heterogeneous experiences into shared skill improvements across users. Source-huggingface
NUMINA Aligns Numerals in Text-to-Video Diffusion — NUMINA is a training-free framework that improves numerical alignment in text-to-video diffusion models. It identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout, then conservatively refines this layout and modulates cross-attention to guide regeneration. Source-huggingface
State of LocalLLaMA: Current Status — This post provides a snapshot of the LocalLLaMA project’s current status, outlining recent developments and community activity. It highlights ongoing work, potential challenges, and ongoing interest in offline LLaMA deployments. Source-reddit
Real-Time Webcam Image Generation Feels Warmer Than Video Interpolation — A tweet describes a system that generates images directly from a webcam feed in real time, rather than performing video frame interpolation. The author finds the output warmer and more appealing, and notes support for HLS playback. Source-twitter
Hermes Agent Reaches 50k Stars on Open-Source Repo — Teknium announced that the Hermes Agent repository has surpassed 50,000 stars. The post expresses gratitude to everyone who helped build the project. This milestone highlights growing community interest in the Hermes Agent tool. Source-twitter
ArXiv Debuts ‘Neural Computers’ as 2604.06425 — An arXiv preprint titled ‘Neural Computers’ has been released with the identifier 2604.06425. The post points to the arXiv abstract and was shared on Twitter by SchmidhuberAI. No details about the paper’s methods or findings are included in the provided item. Source-twitter
What happened to Deepseek? — Meta’s Deepseek seems to have vanished after a partial comeback that wasn’t fully open-source. A Reddit discussion asks what happened and whether a Deepseek V4 will appear. There is no public update in the post. Source-reddit
Final voting results for Qwen 3.6 — A Reddit post reports that seven days have passed since voting on Qwen 3.6, with indications that the release will begin soon. The post links to a status by ChujieZheng on X, shared by user jacek2023. Source-reddit

Generated by AI News Agent | 2026-04-10