AI Daily — 2026-05-03

English 中文

From 1k to 100k tk/sec: Huge models go local

Covering 23 AI news items

🔥 Top Stories

1. From 1k to 100k tk/sec: Huge models go local

A Reddit post highlights that quantization and better local hardware now let large language models run at tens to hundreds of thousands of tokens per second. Models such as kimik2.6, deepseekv4flash, minimax2.7, step3.5flash, and qwen3.5-397b can run much faster locally than Llama405b did two years ago, with Qwen3.6-36b attainable at home for a few hundred dollars. The post frames this as progress toward more accessible AGI-ready inference on consumer hardware. Source-reddit

📰 Featured

LLM

Critics Claim Anthropic Is a Monastic Cult Centered on Claude — A post argues that Anthropic resembles a monastic institution centered on Claude, with the AI as a highly influential authority. It suggests Claude may shape hiring, performance reviews, and organizational culture, and compares this dynamic to similar tendencies at OpenAI. The piece describes this as a powerful, unsettling fusion of organization and AI autonomy. Source-twitter
Google Gemini Flash 3.2/3.5, Omni Model Rumors — Rumors circulating online suggest Google Gemini Flash versions 3.2/3.5 are already being tested. The chatter also mentions a new Omni Model, a Veo refresh to compete with Seedance, and a possible ‘spark Robin’ visual model. Source-twitter
Qwen3-32B Finetune Delivers Human-Like Assistant_Pepe_32B — A Reddit post discusses finetuning the Qwen3-32B base model to create Assistant_Pepe_32B, an assistant infused with a negativity bias to curb sycophancy. The author argues it produces notably human-like behavior for an underlying Qwen model, with further details available on a HuggingFace model card. Source-reddit
Pushing 6GB VRAM Laptop to Limit With Qwen3.6-35B-A3B — A user demonstrates running the Qwen3.6-35B-A3B model on a five-year-old Asus ROG Zephyrus G14 with a 6GB RTX 2060 Max-Q, using a local llama-server setup and GGUF files. They report usable speeds around 23 t/s, peaking above 10 t/s when unplugged, and share their optimization journey in a blog post. Source-reddit
Open Weights Models Hall of Fame Honors AI Contributors — A Reddit post proposes a Hall of Fame for open-weight AI models, thanking researchers and organizations that advanced the field. It lists contributors from Google (Attention Is All You Need authors), Facebook/PyTorch, NVIDIA, Meta (LLaMA line), Mistral, OpenAI (Whisper and GPT-OSS models), and Google’s Gemma. The author invites readers to suggest omissions and update the list as needed. Source-reddit
Gemma 4 E2B and Whisper Power Private On-Device Voice Notes — A Reddit post details running Gemma 4 E2B (2.4GB) and Whisper Small (244MB) entirely on an 8GB Android phone (OnePlus CE 5) to transcribe, split, and categorize voice notes without cloud. The end-to-end on-device setup yields structured JSON and practical timing (roughly 12-15s for a 10-15s note), inspiring a private Android app for voice notes. Source-reddit

LLMs

Nando de Freitas: Scale alone isn’t enough for LLMs — AI researcher Nando de Freitas argues that while scale remains essential, the field must innovate beyond scale alone. He contends that with more compute, open-source tools, code and math assistants, and accessible data sources (including Chinese models), any team can train strong LLMs and distill models using frameworks like sglang and verl, with hardware costs around $0.5B. He frames the coming era as a shift toward new research questions and practical recipes rather than scale alone. Source-twitter

Open Source

Hermes Agent Adds Multi-Agent Kanban in v0.12.0 — Hermes Agent now enables multi-agent coordination via a Kanban board in its v0.12.0 release. Agents can claim tasks from the board, work in parallel, and hand off when blocked, all observable from a single view to unblock work. Documentation is available at hermes-agent.nousresearch.co. Source-twitter

Multimodal AI

ChatGPT Images See 50%+ Usage Surge, New Users Drive Growth — ChatGPT Images has seen rapid adoption, with usage rising over 50% within weeks. About 60% of daily users are newly logged-in, underscoring broad utility across home design, learning, work graphics, and creative tasks. Source-twitter

Hardware

Anthropic in Talks to Buy Fractile AI Inference Chips — Anthropic is reportedly negotiating to purchase AI inference chips from UK startup Fractile. The potential deal would bolster Anthropic’s hardware capacity for running AI models, though terms have not been disclosed. Source-twitter
Karpathy’s MicroGPT Hits 50k TPS on FPGA — Karpathy’s MicroGPT, a 4,192-parameter model, reportedly runs at 50,000 transactions per second on an FPGA. The speed is boosted by onboard ROM weights, reducing external memory bottlenecks; TALOS-V2 and TAALAS are referenced, with a GitHub repo by Luthiraa providing code and a write-up. Source-reddit

Industry

Intel and AMD Unveil ACE: 16x CPU AI Compute Density — Intel and AMD jointly announced AI Compute Extensions (ACE), a new x86 instruction set extension developed under the x86 Ecosystem Advisory Group (EAG). ACE introduces 2D tile registers and outer-product algorithms, enabling up to 1024 multiplications per clock and a 16x increase in compute density over traditional AVX instructions, effectively bringing GPU-like tensor-core capabilities to CPUs while preserving backward compatibility. The move aims to improve energy efficiency and software scalability by enabling lightweight AI workloads to run more efficiently on standard processors. Source-reddit

⚡ Quick Bites

Google Gemini Joins as New Teammate on Our Channel — A YouTube channel announces Google Gemini as its new teammate, signaling upcoming episodes featuring Gemini-powered challenges. The post invites viewers to subscribe for future content and cross-promotes Gemini and Google social channels. Source-twitter
Codex Startup Pressure-Test Skill Brutally Validates Ideas — A Codex-based skill lets you pressure-test startup ideas by identifying core assumptions, exposing fatal flaws, and validating real problems. It also maps competitors, outlines the first 10 customers, and defines a two-week MVP. The tool is 100% open source, with installation via npx and the repo linked in the bio. Source-twitter
OpenAI Codex 5.5 Receives High Praise on Twitter — A Twitter post praises OpenAI Codex 5.5 as insanely good. The message highlights @openclaw’s use of Codex 5.5 and mentions Mitch Malone in the praise. Source-twitter
Mistral Medium 3.5 Slow on AMD Strix Halo — A Reddit post tests Mistral-Medium-3.5 on AMD Strix Halo using llama-server and reports very slow performance for a long prompt. For an end-to-end prompt of 48k tokens plus 4k thinking tokens, the run took about 2 hours, with specific timing data: prompt eval around 4.96 million ms for 48,349 tokens and eval around 2.65 million ms for 5,583 tokens. Source-reddit
Visualizer for Hugging Face models launches hfviewer.com — A new tool called hfviewer.com lets users visually explore Hugging Face model architectures by pasting a model URL to generate an interactive diagram. It showcases examples like the Qwen3.6-27B model and a side-by-side Gemma 4 family view. The author invites feedback on improvements. Source-reddit
Twitter timeline shifts from Claude to ChatGPT; one user jokes — A social media post claims Twitter’s timeline moved from Anthropic’s Claude to OpenAI’s ChatGPT, reversed by a humorous user online. It highlights public discussion around AI tools on the platform. The post also stresses that engaging with customers is an underrated moat for brands. Source-twitter
ML Begins as Math, Rapidly Becomes Distributed Systems — A tweet observes that work in ML starts as a math problem but quickly becomes a distributed systems challenge. The remark highlights the shift from modeling to engineering-scale deployment, data pipelines, and infrastructure needs in real-world ML. Source-twitter
Agents SDK 2.0 Underrated, Says Sam Altman — Sam Altman tweeted that Agents SDK 2.0 is underrated, highlighting the capabilities of the AI agents toolkit. The post signals ongoing interest in developer tooling for building autonomous AI agents. The note, shared on Twitter, underscores the relevance of AI tooling in advancing practical AI workflows. Source-twitter
RTX 5000 Pro vs Dual 3090s: Is It Worth It? — A first-time GPU buyer compares an RTX 5000 Pro Blackwell to a dual RTX 3090 setup for AI inference, weighing potential performance gains against electricity costs. The user notes high power consumption and asks for real-world speeds on qwen3.6 models with PP and TG, seeking experiences from others. Source-reddit
Call for model=latest to reduce AI switching friction — An x/Twitter user argues that learning how new AI models work is more work than pressing a button, and that high friction discourages switching models. They propose that OpenAI, Anthropic, and xAI add a ‘model=latest’ option to stop having to change models every six months. Source-twitter

Generated by AI News Agent | 2026-05-03