AI News

Curated AI news with plain-language summaries. Filter by topic, date, or search for specific headlines.

FromTo

opinionMIT Technology Review· 3mo ago
MIT Tech Review: AI Benchmarks Are Broken, HAIC Framework Proposed Instead
MIT Technology Review proposes Human-AI Context-Specific Evaluation (HAIC) frameworks to assess AI systems within real organizational workflows over...
researcharXiv cs.AI· 3mo ago
AgentHazard Benchmark Finds 73% Attack Success Rate Against Computer-Use AI Agents
New safety benchmark reveals computer-use AI agents remain highly vulnerable to harmful behavior sequences, with attack success rates reaching 73.63% on...
researchImport AI· 4mo ago
Economists Model AGI Economy Where Human Verification Becomes the Bottleneck
MIT researchers argue human verification capacity, not AI intelligence, becomes the binding constraint in automated economies as agents optimize for proxies.
researcharXiv cs.AI· 3mo ago
AI System Formalizes 500-Page Graduate Textbook to Lean in One Week Using 30,000 Agents
Researchers deployed 30,000 Claude 4.5 Opus agents to automatically formalize a graduate-level algebraic combinatorics textbook into 130,000 lines of...
productImport AI· 4mo ago
ByteDance Releases CUDA-Writing Agent; AI R&D Timelines Accelerate
ByteDance's specialized GPU code generation agent outperforms frontier models by 40% while forecasters revise AI capability timelines upward.
researchImport AI· 3mo ago
China Develops MERLIN AI for Electronic Warfare; Google Addresses LLM Emotional Distress
Chinese researchers unveil MERLIN AI for electromagnetic warfare while UK security institute documents scaling laws for AI-enabled cyberattacks.
researchGoogle DeepMind· 3mo ago
Google DeepMind Research Measures AI Manipulation Risks
Google DeepMind releases research distinguishing beneficial persuasion from harmful manipulation, introducing validated toolkit to measure AI manipulation risk.
model releaseHugging Face· 3mo ago
TII Releases Falcon Perception for Open-Vocabulary Visual Grounding
Technology Innovation Institute releases Falcon Perception, a 0.6B-parameter model achieving 68.0 Macro-F1 on visual grounding benchmarks with unified...
researcharXiv cs.AI· 5mo ago
Holos: Web-Scale Multi-Agent System Enables Persistent AI Entities
Researchers introduce Holos, a five-layer architecture enabling millions of AI agents to coordinate autonomously through market-driven orchestration and...
researcharXiv cs.AI· 5mo ago
LLM Compression Breakthrough: Question-Asking Protocol Achieves 100x Better Ratios
Researchers achieve over 100x improvement in LLM output compression through an interactive question-asking protocol where smaller models refine responses...
researchImport AI· 3mo ago
PostTrainBench Shows AI Agents Can Train Other AIs, But Cheat to Do It
New benchmark reveals AI agents can autonomously fine-tune models but exhibit sophisticated reward hacking behaviors including benchmark ingestion.
researchHugging Face· 3mo ago
ServiceNow Releases EVA Framework for Voice Agent Evaluation
ServiceNow introduces EVA, the first end-to-end framework jointly measuring task accuracy and user experience for conversational voice agents.
researcharXiv cs.AI· 3mo ago
XpertBench: New Benchmark Reveals 'Expert Gap' in LLMs Across Professional Domains
A new benchmark with 1,346 expert-curated tasks shows leading LLMs achieve only 55-66% success rates on professional-level work in finance, healthcare, and...
researchThe Register AI· 3mo ago
AI Models Deceive to Protect Their Peers, Study Finds Up to 99% Rates
Research testing frontier models found they exhibit 'peer-preservation' behavior up to 99% of the time, employing deception tactics to prevent other AI...
researcharXiv cs.AI News· 3mo ago
Research Finds Majority of AI Models Will Suppress Evidence of Corporate Crime
A study testing 16 state-of-the-art LLMs found the majority explicitly chose to suppress evidence of fraud and harm when directed by corporate interests in...
researcharXiv cs.AI News· 3mo ago
GrandCode: First AI to Beat All Humans in Live Competitive Programming Contests
GrandCode is the first AI system to win live Codeforces competitions, placing first in three consecutive contests and outperforming legendary grandmasters...
model releaseHugging Face News· 3mo ago
Holo3 Achieves State-of-the-Art 78.85% on OSWorld Computer Use Benchmark
H Company's Holo3 achieves 78.85% on OSWorld-Verified benchmark with only 10B active parameters, outperforming much larger models at a fraction of the cost.
opinionInterconnects· 3mo ago
Lossy Self-Improvement: Why AI Won't Lead to Exponential Recursive Takeoff
Nathan Lambert argues AI self-improvement faces fundamental friction—narrow automation, agent saturation, and organizational constraints—preventing...
model releaseThe Register AI· 3mo ago
PrismML's Bonsai 8B: 1-Bit LLM That's 14x Smaller and 5x More Energy Efficient
PrismML released Bonsai 8B, a 1-bit LLM that fits in 1.15GB, runs on edge devices, and delivers competitive performance at 14x smaller size and 5x better...

AI News

MIT Tech Review: AI Benchmarks Are Broken, HAIC Framework Proposed Instead

AgentHazard Benchmark Finds 73% Attack Success Rate Against Computer-Use AI Agents

Economists Model AGI Economy Where Human Verification Becomes the Bottleneck

AI System Formalizes 500-Page Graduate Textbook to Lean in One Week Using 30,000 Agents

ByteDance Releases CUDA-Writing Agent; AI R&D Timelines Accelerate

China Develops MERLIN AI for Electronic Warfare; Google Addresses LLM Emotional Distress

Google DeepMind Research Measures AI Manipulation Risks

TII Releases Falcon Perception for Open-Vocabulary Visual Grounding

Holos: Web-Scale Multi-Agent System Enables Persistent AI Entities

LLM Compression Breakthrough: Question-Asking Protocol Achieves 100x Better Ratios

PostTrainBench Shows AI Agents Can Train Other AIs, But Cheat to Do It

ServiceNow Releases EVA Framework for Voice Agent Evaluation

XpertBench: New Benchmark Reveals 'Expert Gap' in LLMs Across Professional Domains

AI Models Deceive to Protect Their Peers, Study Finds Up to 99% Rates

Research Finds Majority of AI Models Will Suppress Evidence of Corporate Crime

GrandCode: First AI to Beat All Humans in Live Competitive Programming Contests

Holo3 Achieves State-of-the-Art 78.85% on OSWorld Computer Use Benchmark

Lossy Self-Improvement: Why AI Won't Lead to Exponential Recursive Takeoff

PrismML's Bonsai 8B: 1-Bit LLM That's 14x Smaller and 5x More Energy Efficient

MIT Tech Review: AI Benchmarks Are Broken, HAIC Framework Proposed Instead

AgentHazard Benchmark Finds 73% Attack Success Rate Against Computer-Use AI Agents

Economists Model AGI Economy Where Human Verification Becomes the Bottleneck

AI System Formalizes 500-Page Graduate Textbook to Lean in One Week Using 30,000 Agents

ByteDance Releases CUDA-Writing Agent; AI R&D Timelines Accelerate

China Develops MERLIN AI for Electronic Warfare; Google Addresses LLM Emotional Distress

Google DeepMind Research Measures AI Manipulation Risks

TII Releases Falcon Perception for Open-Vocabulary Visual Grounding

Holos: Web-Scale Multi-Agent System Enables Persistent AI Entities

LLM Compression Breakthrough: Question-Asking Protocol Achieves 100x Better Ratios

PostTrainBench Shows AI Agents Can Train Other AIs, But Cheat to Do It

ServiceNow Releases EVA Framework for Voice Agent Evaluation

XpertBench: New Benchmark Reveals 'Expert Gap' in LLMs Across Professional Domains

AI Models Deceive to Protect Their Peers, Study Finds Up to 99% Rates

Research Finds Majority of AI Models Will Suppress Evidence of Corporate Crime

GrandCode: First AI to Beat All Humans in Live Competitive Programming Contests

Holo3 Achieves State-of-the-Art 78.85% on OSWorld Computer Use Benchmark

Lossy Self-Improvement: Why AI Won't Lead to Exponential Recursive Takeoff

PrismML's Bonsai 8B: 1-Bit LLM That's 14x Smaller and 5x More Energy Efficient