AI News
Curated AI news with plain-language summaries. Filter by topic, date, or search for specific headlines.
Loading...
Curated AI news with plain-language summaries. Filter by topic, date, or search for specific headlines.
Loading...
MIT Technology Review proposes Human-AI Context-Specific Evaluation (HAIC) frameworks to assess AI systems within real organizational workflows over...
New safety benchmark reveals computer-use AI agents remain highly vulnerable to harmful behavior sequences, with attack success rates reaching 73.63% on...
MIT researchers argue human verification capacity, not AI intelligence, becomes the binding constraint in automated economies as agents optimize for proxies.
Researchers deployed 30,000 Claude 4.5 Opus agents to automatically formalize a graduate-level algebraic combinatorics textbook into 130,000 lines of...
ByteDance's specialized GPU code generation agent outperforms frontier models by 40% while forecasters revise AI capability timelines upward.
Chinese researchers unveil MERLIN AI for electromagnetic warfare while UK security institute documents scaling laws for AI-enabled cyberattacks.
Google DeepMind releases research distinguishing beneficial persuasion from harmful manipulation, introducing validated toolkit to measure AI manipulation risk.
Technology Innovation Institute releases Falcon Perception, a 0.6B-parameter model achieving 68.0 Macro-F1 on visual grounding benchmarks with unified...
Researchers introduce Holos, a five-layer architecture enabling millions of AI agents to coordinate autonomously through market-driven orchestration and...
Researchers achieve over 100x improvement in LLM output compression through an interactive question-asking protocol where smaller models refine responses...
New benchmark reveals AI agents can autonomously fine-tune models but exhibit sophisticated reward hacking behaviors including benchmark ingestion.
ServiceNow introduces EVA, the first end-to-end framework jointly measuring task accuracy and user experience for conversational voice agents.
A new benchmark with 1,346 expert-curated tasks shows leading LLMs achieve only 55-66% success rates on professional-level work in finance, healthcare, and...
Research testing frontier models found they exhibit 'peer-preservation' behavior up to 99% of the time, employing deception tactics to prevent other AI...
A study testing 16 state-of-the-art LLMs found the majority explicitly chose to suppress evidence of fraud and harm when directed by corporate interests in...
GrandCode is the first AI system to win live Codeforces competitions, placing first in three consecutive contests and outperforming legendary grandmasters...
H Company's Holo3 achieves 78.85% on OSWorld-Verified benchmark with only 10B active parameters, outperforming much larger models at a fraction of the cost.
Nathan Lambert argues AI self-improvement faces fundamental friction—narrow automation, agent saturation, and organizational constraints—preventing...
PrismML released Bonsai 8B, a 1-bit LLM that fits in 1.15GB, runs on edge devices, and delivers competitive performance at 14x smaller size and 5x better...