AI 每日资讯 - 2026-03-22

发布日期：2026-03-22

收录条目：20

1. Safely Deploying ML Models to Production: Four Controlled Strategies (A/B, Canary, Interleaved, Shadow Testing)

来源：MarkTechPost
发布时间：2026-03-21 23:02 UTC
链接：https://www.marktechpost.com/2026/03/21/safely-deploying-ml-models-to-production-four-controlled-strategies-a-b-canary-interleaved-shadow-testing/

摘要：Deploying a new machine learning model to production is one of the most critical stages of the ML lifecycle. Even if a model performs well on validation and test datasets, directly replacing the existing production model

2. A Coding Implementation to Build an Uncertainty-Aware LLM System with Confidence Estimation, Self-Evaluation, and Automatic Web Research

来源：MarkTechPost
发布时间：2026-03-21 21:39 UTC
链接：https://www.marktechpost.com/2026/03/21/a-coding-implementation-to-build-an-uncertainty-aware-llm-system-with-confidence-estimation-self-evaluation-and-automatic-web-research/

摘要：In this tutorial, we build an uncertainty-aware large language model system that not only generates answers but also estimates the confidence in those answers. We implement a three-stage reasoning pipeline in which the m

3. The gen AI Kool-Aid tastes like eugenics

来源：The Verge AI
发布时间：2026-03-21 14:00 UTC
链接：https://www.theverge.com/entertainment/897923/ghost-in-the-machine-valerie-veatch-interview

摘要：Like many people, director Valerie Veatch was intrigued when OpenAI first released its Sora text-to-video generative AI model to the public in 2024. Though she didn't fully understand the technology, she was curious abou

4. Gemini task automation is slow, clunky, and super impressive

来源：The Verge AI
发布时间：2026-03-21 11:30 UTC
链接：https://www.theverge.com/tech/898282/gemini-task-automation-uber-doordash-hands-on

摘要：I've been testing out Gemini's new task automation on the Pixel 10 Pro and the Galaxy S26 Ultra, which for the first time lets Gemini take the wheel and use apps for you. It's limited to a small subset right now - a hand

5. DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models

来源：arXiv cs.AI
发布时间：2026-03-21 04:00 UTC
链接：https://arxiv.org/abs/2603.18048

摘要：arXiv:2603.18048v1 Announce Type: new Abstract: Recent Audio Multimodal Large Language Models (Audio MLLMs) demonstrate impressive performance on speech benchmarks, yet it remains unclear whether these models genuinely p

6. Continually self-improving AI

来源：arXiv cs.AI
发布时间：2026-03-21 04:00 UTC
链接：https://arxiv.org/abs/2603.18073

摘要：arXiv:2603.18073v1 Announce Type: new Abstract: Modern language model-based AI systems are remarkably powerful, yet their capabilities remain fundamentally capped by their human creators in three key ways. First, althoug

7. Multi-Trait Subspace Steering to Reveal the Dark Side of Human-AI Interaction

来源：arXiv cs.AI
发布时间：2026-03-21 04:00 UTC
链接：https://arxiv.org/abs/2603.18085

摘要：arXiv:2603.18085v1 Announce Type: new Abstract: Recent incidents have highlighted alarming cases where human-AI interactions led to negative psychological outcomes, including mental health crises and even user harm. As L

8. Adaptive Domain Models: Bayesian Evolution, Warm Rotation, and Principled Training for Geometric and Neuromorphic AI

来源：arXiv cs.AI
发布时间：2026-03-21 04:00 UTC
链接：https://arxiv.org/abs/2603.18104

摘要：arXiv:2603.18104v1 Announce Type: new Abstract: Prevailing AI training infrastructure assumes reverse-mode automatic differentiation over IEEE-754 arithmetic. The memory overhead of training relative to inference, optimi

9. Don't Vibe Code, Do Skele-Code: Interactive No-Code Notebooks for Subject Matter Experts to Build Lower-Cost Agentic Workflows

来源：arXiv cs.AI
发布时间：2026-03-21 04:00 UTC
链接：https://arxiv.org/abs/2603.18122

摘要：arXiv:2603.18122v1 Announce Type: new Abstract: Skele-Code is a natural-language and graph-based interface for building workflows with AI agents, designed especially for less or non-technical users. It supports increment

10. Efficient Dense Crowd Trajectory Prediction Via Dynamic Clustering

来源：arXiv cs.AI
发布时间：2026-03-21 04:00 UTC
链接：https://arxiv.org/abs/2603.18166

摘要：arXiv:2603.18166v1 Announce Type: new Abstract: Crowd trajectory prediction plays a crucial role in public safety and management, where it can help prevent disasters such as stampedes. Recent works address the problem by

11. TeachingCoach: A Fine-Tuned Scaffolding Chatbot for Instructional Guidance to Instructors

来源：arXiv cs.AI
发布时间：2026-03-21 04:00 UTC
链接：https://arxiv.org/abs/2603.18189

摘要：arXiv:2603.18189v1 Announce Type: new Abstract: Higher education instructors often lack timely and pedagogically grounded support, as scalable instructional guidance remains limited and existing tools rely on generic cha

12. Access Controlled Website Interaction for Agentic AI with Delegated Critical Tasks

来源：arXiv cs.AI
发布时间：2026-03-21 04:00 UTC
链接：https://arxiv.org/abs/2603.18197

摘要：arXiv:2603.18197v1 Announce Type: new Abstract: Recent studies reveal gaps in delegating critical tasks to agentic AI that accesses websites on the user's behalf, primarily due to limited access control mechanisms on web

13. A Computationally Efficient Learning of Artificial Intelligence System Reliability Considering Error Propagation

来源：arXiv cs.AI
发布时间：2026-03-21 04:00 UTC
链接：https://arxiv.org/abs/2603.18201

摘要：arXiv:2603.18201v1 Announce Type: new Abstract: Artificial Intelligence (AI) systems are increasingly prominent in emerging smart cities, yet their reliability remains a critical concern. These systems typically operate

14. Retrieval-Augmented LLM Agents: Learning to Learn from Experience

来源：arXiv cs.AI
发布时间：2026-03-21 04:00 UTC
链接：https://arxiv.org/abs/2603.18272

摘要：arXiv:2603.18272v1 Announce Type: new Abstract: While large language models (LLMs) have advanced the development of general-purpose agents, achieving robust generalization to unseen tasks remains a significant challenge.

15. EDM-ARS: A Domain-Specific Multi-Agent System for Automated Educational Data Mining Research

来源：arXiv cs.AI
发布时间：2026-03-21 04:00 UTC
链接：https://arxiv.org/abs/2603.18273

摘要：arXiv:2603.18273v1 Announce Type: new Abstract: In this technical report, we present the Educational Data Mining Automated Research System (EDM-ARS), a domain-specific multi-agent pipeline that automates end-to-end educa

16. CORE: Robust Out-of-Distribution Detection via Confidence and Orthogonal Residual Scoring

来源：arXiv cs.AI
发布时间：2026-03-21 04:00 UTC
链接：https://arxiv.org/abs/2603.18290

摘要：arXiv:2603.18290v1 Announce Type: new Abstract: Out-of-distribution (OOD) detection is essential for deploying deep learning models reliably, yet no single method performs consistently across architectures and datasets -

17. The Validity Gap in Health AI Evaluation: A Cross-Sectional Analysis of Benchmark Composition

来源：arXiv cs.AI
发布时间：2026-03-21 04:00 UTC
链接：https://arxiv.org/abs/2603.18294

摘要：arXiv:2603.18294v1 Announce Type: new Abstract: Background: Clinical trials rely on transparent inclusion criteria to ensure generalizability. In contrast, benchmarks validating health-related large language models (LLMs

18. Consumer-to-Clinical Language Shifts in Ambient AI Draft Notes and Clinician-Finalized Documentation: A Multi-level Analysis

来源：arXiv cs.AI
发布时间：2026-03-21 04:00 UTC
链接：https://arxiv.org/abs/2603.18327

摘要：arXiv:2603.18327v1 Announce Type: new Abstract: Ambient AI generates draft clinical notes from patient-clinician conversations, often using lay or consumer-oriented phrasing to support patient understanding instead of st

19. FaithSteer-BENCH: A Deployment-Aligned Stress-Testing Benchmark for Inference-Time Steering

来源：arXiv cs.AI
发布时间：2026-03-21 04:00 UTC
链接：https://arxiv.org/abs/2603.18329

摘要：arXiv:2603.18329v1 Announce Type: new Abstract: Inference-time steering is widely regarded as a lightweight and parameter-free mechanism for controlling large language model (LLM) behavior, and prior work has often sugge

20. MemArchitect: A Policy Driven Memory Governance Layer

来源：arXiv cs.AI
发布时间：2026-03-21 04:00 UTC
链接：https://arxiv.org/abs/2603.18330

摘要：arXiv:2603.18330v1 Announce Type: new Abstract: Persistent Large Language Model (LLM) agents expose a critical governance gap in memory management. Standard Retrieval-Augmented Generation (RAG) frameworks treat memory as

菜单

分享

AI 每日资讯 - 2026-03-22

1. Safely Deploying ML Models to Production: Four Controlled Strategies (A/B, Canary, Interleaved, Shadow Testing)

2. A Coding Implementation to Build an Uncertainty-Aware LLM System with Confidence Estimation, Self-Evaluation, and Automatic Web Research

3. The gen AI Kool-Aid tastes like eugenics

4. Gemini task automation is slow, clunky, and super impressive

5. DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models

6. Continually self-improving AI

7. Multi-Trait Subspace Steering to Reveal the Dark Side of Human-AI Interaction

8. Adaptive Domain Models: Bayesian Evolution, Warm Rotation, and Principled Training for Geometric and Neuromorphic AI

9. Don't Vibe Code, Do Skele-Code: Interactive No-Code Notebooks for Subject Matter Experts to Build Lower-Cost Agentic Workflows

10. Efficient Dense Crowd Trajectory Prediction Via Dynamic Clustering

11. TeachingCoach: A Fine-Tuned Scaffolding Chatbot for Instructional Guidance to Instructors

12. Access Controlled Website Interaction for Agentic AI with Delegated Critical Tasks

13. A Computationally Efficient Learning of Artificial Intelligence System Reliability Considering Error Propagation

14. Retrieval-Augmented LLM Agents: Learning to Learn from Experience

15. EDM-ARS: A Domain-Specific Multi-Agent System for Automated Educational Data Mining Research

16. CORE: Robust Out-of-Distribution Detection via Confidence and Orthogonal Residual Scoring

17. The Validity Gap in Health AI Evaluation: A Cross-Sectional Analysis of Benchmark Composition

18. Consumer-to-Clinical Language Shifts in Ambient AI Draft Notes and Clinician-Finalized Documentation: A Multi-level Analysis

19. FaithSteer-BENCH: A Deployment-Aligned Stress-Testing Benchmark for Inference-Time Steering

20. MemArchitect: A Policy Driven Memory Governance Layer

评论

A2A 初理解：让 AI Agent 真正“互相协作”的通用协议

slow op的排查手段（更新中）

asan内存检测

模型即芯片：AI 推理新分叉

rclone拷贝桶对象失败定位过程

训练初了解：把大模型看成一个复杂函数（通俗版）

vector扩容

智能指针是线程安全的？

ceph中 RBD 使用

cas 无锁编程