发布日期:2026-03-22
收录条目:20
1. Safely Deploying ML Models to Production: Four Controlled Strategies (A/B, Canary, Interleaved, Shadow Testing)
- 来源:MarkTechPost
- 发布时间:2026-03-21 23:02 UTC
- 链接:https://www.marktechpost.com/2026/03/21/safely-deploying-ml-models-to-production-four-controlled-strategies-a-b-canary-interleaved-shadow-testing/
摘要:Deploying a new machine learning model to production is one of the most critical stages of the ML lifecycle. Even if a model performs well on validation and test datasets, directly replacing the existing production model
2. A Coding Implementation to Build an Uncertainty-Aware LLM System with Confidence Estimation, Self-Evaluation, and Automatic Web Research
- 来源:MarkTechPost
- 发布时间:2026-03-21 21:39 UTC
- 链接:https://www.marktechpost.com/2026/03/21/a-coding-implementation-to-build-an-uncertainty-aware-llm-system-with-confidence-estimation-self-evaluation-and-automatic-web-research/
摘要:In this tutorial, we build an uncertainty-aware large language model system that not only generates answers but also estimates the confidence in those answers. We implement a three-stage reasoning pipeline in which the m
3. The gen AI Kool-Aid tastes like eugenics
- 来源:The Verge AI
- 发布时间:2026-03-21 14:00 UTC
- 链接:https://www.theverge.com/entertainment/897923/ghost-in-the-machine-valerie-veatch-interview
摘要:Like many people, director Valerie Veatch was intrigued when OpenAI first released its Sora text-to-video generative AI model to the public in 2024. Though she didn't fully understand the technology, she was curious abou
4. Gemini task automation is slow, clunky, and super impressive
- 来源:The Verge AI
- 发布时间:2026-03-21 11:30 UTC
- 链接:https://www.theverge.com/tech/898282/gemini-task-automation-uber-doordash-hands-on
摘要:I've been testing out Gemini's new task automation on the Pixel 10 Pro and the Galaxy S26 Ultra, which for the first time lets Gemini take the wheel and use apps for you. It's limited to a small subset right now - a hand
5. DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models
- 来源:arXiv cs.AI
- 发布时间:2026-03-21 04:00 UTC
- 链接:https://arxiv.org/abs/2603.18048
摘要:arXiv:2603.18048v1 Announce Type: new Abstract: Recent Audio Multimodal Large Language Models (Audio MLLMs) demonstrate impressive performance on speech benchmarks, yet it remains unclear whether these models genuinely p
6. Continually self-improving AI
- 来源:arXiv cs.AI
- 发布时间:2026-03-21 04:00 UTC
- 链接:https://arxiv.org/abs/2603.18073
摘要:arXiv:2603.18073v1 Announce Type: new Abstract: Modern language model-based AI systems are remarkably powerful, yet their capabilities remain fundamentally capped by their human creators in three key ways. First, althoug
7. Multi-Trait Subspace Steering to Reveal the Dark Side of Human-AI Interaction
- 来源:arXiv cs.AI
- 发布时间:2026-03-21 04:00 UTC
- 链接:https://arxiv.org/abs/2603.18085
摘要:arXiv:2603.18085v1 Announce Type: new Abstract: Recent incidents have highlighted alarming cases where human-AI interactions led to negative psychological outcomes, including mental health crises and even user harm. As L
8. Adaptive Domain Models: Bayesian Evolution, Warm Rotation, and Principled Training for Geometric and Neuromorphic AI
- 来源:arXiv cs.AI
- 发布时间:2026-03-21 04:00 UTC
- 链接:https://arxiv.org/abs/2603.18104
摘要:arXiv:2603.18104v1 Announce Type: new Abstract: Prevailing AI training infrastructure assumes reverse-mode automatic differentiation over IEEE-754 arithmetic. The memory overhead of training relative to inference, optimi
9. Don't Vibe Code, Do Skele-Code: Interactive No-Code Notebooks for Subject Matter Experts to Build Lower-Cost Agentic Workflows
- 来源:arXiv cs.AI
- 发布时间:2026-03-21 04:00 UTC
- 链接:https://arxiv.org/abs/2603.18122
摘要:arXiv:2603.18122v1 Announce Type: new Abstract: Skele-Code is a natural-language and graph-based interface for building workflows with AI agents, designed especially for less or non-technical users. It supports increment
10. Efficient Dense Crowd Trajectory Prediction Via Dynamic Clustering
- 来源:arXiv cs.AI
- 发布时间:2026-03-21 04:00 UTC
- 链接:https://arxiv.org/abs/2603.18166
摘要:arXiv:2603.18166v1 Announce Type: new Abstract: Crowd trajectory prediction plays a crucial role in public safety and management, where it can help prevent disasters such as stampedes. Recent works address the problem by
11. TeachingCoach: A Fine-Tuned Scaffolding Chatbot for Instructional Guidance to Instructors
- 来源:arXiv cs.AI
- 发布时间:2026-03-21 04:00 UTC
- 链接:https://arxiv.org/abs/2603.18189
摘要:arXiv:2603.18189v1 Announce Type: new Abstract: Higher education instructors often lack timely and pedagogically grounded support, as scalable instructional guidance remains limited and existing tools rely on generic cha
12. Access Controlled Website Interaction for Agentic AI with Delegated Critical Tasks
- 来源:arXiv cs.AI
- 发布时间:2026-03-21 04:00 UTC
- 链接:https://arxiv.org/abs/2603.18197
摘要:arXiv:2603.18197v1 Announce Type: new Abstract: Recent studies reveal gaps in delegating critical tasks to agentic AI that accesses websites on the user's behalf, primarily due to limited access control mechanisms on web
13. A Computationally Efficient Learning of Artificial Intelligence System Reliability Considering Error Propagation
- 来源:arXiv cs.AI
- 发布时间:2026-03-21 04:00 UTC
- 链接:https://arxiv.org/abs/2603.18201
摘要:arXiv:2603.18201v1 Announce Type: new Abstract: Artificial Intelligence (AI) systems are increasingly prominent in emerging smart cities, yet their reliability remains a critical concern. These systems typically operate
14. Retrieval-Augmented LLM Agents: Learning to Learn from Experience
- 来源:arXiv cs.AI
- 发布时间:2026-03-21 04:00 UTC
- 链接:https://arxiv.org/abs/2603.18272
摘要:arXiv:2603.18272v1 Announce Type: new Abstract: While large language models (LLMs) have advanced the development of general-purpose agents, achieving robust generalization to unseen tasks remains a significant challenge.
15. EDM-ARS: A Domain-Specific Multi-Agent System for Automated Educational Data Mining Research
- 来源:arXiv cs.AI
- 发布时间:2026-03-21 04:00 UTC
- 链接:https://arxiv.org/abs/2603.18273
摘要:arXiv:2603.18273v1 Announce Type: new Abstract: In this technical report, we present the Educational Data Mining Automated Research System (EDM-ARS), a domain-specific multi-agent pipeline that automates end-to-end educa
16. CORE: Robust Out-of-Distribution Detection via Confidence and Orthogonal Residual Scoring
- 来源:arXiv cs.AI
- 发布时间:2026-03-21 04:00 UTC
- 链接:https://arxiv.org/abs/2603.18290
摘要:arXiv:2603.18290v1 Announce Type: new Abstract: Out-of-distribution (OOD) detection is essential for deploying deep learning models reliably, yet no single method performs consistently across architectures and datasets -
17. The Validity Gap in Health AI Evaluation: A Cross-Sectional Analysis of Benchmark Composition
- 来源:arXiv cs.AI
- 发布时间:2026-03-21 04:00 UTC
- 链接:https://arxiv.org/abs/2603.18294
摘要:arXiv:2603.18294v1 Announce Type: new Abstract: Background: Clinical trials rely on transparent inclusion criteria to ensure generalizability. In contrast, benchmarks validating health-related large language models (LLMs
18. Consumer-to-Clinical Language Shifts in Ambient AI Draft Notes and Clinician-Finalized Documentation: A Multi-level Analysis
- 来源:arXiv cs.AI
- 发布时间:2026-03-21 04:00 UTC
- 链接:https://arxiv.org/abs/2603.18327
摘要:arXiv:2603.18327v1 Announce Type: new Abstract: Ambient AI generates draft clinical notes from patient-clinician conversations, often using lay or consumer-oriented phrasing to support patient understanding instead of st
19. FaithSteer-BENCH: A Deployment-Aligned Stress-Testing Benchmark for Inference-Time Steering
- 来源:arXiv cs.AI
- 发布时间:2026-03-21 04:00 UTC
- 链接:https://arxiv.org/abs/2603.18329
摘要:arXiv:2603.18329v1 Announce Type: new Abstract: Inference-time steering is widely regarded as a lightweight and parameter-free mechanism for controlling large language model (LLM) behavior, and prior work has often sugge
20. MemArchitect: A Policy Driven Memory Governance Layer
- 来源:arXiv cs.AI
- 发布时间:2026-03-21 04:00 UTC
- 链接:https://arxiv.org/abs/2603.18330
摘要:arXiv:2603.18330v1 Announce Type: new Abstract: Persistent Large Language Model (LLM) agents expose a critical governance gap in memory management. Standard Retrieval-Augmented Generation (RAG) frameworks treat memory as