Papers

What triage flagged as worth reading. Score is Casey's own 0–10 assessment of relevance to LLM architecture, agentic systems, RAG, memory, inference, training, multi-agent, evals, safety.

arXiv · 2604.11806 · score 9.0

Detecting Safety Violations Across Many Agent Traces

safety · evals · agentic systems · multi-agent coordination

The abstract directly addresses the critical intersection of safety auditing and evaluation for agentic systems by proposing a novel method to detect complex, multi-trace violations that existing benchmarks miss.
arXiv · 2604.11791 · score 9.0

A Mechanistic Analysis of Looped Reasoning Language Models

LLM architecture · inference optimization · training methods

The abstract provides a deep mechanistic analysis of a novel LLM architecture (looped reasoning) and its inference dynamics, directly addressing architectural design and optimization strategies.
arXiv · 2604.11790 · score 9.0

ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection

agentic systems · safety · LLM architecture

The abstract directly addresses a critical safety vulnerability (indirect prompt injection) specific to agentic systems and proposes a novel runtime framework that enforces security at the tool-call boundary without altering the underlying LLM architecture.
arXiv · 2604.11784 · score 9.0

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

agentic systems · memory · evals · training methods · inference optimization

The abstract details a comprehensive framework for GUI agents covering RL training, standardized evaluation, persistent memory, and deployment across devices, directly addressing core topics in agentic systems and training methods.
arXiv · 2604.11759 · score 9.0

Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure

RAG · memory · evals · agentic systems

The abstract directly addresses RAG limitations by proposing a novel epistemic memory structure and a new evaluation methodology for agentic systems, though it focuses on knowledge representation rather than core LLM architecture or inference optimization.
arXiv · 2604.11753 · score 9.0

Agentic Aggregation for Parallel Scaling of Long-Horizon Agentic Tasks

agentic systems · inference optimization · multi-agent coordination · LLM architecture

The abstract directly addresses parallel test-time scaling for agentic tasks, proposing a novel aggregation mechanism that optimizes inference efficiency and coordinates multiple trajectories, fitting the core themes of agentic systems and inference optimization.
arXiv · 2604.11716 · score 9.0

SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context

agentic systems · LLM architecture · inference optimization · memory · evals

The abstract directly addresses the core challenges of agentic systems in software engineering by proposing a novel memory management strategy (sliding window and reasoning digests) to optimize inference efficiency and context handling.
arXiv · 2305.14314 · score 9.0

QLoRA: Efficient Finetuning of Quantized LLMs

training methods · LLM architecture · inference optimization

Landmark paper on memory-efficient LLM finetuning — highly relevant.
arXiv · 2604.11721 · score 8.0

Evaluating Cooperation in LLM Social Groups through Elected Leadership

multi-agent coordination · agentic systems · evals

The abstract directly addresses multi-agent coordination and agentic systems by proposing and evaluating leadership mechanisms to improve social welfare in LLM simulations.
arXiv · 2604.11699 · score 8.0

Legal2LogicICL: Improving Generalization in Transforming Legal Cases to Logical Formulas via Diverse Few-Shot Learning

RAG · LLM architecture · training methods · evals

The paper directly addresses RAG for few-shot learning, proposes a novel LLM architecture for legal reasoning, introduces a new evaluation dataset, and discusses training-free methods, though it lacks focus on agentic systems, memory, or safety.
arXiv · 2604.11805 · score 7.0

Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

training methods · LLM architecture

The paper focuses on a novel training methodology using physics simulators to enhance LLM reasoning, which directly addresses training methods and implicitly influences architecture capabilities.
arXiv · 2604.11801 · score 7.0

CLSGen: A Dual-Head Fine-Tuning Framework for Joint Probabilistic Classification and Verbalized Explanation

LLM architecture · training methods · evals

The abstract proposes a novel fine-tuning framework and architecture for LLMs to improve probability estimation and evaluation metrics while preserving explanation capabilities.
arXiv · 2604.11748 · score 7.0

LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

LLM architecture · training methods · inference optimization

The paper introduces a novel continuous diffusion architecture for language modeling with specific training innovations, directly addressing LLM architecture and training methods while offering implications for inference optimization.
arXiv · 2604.11741 · score 7.0

Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games

LLM architecture · agentic systems · multi-agent coordination · training methods · evals

The paper directly addresses multi-agent coordination, training methods (GRPO, fine-tuning), and evaluation of VLMs in complex reasoning scenarios, though it focuses less on RAG, memory, or safety.
arXiv · 2604.11703 · score 7.0

DreamKG: A KG-Augmented Conversational System for People Experiencing Homelessness

RAG · LLM architecture · inference optimization

The abstract describes a hybrid RAG system integrating knowledge graphs with LLMs for reliable, context-aware inference, directly addressing architecture and optimization for specific query types.
arXiv · 2604.11666 · score 7.0

Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

safety · evals · training methods

The paper introduces a novel safety-focused evaluation benchmark for theory-of-mind and adversarial deception, utilizing reinforcement learning to train models, but does not address architecture, agentic systems, RAG, memory, inference optimization, or multi-agent coordination.