| Source | Year | Key Finding | Cited In |
|---|---|---|---|
| Vaswani et al., “Attention Is All You Need” | 2017 | Transformer architecture — self-attention mechanism with n-squared pairwise relationships; foundational to context window constraints | P2 |
| Zamfirescu-Pereira et al. (CHI), “Why Johnny Can’t Prompt” | 2023 | Non-experts prefer “Do not X” framing; positive + negative constraints together are strongest prompt structure | P5 |
| Hong et al., MetaGPT | 2023 | Structured artifacts reduce errors ~40% vs. free dialogue in multi-agent systems | P4, P7 |
| Liu et al., “Lost in the Middle” | 2024 | 30%+ accuracy drop when critical information is placed in mid-context positions | P2, P10 |
| Ranjan et al., “One Word Is Not Enough” | 2024 | LLM vocabulary acts as a routing signal in embedding space, activating domain-specific knowledge clusters; superlatives and flattery (“world’s best”) route to motivational/marketing clusters rather than domain expertise | P6, P10 |
| PRISM Persona Framework | 2026 | Accuracy damage from personas scales with length — shorter identities cause less degradation; identities should be the minimum length required, under 50 tokens in practice; alignment-accuracy tradeoff: personas improve instruction-following while degrading factual accuracy on knowledge tasks | P6, P8, P10 |
| MAST, “Why Do Multi-Agent LLM Systems Fail?” | 2024–2025 | 14 failure modes catalogued across communication (4), coordination (5), and quality (5) categories; rubber-stamp approval as #1 quality failure | P7, P8, P10 |
| Captain Agent, “Adaptive In-Conversation Team Building” | 2024 | Adaptive team composition outperforms static composition by 15–25% across benchmarks | P9 |
| Du et al., “Improving Factuality and Reasoning through Multiagent Debate” | 2024 | Multi-agent debate improves reasoning accuracy on structured problems with verifiable answers | P6 |
| LangChain few-shot prompting research | 2024 | 3 well-chosen examples match 9 in effectiveness; diminishing returns are real for few-shot prompting | P3, P10 |
| Anthropic, “Building Effective Agents” | Dec 2024 | Agent vs. workflow distinction; structured handoffs as default pattern | P7 |
| Chroma Research, “Context Rot” | 2025 | Degradation of recall as context length increases; retrieval quality as a function of context window fill | P2 |
| Wu et al. (MIT), “On the Emergence of Position Bias in Transformers” | 2025 | Causal masking and RoPE as architectural causes of U-shaped attention curve — not patchable by prompting | P2 |
| DeepMind et al., “Towards a Science of Scaling Agent Systems” | 2025 | 45% threshold for single-agent sufficiency; effectiveness saturates at 3–4 agents; 5-agent team = 7x cost for 3.1x output | P9, P10 |
| He et al., “Does Prompt Formatting Have Any Impact on LLM Performance?” | 2025 | Prompt structure accounts for up to 40% of performance variance independent of content | P3, P4 |
| Anthropic, “Effective Context Engineering for AI Agents” | Sep 2025 | Attention budget concept, progressive disclosure as a strategy, context as finite resource with real cost | P2, P9 |
| Anthropic, Skill Creator guidance | 2025 | “Explain why things are important in lieu of heavy-handed MUSTs” — BECAUSE clauses outperform imperatives | P5, P10 |
| Anthropic, “Harness Design for Long-Running Application Development” | Mar 2026 | Self-evaluation fails — a generator shares its evaluator’s biases; separation of generation from evaluation dramatically improves output quality | P6, P10 |
| Vaarta Analytics, “Prompt Engineering Is System Design” | 2026 | Structured atomic checks reduce false negatives; at n=19 requirements, accuracy drops below n=5 | P5, P10 |
Research Citation Index
Every principle in this series traces to published research. This index collects all 19 sources cited across the 10 articles.