The Context Window Problem
Large language models have a fundamental limitation: finite context windows. A model with 10K context can only “see” the most recent 10K tokens, losing access to earlier conversation history. Current solutions include:
- Longer context models: Expensive to train and run (128K+ context)
- Summarization: Lossy compression that discards detail
- RAG retrieval: Re-embeds and recomputes attention on every query
The HAT Solution
HAT takes a different approach: exploit known structure.
Unlike general-purpose vector databases that treat all data as unstructured point clouds, AI conversation data has inherent hierarchy:
Session (conversation boundary)
└── Document (topic or turn)
└── Chunk (individual message)
HAT exploits this structure to achieve O(log n) queries with 100% recall, without any training or learning.
Core Claim
A 10K context model with HAT achieves 100% recall on 60K+ tokens with 3.1ms latency.
This is validated by our end-to-end experiments integrating HAT with a local LLM.
Why HAT Outperforms HNSW
| Aspect | HNSW | HAT |
|---|---|---|
| Structure | Learned graph | Known hierarchy |
| Time awareness | None | Built-in |
| Insert cost | O(log n) expensive | O(log n) cheap |
| Semantic meaning | None | Native |
| Memory overhead | High (edges) | Low (centroids) |
The key insight:
HNSW: "Here are points, I'll learn to navigate them"
HAT: "I know these are containers. I'll use what I know."
This is the difference between a search engine and a hippocampus.
The Hippocampus Analogy
HAT mirrors human memory architecture:
| Human Memory | HAT Equivalent |
|---|---|
| Working memory (7±2 items) | Current context window |
| Short-term memory | Recent session containers |
| Long-term episodic | HAT hierarchical storage |
| Memory consolidation (sleep) | HAT consolidation phases |
| Hippocampal indexing | Centroid-based routing |
Algorithm
Data Structure
HAT organizes points into a tree with four levels:
Global (root)
└── Session (conversation boundaries)
└── Document (topic groupings)
└── Chunk (leaf nodes with points)
Each non-leaf container maintains:
- Centroid: Mean of descendant embeddings
- Children: Pointers to child containers
- Timestamp: For temporal locality
Beam Search Query
def hat_query(query, k, beam_width):
beam = {root}
for level in [Session, Document, Chunk]:
candidates = []
for container in beam:
for child in container.children:
score = cosine(query, child.centroid)
candidates.append((child, score))
beam = top_b(candidates)
return top_k(beam)
Complexity: O(b × d × c) = O(log n) when balanced
Experimental Results
HAT vs HNSW on Hierarchical Data
Setup: 1000 points = 20 sessions × 5 documents × 10 chunks, 128 dimensions.
| Metric | HAT | HNSW | Δ |
|---|---|---|---|
| Recall@1 | 100.0% | 76.0% | +24.0% |
| Recall@5 | 100.0% | 72.0% | +28.0% |
| Recall@10 | 100.0% | 70.6% | +29.4% |
| Build Time | 30ms | 2.1s | 70× faster |
| Query Latency | 1.42ms | 0.49ms | HNSW 3× faster |
Key finding: The query latency advantage of HNSW is meaningless at 70% recall.
Scale Analysis
| Points | HAT Build | HNSW Build | HAT R@10 | HNSW R@10 |
|---|---|---|---|---|
| 500 | 16ms | 1.0s | 100% | 55% |
| 1000 | 25ms | 2.0s | 100% | 44.5% |
| 2000 | 50ms | 4.3s | 100% | 67.5% |
| 5000 | 127ms | 11.9s | 100% | 55% |
HAT maintains 100% recall across all tested scales.
End-to-End LLM Integration
Setup: 2000 messages (~60K tokens), sentence-transformers embeddings, gemma3:1b LLM.
| Metric | Value |
|---|---|
| Total tokens | 60,000 |
| Native context sees | 10,000 (16.7%) |
| HAT recall | 100% |
| Retrieval latency | 3.1ms |
| Memory usage | 3.3 MB |
The LLM correctly answers questions about “past” conversations it never saw in its context window.
Negative Results: Complexity Doesn’t Help
Subspace Routing (Grassmann geometry):
- Recall: -8.7% vs centroids
- Latency: +11.8%
Learnable Routing Weights:
- Recall: -2% to +4%
- Latency: ~0%
Conclusion: When structure is known, exploit it directly. Centroids are sufficient.
Consolidation Phases
Inspired by sleep-staged memory consolidation:
| Phase | Operations | Time |
|---|---|---|
| Light (α) | Recompute centroids | 9ms/1K |
| Medium (β) | + Merge/split containers | 9ms/1K |
| Deep (δ) | + Prune empty, optimize layout | 9ms/1K |
| Full (θ) | Complete rebuild | 10ms/1K |
Implementation
HAT is implemented in Rust with Python bindings via PyO3:
from arms_hat import HatIndex
# Create index
index = HatIndex.cosine(1536)
# Add messages with session/document structure
index.new_session()
id = index.add(embedding)
# Query
results = index.near(query_embedding, k=10)
# Persistence
index.save("memory.hat")
Contributions
- First index structure to exploit known AI workload hierarchy
- 100% recall vs 70% for HNSW on hierarchical data
- 70× faster construction than HNSW
- Empirical validation that simple centroids outperform geometric sophistication
- End-to-end integration demonstrated with real LLM
Conclusion
HAT functions as an artificial hippocampus for AI systems, enabling long-term episodic memory without retraining. The key insight is that AI conversation data has known structure that can be exploited rather than learned.
Get Started
HAT is available as the arms-hat crate:
- Rust:
cargo add arms-hat— crates.io | docs.rs - Python:
pip install arms-hat(coming soon) - Source: github.com/automate-capture/hat
- Landing Page: research.automate-capture.com/hat
Part of the ARMS (Attention Reasoning Memory Store) project at Automate Capture Research.