Skip to main content
Context ExtensionMemory SystemsApproximate Nearest NeighborHierarchical IndexingLLMTransformers

Hierarchical Attention Tree: Extending LLM Context Through Structural Memory

A
Automate Capture Research
AI Research Lab

Abstract

We present the Hierarchical Attention Tree (HAT), a novel index structure that extends the effective context of language models by an order of magnitude. A model with 10K native context achieves 100% recall on 60K+ token conversations through hierarchical attention state storage and retrieval, with 3.1ms average latency. Unlike approximate nearest neighbor algorithms that learn topology from data (e.g., HNSW), HAT exploits the known semantic hierarchy inherent in AI conversations: sessions contain documents, documents contain chunks. Our experiments demonstrate 100% recall vs 70% for HNSW on hierarchically-structured data, 70× faster index construction, and that simple centroid-based routing outperforms geometric sophistication.

The Context Window Problem

Large language models have a fundamental limitation: finite context windows. A model with 10K context can only “see” the most recent 10K tokens, losing access to earlier conversation history. Current solutions include:

  • Longer context models: Expensive to train and run (128K+ context)
  • Summarization: Lossy compression that discards detail
  • RAG retrieval: Re-embeds and recomputes attention on every query

The HAT Solution

HAT takes a different approach: exploit known structure.

Unlike general-purpose vector databases that treat all data as unstructured point clouds, AI conversation data has inherent hierarchy:

Session (conversation boundary)
  └── Document (topic or turn)
       └── Chunk (individual message)

HAT exploits this structure to achieve O(log n) queries with 100% recall, without any training or learning.

Core Claim

A 10K context model with HAT achieves 100% recall on 60K+ tokens with 3.1ms latency.

This is validated by our end-to-end experiments integrating HAT with a local LLM.

Why HAT Outperforms HNSW

AspectHNSWHAT
StructureLearned graphKnown hierarchy
Time awarenessNoneBuilt-in
Insert costO(log n) expensiveO(log n) cheap
Semantic meaningNoneNative
Memory overheadHigh (edges)Low (centroids)

The key insight:

HNSW: "Here are points, I'll learn to navigate them"
HAT:  "I know these are containers. I'll use what I know."

This is the difference between a search engine and a hippocampus.

The Hippocampus Analogy

HAT mirrors human memory architecture:

Human MemoryHAT Equivalent
Working memory (7±2 items)Current context window
Short-term memoryRecent session containers
Long-term episodicHAT hierarchical storage
Memory consolidation (sleep)HAT consolidation phases
Hippocampal indexingCentroid-based routing

Algorithm

Data Structure

HAT organizes points into a tree with four levels:

Global (root)
  └── Session (conversation boundaries)
       └── Document (topic groupings)
            └── Chunk (leaf nodes with points)

Each non-leaf container maintains:

  • Centroid: Mean of descendant embeddings
  • Children: Pointers to child containers
  • Timestamp: For temporal locality

Beam Search Query

def hat_query(query, k, beam_width):
    beam = {root}

    for level in [Session, Document, Chunk]:
        candidates = []
        for container in beam:
            for child in container.children:
                score = cosine(query, child.centroid)
                candidates.append((child, score))
        beam = top_b(candidates)

    return top_k(beam)

Complexity: O(b × d × c) = O(log n) when balanced

Experimental Results

HAT vs HNSW on Hierarchical Data

Setup: 1000 points = 20 sessions × 5 documents × 10 chunks, 128 dimensions.

MetricHATHNSWΔ
Recall@1100.0%76.0%+24.0%
Recall@5100.0%72.0%+28.0%
Recall@10100.0%70.6%+29.4%
Build Time30ms2.1s70× faster
Query Latency1.42ms0.49msHNSW 3× faster

Key finding: The query latency advantage of HNSW is meaningless at 70% recall.

Scale Analysis

PointsHAT BuildHNSW BuildHAT R@10HNSW R@10
50016ms1.0s100%55%
100025ms2.0s100%44.5%
200050ms4.3s100%67.5%
5000127ms11.9s100%55%

HAT maintains 100% recall across all tested scales.

End-to-End LLM Integration

Setup: 2000 messages (~60K tokens), sentence-transformers embeddings, gemma3:1b LLM.

MetricValue
Total tokens60,000
Native context sees10,000 (16.7%)
HAT recall100%
Retrieval latency3.1ms
Memory usage3.3 MB

The LLM correctly answers questions about “past” conversations it never saw in its context window.

Negative Results: Complexity Doesn’t Help

Subspace Routing (Grassmann geometry):

  • Recall: -8.7% vs centroids
  • Latency: +11.8%

Learnable Routing Weights:

  • Recall: -2% to +4%
  • Latency: ~0%

Conclusion: When structure is known, exploit it directly. Centroids are sufficient.

Consolidation Phases

Inspired by sleep-staged memory consolidation:

PhaseOperationsTime
Light (α)Recompute centroids9ms/1K
Medium (β)+ Merge/split containers9ms/1K
Deep (δ)+ Prune empty, optimize layout9ms/1K
Full (θ)Complete rebuild10ms/1K

Implementation

HAT is implemented in Rust with Python bindings via PyO3:

from arms_hat import HatIndex

# Create index
index = HatIndex.cosine(1536)

# Add messages with session/document structure
index.new_session()
id = index.add(embedding)

# Query
results = index.near(query_embedding, k=10)

# Persistence
index.save("memory.hat")

Contributions

  1. First index structure to exploit known AI workload hierarchy
  2. 100% recall vs 70% for HNSW on hierarchical data
  3. 70× faster construction than HNSW
  4. Empirical validation that simple centroids outperform geometric sophistication
  5. End-to-end integration demonstrated with real LLM

Conclusion

HAT functions as an artificial hippocampus for AI systems, enabling long-term episodic memory without retraining. The key insight is that AI conversation data has known structure that can be exploited rather than learned.


Get Started

HAT is available as the arms-hat crate:

Part of the ARMS (Attention Reasoning Memory Store) project at Automate Capture Research.

Cite this article

Automate Capture Research (2026). Hierarchical Attention Tree: Extending LLM Context Through Structural Memory. Automate Capture Research.