Hierarchical Attention Tree: Extending LLM Context Through Structural Memory

Abstract

We present the Hierarchical Attention Tree (HAT), a novel index structure that extends the effective context of language models by an order of magnitude. A model with 10K native context achieves 100% recall on 60K+ token conversations through hierarchical attention state storage and retrieval, with 3.1ms average latency. Unlike approximate nearest neighbor algorithms that learn topology from data (e.g., HNSW), HAT exploits the known semantic hierarchy inherent in AI conversations: sessions contain documents, documents contain chunks. Our experiments demonstrate 100% recall vs 70% for HNSW on hierarchically-structured data, 70× faster index construction, and that simple centroid-based routing outperforms geometric sophistication.

The Context Window Problem

Large language models have a fundamental limitation: finite context windows. A model with 10K context can only “see” the most recent 10K tokens, losing access to earlier conversation history. Current solutions include:

Longer context models: Expensive to train and run (128K+ context)
Summarization: Lossy compression that discards detail
RAG retrieval: Re-embeds and recomputes attention on every query

The HAT Solution

HAT takes a different approach: exploit known structure.

Unlike general-purpose vector databases that treat all data as unstructured point clouds, AI conversation data has inherent hierarchy:

Session (conversation boundary)
  └── Document (topic or turn)
       └── Chunk (individual message)

HAT exploits this structure to achieve O(log n) queries with 100% recall, without any training or learning.

Core Claim

A 10K context model with HAT achieves 100% recall on 60K+ tokens with 3.1ms latency.

This is validated by our end-to-end experiments integrating HAT with a local LLM.

Why HAT Outperforms HNSW

Aspect	HNSW	HAT
Structure	Learned graph	Known hierarchy
Time awareness	None	Built-in
Insert cost	O(log n) expensive	O(log n) cheap
Semantic meaning	None	Native
Memory overhead	High (edges)	Low (centroids)

The key insight:

HNSW: "Here are points, I'll learn to navigate them"
HAT:  "I know these are containers. I'll use what I know."

This is the difference between a search engine and a hippocampus.

The Hippocampus Analogy

HAT mirrors human memory architecture:

Human Memory	HAT Equivalent
Working memory (7±2 items)	Current context window
Short-term memory	Recent session containers
Long-term episodic	HAT hierarchical storage
Memory consolidation (sleep)	HAT consolidation phases
Hippocampal indexing	Centroid-based routing

Algorithm

Data Structure

HAT organizes points into a tree with four levels:

Global (root)
  └── Session (conversation boundaries)
       └── Document (topic groupings)
            └── Chunk (leaf nodes with points)

Each non-leaf container maintains:

Centroid: Mean of descendant embeddings
Children: Pointers to child containers
Timestamp: For temporal locality

Beam Search Query

def hat_query(query, k, beam_width):
    beam = {root}

    for level in [Session, Document, Chunk]:
        candidates = []
        for container in beam:
            for child in container.children:
                score = cosine(query, child.centroid)
                candidates.append((child, score))
        beam = top_b(candidates)

    return top_k(beam)

Complexity: O(b × d × c) = O(log n) when balanced

Experimental Results

HAT vs HNSW on Hierarchical Data

Setup: 1000 points = 20 sessions × 5 documents × 10 chunks, 128 dimensions.

Metric	HAT	HNSW	Δ
Recall@1	100.0%	76.0%	+24.0%
Recall@5	100.0%	72.0%	+28.0%
Recall@10	100.0%	70.6%	+29.4%
Build Time	30ms	2.1s	70× faster
Query Latency	1.42ms	0.49ms	HNSW 3× faster

Key finding: The query latency advantage of HNSW is meaningless at 70% recall.

Scale Analysis

Points	HAT Build	HNSW Build	HAT R@10	HNSW R@10
500	16ms	1.0s	100%	55%
1000	25ms	2.0s	100%	44.5%
2000	50ms	4.3s	100%	67.5%
5000	127ms	11.9s	100%	55%

HAT maintains 100% recall across all tested scales.

End-to-End LLM Integration

Setup: 2000 messages (~60K tokens), sentence-transformers embeddings, gemma3:1b LLM.

Metric	Value
Total tokens	60,000
Native context sees	10,000 (16.7%)
HAT recall	100%
Retrieval latency	3.1ms
Memory usage	3.3 MB

The LLM correctly answers questions about “past” conversations it never saw in its context window.

Negative Results: Complexity Doesn’t Help

Subspace Routing (Grassmann geometry):

Recall: -8.7% vs centroids
Latency: +11.8%

Learnable Routing Weights:

Recall: -2% to +4%
Latency: ~0%

Conclusion: When structure is known, exploit it directly. Centroids are sufficient.

Consolidation Phases

Inspired by sleep-staged memory consolidation:

Phase	Operations	Time
Light (α)	Recompute centroids	9ms/1K
Medium (β)	+ Merge/split containers	9ms/1K
Deep (δ)	+ Prune empty, optimize layout	9ms/1K
Full (θ)	Complete rebuild	10ms/1K

Implementation

HAT is implemented in Rust with Python bindings via PyO3:

from arms_hat import HatIndex

# Create index
index = HatIndex.cosine(1536)

# Add messages with session/document structure
index.new_session()
id = index.add(embedding)

# Query
results = index.near(query_embedding, k=10)

# Persistence
index.save("memory.hat")

Contributions

First index structure to exploit known AI workload hierarchy
100% recall vs 70% for HNSW on hierarchical data
70× faster construction than HNSW
Empirical validation that simple centroids outperform geometric sophistication
End-to-end integration demonstrated with real LLM

Conclusion

HAT functions as an artificial hippocampus for AI systems, enabling long-term episodic memory without retraining. The key insight is that AI conversation data has known structure that can be exploited rather than learned.

Get Started

HAT is available as the arms-hat crate:

Rust: cargo add arms-hat — crates.io | docs.rs
Python: pip install arms-hat (coming soon)
Source: github.com/Lumi-node/HAT
Landing Page: research.automate-capture.com/hat

Part of the ARMS (Attention Reasoning Memory Store) project at Automate Capture Research.