Back to Blog
engineeringarchitecture

Session Archive: FTS5 Full-Text Search Meets Git Commit Cross-Referencing

How we built a searchable archive of every Claude Code session — with SQLite FTS5, Porter stemming, worker-thread indexing, and automatic cross-referencing against your git history.

Callipso TeamMarch 9, 202616 min read

Session Archive: FTS5 Full-Text Search Meets Git Commit Cross-Referencing

A week ago, the Archive tab was a flat list of Claude Code sessions sorted by date. You could scroll, you could click, you could resume. That was it. If you remembered a session where you debugged a WebSocket race condition three weeks ago but could not remember which project it was in, you scrolled through hundreds of entries hoping to recognize the right one.

Now the Archive has two-layer search, automatic git commit cross-referencing, a unified timeline, and a preview modal with paired prompt/response tabs. This post covers the architecture in detail — the SQLite schema, the indexing pipeline, how commits get linked to sessions, and the caching strategy that makes it fast.

The Data Sources

The Archive draws from three independent data sources that were never designed to work together:

Source 1: Claude Code Session Files
  ~/.claude/history.jsonl                          (fast metadata: sessionId, project, timestamp)
  ~/.claude/projects/[encoded-cwd]/[sessionId].jsonl  (slow full content: every message, tool use, thinking)

Source 2: Git Repositories
  Discovered from terminal CWDs + session CWDs
  git log with format: hash, message, timestamp, author, additions, deletions, files changed

Source 3: Callipso's Own Data
  ~/.callipso/session-archive.json                 (archived sessions with display text, reformulations, IDE info)
  ~/.config/callipso/reformulation-store.json       (voice reformulations per session)
  ~/Library/Application Support/callipso/search-index.sqlite  (FTS5 index)

The challenge is stitching these together. A Claude Code session file knows nothing about git. A git log knows nothing about Claude Code. The connection between "this session produced these commits" exists only in the overlap of timestamps and working directories.

The Archive uses two search layers that run in parallel and merge results.

Layer 1: Metadata Search (in-memory, fast)

When you type in the search box, Layer 1 filters immediately. It matches against display text, first prompt, project path, git branch, model name, and reformulation text. Multi-word queries use AND logic — every word must appear in at least one field.

This runs against data already loaded in the renderer process. No IPC round-trip. Sub-millisecond for hundreds of sessions. Good for finding sessions by project name, branch name, or a keyword you remember from the prompt.

Layer 2: FTS5 Full-Text Search (SQLite, thorough)

Layer 2 runs 300ms after the user stops typing (debounced). It sends the query to the main process via the search-fulltext IPC channel, which forwards it to a worker thread running the SQLite FTS5 engine.

FTS5 searches the full content of every message in every indexed session — not just metadata, but every prompt you sent and every response Claude gave. Results come back with highlighted snippets showing the matching context.

The two layers merge in the renderer. Layer 1 results appear first (they are instant). Layer 2 results trickle in after the debounce. Duplicates are removed by session ID. FTS5 matches get a snippet badge showing where the match was found in the conversation.

The FTS5 Schema

The search index is a SQLite database at ~/Library/Application Support/callipso/search-index.sqlite. Three tables:

sql
-- Track which working directories have been indexed
CREATE TABLE indexed_cwds (
  cwd             TEXT PRIMARY KEY,
  last_indexed_at  INTEGER,
  session_count    INTEGER,
  message_count    INTEGER
);

-- Track individual session files for incremental indexing
CREATE TABLE indexed_files (
  session_id   TEXT PRIMARY KEY,
  cwd          TEXT,
  file_mtime   INTEGER,        -- file modification time (change detection)
  indexed_at   INTEGER
);

-- The FTS5 virtual table
CREATE VIRTUAL TABLE messages_fts USING fts5(
  session_id UNINDEXED,
  role,                         -- 'user' or 'assistant'
  content,                      -- the actual message text
  timestamp UNINDEXED,
  cwd UNINDEXED,
  tokenize='porter unicode61'   -- Porter stemmer + Unicode-aware tokenizer
);

The tokenizer choice matters. porter unicode61 means:

  • Porter stemming: "debugging" matches "debug," "debugged," and "debugger." Searching for "authentication" finds sessions about "authenticate" and "auth." This is critical for developer search — you rarely remember the exact inflection you used three weeks ago.
  • Unicode61: Handles non-ASCII characters correctly. Code identifiers with underscores and camelCase are tokenized at word boundaries.

The UNINDEXED columns (session_id, timestamp, cwd) are stored in the FTS table but not searchable. They are metadata carried alongside the indexed content so we can group and filter results without joining to another table.

The Indexing Pipeline

Indexing runs in a worker thread to avoid blocking the main process. The pipeline:

User clicks "Index [CWD]" or auto-index triggers
  → Main process sends { action: 'index', cwd } to worker thread
    → Worker loads SearchIndex class
      → Scan ~/.claude/projects/[encoded-cwd]/ for .jsonl files
        → For each session file:
            1. Check indexed_files for existing mtime
            2. If mtime unchanged → skip (incremental)
            3. If new or changed → read JSONL line by line
            4. Extract user messages (full text)
            5. Extract assistant messages (text only — skip tool_use blocks, skip thinking)
            6. INSERT INTO messages_fts in a transaction
            7. UPDATE indexed_files with new mtime
        → Update indexed_cwds with counts
      → Post progress: { type: 'progress', done, total, phase }
    → Main process forwards progress to renderer
  → UI updates progress bar in real time

The key design decisions:

Incremental indexing. The indexed_files table stores each session file's mtime. On re-index, unchanged files are skipped entirely. This means re-indexing a directory with 200 sessions but only 3 new ones takes milliseconds, not seconds.

Assistant message filtering. Claude Code responses contain three types of content: text (the actual response), tool_use (function calls with arguments), and thinking (internal reasoning). We index only the text content. Tool use blocks are noisy — they contain raw file paths, JSON payloads, and command outputs that pollute search results. Thinking blocks are similarly noisy. The user's actual question and Claude's actual answer are what you search for.

Transaction batching. All inserts for a single session file happen in one SQLite transaction. This is both a performance optimization (SQLite commits are expensive) and a correctness guarantee (a crash mid-index does not leave partial session data).

Worker thread isolation. SQLite operations — especially initial indexing of thousands of messages — can take seconds. Running this on the main Electron process would freeze the UI. The worker thread communicates via message passing: { action, data } in, { type, result } out. The main process forwards progress events to the renderer via the search-index-progress IPC channel.

Query Processing

When the user searches, the FTS5 query goes through sanitization before hitting SQLite:

User types: "websocket race condition"

→ Sanitize: split into words, escape special characters
→ Build FTS5 query: each word becomes a prefix match ("websocket"* AND "race"* AND "condition"*)
→ Execute:
    SELECT session_id, role, timestamp, cwd,
           snippet(messages_fts, 2, '<mark>', '</mark>', '...', 40) AS snippet,
           rank
    FROM messages_fts
    WHERE messages_fts MATCH ?
    ORDER BY rank
    LIMIT 50

The snippet() function is FTS5's built-in context extractor. Arguments:

  • 2 — column index for content (0=session_id, 1=role, 2=content)
  • '<mark>' / '</mark>' — highlight markers for matched terms
  • '...' — ellipsis for truncated context
  • 40 — maximum tokens in the snippet

The result includes the rank (BM25 relevance score), the highlighted snippet, the session ID, and the role (user or assistant). The renderer groups results by session and shows the best snippet per session.

If the query includes a CWD filter (from the workspace bar), a WHERE cwd = ? clause narrows results to that project directory.

Git Commit Cross-Referencing

This is the piece that ties sessions to real code changes. The system has three components: repo discovery, commit fetching, and the cross-referencing algorithm.

Repo Discovery

gitLogReader.discoverRepos()
  → Collect CWDs from two sources:
      1. Active terminal CWDs (from the terminal store — what is running right now)
      2. Session CWDs (from ~/.claude/projects/ — every project Claude Code has worked in)
  → For each CWD, run: git rev-parse --show-toplevel
  → Deduplicate by resolved repo root path
  → Return list of unique git repositories

No hardcoded repo paths. If a terminal or session touched a directory, and that directory is inside a git repo, the repo is discovered automatically. This means the timeline updates as you work in new projects — no configuration.

Commit Fetching

gitLogReader.getCommits(repoPath, timeRange)
  → Check in-memory cache (keyed by repoPath:timeRange, 30-second TTL)
  → If cache miss:
      → git log --format="%H|%h|%s|%aI|%an|%ae" --shortstat --since=<timeRange>
      → Parse each commit: full hash, short hash, message, ISO timestamp, author name, email
      → Parse shortstat: N files changed, N insertions(+), N deletions(-)
      → Cache result
  → Return GitCommitEntry[]

Time ranges: 7d, 30d, 90d, all (capped at 365 days for performance). The --shortstat flag gives us additions/deletions/files-changed per commit without fetching full diffs.

The 30-second cache prevents hammering git log on every tab switch or filter change. It is short enough that new commits appear within half a minute.

The Cross-Referencing Algorithm

This is the core of the system. Given a list of sessions and a list of commits, how do you determine which commits came from which session?

There is no explicit link. Claude Code does not tag commits with a session ID. Git does not know about Claude Code. The connection is inferred from two signals: time and place.

typescript
crossRefCommitsWithSessions(sessions, commits) → CrossRefResult[]

For each commit:
  For each session:

    // Step 1: Check timestamp overlap
    commitTime = commit.timestamp
    sessionStart = session.startTime - 5_MINUTE_BUFFER
    sessionEnd = session.endTime + 5_MINUTE_BUFFER

    if commitTime is NOT within [sessionStart, sessionEnd]:
      → skip (no temporal overlap)

    // Step 2: Check CWD match
    if session.cwd === commit.repoPath:
      → HIGH confidence ("Exact CWD match + timestamp overlap")

    if session.cwd starts with commit.repoPath:
      → MEDIUM confidence ("Session CWD is subdirectory of repo")

    if commit.repoPath starts with session.cwd:
      → MEDIUM confidence ("Repo is subdirectory of session CWD")

    // Step 3: Timestamp-only fallback (stricter bounds)
    if commitTime is within [sessionStart, sessionEnd] WITHOUT the 5-min buffer:
      → LOW confidence ("Timestamp overlap only, no CWD match")

The 5-minute buffer on each end accounts for two realities: (1) Claude Code might have been thinking for several minutes before making the commit, and (2) the session end time in the JSONL file might be slightly earlier than the actual last action if the hooks did not fire cleanly.

Each CrossRefResult carries:

  • sessionId — which session
  • commitHash — which commit
  • confidence'high' | 'medium' | 'low'
  • reason — human-readable explanation ("Exact CWD match + timestamp overlap")

The Unified Timeline

The final step merges sessions and commits into a single chronological view:

buildTimeline(sessions, commits, crossRefs)
  → Create ArchiveTimelineItem[] from sessions: { type: 'session', data: SessionDetails }
  → Create ArchiveTimelineItem[] from commits: { type: 'commit', data: GitCommitEntry }
  → Map linkedSessionIds onto each commit (from crossRefs)
  → Sort everything by timestamp (descending)
  → Return merged timeline

Commits that are not linked to any session are grouped separately as "unlinked commits." These are commits made outside of Claude Code — manual edits, other tools, CI/CD automation. Showing them gives a complete picture of what happened in the repo, not just what Claude Code did.

The UI renders sessions and commits with different visual treatments. Sessions show the reformulation text, model, duration, and message count. Commits show the message, author, and stats (files changed, additions, deletions). Linked commits show which session produced them.

The Archive UI

The renderer (archive.js, ~500 lines) manages several coordinated views:

Session List with Infinite Scroll

Initial load fetches the first page (20 sessions) from ~/.claude/history.jsonl — fast metadata only. As the user scrolls, subsequent pages load full session details from the per-project JSONL files. This two-stage loading keeps the initial render under 100ms even with thousands of sessions.

Page 1:  history.jsonl scan (FAST — one file, metadata only)
         → Render session cards with title, timestamp, project
Page 2+: Per-project JSONL read (SLOWER — full message parsing)
         → Hydrate with message count, duration, model, todos, git branch
         → SessionCache stores parsed details (24-hour TTL, mtime-aware invalidation)

The SessionCache (~/.callipso/session-cache.json) prevents re-parsing JSONL files on every Archive tab open. Entries expire after 24 hours or when the source file's mtime changes. LRU-like pruning keeps the cache at 500 entries.

Workspace Filter Bar

The bar shows workspace chips for each project directory. Clicking a chip filters sessions to that workspace. Each chip shows a count of how many sessions exist for that project.

Workspaces are discovered from two sources: the active workspace list (what the user has open) and session CWDs (historical projects). Recent and active workspaces appear first.

Preview Modal with Prompt/Response Tabs

Clicking a session opens a preview modal with two tabs:

  • Prompt tab: The reformulation text (if captured via voice) and the first user message.
  • Response tab: Paired prompt/response tuples from the full session. Each pair shows what the user asked and what Claude answered, in order.

The Response tab calls search-get-session-messages IPC, which reads the full JSONL file and extracts paired message tuples. This is loaded on demand — not preloaded for every session in the list.

FTS5 Indexing Panel

The panel shows unindexed CWDs (directories that appear in Claude Code's history but have not been indexed for full-text search). Each CWD has an "Index" button. Progress is shown in real time: "Indexing... 47/203 sessions (indexing)."

Indexing is cancellable via the search-index-cancel IPC channel, which sends a cancel message to the worker thread.

Settings Panel

Display mode controls which text appears as the session title:

  • firstReformulation — the first voice reformulation (default if using voice)
  • lastReformulation — the most recent reformulation
  • firstPrompt — the first user message
  • lastPrompt — the most recent user message

Time range, model filter, and layout mode (timeline vs. subtabs vs. drawer) are also configurable. Settings persist to ~/.config/callipso/archive-display-settings.json.

The Full Data Flow

Here is the complete pipeline from "user opens Archive tab" to "timeline rendered":

1. Tab activation
   → Load settings from archive-display-settings.json
   → Fetch session summaries (history.jsonl, PAGE_SIZE=20)
   → Fetch session details for page 1 (per-project JSONL, cached)
   → Render session list

2. Timeline load (parallel)
   → Discover git repos (terminal CWDs + session CWDs → git rev-parse)
   → Fetch commits (git log with timeRange, cached 30s)
   → Cross-reference commits with sessions (timestamp + CWD matching)
   → Build unified timeline (sessions + commits, sorted desc)
   → Identify unlinked commits
   → Render timeline with linked badges

3. Search (on user input)
   → Layer 1: In-memory metadata filter (instant)
   → Layer 2: FTS5 query via worker thread (debounced 300ms)
   → Merge results, deduplicate by session ID
   → Mark FTS5 matches with snippet badges
   → Check for unindexed CWDs, show indexing prompt

4. Workspace filter (on chip click)
   → Filter sessions by project path
   → Re-run timeline for filtered set
   → Update chip counts

5. Preview (on session click)
   → Show modal with Prompt tab (from cached session details)
   → On Response tab click: fetch paired messages via IPC (on demand)

The IPC Surface

The Archive system registers 30+ IPC handlers. The most important ones:

ChannelTypePurpose
claude-sessions-summarieshandleFast metadata from history.jsonl (limit, offset)
claude-sessions-paginatedhandleFull details for a page (limit, offset)
claude-session-detailshandleSingle session by sessionId
search-fulltexthandleFTS5 query with optional CWD filter
search-index-cwdhandleStart indexing a CWD (worker thread)
search-index-statushandleIndexed CWDs, total messages, DB size
archive-timelinehandleMerged sessions + commits + cross-refs
archive-git-reposhandleDiscover repos from terminal + session CWDs
archive-git-commitshandleCommits for discovered repos with timeRange
search-get-session-messageshandlePaired prompt/response tuples for preview
archive-session-on-closesendAuto-archive when terminal closes
search-index-progresssendReal-time indexing progress from worker

What the Competitors Build

Session archiving and search has exploded in early 2026. We found eight tools in this space:

ToolSearch EngineMulti-ProviderEmbedded in IDE/Overlay?
CASSTantivy + FTS5 hybrid11+ providersNo (standalone TUI)
Agent SessionsNative macOS app7+ providersNo (standalone app)
ClaudexSQLite FTS5Claude Code onlyNo (web app)
ccriderSQLite FTS5Claude Code onlyNo (Go TUI)
EngramSQLite FTS5Agent-agnosticNo (CLI + HTTP + MCP)
mnemoSQLite FTS5Multi-providerNo (Go library)
Claude-MemSQLite FTS5Claude Code onlyNo (Claude Code plugin)
CallipsoSQLite FTS5Claude CodeYes (embedded in overlay)

Every tool uses SQLite FTS5. That is not a coincidence — it is the right tool for this problem. Lightweight, embedded, no server, good-enough relevance ranking via BM25.

CASS is the most ambitious (Tantivy for primary search, FTS5 as fallback, 11 provider support). Agent Sessions is the most polished standalone app.

Callipso's differentiator is not the search engine — it is the integration. The archive is embedded in the overlay, one tab away from your active terminals. Resume a session from the archive and it launches in the right IDE, in the right project directory, with the right session state. Git cross-referencing shows which commits came from which session without leaving the overlay. The search runs while your agents are still working in the foreground.

No standalone tool can do that. They require switching to a separate application, finding the session, copying context, switching back to your IDE, and manually resuming. The archive being inside the overlay eliminates that context switch.

What We Did Not Build

Multi-provider support. The archive currently reads only Claude Code session files. CASS supports 11 providers. Adding Aider, Codex, or Gemini CLI session formats is a matter of writing additional JSONL parsers — the FTS5 schema is provider-agnostic (session_id, role, content). But we have not done it yet.

Semantic search. CASS supports optional FastEmbed + MiniLM embeddings for semantic similarity. Our search is lexical only — you need to use words that actually appear in the conversation. Semantic search would help for queries like "that session where I fixed the login bug" when the actual conversation used "authentication" and "credentials." We are watching this space but have not committed to an embedding model.

Diff-level cross-referencing. We link commits to sessions by timestamp and CWD overlap. We do not inspect the actual diff content to see if the changed files match the files Claude Code was editing. This would improve confidence scoring but requires reading full git diffs — expensive for repos with large commits.

PR cross-referencing. The GitCommitEntry interface has prNumber, prTitle, and ciStatus fields, but they are not populated yet. Linking commits to PRs would complete the picture from "session → commit → PR → CI status."

What We Learned

  1. FTS5 with Porter stemming is the right default for developer search. We briefly considered trigram indexing for fuzzy matching, but Porter stemming handles the most common case (inflection variants) with no configuration. Developers search for "debug" and expect to find "debugging" — Porter does this out of the box.

  2. Incremental indexing is non-negotiable. The first implementation re-indexed everything on every search. For a directory with 200 sessions, this took 8 seconds. Adding mtime-based change detection made re-indexing take under 50ms for the common case (no new sessions). The indexed_files table adds complexity but makes the system usable.

  3. Cross-referencing by timestamp is surprisingly accurate. The 5-minute buffer catches over 95% of real associations in our testing. False positives are rare because most developers do not have overlapping Claude Code sessions in the same repo. The confidence levels (high/medium/low) let the UI indicate uncertainty without hiding results.

  4. Worker threads are essential for SQLite in Electron. Initial indexing of a large project (500+ sessions, 50,000+ messages) can take 10-20 seconds. On the main process, this freezes the entire overlay. On a worker thread, the UI stays responsive and shows real-time progress. The message-passing overhead is negligible compared to the SQLite I/O.

  5. Two-layer search is better than one fast layer. Layer 1 gives instant feedback on every keystroke. Layer 2 finds results that metadata search misses. The 300ms debounce prevents hammering the SQLite worker on rapid typing. Users see results immediately (Layer 1) and then see better results a moment later (Layer 2) — this feels fast even when the full search takes 500ms.

  6. Session cache invalidation by mtime is simple and correct. We considered LRU eviction, TTL-only expiration, and hash-based invalidation. Mtime comparison is the simplest approach that handles the actual failure mode: a session file being appended to while Claude Code is still running. If the file changed, the cache entry is stale. If it did not, the cache entry is valid. No hashing, no versioning.

Fourteen source files. Three data sources. Two search layers. One timeline. Every session you have ever had with Claude Code is now searchable, cross-referenced with your git history, and one click from resumption.

Share: