How RefgetStore integrates with refgenie
RefgetStore is a content-addressed sequence database (from gtars) that stores sequences and sequence collections by digest. Refgenie uses RefgetStore differently depending on whether it's running as a local client or as a server, controlled by a mode system.
Two modes, two relationships to RefgetStore
Remote RefgetStore (S3/HTTP)
┌─────────────────────────────┐
│ sequences + collections │
│ aliases/ncbi.tsv │
│ aliases/insdc.tsv │
│ collections/*.fhr.json │
└──────────┬──────────────────┘
│
┌───────────────┼───────────────┐
▼ ▼
refgenie client (local) refgenieserver
┌──────────────────┐ ┌──────────────────┐
│ SQLite │ │ PostgreSQL │
│ - genomes │ │ - aliases (flat) │
│ - assets │ │ - genomes │
├──────────────────┤ ├──────────────────┤
│ Local RefgetStore │ │ Remote store cache │
│ (owned, on_disk) │ │ (not owned, │
│ - sequences │ │ open_remote) │
│ - aliases/ │ │ - aliases (pulled) │
│ refgenie.tsv │ │ - fhr (pulled) │
└──────────────────┘ └──────────────────┘
| Local client | Server | |
|---|---|---|
| Database | SQLite | PostgreSQL |
| RefgetStore mode | on_disk() — owned, read/write |
open_remote() — not owned, read-only with local cache |
| Alias source of truth | RefgetStore "refgenie" namespace (StoreAliasManager) |
SQL alias table (AliasManager) |
| Writes to store | Yes | No |
| Mode class | LocalMode |
ServerMode |
Mode is determined at Refgenie.__init__() based on whether a refget_store_url is provided.
Client-side: local RefgetStore
The local CLI maintains a RefgetStore at genome_folder/.refget_store/. Refgenie creates and owns this store via LocalMode.
Adding a genome
refgenie genome init --fasta hg38.fa.gz --name hg38 --description "Human GRCh38"
This:
- Loads sequences into the local RefgetStore via
store.add_sequence_collection_from_fasta()and computes the collection digest - Registers the genome in SQLite with the digest
- Writes the alias "hg38" to the RefgetStore's "refgenie" namespace via
StoreAliasManager
The store is the single source of truth for aliases in local mode. Any tool that opens the RefgetStore directly (without refgenie) can resolve genome names.
Retrieving sequences
refgenie getseq hg38 chr1:0-1000
Resolution:
StoreAliasManager.resolve("hg38")→ digest (from RefgetStore "refgenie" namespace)store.get_sequence_by_name(digest, "chr1")→ sequence recordstore.get_substring(seq_digest, 0, 1000)→"ATCG..."
No FASTA file is needed after initialization. Sequences are content-addressable and deduplicated.
Remote genomes
You can also point refgenie at a remote source:
refgenie genome init \
--remote-url http://seqcolapi.databio.org \
--remote-digest abc123... \
--name hg38 --description "Human GRCh38"
Refgenie auto-detects whether the URL points to a seqcolapi server or a static RefgetStore (by checking for rgstore.json via make_source()). Sequences are fetched on-demand when you call getseq and cached in the local RefgetStore.
Server-side: remote RefgetStore with PostgreSQL
The server does not own a local RefgetStore. Instead, it connects to a remote RefgetStore (on S3 or HTTP) via open_remote(), caching data locally in an ephemeral directory. PostgreSQL is the persistent store for aliases, genomes, and assets.
Server startup
The server is configured via the REFGENIE_REFGET_STORE_URL environment variable:
from refgenie.config import REFGENIE_REFGET_STORE_URL
refgenie = Refgenie(refget_store_url=REFGENIE_REFGET_STORE_URL)
When refget_store_url is set, Refgenie creates a ServerMode, and at startup:
ServerMode.create_store()opens the remote RefgetStore with a local cache directorystore.pull_aliases()— syncs all reference alias namespaces from remote to cachestore.pull_fhr()— syncs all FHR provenance metadata from remote to cache
After startup, all alias resolution and metadata lookups are in-process — no per-request remote calls.
Populating the remote RefgetStore
The server reads from the store but never writes to it. Populating the store is a separate admin process:
- Build the RefgetStore from FASTA files (using gtars or
refget store add) - Host it on S3, HTTP, or any static file server
- Point the server at it via the
REFGENIE_REFGET_STORE_URLenvironment variable
Loading metadata into PostgreSQL
The server also creates a RefgetDBAgent backed by PostgreSQL for serving the seqcol REST API:
from refget.agents import RefgetDBAgent
app.state.dbagent = RefgetDBAgent(engine=refgenie.database_engine)
Server-specific metadata (seqcol collections, DRS records) is loaded via refget admin:
# Load a single genome
refget admin load genome.fa --name "Human GRCh38"
# Batch load from a PEP
refget admin load --pep genomes.yaml --fa-root /data/fasta/
# Load + upload FASTA to S3 + create DRS record
refget admin ingest genome.fa --bucket my-refget-bucket --name "Human GRCh38"
# Batch ingest from PEP
refget admin ingest --pep genomes.yaml --fa-root /data/fasta/ --bucket my-refget-bucket
The load command parses each FASTA, computes the seqcol metadata (names, lengths, sequence digests), and stores it in PostgreSQL. The ingest command additionally uploads the FASTA to S3 and creates a DRS access method record.
How alias resolution works on the server
When a client queries an alias (e.g. GET /v4/aliases/hg38):
- Check the SQL alias table (PostgreSQL) — server policy aliases always win
- If not found, check RefgetStore collection aliases across all namespaces (ncbi, insdc, ucsc, etc.)
- If found, return the full genome record: digest + seqcol level 2 + FHR metadata
{
"alias": "hg38",
"digest": "f1b5a389...",
"source": "server",
"collection": {
"names": ["chr1", "chr2", "..."],
"lengths": [248956422, 242193529, "..."],
"sequences": ["SQ.abc...", "SQ.def...", "..."]
},
"fhr": {
"species": "Homo sapiens",
"assembly_accession": "GCF_000001405.40"
}
}
Service info
The server exposes its RefgetStore URL and capabilities in /seqcol/service-info, so clients can discover the backing store and access sequences directly:
{
"id": "org.refgenie.seqcol",
"seqcol": {
"refget_store": {
"enabled": true,
"url": "https://my-bucket.s3.amazonaws.com/refget_store/"
}
}
}
Without a RefgetStore
When REFGENIE_REFGET_STORE_URL is not set, the server operates in local mode:
- Alias resolution uses the RefgetStore "refgenie" namespace (same as client)
- No reference namespace lookups (ncbi, insdc, etc.)
- Asset management works normally
Database configuration
export POSTGRES_HOST=localhost
export POSTGRES_DB=refget
export POSTGRES_USER=postgres
export POSTGRES_PASSWORD=yourpassword
Typical deployment workflow
- Build a RefgetStore from your FASTA files (using
refget store addor gtars directly) - Host it on S3 or any static file server
- Set up PostgreSQL
- Load metadata:
refget admin ingest --pep genomes.yaml --fa-root /data/fasta/ --bucket my-bucket - Configure the server: set
REFGENIE_REFGET_STORE_URLto point at the hosted store - Start the server:
refgenie serve - Verify: visit
/seqcol/service-infoand/v4/aliaseson the running server
How the two sides connect
A refgenie client can connect to a server as a remote source:
refgenie genome init \
--remote-url http://myserver.com/seqcol \
--remote-digest <digest> \
--name hg38 --description "Human GRCh38"
# Sequences are fetched on demand and cached locally
refgenie getseq hg38 chr1:0-1000
The client uses make_source() to auto-detect whether the URL is a seqcolapi server or a static RefgetStore. If the server advertises a refget_store_url in its service-info, the client can use the store directly for sequence retrieval.
See also
- What is RefgetStore? — overview of the storage format
- Adding a FastAPI router — embedding seqcol endpoints in your own app
- RefgetDB Agent — programmatic database operations
- CLI reference — full
refget admincommand reference