Refgenie Configuration
Refgenie uses a flexible configuration system to manage reference genome assets, supporting both local and remote operation modes. This document describes how to configure refgenie, including environment variables, database backends, and advanced options.
Configuration Overview
Refgenie stores its configuration in a database (SQLite or PostgreSQL) and uses a set of environment variables to control paths and behavior. The configuration tracks:
- The genome folder (where assets are stored)
- The genome archive folder (for asset archives)
- Database connection settings
- Subscribed remote servers
- Registered data channels and remotes
A visual overview of the configuration and database model is available in the database model diagram.
Initialization
To initialize refgenie with default settings, simply run:
refgenie1 init
This will create a default configuration in your home directory under .refgenie/
and set up a local SQLite database. You can override the default locations and database settings using environment variables (see below).
Environment Variables
Refgenie supports the following environment variables for configuration:
Variable | Default | Purpose |
---|---|---|
REFGENIE_HOME_PATH |
~/.refgenie |
Base directory for all refgenie files |
REFGENIE_LOG_LEVEL |
INFO |
Logging verbosity |
REFGENIE_GENOME_FOLDER |
$REFGENIE_HOME_PATH/genomes |
Directory for genome assets |
REFGENIE_GENOME_ARCHIVE_FOLDER |
$REFGENIE_HOME_PATH/archives |
Directory for asset archives |
REFGENIE_DB_CONFIG_PATH |
$REFGENIE_HOME_PATH/refgenie_db_config.yaml |
Path to the database config YAML file (see next section to learn more about the schema) |
You can set these variables in your shell before running refgenie, for example:
export REFGENIE_GENOME_FOLDER=~/my_genomes
export REFGENIE_DB_CONFIG_PATH=~/my_refgenie_db_config.yaml
Database Backend
Refgenie supports both SQLite (default) and PostgreSQL for storing metadata. The database connection is configured via a YAML file, whose path is set by REFGENIE_DB_CONFIG_PATH
.
Database Configuration YAML Schema
The database backend is configured via a YAML file whose path is set by REFGENIE_DB_CONFIG_PATH
. The schema and required fields depend on the database type you choose. Refgenie uses a discriminated union to select the correct configuration model based on the type
field in the YAML file:
- If
type: sqlite
, the SQLite configuration schema is used. - If
type: postgresql
, the PostgreSQL configuration schema is used.
This means the type
field acts as a discriminator, and the rest of the fields are validated according to the selected backend.
Common Fields
type
: The database backend type. Must be eithersqlite
orpostgresql
. This is the discriminator for the schema.
SQLite Example
type: sqlite
path: ~/refgenie_db/refgenie # Path to the SQLite database file
- Only
type
andpath
are required for SQLite. path
should be a valid file path where the SQLite database will be stored.
PostgreSQL Example
type: postgresql
name: refgenie # Name of the PostgreSQL database
host: localhost # Hostname of the PostgreSQL server
port: 5432 # Port number
user: postgres # Username
password: mysecretpassword # Password
- All fields shown above are required for PostgreSQL.
name
is the database name (not a file path).host
,port
,user
, andpassword
must be specified to connect to the server.
Note: Refgenie will create a default config file if one does not exist. You can edit this file to point to your preferred backend and provide the necessary connection details. The config loader will automatically select the correct schema based on the
type
field using a discriminated union.
Why use a non-local PostgreSQL database?
While SQLite is simple and works well for single-user or small-scale setups, configuring refgenie to use a PostgreSQL database (especially a remote or managed instance) provides several advantages for larger or more demanding environments:
- High performance for concurrent access: PostgreSQL supports multiple simultaneous users and high-throughput operations, making it suitable for large teams or automated pipelines.
- Distributed and multi-user workflows: A remote PostgreSQL server allows multiple users, compute nodes, or cloud services to access and update the same refgenie database, supporting collaborative and distributed workflows.
- Scalability for large organizations: PostgreSQL can efficiently handle large datasets and many assets, making it ideal for institutional or enterprise deployments.
- High availability and reliability: Managed PostgreSQL services (e.g., AWS RDS, Google Cloud SQL) offer automated backups, failover, and monitoring, reducing downtime and risk of data loss.
- Centralized management: A single remote database can serve as the authoritative source of truth for all reference assets across an organization.
For most individual users or small labs, SQLite is sufficient. For production, cloud, or institutional deployments, PostgreSQL is recommended.
Local and Remote Configuration
- Local mode: All assets and metadata are managed on your local machine using the configured database and folders.
- Remote mode: You can subscribe to remote refgenie servers to pull assets or use remote seek functionality. Use the CLI or Python API to add server subscriptions:
refgenie1 subscribe http://refgenomes.databio.org
refgenie1 unsubscribe -s http://refgenomes.databio.org
You can list, add, or remove remote servers and data channels at any time. See the CLI documentation for details.
Inspecting Configuration
You can inspect the current configuration using:
refgenie1 config get
This will print a table with the current settings, including environment-based overrides.
Refgenie configuration
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ genome_folder ┃ version ┃ genome_archive_folder ┃ servers ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ /Users/johndoe/.refgenie/genomes │ 1 │ /Users/johndoe/.refgenie/archives │ ['http://refgenomes.databio.org'] │
└─────────────────────────────────────┴─────────┴──────────────────────────────────────┴───────────────────────────────────┘
Environment-based configuration
log_level: LogLevel.INFO
genome_folder: /Users/johndoe/.refgenie/genomes
genome_archive_folder: /Users/johndoe/.refgenie/archives
database_config_path: /Users/johndoe/.refgenie/refgenie_db_config.yaml
Advanced configuration
Data Channels and Remotes
Refgenie supports registration of data channels (external sources of asset classes and recipes) and remotes (external storage backends, e.g., S3). These are tracked in the configuration database and can be managed via CLI or API.
Advanced: Programmatic Configuration and Customization
When using refgenie as a Python library, users can inject custom objects into the Refgenie
class initializer to control its behavior beyond environment variables and config files.
Provide path to database config file
When using refgenie as a Python library, you can specify the path to the database configuration file directly in the Refgenie
constructor with database_config_path
. This allows you to programmatically set up the database connection without relying on environment variables.
Custom Database Engine
You can provide a custom SQLAlchemy database engine to the Refgenie
constructor via the database_engine
argument. This allows you to:
- Use a pre-configured or pooled database connection
- Integrate with existing infrastructure or testing environments
Example:
from refgenie import Refgenie
from sqlalchemy import create_engine
engine = create_engine("postgresql://user:password@host:port/dbname")
refgenie = Refgenie(database_engine=engine)
Note that this argument and the
database_config_path
argument are mutually exclusive. If both are provided,database_config_path
will be ignored.
Custom Server Clients
You can also inject a mapping of server URLs to custom server client objects using the server_clients_mapping
argument. Importantly, each client must implement the ServerClient
protocol (see refgenie.server.models.ServerClient
).
This is useful for:
- Extending or customizing remote server interactions
- Using mock or test clients
Example:
from refgenie import Refgenie
from mymodule import MyCustomServerClient
server_clients = {
"http://myserver": MyCustomServerClient(),
}
refgenie = Refgenie(server_clients_mapping=server_clients)
╭──────────────────────────────────────── ServerClient Protocol structure ────────────────────────────────────────╮
│ class ServerClient(*args, **kwargs): │
│ │
│ Protocol for the server client. │
│ │
│ server_url = <property object at 0x11b0bf880> │
│ download_with_progress = def download_with_progress(self, operation_id: str, output_path: pathlib.Path, params: │
│ Optional[Dict] = None, url_format_params: Optional[Dict[str, str]] = None, name: │
│ Optional[str] = None) -> pathlib.Path: Download asset at given URL to given filepath, │
│ show progress along the way. │
│ get = def get(self, operation_id: str, params: Optional[Dict] = None, url_format_params: │
│ Optional[Dict[str, str]] = None) -> Union[Dict, str]: Send a GET request to the │
│ specified operation ID. │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
Note: If a provided client does not implement the required protocol, refgenie will raise a
ValueError
.
This programmatic approach enables full control for advanced, production, or testing scenarios.
Database Migrations
Refgenie uses Alembic for database schema migrations. For advanced users or developers, see the migrations README for details on managing schema changes.