Skip to content

How to use refgenconf to manage Refgenie assets in a pipeline

Below we present an example use of refgenconf package. It is installed automatically with refgenie (or separately installable with pip install refgenconf). All the asset fetching functionality is impelmented in refgenconf package, so pipelines that just use Python API do not need to depend on refgenie.

Goal

The goal of the code below is to get a path to the refgenie-managed fasta file for a user-specified genome.

Genome FASTA is a part of fasta asset, accessible as a fasta seek key. To retrieve the path this file on the command line one would say: refgenie seek <genome>/fasta. For example:

refgenie seek hg38/fasta

Steps

First, let's set the $REFGENIE environmet variable. It should be set by a pipeline user or the config file path should be provided explictly, e.g. as an input to the pipeline (here shown as user_provided_cfg_path = None -- not provided)

import os
os.environ["REFGENIE"] = "./refgenie.yaml"
user_provided_cfg_path = None
user_provided_genome = "rCRSd"

Next, let's import components of refgenconf that we'll use

from refgenconf import RefGenConf, select_genome_config, RefgenconfError, CFG_ENV_VARS, CFG_FOLDER_KEY
from yacman import UndefinedAliasError

Now, we can use the select_genome_config function to determine the preferred path to the config file. If user_provided_cfg_path is None (not specified) the $REFGENIE environment variable is used.

refgenie_cfg_path = select_genome_config(filename=user_provided_cfg_path, check_exist=False)

The function returns None if none of the above point to a valid path. That's why we raise an aproppriate error below. Obviously, the name of --rfg-config argument depends on pipeline design.

if not refgenie_cfg_path:
    raise OSError(f"Could not determine path to a refgenie genome configuration file."
                  f"Use --rfg-config argument or set '{CFG_ENV_VARS}' environment variable to provide it")

Otherwise it returns a determined path (str). So, we check if it exists and read the object if it does. If it does not, we can initialize the config file

if isinstance(refgenie_cfg_path, str) and os.path.exists(refgenie_cfg_path):
    print(f"Reading refgenie genome configuration file from file: {refgenie_cfg_path}")
    rgc = RefGenConf(filepath=refgenie_cfg_path)
else:
    print(f"File '{refgenie_cfg_path}' does not exist. Initializing refgenie genome configuration file.")
    rgc = RefGenConf(entries={CFG_FOLDER_KEY: os.path.dirname(refgenie_cfg_path)})
    rgc.initialize_config_file(filepath=refgenie_cfg_path)
    rgc.subscribe(urls="http://rg.databio.org:82", reset=True) # subscribe to the desired server, if needed
File '/Users/mstolarczyk/code/refgenie/docs_jupyter/refgenie.yaml' does not exist. Initializing refgenie genome configuration file.

Finally, we try to retrieve the path to out asset of interest and pull from refgenieserver if the retrieval fails.

try:
    fasta = rgc.seek(genome_name=user_provided_genome, asset_name="fasta", tag_name="default",
                                seek_key="fasta")
except (RefgenconfError, UndefinedAliasError):
    print("Could not determine path to chrom.sizes asset, pulling")
    rgc.pull(genome=user_provided_genome, asset="fasta", tag="default")
    fasta = rgc.seek(genome_name=user_provided_genome, asset_name="fasta", tag_name="default",
                                seek_key="fasta")
print(f"Determined path to fasta asset: {fasta}")
Could not determine path to chrom.sizes asset, pulling



Output()


Determined path to fasta asset: /Users/mstolarczyk/code/refgenie/docs_jupyter/alias/rCRSd/fasta/default/rCRSd.fa