Preparing Servable Archives with Refgenie and Snakemake
refgenie/refgenieserver1 repository contains all the necessary files to generate and execute a Snakemake workflow that builds and archives servable reference genome assets. The workflow is designed to be run locally or on a cluster, leveraging the flexibility of Snakemake.
By following these instructions, you can efficiently generate and execute a Snakemake workflow for building and archiving servable reference genome assets.
Snakefile Generation
To enhance flexibility, the Snakefile can be dynamically generated from a Jinja2 template using refgenie generate
command:
refgenie1 generate snakefile --help
usage: refgenie generate snakefile [-h] --output-path O [--snakefile-template-path S]
Generate a Snakemake file.
options:
-h, --help show this help message and exit
--output-path O, -o O Path to save the generated Snakefile.
--snakefile-template-path S, -s S Path to the Snakefile template. If not provided, the default template will be used.
refgenie1 generate snakefile --output-path generated.smk
This script utilizes the available Refgenie configuration to generate the Snakefile. The asset-building rules and dependancies between them are derived from the recipes currently managed by Refgenie.
Usage
To run the workflow, ensure that Snakemake and Refgenie are installed and configured, and that the Snakefile has been generated.
Input Files
Many recipes require input files (e.g. fasta
recipe requires and input FASTA file), which need to be available to the asset building software. If needed, refer to Extra: Downloading recipe input files section below for more information on how the download process can be streamlined.
Configuration
PEP Configuration
The workflow expects a PEP (Portable Encapsulated Project) configuration file in the ./pep
directory. The PEP must contain the following attributes:
genome_name
: The name of the genome to prepare the servable archive for.genome_description
: A description of the genome.species_name
: The scientific name of the species (e.g., "Homo sapiens", "Mus musculus").fasta_file_path
: The path to the FASTA file for the genome.asset_groups
: A list of asset groups to build for the genome.
Additionally, the PEP must define the asset groups to be built for each genome, which can be specified using a subsample table.
Example PEP:
config.yaml:
pep_version: 2.0.0
sample_table: sample_table.csv
subsample_table: subsample_table.csv
sample_modifiers:
append:
fasta_file_path: path
derive:
attributes: [fasta_file_path]
sources:
path: ${REFGENIE_INPUTS}/{genome_name}.fa
A real-life example of a PEP config file
sample_table.csv:
sample_name,genome_name,genome_description,species_name
rCRSd,rCRSd,The revised Cambridge reference sequence.,Homo sapiens
mm10,mm10,The GCA_000001635.5 sequences for alignment pipelines from NCBI.,Mus musculus
hg38,hg38,The GCA_000001405.15 GRCh38 no-alt analysis set from NCBI.,Homo sapiens
A real-life example of PEP sample table file
subsample_table.csv:
genome_name,asset_group
rCRSd,fasta
rCRSd,bowtie2_index
rCRSd,bwa_index
dm6,fasta
A real-life example of PEP subssample table file
Environment Variables
In addition to the standard environment variables used to configure Refgenie, the workflow requires the following environment variables:
REFGENIE_INPUTS
: The path to the directory where the input files for the workflow are stored.- make sure files in
$REFGENIE_INPUTS
are named according to the convention used in the PEP TEMPLATE_THREADS
: The default number of threads to use for template generation. This parameter affects thethreads
parameter for asset groups whose recipes do not specify a default number of threads.
Running the Workflow
Once configured, you can run the workflow using the following command:
snakemake --jobs <num_cores_if_local/num_parallel_jobs_if_cluster>
snakemake --jobs unlimited --snakefile generated.smk --default-resources slurm_account=<acct> slurm_partition=standard mem_mb=32000 --cores 8 --workflow-profile <path_to_snakemake_dir>
Note: --workflow-profile
needs to point to a directory where snakemake config.yaml
is located, which in turn points to the executor to be used.
executor: slurm
Recipe software dependencies
The recipes very often require specialized bioinformatics software to build the assets which usually isn't available in the your system/head node.
Here are potential ways to manage the software dependencies:
- (recommended) Bulker Refgenie tutorial
- we use more recent bulker Refgenie manifest
- Snakemake: using-environment-modules
- Snakemake: running-jobs-in-containers
snakemake --software-deployment-method apptainer
- also check
singularity:
directive withsnakemake --use-singularity
flag
Extra: Using Taskfile
The workflow can also be run using Taskfile for easier management of tasks. To run the workflow using Taskfile, use the following command:
task archive
Extra: Downloading recipe input files
The ./pep/download_recipe_inputs.py
script can be used to download the input files for the recipes from the sources specified in the ./pep/recipe_inputs_sources.csv
file.
uv run python download_recipe_inputs.py recipe_inputs_sources.csv <output_dir>
Extra: Report Generation
Snakemake can generate a detailed report of the workflow execution, providing a visual overview and verifying that all steps ran as expected. To generate the report, use the following command:
snakemake --report refgenie.html