Buildable assets
Refgenie can build a handful of assets for which we have already created building recipes. refgenie list lists all assets refegenie can build:
$ refgenie list
Local recipes: bismark_bt1_index, bismark_bt2_index, bowtie2_index, bwa_index, dbnsfp, ensembl_gtf, ensembl_rb, epilog_index, fasta, feat_annotation, gencode_gtf, hisat2_index, kallisto_index, refgene_anno, salmon_index, star_index, suffixerator_index, tallymer_index
If you want to add a new asset, you'll have to work with us to provide a script that can build it, and we can incorporate it into refgenie. If you have assets that cannot be scripted, or you want to add some other custom asset you may manually add custom assets and still have them managed by refgenie. We expect this will get much easier in the future.
Below, we go through the assets you can build and how to build them.
Top-level assets you can build
fasta
required files: --files fasta=/path/to/fasta_file (e.g. example_genome.fa.gz)
required parameters: none
required asset: none
required software: samtools
We recommend for every genome, you first build the fasta asset, because it's a starting point for building a lot of other assets.
Example fasta files:
wget http://big.databio.org/example_data/rCRS.fa.gz
refgenie build rCRS/fasta --files fasta=rCRS.fa.gz
refgenie seek rCRS/fasta
blacklist
required files: --files blacklist=/path/to/blacklist_file (e.g. hg38-blacklist.v2.bed.gz)
required parameters: none
required asset: none
required software: none
The blacklist asset represents regions that should be excluded from sequencing experiments. The ENCODE blacklist represents a comprehensive listing of these regions for several model organisms [^Amemiya2019].
Example blacklist files:
wget https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz
refgenie build hg38/blacklist --files blacklist=hg38-blacklist.v2.bed.gz
refgene_anno
required files: --files refgene=/path/to/refGene_file (e.g. refGene.txt.gz)
required parameters: none
required asset: none
required software: none
The refgene_anno asset is used to produce derived assets including transcription start sites (TSSs), exons, introns, and premature mRNA sequences.
Example refGene annotation files:
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
refgenie build hg38/refgene_anno --files refgene=refGene.txt.gz
gencode_gtf
required files: --files gencode_gtf=/path/to/gencode_file (e.g. gencode.gtf.gz)
required parameters: none
required asset: none
required software: none
The gencode_gtf asset contains all annotated transcripts.
Example gencode files:
- hg19 comprehensive gene annotation
- hg38 comprehensive gene annotation
- mm10 comprehensive gene annotation
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/gencode.vM23.annotation.gtf.gz
refgenie build mm10/gencode_gtf --files gencode_gtf=gencode.vM23.annotation.gtf.gz
ensembl_gtf
required files: --files ensembl_gtf=/path/to/ensembl_file (e.g. ensembl.gtf.gz)
required parameters: none
required asset: none
required software: none
The ensembl_gtf asset is used to build other derived assets including a comprehensive TSS annotation and gene body annotation.
Example Ensembl files:
wget ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz
refgenie build hg38/ensembl-gtf --files ensembl_gtf=Homo_sapiens.GRCh38.97.gtf.gz
ensembl_rb
required files: --files gff=/path/to/gff_file (e.g. regulatory_features.ff.gz)
required parameters: none
required asset: none
required software: none
The ensembl_rb asset is used to produce derived assets including feature annotations.
Example Ensembl files:
wget ftp://ftp.ensembl.org/pub/current_regulation/homo_sapiens/homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20190329.gff.gz
refgenie build hg38/ensembl_rb --files gff=homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20190329.gff.gz
dbnsfp
required files: --files dbnsfp=/path/to/dbnsfp_file (e.g. dbNSFP4.0a.zip)
required parameters: none
required asset: none
required software: none
The dbnsfp asset is the annotation database for non-synonymous SNPs.
wget ftp://dbnsfp:[email protected]/dbNSFP4.0a.zip
refgenie build test/dbnsfp --files dbnsfp=dbNSFP4.0a.zip
Derived assets you can build
For many of the following derived assets, you will need the corresponding software to build the asset. You can either install software on a case-by-case basis natively, or you can build the assets using docker.
bowtie2_index
required files: none
required parameters: none
required asset: fasta
required software: bowtie2
refgenie build test/bowtie2_index
bismark_bt1_index and bismark_bt2_index
required files: none
required parameters: none
required asset: fasta
required software: bismark
refgenie build test/bismark_bt1_index
refgenie build test/bismark_bt2_index
bwa_index
required files: none
required parameters: none
required asset: fasta
required software: bwa
refgenie build test/bwa_index
hisat2_index
required files: none
required asset: fasta
required software: hisat2
refgenie build test/hisat2_index
epilog_index
required files: none
required parameters: --params context=CG (Default)
required asset: fasta
required software: epilog
refgenie build test/epilog_index --params context=CG
kallisto_index
required files: none
required parameters: none
required asset: fasta
required software: kallisto
refgenie build test/kallisto_index
salmon_index
required files: none
required parameters: none
required asset: fasta
required software: salmon
refgenie build test/salmon_index
star_index
required files: none
required parameters: none
required asset: fasta
required software: star
refgenie build test/star_index
suffixerator_index
required files: none
required parameters: --params memlimit=8GB (Default)
required asset: fasta
required software: GenomeTools
refgenie build test/suffixerator_index --params memlimit=8GB
tallymer_index
required files: none
required parameters: --params mersize=30 minocc=2 (Default)
required asset: fasta
required software: GenomeTools
refgenie build test/tallymer_index --params mersize=30 minocc=2
feat_annotation
required files: none
required parameters: none
required asset: ensembl_gtf, ensembl_rb
required software: none
The feat_annotation asset includes the following genomic feature annotations: enhancers, promoters, promoter flanking regions, 5' UTR, 3' UTR, exons, and introns.
refgenie build test/feat_annotation
[^Amemiya2019]: Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep 2019;9, 9354. doi:10.1038/s41598-019-45839-z