Buildable assets
Refgenie
can build a handful of assets for which we have already created building recipes. refgenie list
lists all assets refegenie can build:
$ refgenie list
Local recipes: bismark_bt1_index, bismark_bt2_index, bowtie2_index, bwa_index, dbnsfp, ensembl_gtf, ensembl_rb, epilog_index, fasta, feat_annotation, gencode_gtf, hisat2_index, kallisto_index, refgene_anno, salmon_index, star_index, suffixerator_index, tallymer_index
If you want to add a new asset, you'll have to work with us to provide a script that can build it, and we can incorporate it into refgenie
. If you have assets that cannot be scripted, or you want to add some other custom asset you may manually add custom assets and still have them managed by refgenie
. We expect this will get much easier in the future.
Below, we go through the assets you can build and how to build them.
Top-level assets you can build
fasta
required files: --files fasta=/path/to/fasta_file
(e.g. example_genome.fa.gz)
required parameters: none
required asset: none
required software: samtools
We recommend for every genome, you first build the fasta
asset, because it's a starting point for building a lot of other assets.
Example fasta files:
wget http://big.databio.org/example_data/rCRS.fa.gz
refgenie build rCRS/fasta --files fasta=rCRS.fa.gz
refgenie seek rCRS/fasta
blacklist
required files: --files blacklist=/path/to/blacklist_file
(e.g. hg38-blacklist.v2.bed.gz)
required parameters: none
required asset: none
required software: none
The blacklist
asset represents regions that should be excluded from sequencing experiments. The ENCODE blacklist represents a comprehensive listing of these regions for several model organisms [^Amemiya2019].
Example blacklist files:
wget https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz
refgenie build hg38/blacklist --files blacklist=hg38-blacklist.v2.bed.gz
refgene_anno
required files: --files refgene=/path/to/refGene_file
(e.g. refGene.txt.gz)
required parameters: none
required asset: none
required software: none
The refgene_anno
asset is used to produce derived assets including transcription start sites (TSSs), exons, introns, and premature mRNA sequences.
Example refGene annotation files:
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz
refgenie build hg38/refgene_anno --files refgene=refGene.txt.gz
gencode_gtf
required files: --files gencode_gtf=/path/to/gencode_file
(e.g. gencode.gtf.gz)
required parameters: none
required asset: none
required software: none
The gencode_gtf
asset contains all annotated transcripts.
Example gencode files:
- hg19 comprehensive gene annotation
- hg38 comprehensive gene annotation
- mm10 comprehensive gene annotation
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/gencode.vM23.annotation.gtf.gz
refgenie build mm10/gencode_gtf --files gencode_gtf=gencode.vM23.annotation.gtf.gz
ensembl_gtf
required files: --files ensembl_gtf=/path/to/ensembl_file
(e.g. ensembl.gtf.gz)
required parameters: none
required asset: none
required software: none
The ensembl_gtf
asset is used to build other derived assets including a comprehensive TSS annotation and gene body annotation.
Example Ensembl files:
wget ftp://ftp.ensembl.org/pub/release-97/gtf/homo_sapiens/Homo_sapiens.GRCh38.97.gtf.gz
refgenie build hg38/ensembl-gtf --files ensembl_gtf=Homo_sapiens.GRCh38.97.gtf.gz
ensembl_rb
required files: --files gff=/path/to/gff_file
(e.g. regulatory_features.ff.gz)
required parameters: none
required asset: none
required software: none
The ensembl_rb
asset is used to produce derived assets including feature annotations.
Example Ensembl files:
wget ftp://ftp.ensembl.org/pub/current_regulation/homo_sapiens/homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20190329.gff.gz
refgenie build hg38/ensembl_rb --files gff=homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20190329.gff.gz
dbnsfp
required files: --files dbnsfp=/path/to/dbnsfp_file
(e.g. dbNSFP4.0a.zip)
required parameters: none
required asset: none
required software: none
The dbnsfp
asset is the annotation database for non-synonymous SNPs.
wget ftp://dbnsfp:[email protected]/dbNSFP4.0a.zip
refgenie build test/dbnsfp --files dbnsfp=dbNSFP4.0a.zip
Derived assets you can build
For many of the following derived assets, you will need the corresponding software to build the asset. You can either install software on a case-by-case basis natively, or you can build the assets using docker
.
bowtie2_index
required files: none
required parameters: none
required asset: fasta
required software: bowtie2
refgenie build test/bowtie2_index
bismark_bt1_index and bismark_bt2_index
required files: none
required parameters: none
required asset: fasta
required software: bismark
refgenie build test/bismark_bt1_index
refgenie build test/bismark_bt2_index
bwa_index
required files: none
required parameters: none
required asset: fasta
required software: bwa
refgenie build test/bwa_index
hisat2_index
required files: none
required asset: fasta
required software: hisat2
refgenie build test/hisat2_index
epilog_index
required files: none
required parameters: --params context=CG
(Default)
required asset: fasta
required software: epilog
refgenie build test/epilog_index --params context=CG
kallisto_index
required files: none
required parameters: none
required asset: fasta
required software: kallisto
refgenie build test/kallisto_index
salmon_index
required files: none
required parameters: none
required asset: fasta
required software: salmon
refgenie build test/salmon_index
star_index
required files: none
required parameters: none
required asset: fasta
required software: star
refgenie build test/star_index
suffixerator_index
required files: none
required parameters: --params memlimit=8GB
(Default)
required asset: fasta
required software: GenomeTools
refgenie build test/suffixerator_index --params memlimit=8GB
tallymer_index
required files: none
required parameters: --params mersize=30 minocc=2
(Default)
required asset: fasta
required software: GenomeTools
refgenie build test/tallymer_index --params mersize=30 minocc=2
feat_annotation
required files: none
required parameters: none
required asset: ensembl_gtf
, ensembl_rb
required software: none
The feat_annotation
asset includes the following genomic feature annotations: enhancers, promoters, promoter flanking regions, 5' UTR, 3' UTR, exons, and introns.
refgenie build test/feat_annotation
[^Amemiya2019]: Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep 2019;9, 9354. doi:10.1038/s41598-019-45839-z