Computing digests locally¶
Introduction¶
This tutorial shows you how to compute GA4GH content-addressable identifiers for sequences and FASTA files using the refget package.
New to GA4GH digests?
See What are GA4GH digests? for background on why content-addressable identifiers matter.
Learning objectives
- Compute a refget sequence digest for a sequence
- Compute a refget sequence collection digest from a FASTA file
- Understand level 1 and level 2 representations
Computing a sequence digest¶
Use sha512t24u_digest() to compute the GA4GH digest for any string:
from refget.digests import sha512t24u_digest
from refget.store import digest_fasta
from refget.utils import fasta_to_seqcol_dict
sha512t24u_digest('GGAA')
'YBbVX0dLKG1ieEDCiMmkrTZFt_Z5Vdaj'
Computing a collection digest from FASTA¶
For a FASTA file containing multiple sequences, compute the top-level collection digest:
digest_fasta('../../../test_fasta/base.fa').digest
'XZlrcEGi6mlopZ2uD8ObHkQB1d0oDwKk'
Getting the full sequence collection (level 2)¶
To see the complete sequence collection representation with all attributes (names, lengths, sequences, and derived attributes), use fasta_to_seqcol_dict():
Note: sequences contains SQ.-prefixed digests for each sequence. The sorted_* attributes enable order-independent comparison.
fasta_to_seqcol_dict('../../../test_fasta/base.fa')
{'lengths': [8, 4, 4],
'names': ['chrX', 'chr1', 'chr2'],
'sequences': ['SQ.iYtREV555dUFKg2_agSJW6suquUyPpMw',
'SQ.YBbVX0dLKG1ieEDCiMmkrTZFt_Z5Vdaj',
'SQ.AcLxtBuKEPk_7PGE_H4dGElwZHCujwH6'],
'sorted_name_length_pairs': ['IWFt7HQ4XoMk34U27BKO-4szSRifP6H5',
'chDD8A4S8YZKNNctCimHasAA2Dn596SZ',
'enZNOGccwFbN9yJ3YZVifFTFCVA9hIpH'],
'sorted_sequences': ['SQ.iYtREV555dUFKg2_agSJW6suquUyPpMw',
'SQ.YBbVX0dLKG1ieEDCiMmkrTZFt_Z5Vdaj',
'SQ.AcLxtBuKEPk_7PGE_H4dGElwZHCujwH6']}
Iterating over sequences¶
For lower-level access to individual sequence metadata (name, length, digests), use the digest_fasta function. This returns an iterator over each sequence:
Summary
sha512t24u_digest()computes GA4GH digests for sequencesdigest_fasta()computes collection digests from FASTA filesfasta_to_seqcol_dict()returns full level 2 sequence collection data