Skip to content

Refget Python API Documentation

FASTA Processing

digest_fasta

digest_fasta(fasta)

Digest all sequences in a FASTA file and compute collection-level digests.

This function reads a FASTA file and computes GA4GH-compliant digests for each sequence, as well as collection-level digests (Level 1 and Level 2) following the GA4GH refget specification.

Parameters:

Name Type Description Default
fasta Union[str, PathLike]

Path to FASTA file (str or PathLike).

required

Returns:

Type Description
SequenceCollection

Collection containing all sequences with their metadata and computed digests.

Raises:

Type Description
IOError

If the FASTA file cannot be read or parsed.

Example:: from gtars.refget import digest_fasta collection = digest_fasta("genome.fa") print(f"Collection digest: {collection.digest}") print(f"Number of sequences: {len(collection)}")

fasta_to_seqcol_dict

fasta_to_seqcol_dict(fasta_file_path)

Convert a FASTA file into a Sequence Collection dict.

Parameters:

Name Type Description Default
fasta_file_path Union[str, Path]

Path to the FASTA file

required

Returns:

Name Type Description
dict dict

A canonical sequence collection dictionary

Raises:

Type Description
ImportError

If gtars is not installed (required for FASTA processing)

Source code in refget/utils.py
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
def fasta_to_seqcol_dict(fasta_file_path: Union[str, Path]) -> dict:
    """
    Convert a FASTA file into a Sequence Collection dict.

    Args:
        fasta_file_path: Path to the FASTA file

    Returns:
        dict: A canonical sequence collection dictionary

    Raises:
        ImportError: If gtars is not installed (required for FASTA processing)
    """
    if not GTARS_INSTALLED:
        raise ImportError("fasta_to_seqcol_dict requires gtars. Install with: pip install gtars")

    from gtars.refget import digest_fasta

    fasta_seq_digests = digest_fasta(fasta_file_path)
    seqcol_dict = {
        "lengths": [],
        "names": [],
        "sequences": [],
        "sorted_name_length_pairs": [],
        "sorted_sequences": [],
    }
    for s in fasta_seq_digests.sequences:
        seq_name = s.metadata.name
        seq_length = s.metadata.length
        seq_digest = "SQ." + s.metadata.sha512t24u
        nlp = {"length": seq_length, "name": seq_name}
        snlp_digest = sha512t24u_digest(canonical_str(nlp))
        seqcol_dict["lengths"].append(seq_length)
        seqcol_dict["names"].append(seq_name)
        seqcol_dict["sorted_name_length_pairs"].append(snlp_digest)
        seqcol_dict["sequences"].append(seq_digest)
        seqcol_dict["sorted_sequences"].append(seq_digest)
    seqcol_dict["sorted_name_length_pairs"].sort()
    return seqcol_dict

compare_seqcols

compare_seqcols(A, B)

Workhorse comparison function

Parameters:

Name Type Description Default
A SeqColDict

Sequence collection A

required
B SeqColDict

Sequence collection B

required

Returns:

Name Type Description
dict dict

Following formal seqcol specification comparison function return value

Source code in refget/utils.py
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
def compare_seqcols(A: SeqColDict, B: SeqColDict) -> dict:
    """
    Workhorse comparison function

    Args:
        A: Sequence collection A
        B: Sequence collection B

    Returns:
        dict: Following formal seqcol specification comparison function return value
    """
    # validate_seqcol(A)  # First ensure these are the right structure
    # validate_seqcol(B)
    a_keys = list(A.keys())
    b_keys = list(B.keys())
    a_keys.sort()
    b_keys.sort()

    all_keys = a_keys + list(set(b_keys) - set(a_keys))
    all_keys.sort()
    result = {}

    # Compute lengths of each array; only do this for array attributes
    a_lengths = {}
    b_lengths = {}
    for k in a_keys:
        a_lengths[k] = len(A[k])
    for k in b_keys:
        b_lengths[k] = len(B[k])

    return_obj = {
        "attributes": {"a_only": [], "b_only": [], "a_and_b": []},
        "array_elements": {
            "a_count": a_lengths,
            "b_count": b_lengths,
            "a_and_b_count": {},
            "a_and_b_same_order": {},
        },
    }

    for k in all_keys:
        _LOGGER.debug(k)
        if k not in A:
            result[k] = {"flag": -1}
            return_obj["attributes"]["b_only"].append(k)
            # return_obj["array_elements"]["total"][k] = {"a": None, "b": len(B[k])}
        elif k not in B:
            return_obj["attributes"]["a_only"].append(k)
            # return_obj["array_elements"]["total"][k] = {"a": len(A[k]), "b": None}
        else:
            return_obj["attributes"]["a_and_b"].append(k)
            res = _compare_elements(A[k], B[k])
            # return_obj["array_elements"]["total"][k] = {"a": len(A[k]), "b": len(B[k])}
            return_obj["array_elements"]["a_and_b_count"][k] = res["a_and_b"]
            return_obj["array_elements"]["a_and_b_same_order"][k] = res["a_and_b_same_order"]
    return return_obj

calc_jaccard_similarities

calc_jaccard_similarities(A, B)

Takes two sequence collections and calculates jaccard similarties for all attributes

Parameters:

Name Type Description Default
A SeqColDict

Sequence collection A

required
B SeqColDict

Sequence collection B

required

Returns:

Name Type Description
dict dict[str, float]

Jaccard similarities for all attributes

Source code in refget/utils.py
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
def calc_jaccard_similarities(A: SeqColDict, B: SeqColDict) -> dict[str, float]:
    """
    Takes two sequence collections and calculates jaccard similarties for all attributes

    Args:
        A: Sequence collection A
        B: Sequence collection B

    Returns:
        dict: Jaccard similarities for all attributes
    """

    def calc_jaccard_similarity(A_B_intersection: int, A_B_union: int) -> float:
        if A_B_union == 0:
            return 0.0
        jaccard_similarity = A_B_intersection / A_B_union
        return jaccard_similarity

    jaccard_similarities = {}

    if (
        "human_readable_names" in A.keys()
    ):  # this can cause issues if key exists but is NoneType when comparing with compare_seqcols()
        del A["human_readable_names"]
    if "human_readable_names" in B.keys():
        del B["human_readable_names"]

    comparison_dict = compare_seqcols(A, B)

    list_a_keys = list(comparison_dict["array_elements"]["a_and_b_count"].keys())

    for key in list_a_keys:
        intersection_seqcol = comparison_dict["array_elements"]["a_and_b_count"].get(key)
        a = comparison_dict["array_elements"]["a_count"].get(key)
        b = comparison_dict["array_elements"]["b_count"].get(key)
        union_seqcol = (
            a + b - intersection_seqcol
        )  # inclusion-exclusion principal for calculating union
        jaccard_similarity = calc_jaccard_similarity(intersection_seqcol, union_seqcol)
        jaccard_similarities.update({key: jaccard_similarity})
    return jaccard_similarities

validate_seqcol

validate_seqcol(seqcol_obj, schema=None)

Validate a seqcol object against the seqcol schema. Returns True if valid, raises InvalidSeqColError if not, which enumerates the errors. Retrieve individual errors with exception.errors

Source code in refget/utils.py
45
46
47
48
49
50
51
52
53
54
55
56
57
def validate_seqcol(seqcol_obj: SeqColDict, schema=None) -> bool:
    """Validate a seqcol object against the seqcol schema.
    Returns True if valid, raises InvalidSeqColError if not, which enumerates the errors.
    Retrieve individual errors with exception.errors
    """
    with open(SEQCOL_SCHEMA_PATH, "r") as f:
        schema = json.load(f)
    validator = Draft7Validator(schema)

    if not validator.is_valid(seqcol_obj):
        errors = sorted(validator.iter_errors(seqcol_obj), key=lambda e: e.path)
        raise InvalidSeqColError("Validation failed", errors)
    return True

validate_seqcol_bool

validate_seqcol_bool(seqcol_obj, schema=None)

Validate a seqcol object against the seqcol schema. Returns True if valid, False if not.

To enumerate the errors, use validate_seqcol instead.

Source code in refget/utils.py
33
34
35
36
37
38
39
40
41
42
def validate_seqcol_bool(seqcol_obj: SeqColDict, schema=None) -> bool:
    """
    Validate a seqcol object against the seqcol schema. Returns True if valid, False if not.

    To enumerate the errors, use validate_seqcol instead.
    """
    with open(SEQCOL_SCHEMA_PATH, "r") as f:
        schema = json.load(f)
    validator = Draft7Validator(schema)
    return validator.is_valid(seqcol_obj)

FastAPI Integration

create_refget_router

create_refget_router(sequences=False, collections=True, pangenomes=False, fasta_drs=False, refget_store_url=None)

Create a FastAPI router for the sequence collection API. This router provides endpoints for retrieving and comparing sequence collections. You can choose which endpoints to include by setting the sequences, collections, pangenomes, or fasta_drs flags.

Parameters:

Name Type Description Default
sequences bool

Include sequence endpoints

False
collections bool

Include sequence collection endpoints

True
pangenomes bool

Include pangenome endpoints

False
fasta_drs bool

Include FASTA DRS endpoints

False
refget_store_url str

URL of backing RefgetStore (e.g., s3://bucket/store/)

None

Returns:

Type Description
APIRouter

A FastAPI router with the specified endpoints

Examples:

app.include_router(create_refget_router(fasta_drs=True), prefix="/seqcol")
Source code in refget/router.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
def create_refget_router(
    sequences: bool = False,
    collections: bool = True,
    pangenomes: bool = False,
    fasta_drs: bool = False,
    refget_store_url: str = None,
) -> APIRouter:
    """
    Create a FastAPI router for the sequence collection API.
    This router provides endpoints for retrieving and comparing sequence collections.
    You can choose which endpoints to include by setting the sequences, collections,
    pangenomes, or fasta_drs flags.

    Args:
        sequences (bool): Include sequence endpoints
        collections (bool): Include sequence collection endpoints
        pangenomes (bool): Include pangenome endpoints
        fasta_drs (bool): Include FASTA DRS endpoints
        refget_store_url (str): URL of backing RefgetStore (e.g., s3://bucket/store/)

    Returns:
        (APIRouter): A FastAPI router with the specified endpoints

    Examples:
        ```
        app.include_router(create_refget_router(fasta_drs=True), prefix="/seqcol")
        ```
    """
    # Store config for service-info discovery
    _ROUTER_CONFIG["fasta_drs"] = fasta_drs
    _ROUTER_CONFIG["refget_store_url"] = refget_store_url

    refget_router = APIRouter()
    if sequences:
        _LOGGER.info("Adding sequence endpoints...")
        refget_router.include_router(seq_router)
    if collections:
        _LOGGER.info("Adding collection endpoints...")
        refget_router.include_router(seqcol_router)
    if pangenomes:
        _LOGGER.info("Adding pangenome endpoints...")
        refget_router.include_router(pangenome_router)
    if fasta_drs:
        _LOGGER.info("Adding FASTA DRS endpoints...")
        refget_router.include_router(fasta_drs_router, prefix="/fasta")
    return refget_router

Client Classes

The client module provides interfaces for interacting with refget-compliant servers.

SequenceClient

SequenceClient(urls=['https://www.ebi.ac.uk/ena/cram'], raise_errors=None)

Bases: RefgetClient

A client for interacting with a refget sequences API.

Initializes the sequences client.

Parameters:

Name Type Description Default
urls list

A list of base URLs of the sequences API. Defaults to ["https://www.ebi.ac.uk/ena/cram/sequence/"].

['https://www.ebi.ac.uk/ena/cram']
raise_errors bool

Whether to raise errors or log them. Defaults to None, which will guess.

None

Attributes: urls (list): The list of base URLs of the sequences API.

Source code in refget/clients.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
def __init__(
    self,
    urls: list[str] = ["https://www.ebi.ac.uk/ena/cram"],
    raise_errors: Optional[bool] = None,
) -> None:
    """
    Initializes the sequences client.

    Args:
        urls (list, optional): A list of base URLs of the sequences API. Defaults to ["https://www.ebi.ac.uk/ena/cram/sequence/"].
        raise_errors (bool, optional): Whether to raise errors or log them. Defaults to None, which will guess.
    Attributes:
        urls (list): The list of base URLs of the sequences API.
    """
    # Remove trailing slaches from input URLs
    self.urls = [url.rstrip("/") for url in urls]
    # If raise_errors is None, set it to True if the client is not being used as a library
    if raise_errors is None:
        raise_errors = __name__ == "__main__"
    self.raise_errors = raise_errors

get_metadata

get_metadata(digest)

Retrieves metadata for a given sequence digest.

Parameters:

Name Type Description Default
digest str

The digest of the sequence.

required

Returns:

Type Description
dict

The metadata.

Source code in refget/clients.py
88
89
90
91
92
93
94
95
96
97
98
99
def get_metadata(self, digest: str) -> Optional[dict]:
    """
    Retrieves metadata for a given sequence digest.

    Args:
        digest (str): The digest of the sequence.

    Returns:
        (dict): The metadata.
    """
    endpoint = f"/sequence/{digest}/metadata"
    return _try_urls(self.urls, endpoint, raise_errors=self.raise_errors)

get_sequence

get_sequence(digest, start=None, end=None)

Retrieves a sequence for a given digest.

Parameters:

Name Type Description Default
digest str

The digest of the sequence.

required

Returns:

Type Description
str

The sequence.

Source code in refget/clients.py
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
def get_sequence(
    self, digest: str, start: Optional[int] = None, end: Optional[int] = None
) -> Optional[str]:
    """
    Retrieves a sequence for a given digest.

    Args:
        digest (str): The digest of the sequence.

    Returns:
        (str): The sequence.
    """
    query_params = {}
    if start is not None:
        query_params["start"] = start
    if end is not None:
        query_params["end"] = end

    endpoint = f"/sequence/{digest}"
    return _try_urls(self.urls, endpoint, params=query_params, raise_errors=self.raise_errors)

SequenceCollectionClient

SequenceCollectionClient(urls=['https://seqcolapi.databio.org'], raise_errors=None)

Bases: RefgetClient

A client for interacting with a refget sequence collections API.

Initializes the sequence collection client.

Parameters:

Name Type Description Default
urls list

A list of base URLs of the sequence collection API. Defaults to ["https://seqcolapi.databio.org"].

['https://seqcolapi.databio.org']

Attributes:

Name Type Description
urls list

The list of base URLs of the sequence collection API.

Source code in refget/clients.py
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
def __init__(
    self,
    urls: list[str] = ["https://seqcolapi.databio.org"],
    raise_errors: Optional[bool] = None,
) -> None:
    """
    Initializes the sequence collection client.

    Args:
        urls (list, optional): A list of base URLs of the sequence collection API. Defaults to ["https://seqcolapi.databio.org"].

    Attributes:
        urls (list): The list of base URLs of the sequence collection API.
    """
    # Remove trailing slaches from input URLs
    self.urls = [url.rstrip("/") for url in urls]
    # If raise_errors is None, set it to True if the client is not being used as a library
    if raise_errors is None:
        raise_errors = __name__ == "__main__"
    self.raise_errors = raise_errors
    self._fasta_client = None

build_chrom_sizes

build_chrom_sizes(digest)

Build a chrom.sizes file content for a sequence collection.

Format per line: NAME\tLENGTH

Parameters:

Name Type Description Default
digest str

The sequence collection digest

required

Returns:

Type Description
str

String content of the chrom.sizes file

Source code in refget/clients.py
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
def build_chrom_sizes(self, digest: str) -> str:
    """
    Build a chrom.sizes file content for a sequence collection.

    Format per line: NAME\\tLENGTH

    Args:
        digest (str): The sequence collection digest

    Returns:
        (str): String content of the chrom.sizes file
    """
    collection = self.get_collection(digest, level=2)
    if not collection:
        raise ValueError(f"No collection found for {digest}")

    names = collection["names"]
    lengths = collection["lengths"]

    lines = []
    for name, length in zip(names, lengths):
        lines.append(f"{name}\t{length}")

    return "\n".join(lines) + "\n"

build_fai

build_fai(digest)

Build a complete .fai index file content for a FASTA.

FAI format per line: NAME\tLENGTH\tOFFSET\tLINEBASES\tLINEWIDTH

Parameters:

Name Type Description Default
digest str

The sequence collection digest

required

Returns:

Type Description
str

String content of the .fai file

Source code in refget/clients.py
212
213
214
215
216
217
218
219
220
221
222
223
224
def build_fai(self, digest: str) -> str:
    """
    Build a complete .fai index file content for a FASTA.

    FAI format per line: NAME\\tLENGTH\\tOFFSET\\tLINEBASES\\tLINEWIDTH

    Args:
        digest (str): The sequence collection digest

    Returns:
        (str): String content of the .fai file
    """
    return self._get_fasta_helper().build_fai(digest, seqcol_client=self)

compare

compare(digest1, digest2)

Compares two sequence collections hosted on the server.

Parameters:

Name Type Description Default
digest1 str

The digest of the first sequence collection.

required
digest2 str

The digest of the second sequence collection.

required

Returns:

Type Description
dict

The JSON response containing the comparison of the two sequence collections.

Source code in refget/clients.py
308
309
310
311
312
313
314
315
316
317
318
319
320
def compare(self, digest1: str, digest2: str) -> Optional[dict]:
    """
    Compares two sequence collections hosted on the server.

    Args:
        digest1 (str): The digest of the first sequence collection.
        digest2 (str): The digest of the second sequence collection.

    Returns:
        (dict): The JSON response containing the comparison of the two sequence collections.
    """
    endpoint = f"/comparison/{digest1}/{digest2}"
    return _try_urls(self.urls, endpoint)

compare_local

compare_local(digest, local_collection)

Compares a server-hosted sequence collection with a local collection.

Parameters:

Name Type Description Default
digest str

The digest of the server-hosted sequence collection.

required
local_collection dict

A level 2 sequence collection representation.

required

Returns:

Type Description
dict

The JSON response containing the comparison.

Source code in refget/clients.py
322
323
324
325
326
327
328
329
330
331
332
333
334
def compare_local(self, digest: str, local_collection: dict) -> Optional[dict]:
    """
    Compares a server-hosted sequence collection with a local collection.

    Args:
        digest (str): The digest of the server-hosted sequence collection.
        local_collection (dict): A level 2 sequence collection representation.

    Returns:
        (dict): The JSON response containing the comparison.
    """
    endpoint = f"/comparison/{digest}"
    return _try_urls(self.urls, endpoint, method="POST", json=local_collection)

download_fasta

download_fasta(digest, dest_path=None, access_id=None)

Download the FASTA file to a local path.

Parameters:

Name Type Description Default
digest str

The sequence collection digest

required
dest_path str

Destination file path. If None, uses object name.

None
access_id str

Specific access method to use. If None, tries all.

None

Returns:

Type Description
str

Path to downloaded file

Raises:

Type Description
ValueError

If no access methods available or specified access_id not found

Source code in refget/clients.py
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
def download_fasta(self, digest: str, dest_path: str = None, access_id: str = None) -> str:
    """
    Download the FASTA file to a local path.

    Args:
        digest (str): The sequence collection digest
        dest_path (str, optional): Destination file path. If None, uses object name.
        access_id (str, optional): Specific access method to use. If None, tries all.

    Returns:
        (str): Path to downloaded file

    Raises:
        ValueError: If no access methods available or specified access_id not found
    """
    return self._get_fasta_helper().download(digest, dest_path, access_id)

download_fasta_to_store

download_fasta_to_store(digest, store, access_id=None, temp_dir=None)

Download the FASTA file and import it into a RefgetStore.

This method downloads the FASTA file from the DRS endpoint and immediately imports it into the provided RefgetStore, enabling local sequence retrieval by digest without re-downloading.

Parameters:

Name Type Description Default
digest str

The sequence collection digest

required
store RefgetStore

The RefgetStore instance to import into

required
access_id str

Specific access method to use. If None, tries all.

None
temp_dir str

Directory for temporary download. If None, uses system temp.

None

Returns:

Type Description
str

The collection digest of the imported sequences

Raises:

Type Description
ValueError

If no access methods available or specified access_id not found

ImportError

If gtars/RefgetStore is not available

Example

from refget.store import RefgetStore, StorageMode from refget.clients import SequenceCollectionClient store = RefgetStore(StorageMode.Encoded) client = SequenceCollectionClient() collection_digest = client.download_fasta_to_store("abc123", store)

Now you can retrieve sequences by digest from the local store

seq = store.get_substring(sequence_digest, 0, 100)

Source code in refget/clients.py
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
def download_fasta_to_store(
    self, digest: str, store: "RefgetStore", access_id: str = None, temp_dir: str = None
) -> str:
    """
    Download the FASTA file and import it into a RefgetStore.

    This method downloads the FASTA file from the DRS endpoint and immediately
    imports it into the provided RefgetStore, enabling local sequence retrieval
    by digest without re-downloading.

    Args:
        digest (str): The sequence collection digest
        store (RefgetStore): The RefgetStore instance to import into
        access_id (str, optional): Specific access method to use. If None, tries all.
        temp_dir (str, optional): Directory for temporary download. If None, uses system temp.

    Returns:
        (str): The collection digest of the imported sequences

    Raises:
        ValueError: If no access methods available or specified access_id not found
        ImportError: If gtars/RefgetStore is not available

    Example:
        >>> from refget.store import RefgetStore, StorageMode
        >>> from refget.clients import SequenceCollectionClient
        >>> store = RefgetStore(StorageMode.Encoded)
        >>> client = SequenceCollectionClient()
        >>> collection_digest = client.download_fasta_to_store("abc123", store)
        >>> # Now you can retrieve sequences by digest from the local store
        >>> seq = store.get_substring(sequence_digest, 0, 100)
    """
    return self._get_fasta_helper().download_to_store(digest, store, access_id, temp_dir)

get_attribute

get_attribute(attribute, digest)

Retrieves a specific attribute value by its digest.

Parameters:

Name Type Description Default
attribute str

The attribute name (e.g., "names", "lengths", "sequences").

required
digest str

The level 1 digest of the attribute.

required

Returns:

Type Description
dict

The JSON response containing the attribute value.

Source code in refget/clients.py
294
295
296
297
298
299
300
301
302
303
304
305
306
def get_attribute(self, attribute: str, digest: str) -> Optional[dict]:
    """
    Retrieves a specific attribute value by its digest.

    Args:
        attribute (str): The attribute name (e.g., "names", "lengths", "sequences").
        digest (str): The level 1 digest of the attribute.

    Returns:
        (dict): The JSON response containing the attribute value.
    """
    endpoint = f"/attribute/collection/{attribute}/{digest}"
    return _try_urls(self.urls, endpoint)

get_collection

get_collection(digest, level=2)

Retrieves a sequence collection for a given digest and detail level.

Parameters:

Name Type Description Default
digest str

The digest of the sequence collection.

required
level int

The level of detail for the sequence collection. Defaults to 2.

2

Returns:

Type Description
dict

The JSON response containing the sequence collection.

Source code in refget/clients.py
280
281
282
283
284
285
286
287
288
289
290
291
292
def get_collection(self, digest: str, level: int = 2) -> Optional[dict]:
    """
    Retrieves a sequence collection for a given digest and detail level.

    Args:
        digest (str): The digest of the sequence collection.
        level (int, optional): The level of detail for the sequence collection. Defaults to 2.

    Returns:
        (dict): The JSON response containing the sequence collection.
    """
    endpoint = f"/collection/{digest}?level={level}"
    return _try_urls(self.urls, endpoint)

get_fasta

get_fasta(digest)

Get DRS object metadata for a FASTA file.

Parameters:

Name Type Description Default
digest str

The sequence collection digest (which is also the DRS object ID)

required

Returns:

Type Description
dict

DRS object with id, self_uri, size, checksums, access_methods, etc.

Source code in refget/clients.py
137
138
139
140
141
142
143
144
145
146
147
def get_fasta(self, digest: str) -> Optional[dict]:
    """
    Get DRS object metadata for a FASTA file.

    Args:
        digest (str): The sequence collection digest (which is also the DRS object ID)

    Returns:
        (dict): DRS object with id, self_uri, size, checksums, access_methods, etc.
    """
    return self._get_fasta_helper().get_object(digest)

get_fasta_index

get_fasta_index(digest)

Get FAI index data for a FASTA file.

Parameters:

Name Type Description Default
digest str

The sequence collection digest

required

Returns:

Type Description
dict

Dict with line_bases, extra_line_bytes, offsets

Source code in refget/clients.py
149
150
151
152
153
154
155
156
157
158
159
def get_fasta_index(self, digest: str) -> Optional[dict]:
    """
    Get FAI index data for a FASTA file.

    Args:
        digest (str): The sequence collection digest

    Returns:
        (dict): Dict with line_bases, extra_line_bytes, offsets
    """
    return self._get_fasta_helper().get_index(digest)

get_refget_store

get_refget_store(cache_dir)

Get a RefgetStore instance connected to the server's backing store.

Parameters:

Name Type Description Default
cache_dir str

Local directory for caching store data

required

Returns:

Type Description
RefgetStore

RefgetStore instance loaded from remote

Raises:

Type Description
ValueError

If server doesn't have a RefgetStore configured

ImportError

If gtars is not installed

Source code in refget/clients.py
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
def get_refget_store(self, cache_dir: str) -> "RefgetStore":
    """
    Get a RefgetStore instance connected to the server's backing store.

    Args:
        cache_dir (str): Local directory for caching store data

    Returns:
        (RefgetStore): RefgetStore instance loaded from remote

    Raises:
        ValueError: If server doesn't have a RefgetStore configured
        ImportError: If gtars is not installed
    """
    url = self.get_refget_store_url()
    if not url:
        raise ValueError("Server does not have a RefgetStore configured")

    try:
        from .store import RefgetStore
    except ImportError:
        raise ImportError("gtars is required: pip install gtars")

    return RefgetStore.load_remote(cache_dir, url)

get_refget_store_url

get_refget_store_url()

Discover RefgetStore URL from service-info if available.

Returns:

Type Description
str

The RefgetStore URL if configured, None otherwise.

Source code in refget/clients.py
407
408
409
410
411
412
413
414
415
416
417
418
def get_refget_store_url(self) -> Optional[str]:
    """
    Discover RefgetStore URL from service-info if available.

    Returns:
        (str): The RefgetStore URL if configured, None otherwise.
    """
    info = self.service_info()
    store_config = info.get("seqcol", {}).get("refget_store", {})
    if store_config.get("enabled"):
        return store_config.get("url")
    return None

is_fasta_drs_enabled

is_fasta_drs_enabled()

Check if FastaDRS endpoints are available.

Returns:

Type Description
bool

True if FastaDRS is enabled, False otherwise.

Source code in refget/clients.py
397
398
399
400
401
402
403
404
405
def is_fasta_drs_enabled(self) -> bool:
    """
    Check if FastaDRS endpoints are available.

    Returns:
        (bool): True if FastaDRS is enabled, False otherwise.
    """
    info = self.service_info()
    return info.get("seqcol", {}).get("fasta_drs", {}).get("enabled", False)

list_attributes

list_attributes(attribute, page=None, page_size=None)

Lists all available values for a given attribute with optional paging support.

Parameters:

Name Type Description Default
attribute str

The attribute to list values for.

required
page int

The page number to retrieve. Defaults to None.

None
page_size int

The number of items per page. Defaults to None.

None

Returns:

Type Description
dict

The JSON response containing the list of available values for the attribute.

Source code in refget/clients.py
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
def list_attributes(
    self, attribute: str, page: Optional[int] = None, page_size: Optional[int] = None
) -> Optional[dict]:
    """
    Lists all available values for a given attribute with optional paging support.

    Args:
        attribute (str): The attribute to list values for.
        page (int, optional): The page number to retrieve. Defaults to None.
        page_size (int, optional): The number of items per page. Defaults to None.

    Returns:
        (dict): The JSON response containing the list of available values for the attribute.
    """
    params = {}
    if page is not None:
        params["page"] = page
    if page_size is not None:
        params["page_size"] = page_size

    endpoint = f"/list/attributes/{attribute}"
    return _try_urls(self.urls, endpoint, params=params)

list_collections

list_collections(page=None, page_size=None, **filters)

Lists all available sequence collections with optional paging and attribute filtering support.

Parameters:

Name Type Description Default
page int

The page number to retrieve. Defaults to None.

None
page_size int

The number of items per page. Defaults to None.

None
**filters Any

Optional attribute filters (e.g., names="abc123", lengths="def456"). Values should be level 1 digests of the attributes.

{}

Returns:

Type Description
dict

The JSON response containing the list of available sequence collections.

Source code in refget/clients.py
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
def list_collections(
    self,
    page: Optional[int] = None,
    page_size: Optional[int] = None,
    **filters,
) -> Optional[dict]:
    """
    Lists all available sequence collections with optional paging and attribute filtering support.

    Args:
        page (int, optional): The page number to retrieve. Defaults to None.
        page_size (int, optional): The number of items per page. Defaults to None.
        **filters (Any): Optional attribute filters (e.g., names="abc123", lengths="def456").
                  Values should be level 1 digests of the attributes.

    Returns:
        (dict): The JSON response containing the list of available sequence collections.
    """
    params = {}
    if page is not None:
        params["page"] = page
    if page_size is not None:
        params["page_size"] = page_size
    params.update(filters)

    endpoint = "/list/collection"
    return _try_urls(self.urls, endpoint, params=params)

service_info

service_info()

Retrieves information about the service.

Returns:

Type Description
dict

The service information.

Source code in refget/clients.py
387
388
389
390
391
392
393
394
395
def service_info(self) -> Optional[dict]:
    """
    Retrieves information about the service.

    Returns:
        (dict): The service information.
    """
    endpoint = "/service-info"
    return _try_urls(self.urls, endpoint)

write_chrom_sizes

write_chrom_sizes(digest, dest_path)

Write a chrom.sizes file for a sequence collection.

Parameters:

Name Type Description Default
digest str

The sequence collection digest

required
dest_path str

Path to write the chrom.sizes file

required

Returns:

Type Description
str

Path to the written file

Source code in refget/clients.py
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
def write_chrom_sizes(self, digest: str, dest_path: str) -> str:
    """
    Write a chrom.sizes file for a sequence collection.

    Args:
        digest (str): The sequence collection digest
        dest_path (str): Path to write the chrom.sizes file

    Returns:
        (str): Path to the written file
    """
    content = self.build_chrom_sizes(digest)
    with open(dest_path, "w") as f:
        f.write(content)
    return dest_path

write_fai

write_fai(digest, dest_path)

Write a .fai index file for a FASTA.

Parameters:

Name Type Description Default
digest str

The sequence collection digest

required
dest_path str

Path to write the .fai file

required

Returns:

Type Description
str

Path to the written file

Source code in refget/clients.py
226
227
228
229
230
231
232
233
234
235
236
237
def write_fai(self, digest: str, dest_path: str) -> str:
    """
    Write a .fai index file for a FASTA.

    Args:
        digest (str): The sequence collection digest
        dest_path (str): Path to write the .fai file

    Returns:
        (str): Path to the written file
    """
    return self._get_fasta_helper().write_fai(digest, dest_path, seqcol_client=self)

FastaDrsClient

FastaDrsClient(urls=['https://seqcolapi.databio.org/fasta'], raise_errors=None)

Bases: RefgetClient

A client for interacting with FASTA files via GA4GH DRS endpoints.

Initializes the FASTA DRS client.

Parameters:

Name Type Description Default
urls list

A list of base URLs of the FASTA DRS API. Defaults to ["https://seqcolapi.databio.org/fasta"].

['https://seqcolapi.databio.org/fasta']
raise_errors bool

Whether to raise errors or log them. Defaults to None, which will guess.

None

Attributes:

Name Type Description
urls list

The list of base URLs of the FASTA DRS API.

Source code in refget/clients.py
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
def __init__(
    self,
    urls: list[str] = ["https://seqcolapi.databio.org/fasta"],
    raise_errors: Optional[bool] = None,
) -> None:
    """
    Initializes the FASTA DRS client.

    Args:
        urls (list, optional): A list of base URLs of the FASTA DRS API.
            Defaults to ["https://seqcolapi.databio.org/fasta"].
        raise_errors (bool, optional): Whether to raise errors or log them.
            Defaults to None, which will guess.

    Attributes:
        urls (list): The list of base URLs of the FASTA DRS API.
    """
    self.urls = [url.rstrip("/") for url in urls]
    if raise_errors is None:
        raise_errors = __name__ == "__main__"
    self.raise_errors = raise_errors

build_fai

build_fai(digest, seqcol_client=None)

Build a complete .fai index file content for a FASTA.

FAI format per line: NAME LENGTH OFFSET LINEBASES LINEWIDTH

Parameters:

Name Type Description Default
digest str

The sequence collection digest

required
seqcol_client SequenceCollectionClient

SequenceCollectionClient to use. If None, uses parent client or creates one.

None

Returns:

Type Description
str

String content of the .fai file

Source code in refget/clients.py
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
def build_fai(self, digest: str, seqcol_client: "SequenceCollectionClient" = None) -> str:
    """
    Build a complete .fai index file content for a FASTA.

    FAI format per line: NAME\tLENGTH\tOFFSET\tLINEBASES\tLINEWIDTH

    Args:
        digest (str): The sequence collection digest
        seqcol_client (SequenceCollectionClient, optional): SequenceCollectionClient
            to use. If None, uses parent client or creates one.

    Returns:
        (str): String content of the .fai file
    """
    # Get FAI index data
    index = self.get_index(digest)
    if not index:
        raise ValueError(f"No FAI index for {digest}")

    # Get sequence collection for names/lengths
    if seqcol_client is None:
        # Use parent client if we were created via SequenceCollectionClient.fasta
        if hasattr(self, "_seqcol_client") and self._seqcol_client is not None:
            seqcol_client = self._seqcol_client
        else:
            # Derive seqcol URL from fasta URL (strip /fasta suffix)
            base_urls = [url.rsplit("/fasta", 1)[0] for url in self.urls]
            seqcol_client = SequenceCollectionClient(urls=base_urls)

    collection = seqcol_client.get_collection(digest, level=2)
    if not collection:
        raise ValueError(f"No collection found for {digest}")

    names = collection["names"]
    lengths = collection["lengths"]
    offsets = index["offsets"]
    line_bases = index["line_bases"]
    line_width = line_bases + index["extra_line_bytes"]

    # Build FAI lines
    lines = []
    for name, length, offset in zip(names, lengths, offsets):
        # FAI format: NAME LENGTH OFFSET LINEBASES LINEWIDTH
        lines.append(f"{name}\t{length}\t{offset}\t{line_bases}\t{line_width}")

    return "\n".join(lines) + "\n"

download

download(digest, dest_path=None, access_id=None)

Download the FASTA file to a local path.

Parameters:

Name Type Description Default
digest str

The sequence collection digest

required
dest_path str

Destination file path. If None, uses object name.

None
access_id str

Specific access method to use. If None, tries all.

None

Returns:

Type Description
str

Path to downloaded file

Raises:

Type Description
ValueError

If no access methods available or specified access_id not found

Source code in refget/clients.py
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
def download(self, digest: str, dest_path: str = None, access_id: str = None) -> str:
    """
    Download the FASTA file to a local path.

    Args:
        digest (str): The sequence collection digest
        dest_path (str, optional): Destination file path. If None, uses object name.
        access_id (str, optional): Specific access method to use. If None, tries all.

    Returns:
        (str): Path to downloaded file

    Raises:
        ValueError: If no access methods available or specified access_id not found
    """
    drs_obj = self.get_object(digest)
    if not drs_obj or not drs_obj.get("access_methods"):
        raise ValueError(f"No access methods for {digest}")

    # Filter to specific access method if requested
    methods = drs_obj["access_methods"]
    if access_id:
        methods = [m for m in methods if m.get("access_id") == access_id]
        if not methods:
            raise ValueError(f"Access method '{access_id}' not found for {digest}")

    # Find first accessible URL
    for method in methods:
        url = None
        if method.get("access_url"):
            access_url = method["access_url"]
            url = access_url.get("url") if isinstance(access_url, dict) else access_url
        elif method.get("access_id"):
            access_info = self.get_access_url(digest, method["access_id"])
            url = access_info.get("url") if access_info else None

        if url:
            if dest_path is None:
                dest_path = drs_obj.get("name", f"{digest}.fa")

            response = requests.get(url, stream=True)
            response.raise_for_status()
            with open(dest_path, "wb") as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
            return dest_path

    raise ValueError(f"No accessible URLs for {digest}")

download_to_store

download_to_store(digest, store, access_id=None, temp_dir=None)

Download the FASTA file and import it into a RefgetStore.

This method downloads the FASTA file from the DRS endpoint and immediately imports it into the provided RefgetStore, enabling local sequence retrieval by digest without re-downloading.

Parameters:

Name Type Description Default
digest str

The sequence collection digest

required
store RefgetStore

The RefgetStore instance to import into

required
access_id str

Specific access method to use. If None, tries all.

None
temp_dir str

Directory for temporary download. If None, uses system temp.

None

Returns:

Type Description
str

The collection digest of the imported sequences

Raises:

Type Description
ValueError

If no access methods available or specified access_id not found

ImportError

If gtars/RefgetStore is not available

Example

from refget.store import RefgetStore, StorageMode store = RefgetStore(StorageMode.Encoded) client = FastaDrsClient() collection_digest = client.download_to_store("abc123", store)

Source code in refget/clients.py
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
def download_to_store(
    self, digest: str, store: "RefgetStore", access_id: str = None, temp_dir: str = None
) -> str:
    """
    Download the FASTA file and import it into a RefgetStore.

    This method downloads the FASTA file from the DRS endpoint and immediately
    imports it into the provided RefgetStore, enabling local sequence retrieval
    by digest without re-downloading.

    Args:
        digest (str): The sequence collection digest
        store (RefgetStore): The RefgetStore instance to import into
        access_id (str, optional): Specific access method to use. If None, tries all.
        temp_dir (str, optional): Directory for temporary download. If None, uses system temp.

    Returns:
        (str): The collection digest of the imported sequences

    Raises:
        ValueError: If no access methods available or specified access_id not found
        ImportError: If gtars/RefgetStore is not available

    Example:
        >>> from refget.store import RefgetStore, StorageMode
        >>> store = RefgetStore(StorageMode.Encoded)
        >>> client = FastaDrsClient()
        >>> collection_digest = client.download_to_store("abc123", store)
    """
    import tempfile
    import os

    # Verify store is available
    try:
        from .store import RefgetStore as RefgetStoreClass
    except ImportError:
        raise ImportError("gtars is required for download_to_store functionality")

    # Download to temporary location
    temp_file = None
    try:
        if temp_dir:
            os.makedirs(temp_dir, exist_ok=True)
            temp_file = os.path.join(temp_dir, f"{digest}.fa")
        else:
            # Create a named temporary file
            fd, temp_file = tempfile.mkstemp(suffix=".fa", prefix=f"{digest}_")
            os.close(fd)  # Close the file descriptor

        # Download the FASTA
        downloaded_path = self.download(digest, dest_path=temp_file, access_id=access_id)
        _LOGGER.info(f"Downloaded FASTA to {downloaded_path}")

        # Import into store
        store.import_fasta(downloaded_path)
        _LOGGER.info(f"Imported FASTA into RefgetStore: {digest}")

        return digest

    finally:
        # Clean up temporary file if we created it in system temp
        if temp_file and not temp_dir and os.path.exists(temp_file):
            try:
                os.remove(temp_file)
            except Exception as e:
                _LOGGER.warning(f"Could not remove temporary file {temp_file}: {e}")

get_access_url

get_access_url(digest, access_id)

Get access URL for a specific access method.

Parameters:

Name Type Description Default
digest str

The sequence collection digest

required
access_id str

The access ID from the access method

required

Returns:

Type Description
dict

Access URL object

Source code in refget/clients.py
503
504
505
506
507
508
509
510
511
512
513
514
515
def get_access_url(self, digest: str, access_id: str) -> Optional[dict]:
    """
    Get access URL for a specific access method.

    Args:
        digest (str): The sequence collection digest
        access_id (str): The access ID from the access method

    Returns:
        (dict): Access URL object
    """
    endpoint = f"/objects/{digest}/access/{access_id}"
    return _try_urls(self.urls, endpoint, raise_errors=self.raise_errors)

get_index

get_index(digest)

Get FAI index data for a FASTA file.

Parameters:

Name Type Description Default
digest str

The sequence collection digest

required

Returns:

Type Description
dict

Dict with line_bases, extra_line_bytes, offsets

Source code in refget/clients.py
490
491
492
493
494
495
496
497
498
499
500
501
def get_index(self, digest: str) -> Optional[dict]:
    """
    Get FAI index data for a FASTA file.

    Args:
        digest (str): The sequence collection digest

    Returns:
        (dict): Dict with line_bases, extra_line_bytes, offsets
    """
    endpoint = f"/objects/{digest}/index"
    return _try_urls(self.urls, endpoint, raise_errors=self.raise_errors)

get_object

get_object(digest)

Get DRS object metadata for a FASTA file.

Parameters:

Name Type Description Default
digest str

The sequence collection digest (which is also the DRS object ID)

required

Returns:

Type Description
dict

DRS object with id, self_uri, size, checksums, access_methods, etc.

Source code in refget/clients.py
477
478
479
480
481
482
483
484
485
486
487
488
def get_object(self, digest: str) -> Optional[dict]:
    """
    Get DRS object metadata for a FASTA file.

    Args:
        digest (str): The sequence collection digest (which is also the DRS object ID)

    Returns:
        (dict): DRS object with id, self_uri, size, checksums, access_methods, etc.
    """
    endpoint = f"/objects/{digest}"
    return _try_urls(self.urls, endpoint, raise_errors=self.raise_errors)

service_info

service_info()

Get DRS service info.

Returns:

Type Description
dict

The service information.

Source code in refget/clients.py
517
518
519
520
521
522
523
524
525
def service_info(self) -> Optional[dict]:
    """
    Get DRS service info.

    Returns:
        (dict): The service information.
    """
    endpoint = "/service-info"
    return _try_urls(self.urls, endpoint)

write_fai

write_fai(digest, dest_path, seqcol_client=None)

Write a .fai index file for a FASTA.

Parameters:

Name Type Description Default
digest str

The sequence collection digest

required
dest_path str

Path to write the .fai file

required
seqcol_client SequenceCollectionClient

SequenceCollectionClient to use

None

Returns:

Type Description
str

Path to the written file

Source code in refget/clients.py
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
def write_fai(
    self, digest: str, dest_path: str, seqcol_client: "SequenceCollectionClient" = None
) -> str:
    """
    Write a .fai index file for a FASTA.

    Args:
        digest (str): The sequence collection digest
        dest_path (str): Path to write the .fai file
        seqcol_client (SequenceCollectionClient, optional): SequenceCollectionClient to use

    Returns:
        (str): Path to the written file
    """
    fai_content = self.build_fai(digest, seqcol_client)
    with open(dest_path, "w") as f:
        f.write(fai_content)
    return dest_path

PangenomeClient

Bases: RefgetClient

Agent Classes

Agents provide higher-level abstractions for working with refget data in a PostgreSQL database.

RefgetDBAgent

RefgetDBAgent(engine=None, postgres_str=None, schema=SEQCOL_SCHEMA_PATH, inherent_attrs=DEFAULT_INHERENT_ATTRS, fasta_drs_url_prefix=None)

Bases: object

Primary aggregator agent, interface to all other agents

Parameterized it via these environment variables: - POSTGRES_HOST - POSTGRES_DB - POSTGRES_USER - POSTGRES_PASSWORD

Source code in refget/agents.py
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
def __init__(
    self,
    engine: Optional[SqlalchemyDatabaseEngine] = None,
    postgres_str: Optional[str] = None,
    schema=SEQCOL_SCHEMA_PATH,
    inherent_attrs: List[str] = DEFAULT_INHERENT_ATTRS,
    fasta_drs_url_prefix: Optional[str] = None,
):  # = "sqlite:///foo.db"
    if engine is not None:
        self.engine = engine
    else:
        if not postgres_str:
            # Configure via environment variables
            POSTGRES_HOST = os.getenv("POSTGRES_HOST")
            POSTGRES_PORT = os.getenv("POSTGRES_PORT")
            POSTGRES_DB = os.getenv("POSTGRES_DB")
            POSTGRES_USER = os.getenv("POSTGRES_USER")
            POSTGRES_PASSWORD = os.getenv("POSTGRES_PASSWORD")
            postgres_str = URL.create(
                "postgresql",
                username=POSTGRES_USER,
                password=POSTGRES_PASSWORD,
                host=POSTGRES_HOST,
                port=int(POSTGRES_PORT) if POSTGRES_PORT else None,
                database=POSTGRES_DB,
            )

        try:
            self.engine = create_engine(postgres_str, echo=False)
        except Exception as e:
            _LOGGER.error(f"Error: {e}")
            _LOGGER.error("Unable to connect to database")
            _LOGGER.error(
                "Please check that you have set the database credentials correctly in the environment variables"
            )
            _LOGGER.error(f"Database engine string: {postgres_str}")
            raise e
    try:
        SQLModel.metadata.create_all(self.engine)
    except Exception as e:
        _LOGGER.error(f"Error: {e}")
        _LOGGER.error("Unable to create tables in the database")
        raise e

    # Read schema
    if schema:
        self.schema_dict = load_json(schema)
        _LOGGER.debug(f"Schema: {self.schema_dict}")
        try:
            self.inherent_attrs = self.schema_dict["ga4gh"]["inherent"]
        except KeyError:
            self.inherent_attrs = inherent_attrs
            _LOGGER.warning(
                f"No 'inherent' attributes found in schema; using defaults: {inherent_attrs}"
            )
    else:
        _LOGGER.warning("No schema provided; using defaults")
        self.schema_dict = None
        self.inherent_attrs = inherent_attrs

    self.__sequence = SequenceAgent(self.engine)
    self.__seqcol = SequenceCollectionAgent(self.engine, self.inherent_attrs, self)
    self.__pangenome = PangenomeAgent(self)
    self.__attribute = AttributeAgent(self.engine)
    self.__fasta_drs = FastaDrsAgent(self.engine, fasta_drs_url_prefix)

calc_similarities

calc_similarities(digestA, digestB)

Calculates the Jaccard similarity between two sequence collections.

This method retrieves two sequence collections using their digests and then computes jaccard similarities for all attributes.

Parameters:

Name Type Description Default
digestA str

The digest (identifier) for the first sequence collection.

required
digestB str

The digest (identifier) for the second sequence collection.

required

Returns:

Name Type Description
dict dict

The Jaccard similarity score between the two sequence collections for all present and shared attributes.

Source code in refget/agents.py
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
def calc_similarities(self, digestA: str, digestB: str) -> dict:
    """
    Calculates the Jaccard similarity between two sequence collections.

    This method retrieves two sequence collections using their digests and then
    computes jaccard similarities for all attributes.

    Args:
        digestA (str): The digest (identifier) for the first sequence collection.
        digestB (str): The digest (identifier) for the second sequence collection.

    Returns:
        dict: The Jaccard similarity score between the two sequence collections for all present and shared attributes.

    """
    A = self.seqcol.get(digestA, return_format="level2")
    B = self.seqcol.get(digestB, return_format="level2")
    return calc_jaccard_similarities(A, B)

calc_similarities_seqcol_dicts

calc_similarities_seqcol_dicts(seqcolA, seqcolB)

Calculates the Jaccard similarity between two sequence collections.

This method retrieves one sequence collections using a digests and then computes jaccard similarities versus another input sequence collection dictionary.

Parameters:

Name Type Description Default
seqcolA dict

the first sequence collection in dict format.

required
seqcolB dict

the second sequence collection in dict format.

required

Returns:

Name Type Description
dict dict

The Jaccard similarity score between the two sequence collections for all present and shared attributes.

Source code in refget/agents.py
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
def calc_similarities_seqcol_dicts(self, seqcolA: dict, seqcolB: dict) -> dict:
    """
    Calculates the Jaccard similarity between two sequence collections.

    This method retrieves one sequence collections using a digests and then
    computes jaccard similarities versus another input sequence collection dictionary.

    Args:
        seqcolA (dict): the first sequence collection in dict format.
        seqcolB (dict): the second sequence collection in dict format.

    Returns:
        dict: The Jaccard similarity score between the two sequence collections for all present and shared attributes.

    """

    return calc_jaccard_similarities(seqcolA, seqcolB)

truncate

truncate()

Delete all records from the database

Source code in refget/agents.py
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
def truncate(self) -> int:
    """Delete all records from the database"""

    with Session(self.engine) as session:
        statement = delete(SequenceCollection)
        result1 = session.exec(statement)
        statement = delete(Pangenome)
        result = session.exec(statement)
        statement = delete(NamesAttr)
        result = session.exec(statement)
        statement = delete(LengthsAttr)
        result = session.exec(statement)
        statement = delete(SequencesAttr)
        result = session.exec(statement)
        # statement = delete(SortedNameLengthPairsAttr)
        # result = session.exec(statement)
        statement = delete(NameLengthPairsAttr)
        result = session.exec(statement)
        statement = delete(SortedSequencesAttr)
        result = session.exec(statement)

        session.commit()
        return result1.rowcount

SequenceCollectionAgent

SequenceCollectionAgent(engine, inherent_attrs=None, parent=None)

Bases: object

Agent for interacting with database of sequence collection

Source code in refget/agents.py
169
170
171
172
173
174
175
176
177
def __init__(
    self,
    engine: SqlalchemyDatabaseEngine,
    inherent_attrs: Optional[List[str]] = None,
    parent: Optional["RefgetDBAgent"] = None,
) -> None:
    self.engine = engine
    self.inherent_attrs = inherent_attrs
    self.parent = parent

add

add(seqcol, update=False)

Add a sequence collection to the database or update it if it exists

Parameters:

Name Type Description Default
seqcol SequenceCollection

The sequence collection to add

required
update bool

If True, update an existing collection if it exists

False

Returns:

Type Description
SequenceCollection

The added or updated sequence collection

Source code in refget/agents.py
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
def add(self, seqcol: SequenceCollection, update: bool = False) -> SequenceCollection:
    """
    Add a sequence collection to the database or update it if it exists

    Args:
        seqcol: The sequence collection to add
        update: If True, update an existing collection if it exists

    Returns:
        The added or updated sequence collection
    """
    with Session(self.engine, expire_on_commit=False) as session:
        with session.no_autoflush:
            existing = session.get(SequenceCollection, seqcol.digest)

            if existing and not update:
                return existing

            # Process attributes (create if needed)
            attr_map = {
                "names": (NamesAttr, seqcol.names),
                "sequences": (SequencesAttr, seqcol.sequences),
                "sorted_sequences": (SortedSequencesAttr, seqcol.sorted_sequences),
                "lengths": (LengthsAttr, seqcol.lengths),
                "name_length_pairs": (NameLengthPairsAttr, seqcol.name_length_pairs),
            }

            processed_attrs = {}

            # Create or retrieve attributes
            for attr_name, (attr_class, attr_obj) in attr_map.items():
                attr = session.get(attr_class, attr_obj.digest)
                if not attr:
                    attr = attr_class(**attr_obj.model_dump())
                    session.add(attr)
                processed_attrs[attr_name] = attr

            if existing and update:
                # Update existing collection

                existing_names = [
                    name_model.human_readable_name
                    for name_model in existing.human_readable_names
                ]

                for name_model in seqcol.human_readable_names:
                    if name_model.human_readable_name not in existing_names:

                        new_name = HumanReadableNames(
                            human_readable_name=name_model.human_readable_name,
                            digest=existing.digest,
                        )

                        session.add(new_name)

                        existing.human_readable_names.append(new_name)

                for attr_name, attr in processed_attrs.items():
                    # Update attribute reference
                    setattr(existing, f"{attr_name}_digest", attr.digest)

                    # Update relationship - first remove from all existing collections
                    getattr(attr, "collection", []).append(existing)

                # Update transient attributes
                existing.sorted_name_length_pairs_digest = (
                    seqcol.sorted_name_length_pairs_digest
                )

                session.commit()
                return existing
            else:
                # Create new collection
                new_collection = SequenceCollection(
                    digest=seqcol.digest,
                    human_readable_names=seqcol.human_readable_names,
                    sorted_name_length_pairs_digest=seqcol.sorted_name_length_pairs_digest,
                )

                # Link attributes to collection
                for attr in processed_attrs.values():
                    getattr(attr, "collection", []).append(new_collection)

                session.add(new_collection)
                session.commit()
                return new_collection

add_from_dict

add_from_dict(seqcol_dict, update=False)

Add a sequence collection from a seqcol dictionary

Parameters:

Name Type Description Default
seqcol_dict dict

The sequence collection in dictionary form

required
update bool

If True, update an existing collection if it exists

False

Returns:

Type Description
SequenceCollection

The added or updated sequence collection

Source code in refget/agents.py
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
def add_from_dict(self, seqcol_dict: dict, update: bool = False) -> SequenceCollection:
    """
    Add a sequence collection from a seqcol dictionary

    Args:
        seqcol_dict (dict): The sequence collection in dictionary form
        update (bool): If True, update an existing collection if it exists

    Returns:
        (SequenceCollection): The added or updated sequence collection
    """
    seqcol = SequenceCollection.from_dict(seqcol_dict, self.inherent_attrs)
    _LOGGER.info(f"SeqCol: {seqcol}")
    _LOGGER.debug(f"SeqCol name_length_pairs: {seqcol.name_length_pairs.value}")
    return self.add(seqcol, update)

add_from_fasta_file

add_from_fasta_file(fasta_file_path, update=False, create_fasta_drs=True, human_readable_name=None)

Given a path to a fasta file, load the sequences into the refget database.

Parameters:

Name Type Description Default
fasta_file_path str

Path to the fasta file

required
update bool

If True, update an existing collection if it exists

False
create_fasta_drs bool

If True, create a FastaDrsObject for the FASTA file

True
human_readable_name str

Optional human-readable name for the collection

None

Returns:

Type Description
SequenceCollection

The added or updated sequence collection

Source code in refget/agents.py
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
def add_from_fasta_file(
    self,
    fasta_file_path: str,
    update: bool = False,
    create_fasta_drs: bool = True,
    human_readable_name: str = None,
) -> SequenceCollection:
    """
    Given a path to a fasta file, load the sequences into the refget database.

    Args:
        fasta_file_path (str): Path to the fasta file
        update (bool): If True, update an existing collection if it exists
        create_fasta_drs (bool): If True, create a FastaDrsObject for the FASTA file
        human_readable_name (str): Optional human-readable name for the collection

    Returns:
       (SequenceCollection): The added or updated sequence collection
    """
    CSC = fasta_to_seqcol_dict(fasta_file_path)
    if human_readable_name:
        CSC["human_readable_names"] = human_readable_name
    seqcol = self.add_from_dict(CSC, update)

    if create_fasta_drs and self.parent and self.parent.fasta_drs:
        drs_obj = FastaDrsObject.from_fasta_file(fasta_file_path, digest=seqcol.digest)
        if self.parent.fasta_drs.url_prefix:
            url = self.parent.fasta_drs.url_prefix + os.path.basename(fasta_file_path)
            drs_obj.access_methods = [
                AccessMethod(type="https", access_url=AccessURL(url=url))
            ]
        self.parent.fasta_drs.add(drs_obj)

    return seqcol

add_from_fasta_file_with_name

add_from_fasta_file_with_name(fasta_file_path, human_readable_name, update=False, create_fasta_drs=True)

Given a path to a fasta file, and a human-readable name, load the sequences into the refget database.

Deprecated: Use add_from_fasta_file(fasta_file_path, human_readable_name=name) instead.

Source code in refget/agents.py
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
def add_from_fasta_file_with_name(
    self,
    fasta_file_path: str,
    human_readable_name: str,
    update: bool = False,
    create_fasta_drs: bool = True,
) -> SequenceCollection:
    """
    Given a path to a fasta file, and a human-readable name, load the sequences into the refget database.

    Deprecated: Use add_from_fasta_file(fasta_file_path, human_readable_name=name) instead.
    """
    return self.add_from_fasta_file(
        fasta_file_path,
        update=update,
        create_fasta_drs=create_fasta_drs,
        human_readable_name=human_readable_name,
    )

add_from_fasta_pep

add_from_fasta_pep(pep, fa_root, update=False, create_fasta_drs=True)

Given a PEP project and a root directory containing the fasta files, load the fasta files into the refget database.

Parameters:

Name Type Description Default
pep Project

PEP project object containing sample metadata

required
fa_root str

Root directory containing the fasta files

required
update bool

If True, update existing sequence collections

False
create_fasta_drs bool

If True, create FastaDrsObjects for the FASTA files

True

Returns:

Type Description
dict

A dictionary of the digests of the added sequence collections

Source code in refget/agents.py
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
def add_from_fasta_pep(
    self,
    pep: "peppy.Project",
    fa_root: str,
    update: bool = False,
    create_fasta_drs: bool = True,
) -> dict:
    """
    Given a PEP project and a root directory containing the fasta files,
    load the fasta files into the refget database.

    Args:
        pep (peppy.Project): PEP project object containing sample metadata
        fa_root (str): Root directory containing the fasta files
        update (bool): If True, update existing sequence collections
        create_fasta_drs (bool): If True, create FastaDrsObjects for the FASTA files

    Returns:
        (dict): A dictionary of the digests of the added sequence collections
    """

    total_files = len(pep.samples)
    results = {}
    import time

    for i, s in enumerate(pep.samples, 1):
        fa_path = os.path.join(fa_root, s.fasta)
        _LOGGER.info(f"Loading {fa_path} ({i} of {total_files})")

        start_time = time.time()  # Record start time
        if s.sample_name:
            results[s.fasta] = self.add_from_fasta_file_with_name(
                fa_path, s.sample_name, update, create_fasta_drs
            ).digest
        else:
            results[s.fasta] = self.add_from_fasta_file(
                fa_path, update, create_fasta_drs
            ).digest
        elapsed_time = time.time() - start_time  # Calculate elapsed time

        _LOGGER.info(f"Loaded in {elapsed_time:.2f} seconds")

    return results

get

get(digest, return_format='level2', attribute=None, itemwise_limit=None)

Get a sequence collection by digest

Parameters:

Name Type Description Default
digest str

The digest of the sequence collection

required
return_format str

The format in which to return the sequence collection

'level2'
attribute str

Name of an attribute to return, if you just want an attribute

None
itemwise_limit int

Limit the number of items returned in itemwise format

None

Returns:

Type Description
SequenceCollection

The sequence collection (in requested format)

Source code in refget/agents.py
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
def get(
    self,
    digest: str,
    return_format: str = "level2",
    attribute: Optional[str] = None,
    itemwise_limit: Optional[int] = None,
) -> SequenceCollection | dict | list:
    """
    Get a sequence collection by digest

    Args:
        digest (str): The digest of the sequence collection
        return_format (str): The format in which to return the sequence collection
        attribute (str): Name of an attribute to return, if you just want an attribute
        itemwise_limit (int): Limit the number of items returned in itemwise format

    Returns:
        (SequenceCollection): The sequence collection (in requested format)
    """
    with Session(self.engine) as session:
        statement = select(SequenceCollection).where(SequenceCollection.digest == digest)
        results = session.exec(statement)
        seqcol = results.one_or_none()
        if not seqcol:
            raise ValueError(f"SequenceCollection with digest '{digest}' not found")
        if attribute:
            return getattr(seqcol, attribute).value
        elif return_format == "level2":
            return seqcol.level2()
        elif return_format == "level1":
            return seqcol.level1()
        elif return_format == "itemwise":
            return seqcol.itemwise(itemwise_limit)
        else:
            return seqcol

search_by_attributes

search_by_attributes(filters, offset=0, limit=50)

Search sequence collections by multiple attribute filters (AND logic).

Parameters:

Name Type Description Default
filters dict

Dict of {attribute_name: digest} pairs

required
offset int

Pagination offset

0
limit int

Max results to return

50

Returns:

Type Description
dict

Dict with pagination info and results

Source code in refget/agents.py
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
def search_by_attributes(self, filters: dict, offset: int = 0, limit: int = 50) -> dict:
    """
    Search sequence collections by multiple attribute filters (AND logic).

    Args:
        filters: Dict of {attribute_name: digest} pairs
        offset: Pagination offset
        limit: Max results to return

    Returns:
        Dict with pagination info and results
    """
    with Session(self.engine) as session:
        # Start with base query
        list_stmt = select(SequenceCollection)
        cnt_stmt = select(func.count(SequenceCollection.digest))

        # Chain .where() for each filter (creates AND logic)
        for attr_name, attr_digest in filters.items():
            # Validate attribute exists to prevent SQL injection
            if attr_name not in ATTR_TYPE_MAP:
                raise ValueError(f"Unknown attribute: {attr_name}")

            # Build WHERE condition dynamically
            digest_column = getattr(SequenceCollection, f"{attr_name}_digest")
            list_stmt = list_stmt.where(digest_column == attr_digest)
            cnt_stmt = cnt_stmt.where(digest_column == attr_digest)

        # Add pagination
        list_stmt = list_stmt.offset(offset).limit(limit)

        # Execute queries
        cnt_res = session.exec(cnt_stmt)
        list_res = session.exec(list_stmt)
        count = cnt_res.one()
        seqcols = list_res.all()

        return {
            "pagination": {"page": offset // limit, "page_size": limit, "total": count},
            "results": seqcols,
        }

SequenceAgent

SequenceAgent(engine)

Bases: object

Agent for interacting with database of sequences

Source code in refget/agents.py
102
103
def __init__(self, engine: SqlalchemyDatabaseEngine) -> None:
    self.engine = engine

PangenomeAgent

PangenomeAgent(parent)

Bases: object

Agent for interacting with database of pangenomes

Source code in refget/agents.py
547
548
549
def __init__(self, parent: "RefgetDBAgent") -> None:
    self.engine = parent.engine
    self.parent = parent

AttributeAgent

AttributeAgent(engine)

Bases: object

Source code in refget/agents.py
632
633
def __init__(self, engine: SqlalchemyDatabaseEngine) -> None:
    self.engine = engine

FastaDrsAgent

FastaDrsAgent(engine, url_prefix=None)

Agent for interacting with database of FASTA DRS objects

Source code in refget/agents.py
688
689
690
def __init__(self, engine: SqlalchemyDatabaseEngine, url_prefix: Optional[str] = None) -> None:
    self.engine = engine
    self.url_prefix = url_prefix

add

add(fasta_drs)

Add a FastaDrsObject to the database

Source code in refget/agents.py
702
703
704
705
706
707
708
709
710
711
def add(self, fasta_drs: FastaDrsObject) -> FastaDrsObject:
    """Add a FastaDrsObject to the database"""
    with Session(self.engine, expire_on_commit=False) as session:
        with session.no_autoflush:
            existing = session.get(FastaDrsObject, fasta_drs.id)
            if existing:
                return existing
            session.add(fasta_drs)
            session.commit()
            return fasta_drs

add_access_method

add_access_method(digest, access_method)

Add an access method to an existing FastaDrsObject.

Parameters:

Name Type Description Default
digest str

The digest (object_id) of the DRS object

required
access_method AccessMethod

The AccessMethod to add

required

Returns:

Type Description
FastaDrsObject

The updated FastaDrsObject

Source code in refget/agents.py
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
def add_access_method(self, digest: str, access_method: AccessMethod) -> FastaDrsObject:
    """
    Add an access method to an existing FastaDrsObject.

    Args:
        digest: The digest (object_id) of the DRS object
        access_method: The AccessMethod to add

    Returns:
        The updated FastaDrsObject
    """
    with Session(self.engine, expire_on_commit=False) as session:
        drs_obj = session.get(FastaDrsObject, digest)
        if not drs_obj:
            raise ValueError(f"FastaDrsObject with id '{digest}' not found")
        # Create a new list to ensure SQLAlchemy detects the change
        current_methods = list(drs_obj.access_methods) if drs_obj.access_methods else []
        current_methods.append(access_method)
        drs_obj.access_methods = current_methods
        session.add(drs_obj)
        session.commit()
        return drs_obj

get

get(digest)

Get a FastaDrsObject by its digest (object_id)

Source code in refget/agents.py
692
693
694
695
696
697
698
699
700
def get(self, digest: str) -> FastaDrsObject:
    """Get a FastaDrsObject by its digest (object_id)"""
    with Session(self.engine) as session:
        statement = select(FastaDrsObject).where(FastaDrsObject.id == digest)
        results = session.exec(statement)
        response = results.first()
        if not response:
            raise ValueError(f"FastaDrsObject with id '{digest}' not found")
        return response

list_by_offset

list_by_offset(limit=50, offset=0)

List FastaDrsObjects with pagination

Source code in refget/agents.py
713
714
715
716
717
718
719
720
721
722
723
724
725
def list_by_offset(self, limit: int = 50, offset: int = 0) -> dict:
    """List FastaDrsObjects with pagination"""
    with Session(self.engine) as session:
        list_stmt = select(FastaDrsObject).offset(offset).limit(limit)
        cnt_stmt = select(func.count(FastaDrsObject.id))
        cnt_res = session.exec(cnt_stmt)
        list_res = session.exec(list_stmt)
        count = cnt_res.one()
        drs_objs = list_res.all()
        return {
            "pagination": {"page": int(offset / limit), "page_size": limit, "total": count},
            "results": drs_objs,
        }

RefgetStore (gtars)

RefgetStore provides high-performance local sequence storage implemented in Rust. It supports:

  • In-memory and on-disk storage with optional compression
  • Remote store access with local caching
  • Sequence retrieval by digest or by collection + name
  • BED file region extraction for batch operations
  • FASTA export for individual sequences or regions

See the RefgetStore tutorial for usage examples.

RefgetStore

RefgetStore(mode)

A global store for GA4GH refget sequences with lazy-loading support.

RefgetStore provides content-addressable storage for reference genome sequences following the GA4GH refget specification. Supports both local and remote stores with on-demand sequence loading.

Attributes:

Name Type Description
cache_path Optional[str]

Local directory path where the store is located or cached. None for in-memory stores.

remote_url Optional[str]

Remote URL of the store if loaded remotely, None otherwise.

Note

Boolean evaluation: RefgetStore follows Python container semantics, meaning bool(store) is False for empty stores (like list, dict, etc.). To check if a store variable is initialized (not None), use if store is not None: rather than if store:.

Example::

store = RefgetStore.in_memory()  # Empty store
bool(store)  # False (empty container)
len(store)   # 0

# Wrong: checks emptiness, not initialization
if store:
    process(store)

# Right: checks if variable is set
if store is not None:
    process(store)

Examples:

Create a new store and import sequences::

from gtars.refget import RefgetStore, StorageMode
store = RefgetStore(StorageMode.Encoded)
store.import_fasta("genome.fa")

Open an existing local store::

store = RefgetStore.open_local("/data/hg38")
seq = store.get_substring("chr1_digest", 0, 1000)

Open a remote store with caching::

store = RefgetStore.open_remote(
    "/local/cache",
    "https://example.com/hg38"
)

Create a new empty RefgetStore.

Parameters:

Name Type Description Default
mode StorageMode

Storage mode - StorageMode.Raw (uncompressed) or StorageMode.Encoded (bit-packed, space-efficient).

required

Example::

store = RefgetStore(StorageMode.Encoded)

disable_persistence

disable_persistence()

Disable disk persistence for this store.

New sequences will be kept in memory only. Existing Stub sequences can still be loaded from disk if local_path is set.

Example::

store = RefgetStore.open_remote("/cache", "https://example.com")
store.disable_persistence()  # Stop caching new sequences

enable_persistence

enable_persistence(path)

Enable disk persistence for this store.

Sets up the store to write sequences to disk. Any in-memory Full sequences are flushed to disk and converted to Stubs.

Parameters:

Name Type Description Default
path Union[str, PathLike]

Directory for storing sequences and metadata.

required

Raises:

Type Description
IOError

If the directory cannot be created or written to.

Example::

store = RefgetStore.in_memory()
store.add_sequence_collection_from_fasta("genome.fa")
store.enable_persistence("/data/store")  # Flush to disk

export_fasta

export_fasta(collection_digest, output_path, sequence_names=None, line_width=None)

Export sequences from a collection to a FASTA file.

Parameters:

Name Type Description Default
collection_digest str

Collection to export from.

required
output_path Union[str, PathLike]

Path to write FASTA file.

required
sequence_names Optional[List[str]]

Optional list of sequence names to export. If None, exports all sequences in the collection.

None
line_width Optional[int]

Optional line width for wrapping sequences. If None, uses default of 80.

None

export_fasta_by_digests

export_fasta_by_digests(digests, output_path, line_width=None)

Export sequences by their digests to a FASTA file.

Parameters:

Name Type Description Default
digests List[str]

List of sequence digests to export.

required
output_path Union[str, PathLike]

Path to write FASTA file.

required
line_width Optional[int]

Optional line width for wrapping sequences. If None, uses default of 80.

None

get_collection

get_collection(collection_digest)

Get a collection by digest with all sequences loaded.

Loads the collection and all its sequence data into memory. Use this when you need full access to sequence content.

Parameters:

Name Type Description Default
collection_digest str

The collection's SHA-512/24u digest.

required

Returns:

Type Description
SequenceCollection

The collection with all sequence data loaded.

Raises:

Type Description
IOError

If the collection cannot be loaded.

Example::

collection = store.get_collection("uC_UorBNf3YUu1YIDainBhI94CedlNeH")
for seq in collection.sequences:
    print(f"{seq.metadata.name}: {seq.decode()[:20]}...")

get_collection_metadata

get_collection_metadata(collection_digest)

Get metadata for a collection by digest.

Returns lightweight metadata without loading the full collection. Use this for quick lookups of collection information.

Parameters:

Name Type Description Default
collection_digest str

The collection's SHA-512/24u digest.

required

Returns:

Type Description
Optional[SequenceCollectionMetadata]

Collection metadata if found, None otherwise.

Example::

meta = store.get_collection_metadata("uC_UorBNf3YUu1YIDainBhI94CedlNeH")
if meta:
    print(f"Collection has {meta.n_sequences} sequences")

get_seqs_bed_file

get_seqs_bed_file(collection_digest, bed_file_path, output_fasta_path)

Extract sequences for BED regions and write to FASTA.

Parameters:

Name Type Description Default
collection_digest str

Collection digest to look up sequence names.

required
bed_file_path Union[str, PathLike]

Path to BED file with regions.

required
output_fasta_path Union[str, PathLike]

Path to write output FASTA file.

required

get_seqs_bed_file_to_vec

get_seqs_bed_file_to_vec(collection_digest, bed_file_path)

Extract sequences for BED regions and return as list.

Parameters:

Name Type Description Default
collection_digest str

Collection digest to look up sequence names.

required
bed_file_path Union[str, PathLike]

Path to BED file with regions.

required

Returns:

Type Description
List[RetrievedSequence]

List of retrieved sequence segments.

get_sequence

get_sequence(digest)

Retrieve a sequence record by its digest (SHA-512/24u or MD5).

Loads the sequence data if not already in memory. Supports lookup by either SHA-512/24u (preferred) or MD5 digest.

Parameters:

Name Type Description Default
digest str

Sequence digest (SHA-512/24u base64url or MD5 hex string).

required

Returns:

Type Description
Optional[SequenceRecord]

The sequence record with data if found, None otherwise.

Example::

record = store.get_sequence("aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2")
if record:
    print(f"Found: {record.metadata.name}")
    print(f"Sequence: {record.decode()[:50]}...")

get_sequence_by_name

get_sequence_by_name(collection_digest, sequence_name)

Retrieve a sequence by collection digest and sequence name.

Looks up a sequence within a specific collection using its name (e.g., "chr1", "chrM"). Loads the sequence data if needed.

Parameters:

Name Type Description Default
collection_digest str

The collection's SHA-512/24u digest.

required
sequence_name str

Name of the sequence within that collection.

required

Returns:

Type Description
Optional[SequenceRecord]

The sequence record with data if found, None otherwise.

Example::

record = store.get_sequence_by_name(
    "uC_UorBNf3YUu1YIDainBhI94CedlNeH",
    "chr1"
)
if record:
    print(f"Sequence: {record.decode()[:50]}...")

get_sequence_metadata

get_sequence_metadata(seq_digest)

Get metadata for a sequence by digest (no data loaded).

Use this for lightweight lookups when you don't need the actual sequence.

Parameters:

Name Type Description Default
seq_digest str

The sequence's SHA-512/24u digest.

required

Returns:

Type Description
Optional[SequenceMetadata]

Sequence metadata if found, None otherwise.

get_substring

get_substring(seq_digest, start, end)

Extract a substring from a sequence.

Retrieves a specific region from a sequence using 0-based, half-open coordinates [start, end). Automatically loads sequence data if not already cached (for lazy-loaded stores).

Parameters:

Name Type Description Default
seq_digest str

Sequence digest (SHA-512/24u).

required
start int

Start position (0-based, inclusive).

required
end int

End position (0-based, exclusive).

required

Returns:

Type Description
Optional[str]

The substring sequence if found, None otherwise.

Example::

# Get first 1000 bases of chr1
seq = store.get_substring("chr1_digest", 0, 1000)
print(f"First 50bp: {seq[:50]}")

import_fasta

import_fasta(file_path)

Import sequences from a FASTA file into the store.

Reads all sequences from a FASTA file and adds them to the store. Computes GA4GH digests and creates a sequence collection.

Parameters:

Name Type Description Default
file_path Union[str, PathLike]

Path to the FASTA file.

required

Raises:

Type Description
IOError

If the file cannot be read or parsed.

Example::

store = RefgetStore(StorageMode.Encoded)
store.import_fasta("genome.fa")

in_memory classmethod

in_memory()

Create a new in-memory RefgetStore.

Creates a store that keeps all sequences in memory. Use this for temporary processing or when you don't need disk persistence.

Returns:

Type Description
RefgetStore

New empty RefgetStore with Encoded storage mode.

Example::

store = RefgetStore.in_memory()
store.import_fasta("genome.fa")

is_collection_loaded

is_collection_loaded(collection_digest)

Check if a collection is fully loaded.

Returns True if the collection's sequence list is loaded in memory, False if it's only metadata (stub).

Parameters:

Name Type Description Default
collection_digest str

The collection's SHA-512/24u digest.

required

Returns:

Type Description
bool

True if loaded, False otherwise.

iter_collections

iter_collections()

Iterate over all collections with their sequences loaded.

This loads all collection data upfront and returns a list of SequenceCollection objects with full sequence data.

For browsing without loading data, use list_collections() instead.

Returns:

Type Description
List[SequenceCollection]

List of all collections with loaded sequences.

Example::

for coll in store.iter_collections():
    print(f"{coll.digest}: {len(coll.sequences)} sequences")

iter_sequences

iter_sequences()

Iterate over all sequences with their data loaded.

This ensures all sequence data is loaded and returns a list of SequenceRecord objects with full sequence data.

For browsing without loading data, use list_sequences() instead.

Returns:

Type Description
List[SequenceRecord]

List of all sequences with loaded data.

Example::

for seq in store.iter_sequences():
    print(f"{seq.metadata.name}: {seq.decode()[:20]}...")

list_collections

list_collections()

List all collection metadata in the store.

Returns metadata for all collections without loading full collection data. Use this for browsing/inventory operations.

Returns:

Type Description
List[SequenceCollectionMetadata]

List of metadata for all collections.

Example::

for meta in store.list_collections():
    print(f"Collection {meta.digest}: {meta.n_sequences} sequences")

list_sequences

list_sequences()

List all sequence metadata in the store.

Returns metadata for all sequences without loading sequence data. Use this for browsing/inventory operations.

Returns:

Type Description
List[SequenceMetadata]

List of metadata for all sequences in the store.

Example::

for meta in store.list_sequences():
    print(f"{meta.name}: {meta.length} bp")

on_disk classmethod

on_disk(cache_path)

Create or load a disk-backed RefgetStore.

If the directory contains an existing store (rgstore.json), loads it. Otherwise creates a new store with Encoded mode.

Parameters:

Name Type Description Default
cache_path Union[str, PathLike]

Directory path for the store. Created if it doesn't exist.

required

Returns:

Type Description
RefgetStore

RefgetStore (new or loaded from disk).

Example::

store = RefgetStore.on_disk("/data/my_store")
store.import_fasta("genome.fa")
# Store is automatically persisted to disk

open_local classmethod

open_local(path)

Open a local RefgetStore from a directory.

Loads only lightweight metadata and stubs. Collections and sequences remain as stubs until explicitly accessed with get_collection()/get_sequence().

Expects: rgstore.json, sequences.rgsi, collections.rgci, collections/*.rgsi

Parameters:

Name Type Description Default
path Union[str, PathLike]

Local directory containing the refget store.

required

Returns:

Type Description
RefgetStore

RefgetStore with metadata loaded, sequences lazy-loaded.

Raises:

Type Description
IOError

If the store directory or index files cannot be read.

Example::

store = RefgetStore.open_local("/data/hg38_store")
seq = store.get_substring("chr1_digest", 0, 1000)

open_remote classmethod

open_remote(cache_path, remote_url)

Open a remote RefgetStore with local caching.

Loads only lightweight metadata and stubs from the remote URL. Data is fetched on-demand when get_collection()/get_sequence() is called.

By default, persistence is enabled (sequences are cached to disk). Call disable_persistence() after loading to keep only in memory.

Parameters:

Name Type Description Default
cache_path Union[str, PathLike]

Local directory to cache downloaded metadata and sequences. Created if it doesn't exist.

required
remote_url str

Base URL of the remote refget store (e.g., "https://example.com/hg38" or "s3://bucket/hg38").

required

Returns:

Type Description
RefgetStore

RefgetStore with metadata loaded, sequences fetched on-demand.

Raises:

Type Description
IOError

If remote metadata cannot be fetched or cache cannot be written.

Example::

store = RefgetStore.open_remote(
    "/data/cache/hg38",
    "https://refget-server.com/hg38"
)
# First access fetches from remote and caches
seq = store.get_substring("chr1_digest", 0, 1000)
# Second access uses cache
seq2 = store.get_substring("chr1_digest", 1000, 2000)

set_encoding_mode

set_encoding_mode(mode)

Change the storage mode, re-encoding/decoding existing sequences as needed.

When switching from Raw to Encoded, all Full sequences in memory are encoded (2-bit packed). When switching from Encoded to Raw, all Full sequences in memory are decoded back to raw bytes.

Parameters:

Name Type Description Default
mode StorageMode

The storage mode to switch to (StorageMode.Raw or StorageMode.Encoded).

required

Example::

store = RefgetStore.in_memory()
store.set_encoding_mode(StorageMode.Raw)

stats

stats()

Returns statistics about the store.

Returns:

Type Description
dict

dict with keys: - 'n_sequences': Total number of sequences (Stub + Full) - 'n_sequences_loaded': Number of sequences with data loaded (Full) - 'n_collections': Total number of collections (Stub + Full) - 'n_collections_loaded': Number of collections with sequences loaded (Full) - 'storage_mode': Storage mode ('Raw' or 'Encoded') - 'total_disk_size': Total size of all files on disk in bytes

Note

n_collections_loaded only reflects collections fully loaded in memory. For remote stores, collections are loaded on-demand when accessed.

Example::

stats = store.stats()
print(f"Store has {stats['n_sequences']} sequences")
print(f"Collections: {stats['n_collections']} total, {stats['n_collections_loaded']} loaded")

write_store_to_directory

write_store_to_directory(root_path, seqdata_path_template)

Write the store to a directory on disk.

Persists the store with all sequences and metadata to disk using the RefgetStore directory format.

Parameters:

Name Type Description Default
root_path Union[str, PathLike]

Directory path to write the store to.

required
seqdata_path_template str

Path template for sequence files (e.g., "sequences/%s2/%s.seq" where %s2 = first 2 chars of digest, %s = full digest).

required

Example::

store.write_store_to_directory(
    "/data/my_store",
    "sequences/%s2/%s.seq"
)

Digest Functions

Low-level functions for computing GA4GH digests:

sha512t24u_digest

sha512t24u_digest(readable)

Compute the GA4GH SHA-512/24u digest for a sequence.

This function computes the GA4GH refget standard digest (truncated SHA-512, base64url encoded) for a given sequence string or bytes.

Parameters:

Name Type Description Default
readable Union[str, bytes]

Input sequence as str or bytes.

required

Returns:

Type Description
str

The SHA-512/24u digest (32 character base64url string).

Raises:

Type Description
TypeError

If input is not str or bytes.

Example:: from gtars.refget import sha512t24u_digest digest = sha512t24u_digest("ACGT") print(digest) # Output: 'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'

canonical_str

canonical_str(item)

Convert a dict into a canonical string representation

Source code in refget/utils.py
21
22
23
24
25
def canonical_str(item: dict) -> bytes:
    """Convert a dict into a canonical string representation"""
    return json.dumps(
        item, separators=(",", ":"), ensure_ascii=False, allow_nan=False, sort_keys=True
    ).encode()