Refget Python API Documentation

FASTA Processing

digest_fasta

digest_fasta(fasta)

Digest all sequences in a FASTA file and compute collection-level digests.

This function reads a FASTA file and computes GA4GH-compliant digests for each sequence, as well as collection-level digests (Level 1 and Level 2) following the GA4GH refget specification.

Parameters:

Name	Type	Description	Default
`fasta`	`Union[str, PathLike]`	Path to FASTA file (str or PathLike).	required

Returns:

Type	Description
`SequenceCollection`	Collection containing all sequences with their metadata and computed digests.

Raises:

Type	Description
`IOError`	If the FASTA file cannot be read or parsed.

Example:: from gtars.refget import digest_fasta collection = digest_fasta("genome.fa") print(f"Collection digest: {collection.digest}") print(f"Number of sequences: {len(collection)}")

fasta_to_seqcol_dict

fasta_to_seqcol_dict(fasta_file_path)

Convert a FASTA file into a Sequence Collection dict.

Parameters:

Name	Type	Description	Default
`fasta_file_path`	`Union[str, Path]`	Path to the FASTA file	required

Returns:

Name	Type	Description
`dict`	`dict`	A canonical sequence collection dictionary

Raises:

Type	Description
`ImportError`	If gtars is not installed (required for FASTA processing)

Source code in refget/utils.py

def fasta_to_seqcol_dict(fasta_file_path: Union[str, Path]) -> dict:
    """
    Convert a FASTA file into a Sequence Collection dict.

    Args:
        fasta_file_path: Path to the FASTA file

    Returns:
        dict: A canonical sequence collection dictionary

    Raises:
        ImportError: If gtars is not installed (required for FASTA processing)
    """
    if not GTARS_INSTALLED:
        raise ImportError("fasta_to_seqcol_dict requires gtars. Install with: pip install gtars")

    from gtars.refget import digest_fasta

    fasta_seq_digests = digest_fasta(fasta_file_path)
    seqcol_dict = {
        "lengths": [],
        "names": [],
        "sequences": [],
        "sorted_name_length_pairs": [],
        "sorted_sequences": [],
    }
    for s in fasta_seq_digests.sequences:
        seq_name = s.metadata.name
        seq_length = s.metadata.length
        seq_digest = "SQ." + s.metadata.sha512t24u
        nlp = {"length": seq_length, "name": seq_name}
        snlp_digest = sha512t24u_digest(canonical_str(nlp))
        seqcol_dict["lengths"].append(seq_length)
        seqcol_dict["names"].append(seq_name)
        seqcol_dict["sorted_name_length_pairs"].append(snlp_digest)
        seqcol_dict["sequences"].append(seq_digest)
        seqcol_dict["sorted_sequences"].append(seq_digest)
    seqcol_dict["sorted_name_length_pairs"].sort()
    return seqcol_dict

compare_seqcols

compare_seqcols(A, B)

Workhorse comparison function

Parameters:

Name	Type	Description	Default
`A`	`SeqColDict`	Sequence collection A	required
`B`	`SeqColDict`	Sequence collection B	required

Returns:

Name	Type	Description
`dict`	`dict`	Following formal seqcol specification comparison function return value

Source code in refget/utils.py

def compare_seqcols(A: SeqColDict, B: SeqColDict) -> dict:
    """
    Workhorse comparison function

    Args:
        A: Sequence collection A
        B: Sequence collection B

    Returns:
        dict: Following formal seqcol specification comparison function return value
    """
    # validate_seqcol(A)  # First ensure these are the right structure
    # validate_seqcol(B)
    a_keys = list(A.keys())
    b_keys = list(B.keys())
    a_keys.sort()
    b_keys.sort()

    all_keys = a_keys + list(set(b_keys) - set(a_keys))
    all_keys.sort()
    result = {}

    # Compute lengths of each array; only do this for array attributes
    a_lengths = {}
    b_lengths = {}
    for k in a_keys:
        a_lengths[k] = len(A[k])
    for k in b_keys:
        b_lengths[k] = len(B[k])

    return_obj = {
        "attributes": {"a_only": [], "b_only": [], "a_and_b": []},
        "array_elements": {
            "a_count": a_lengths,
            "b_count": b_lengths,
            "a_and_b_count": {},
            "a_and_b_same_order": {},
        },
    }

    for k in all_keys:
        _LOGGER.debug(k)
        if k not in A:
            result[k] = {"flag": -1}
            return_obj["attributes"]["b_only"].append(k)
            # return_obj["array_elements"]["total"][k] = {"a": None, "b": len(B[k])}
        elif k not in B:
            return_obj["attributes"]["a_only"].append(k)
            # return_obj["array_elements"]["total"][k] = {"a": len(A[k]), "b": None}
        else:
            return_obj["attributes"]["a_and_b"].append(k)
            res = _compare_elements(A[k], B[k])
            # return_obj["array_elements"]["total"][k] = {"a": len(A[k]), "b": len(B[k])}
            return_obj["array_elements"]["a_and_b_count"][k] = res["a_and_b"]
            return_obj["array_elements"]["a_and_b_same_order"][k] = res["a_and_b_same_order"]
    return return_obj

calc_jaccard_similarities

calc_jaccard_similarities(A, B)

Takes two sequence collections and calculates jaccard similarties for all attributes

Parameters:

Name	Type	Description	Default
`A`	`SeqColDict`	Sequence collection A	required
`B`	`SeqColDict`	Sequence collection B	required

Returns:

Name	Type	Description
`dict`	`dict[str, float]`	Jaccard similarities for all attributes

Source code in refget/utils.py

def calc_jaccard_similarities(A: SeqColDict, B: SeqColDict) -> dict[str, float]:
    """
    Takes two sequence collections and calculates jaccard similarties for all attributes

    Args:
        A: Sequence collection A
        B: Sequence collection B

    Returns:
        dict: Jaccard similarities for all attributes
    """

    def calc_jaccard_similarity(A_B_intersection: int, A_B_union: int) -> float:
        if A_B_union == 0:
            return 0.0
        jaccard_similarity = A_B_intersection / A_B_union
        return jaccard_similarity

    jaccard_similarities = {}

    if (
        "human_readable_names" in A.keys()
    ):  # this can cause issues if key exists but is NoneType when comparing with compare_seqcols()
        del A["human_readable_names"]
    if "human_readable_names" in B.keys():
        del B["human_readable_names"]

    comparison_dict = compare_seqcols(A, B)

    list_a_keys = list(comparison_dict["array_elements"]["a_and_b_count"].keys())

    for key in list_a_keys:
        intersection_seqcol = comparison_dict["array_elements"]["a_and_b_count"].get(key)
        a = comparison_dict["array_elements"]["a_count"].get(key)
        b = comparison_dict["array_elements"]["b_count"].get(key)
        union_seqcol = (
            a + b - intersection_seqcol
        )  # inclusion-exclusion principal for calculating union
        jaccard_similarity = calc_jaccard_similarity(intersection_seqcol, union_seqcol)
        jaccard_similarities.update({key: jaccard_similarity})
    return jaccard_similarities

validate_seqcol

validate_seqcol(seqcol_obj, schema=None)

Validate a seqcol object against the seqcol schema. Returns True if valid, raises InvalidSeqColError if not, which enumerates the errors. Retrieve individual errors with exception.errors

Source code in refget/utils.py

def validate_seqcol(seqcol_obj: SeqColDict, schema=None) -> bool:
    """Validate a seqcol object against the seqcol schema.
    Returns True if valid, raises InvalidSeqColError if not, which enumerates the errors.
    Retrieve individual errors with exception.errors
    """
    with open(SEQCOL_SCHEMA_PATH, "r") as f:
        schema = json.load(f)
    validator = Draft7Validator(schema)

    if not validator.is_valid(seqcol_obj):
        errors = sorted(validator.iter_errors(seqcol_obj), key=lambda e: e.path)
        raise InvalidSeqColError("Validation failed", errors)
    return True

validate_seqcol_bool

validate_seqcol_bool(seqcol_obj, schema=None)

Validate a seqcol object against the seqcol schema. Returns True if valid, False if not.

To enumerate the errors, use validate_seqcol instead.

Source code in refget/utils.py

def validate_seqcol_bool(seqcol_obj: SeqColDict, schema=None) -> bool:
    """
    Validate a seqcol object against the seqcol schema. Returns True if valid, False if not.

    To enumerate the errors, use validate_seqcol instead.
    """
    with open(SEQCOL_SCHEMA_PATH, "r") as f:
        schema = json.load(f)
    validator = Draft7Validator(schema)
    return validator.is_valid(seqcol_obj)

FastAPI Integration

create_refget_router

create_refget_router(sequences=False, collections=True, pangenomes=False, fasta_drs=False, refget_store_url=None)

Create a FastAPI router for the sequence collection API. This router provides endpoints for retrieving and comparing sequence collections. You can choose which endpoints to include by setting the sequences, collections, pangenomes, or fasta_drs flags.

Parameters:

Name	Type	Description	Default
`sequences`	`bool`	Include sequence endpoints	`False`
`collections`	`bool`	Include sequence collection endpoints	`True`
`pangenomes`	`bool`	Include pangenome endpoints	`False`
`fasta_drs`	`bool`	Include FASTA DRS endpoints	`False`
`refget_store_url`	`str`	URL of backing RefgetStore (e.g., s3://bucket/store/)	`None`

Returns:

Type	Description
`APIRouter`	A FastAPI router with the specified endpoints

Examples:

app.include_router(create_refget_router(fasta_drs=True), prefix="/seqcol")

Source code in refget/router.py

def create_refget_router(
    sequences: bool = False,
    collections: bool = True,
    pangenomes: bool = False,
    fasta_drs: bool = False,
    refget_store_url: str = None,
) -> APIRouter:
    """
    Create a FastAPI router for the sequence collection API.
    This router provides endpoints for retrieving and comparing sequence collections.
    You can choose which endpoints to include by setting the sequences, collections,
    pangenomes, or fasta_drs flags.

    Args:
        sequences (bool): Include sequence endpoints
        collections (bool): Include sequence collection endpoints
        pangenomes (bool): Include pangenome endpoints
        fasta_drs (bool): Include FASTA DRS endpoints
        refget_store_url (str): URL of backing RefgetStore (e.g., s3://bucket/store/)

    Returns:
        (APIRouter): A FastAPI router with the specified endpoints

    Examples:
        ```
        app.include_router(create_refget_router(fasta_drs=True), prefix="/seqcol")
        ```
    """
    # Store config for service-info discovery
    _ROUTER_CONFIG["fasta_drs"] = fasta_drs
    _ROUTER_CONFIG["refget_store_url"] = refget_store_url

    refget_router = APIRouter()
    if sequences:
        _LOGGER.info("Adding sequence endpoints...")
        refget_router.include_router(seq_router)
    if collections:
        _LOGGER.info("Adding collection endpoints...")
        refget_router.include_router(seqcol_router)
    if pangenomes:
        _LOGGER.info("Adding pangenome endpoints...")
        refget_router.include_router(pangenome_router)
    if fasta_drs:
        _LOGGER.info("Adding FASTA DRS endpoints...")
        refget_router.include_router(fasta_drs_router, prefix="/fasta")
    return refget_router

Client Classes

The client module provides interfaces for interacting with refget-compliant servers.

SequenceClient

SequenceClient(urls=['https://www.ebi.ac.uk/ena/cram'], raise_errors=None)

Bases: RefgetClient

A client for interacting with a refget sequences API.

Initializes the sequences client.

Parameters:

Name	Type	Description	Default
`urls`	`list`	A list of base URLs of the sequences API. Defaults to ["https://www.ebi.ac.uk/ena/cram/sequence/"].	`['https://www.ebi.ac.uk/ena/cram']`
`raise_errors`	`bool`	Whether to raise errors or log them. Defaults to None, which will guess.	`None`

Attributes: urls (list): The list of base URLs of the sequences API.

Source code in refget/clients.py

def __init__(
    self,
    urls: list[str] = ["https://www.ebi.ac.uk/ena/cram"],
    raise_errors: Optional[bool] = None,
) -> None:
    """
    Initializes the sequences client.

    Args:
        urls (list, optional): A list of base URLs of the sequences API. Defaults to ["https://www.ebi.ac.uk/ena/cram/sequence/"].
        raise_errors (bool, optional): Whether to raise errors or log them. Defaults to None, which will guess.
    Attributes:
        urls (list): The list of base URLs of the sequences API.
    """
    # Remove trailing slaches from input URLs
    self.urls = [url.rstrip("/") for url in urls]
    # If raise_errors is None, set it to True if the client is not being used as a library
    if raise_errors is None:
        raise_errors = __name__ == "__main__"
    self.raise_errors = raise_errors

get_metadata

get_metadata(digest)

Retrieves metadata for a given sequence digest.

Parameters:

Name	Type	Description	Default
`digest`	`str`	The digest of the sequence.	required

Returns:

Type	Description
`dict`	The metadata.

Source code in refget/clients.py

def get_metadata(self, digest: str) -> Optional[dict]:
    """
    Retrieves metadata for a given sequence digest.

    Args:
        digest (str): The digest of the sequence.

    Returns:
        (dict): The metadata.
    """
    endpoint = f"/sequence/{digest}/metadata"
    return _try_urls(self.urls, endpoint, raise_errors=self.raise_errors)

get_sequence

get_sequence(digest, start=None, end=None)

Retrieves a sequence for a given digest.

Parameters:

Name	Type	Description	Default
`digest`	`str`	The digest of the sequence.	required

Returns:

Type	Description
`str`	The sequence.

Source code in refget/clients.py

def get_sequence(
    self, digest: str, start: Optional[int] = None, end: Optional[int] = None
) -> Optional[str]:
    """
    Retrieves a sequence for a given digest.

    Args:
        digest (str): The digest of the sequence.

    Returns:
        (str): The sequence.
    """
    query_params = {}
    if start is not None:
        query_params["start"] = start
    if end is not None:
        query_params["end"] = end

    endpoint = f"/sequence/{digest}"
    return _try_urls(self.urls, endpoint, params=query_params, raise_errors=self.raise_errors)

SequenceCollectionClient

SequenceCollectionClient(urls=['https://seqcolapi.databio.org'], raise_errors=None)

Bases: RefgetClient

A client for interacting with a refget sequence collections API.

Initializes the sequence collection client.

Parameters:

Name	Type	Description	Default
`urls`	`list`	A list of base URLs of the sequence collection API. Defaults to ["https://seqcolapi.databio.org"].	`['https://seqcolapi.databio.org']`

Attributes:

Name	Type	Description
`urls`	`list`	The list of base URLs of the sequence collection API.

Source code in refget/clients.py

def __init__(
    self,
    urls: list[str] = ["https://seqcolapi.databio.org"],
    raise_errors: Optional[bool] = None,
) -> None:
    """
    Initializes the sequence collection client.

    Args:
        urls (list, optional): A list of base URLs of the sequence collection API. Defaults to ["https://seqcolapi.databio.org"].

    Attributes:
        urls (list): The list of base URLs of the sequence collection API.
    """
    # Remove trailing slaches from input URLs
    self.urls = [url.rstrip("/") for url in urls]
    # If raise_errors is None, set it to True if the client is not being used as a library
    if raise_errors is None:
        raise_errors = __name__ == "__main__"
    self.raise_errors = raise_errors
    self._fasta_client = None

build_chrom_sizes

build_chrom_sizes(digest)

Build a chrom.sizes file content for a sequence collection.

Format per line: NAME\tLENGTH

Parameters:

Name	Type	Description	Default
`digest`	`str`	The sequence collection digest	required

Returns:

Type	Description
`str`	String content of the chrom.sizes file

Source code in refget/clients.py

def build_chrom_sizes(self, digest: str) -> str:
    """
    Build a chrom.sizes file content for a sequence collection.

    Format per line: NAME\\tLENGTH

    Args:
        digest (str): The sequence collection digest

    Returns:
        (str): String content of the chrom.sizes file
    """
    collection = self.get_collection(digest, level=2)
    if not collection:
        raise ValueError(f"No collection found for {digest}")

    names = collection["names"]
    lengths = collection["lengths"]

    lines = []
    for name, length in zip(names, lengths):
        lines.append(f"{name}\t{length}")

    return "\n".join(lines) + "\n"

build_fai

build_fai(digest)

Build a complete .fai index file content for a FASTA.

FAI format per line: NAME\tLENGTH\tOFFSET\tLINEBASES\tLINEWIDTH

Parameters:

Name	Type	Description	Default
`digest`	`str`	The sequence collection digest	required

Returns:

Type	Description
`str`	String content of the .fai file

Source code in refget/clients.py

def build_fai(self, digest: str) -> str:
    """
    Build a complete .fai index file content for a FASTA.

    FAI format per line: NAME\\tLENGTH\\tOFFSET\\tLINEBASES\\tLINEWIDTH

    Args:
        digest (str): The sequence collection digest

    Returns:
        (str): String content of the .fai file
    """
    return self._get_fasta_helper().build_fai(digest, seqcol_client=self)

compare

compare(digest1, digest2)

Compares two sequence collections hosted on the server.

Parameters:

Name	Type	Description	Default
`digest1`	`str`	The digest of the first sequence collection.	required
`digest2`	`str`	The digest of the second sequence collection.	required

Returns:

Type	Description
`dict`	The JSON response containing the comparison of the two sequence collections.

Source code in refget/clients.py

def compare(self, digest1: str, digest2: str) -> Optional[dict]:
    """
    Compares two sequence collections hosted on the server.

    Args:
        digest1 (str): The digest of the first sequence collection.
        digest2 (str): The digest of the second sequence collection.

    Returns:
        (dict): The JSON response containing the comparison of the two sequence collections.
    """
    endpoint = f"/comparison/{digest1}/{digest2}"
    return _try_urls(self.urls, endpoint)

compare_local

compare_local(digest, local_collection)

Compares a server-hosted sequence collection with a local collection.

Parameters:

Name	Type	Description	Default
`digest`	`str`	The digest of the server-hosted sequence collection.	required
`local_collection`	`dict`	A level 2 sequence collection representation.	required

Returns:

Type	Description
`dict`	The JSON response containing the comparison.

Source code in refget/clients.py

def compare_local(self, digest: str, local_collection: dict) -> Optional[dict]:
    """
    Compares a server-hosted sequence collection with a local collection.

    Args:
        digest (str): The digest of the server-hosted sequence collection.
        local_collection (dict): A level 2 sequence collection representation.

    Returns:
        (dict): The JSON response containing the comparison.
    """
    endpoint = f"/comparison/{digest}"
    return _try_urls(self.urls, endpoint, method="POST", json=local_collection)

download_fasta

download_fasta(digest, dest_path=None, access_id=None)

Download the FASTA file to a local path.

Parameters:

Name	Type	Description	Default
`digest`	`str`	The sequence collection digest	required
`dest_path`	`str`	Destination file path. If None, uses object name.	`None`
`access_id`	`str`	Specific access method to use. If None, tries all.	`None`

Returns:

Type	Description
`str`	Path to downloaded file

Raises:

Type	Description
`ValueError`	If no access methods available or specified access_id not found

Source code in refget/clients.py

def download_fasta(self, digest: str, dest_path: str = None, access_id: str = None) -> str:
    """
    Download the FASTA file to a local path.

    Args:
        digest (str): The sequence collection digest
        dest_path (str, optional): Destination file path. If None, uses object name.
        access_id (str, optional): Specific access method to use. If None, tries all.

    Returns:
        (str): Path to downloaded file

    Raises:
        ValueError: If no access methods available or specified access_id not found
    """
    return self._get_fasta_helper().download(digest, dest_path, access_id)

download_fasta_to_store

download_fasta_to_store(digest, store, access_id=None, temp_dir=None)

Download the FASTA file and import it into a RefgetStore.

This method downloads the FASTA file from the DRS endpoint and immediately imports it into the provided RefgetStore, enabling local sequence retrieval by digest without re-downloading.

Parameters:

Name	Type	Description	Default
`digest`	`str`	The sequence collection digest	required
`store`	`RefgetStore`	The RefgetStore instance to import into	required
`access_id`	`str`	Specific access method to use. If None, tries all.	`None`
`temp_dir`	`str`	Directory for temporary download. If None, uses system temp.	`None`

Returns:

Type	Description
`str`	The collection digest of the imported sequences

Raises:

Type	Description
`ValueError`	If no access methods available or specified access_id not found
`ImportError`	If gtars/RefgetStore is not available

Example

from refget.store import RefgetStore, StorageMode from refget.clients import SequenceCollectionClient store = RefgetStore(StorageMode.Encoded) client = SequenceCollectionClient() collection_digest = client.download_fasta_to_store("abc123", store)

Now you can retrieve sequences by digest from the local store

seq = store.get_substring(sequence_digest, 0, 100)

Source code in refget/clients.py

def download_fasta_to_store(
    self, digest: str, store: "RefgetStore", access_id: str = None, temp_dir: str = None
) -> str:
    """
    Download the FASTA file and import it into a RefgetStore.

    This method downloads the FASTA file from the DRS endpoint and immediately
    imports it into the provided RefgetStore, enabling local sequence retrieval
    by digest without re-downloading.

    Args:
        digest (str): The sequence collection digest
        store (RefgetStore): The RefgetStore instance to import into
        access_id (str, optional): Specific access method to use. If None, tries all.
        temp_dir (str, optional): Directory for temporary download. If None, uses system temp.

    Returns:
        (str): The collection digest of the imported sequences

    Raises:
        ValueError: If no access methods available or specified access_id not found
        ImportError: If gtars/RefgetStore is not available

    Example:
        >>> from refget.store import RefgetStore, StorageMode
        >>> from refget.clients import SequenceCollectionClient
        >>> store = RefgetStore(StorageMode.Encoded)
        >>> client = SequenceCollectionClient()
        >>> collection_digest = client.download_fasta_to_store("abc123", store)
        >>> # Now you can retrieve sequences by digest from the local store
        >>> seq = store.get_substring(sequence_digest, 0, 100)
    """
    return self._get_fasta_helper().download_to_store(digest, store, access_id, temp_dir)

get_attribute

get_attribute(attribute, digest)

Retrieves a specific attribute value by its digest.

Parameters:

Name	Type	Description	Default
`attribute`	`str`	The attribute name (e.g., "names", "lengths", "sequences").	required
`digest`	`str`	The level 1 digest of the attribute.	required

Returns:

Type	Description
`dict`	The JSON response containing the attribute value.

Source code in refget/clients.py

def get_attribute(self, attribute: str, digest: str) -> Optional[dict]:
    """
    Retrieves a specific attribute value by its digest.

    Args:
        attribute (str): The attribute name (e.g., "names", "lengths", "sequences").
        digest (str): The level 1 digest of the attribute.

    Returns:
        (dict): The JSON response containing the attribute value.
    """
    endpoint = f"/attribute/collection/{attribute}/{digest}"
    return _try_urls(self.urls, endpoint)

get_collection

get_collection(digest, level=2)

Retrieves a sequence collection for a given digest and detail level.

Parameters:

Name	Type	Description	Default
`digest`	`str`	The digest of the sequence collection.	required
`level`	`int`	The level of detail for the sequence collection. Defaults to 2.	`2`

Returns:

Type	Description
`dict`	The JSON response containing the sequence collection.

Source code in refget/clients.py

def get_collection(self, digest: str, level: int = 2) -> Optional[dict]:
    """
    Retrieves a sequence collection for a given digest and detail level.

    Args:
        digest (str): The digest of the sequence collection.
        level (int, optional): The level of detail for the sequence collection. Defaults to 2.

    Returns:
        (dict): The JSON response containing the sequence collection.
    """
    endpoint = f"/collection/{digest}?level={level}"
    return _try_urls(self.urls, endpoint)

get_fasta

get_fasta(digest)

Get DRS object metadata for a FASTA file.

Parameters:

Name	Type	Description	Default
`digest`	`str`	The sequence collection digest (which is also the DRS object ID)	required

Returns:

Type	Description
`dict`	DRS object with id, self_uri, size, checksums, access_methods, etc.

Source code in refget/clients.py

def get_fasta(self, digest: str) -> Optional[dict]:
    """
    Get DRS object metadata for a FASTA file.

    Args:
        digest (str): The sequence collection digest (which is also the DRS object ID)

    Returns:
        (dict): DRS object with id, self_uri, size, checksums, access_methods, etc.
    """
    return self._get_fasta_helper().get_object(digest)

get_fasta_index

get_fasta_index(digest)

Get FAI index data for a FASTA file.

Parameters:

Name	Type	Description	Default
`digest`	`str`	The sequence collection digest	required

Returns:

Type	Description
`dict`	Dict with line_bases, extra_line_bytes, offsets

Source code in refget/clients.py

def get_fasta_index(self, digest: str) -> Optional[dict]:
    """
    Get FAI index data for a FASTA file.

    Args:
        digest (str): The sequence collection digest

    Returns:
        (dict): Dict with line_bases, extra_line_bytes, offsets
    """
    return self._get_fasta_helper().get_index(digest)

get_refget_store

get_refget_store(cache_dir)

Get a RefgetStore instance connected to the server's backing store.

Parameters:

Name	Type	Description	Default
`cache_dir`	`str`	Local directory for caching store data	required

Returns:

Type	Description
`RefgetStore`	RefgetStore instance loaded from remote

Raises:

Type	Description
`ValueError`	If server doesn't have a RefgetStore configured
`ImportError`	If gtars is not installed

Source code in refget/clients.py

def get_refget_store(self, cache_dir: str) -> "RefgetStore":
    """
    Get a RefgetStore instance connected to the server's backing store.

    Args:
        cache_dir (str): Local directory for caching store data

    Returns:
        (RefgetStore): RefgetStore instance loaded from remote

    Raises:
        ValueError: If server doesn't have a RefgetStore configured
        ImportError: If gtars is not installed
    """
    url = self.get_refget_store_url()
    if not url:
        raise ValueError("Server does not have a RefgetStore configured")

    try:
        from .store import RefgetStore
    except ImportError:
        raise ImportError("gtars is required: pip install gtars")

    return RefgetStore.load_remote(cache_dir, url)

get_refget_store_url

get_refget_store_url()

Discover RefgetStore URL from service-info if available.

Returns:

Type	Description
`str`	The RefgetStore URL if configured, None otherwise.

Source code in refget/clients.py

def get_refget_store_url(self) -> Optional[str]:
    """
    Discover RefgetStore URL from service-info if available.

    Returns:
        (str): The RefgetStore URL if configured, None otherwise.
    """
    info = self.service_info()
    store_config = info.get("seqcol", {}).get("refget_store", {})
    if store_config.get("enabled"):
        return store_config.get("url")
    return None

is_fasta_drs_enabled

is_fasta_drs_enabled()

Check if FastaDRS endpoints are available.

Returns:

Type	Description
`bool`	True if FastaDRS is enabled, False otherwise.

Source code in refget/clients.py

def is_fasta_drs_enabled(self) -> bool:
    """
    Check if FastaDRS endpoints are available.

    Returns:
        (bool): True if FastaDRS is enabled, False otherwise.
    """
    info = self.service_info()
    return info.get("seqcol", {}).get("fasta_drs", {}).get("enabled", False)

list_attributes

list_attributes(attribute, page=None, page_size=None)

Lists all available values for a given attribute with optional paging support.

Parameters:

Name	Type	Description	Default
`attribute`	`str`	The attribute to list values for.	required
`page`	`int`	The page number to retrieve. Defaults to None.	`None`
`page_size`	`int`	The number of items per page. Defaults to None.	`None`

Returns:

Type	Description
`dict`	The JSON response containing the list of available values for the attribute.

Source code in refget/clients.py

def list_attributes(
    self, attribute: str, page: Optional[int] = None, page_size: Optional[int] = None
) -> Optional[dict]:
    """
    Lists all available values for a given attribute with optional paging support.

    Args:
        attribute (str): The attribute to list values for.
        page (int, optional): The page number to retrieve. Defaults to None.
        page_size (int, optional): The number of items per page. Defaults to None.

    Returns:
        (dict): The JSON response containing the list of available values for the attribute.
    """
    params = {}
    if page is not None:
        params["page"] = page
    if page_size is not None:
        params["page_size"] = page_size

    endpoint = f"/list/attributes/{attribute}"
    return _try_urls(self.urls, endpoint, params=params)

list_collections

list_collections(page=None, page_size=None, **filters)

Lists all available sequence collections with optional paging and attribute filtering support.

Parameters:

Name	Type	Description	Default
`page`	`int`	The page number to retrieve. Defaults to None.	`None`
`page_size`	`int`	The number of items per page. Defaults to None.	`None`
`**filters`	`Any`	Optional attribute filters (e.g., names="abc123", lengths="def456"). Values should be level 1 digests of the attributes.	`{}`

Returns:

Type	Description
`dict`	The JSON response containing the list of available sequence collections.

Source code in refget/clients.py

def list_collections(
    self,
    page: Optional[int] = None,
    page_size: Optional[int] = None,
    **filters,
) -> Optional[dict]:
    """
    Lists all available sequence collections with optional paging and attribute filtering support.

    Args:
        page (int, optional): The page number to retrieve. Defaults to None.
        page_size (int, optional): The number of items per page. Defaults to None.
        **filters (Any): Optional attribute filters (e.g., names="abc123", lengths="def456").
                  Values should be level 1 digests of the attributes.

    Returns:
        (dict): The JSON response containing the list of available sequence collections.
    """
    params = {}
    if page is not None:
        params["page"] = page
    if page_size is not None:
        params["page_size"] = page_size
    params.update(filters)

    endpoint = "/list/collection"
    return _try_urls(self.urls, endpoint, params=params)

service_info

service_info()

Retrieves information about the service.

Returns:

Type	Description
`dict`	The service information.

Source code in refget/clients.py

def service_info(self) -> Optional[dict]:
    """
    Retrieves information about the service.

    Returns:
        (dict): The service information.
    """
    endpoint = "/service-info"
    return _try_urls(self.urls, endpoint)

write_chrom_sizes

write_chrom_sizes(digest, dest_path)

Write a chrom.sizes file for a sequence collection.

Parameters:

Name	Type	Description	Default
`digest`	`str`	The sequence collection digest	required
`dest_path`	`str`	Path to write the chrom.sizes file	required

Returns:

Type	Description
`str`	Path to the written file

Source code in refget/clients.py

def write_chrom_sizes(self, digest: str, dest_path: str) -> str:
    """
    Write a chrom.sizes file for a sequence collection.

    Args:
        digest (str): The sequence collection digest
        dest_path (str): Path to write the chrom.sizes file

    Returns:
        (str): Path to the written file
    """
    content = self.build_chrom_sizes(digest)
    with open(dest_path, "w") as f:
        f.write(content)
    return dest_path

write_fai

write_fai(digest, dest_path)

Write a .fai index file for a FASTA.

Parameters:

Name	Type	Description	Default
`digest`	`str`	The sequence collection digest	required
`dest_path`	`str`	Path to write the .fai file	required

Returns:

Type	Description
`str`	Path to the written file

Source code in refget/clients.py

def write_fai(self, digest: str, dest_path: str) -> str:
    """
    Write a .fai index file for a FASTA.

    Args:
        digest (str): The sequence collection digest
        dest_path (str): Path to write the .fai file

    Returns:
        (str): Path to the written file
    """
    return self._get_fasta_helper().write_fai(digest, dest_path, seqcol_client=self)

FastaDrsClient

FastaDrsClient(urls=['https://seqcolapi.databio.org/fasta'], raise_errors=None)

Bases: RefgetClient

A client for interacting with FASTA files via GA4GH DRS endpoints.

Initializes the FASTA DRS client.

Parameters:

Name	Type	Description	Default
`urls`	`list`	A list of base URLs of the FASTA DRS API. Defaults to ["https://seqcolapi.databio.org/fasta"].	`['https://seqcolapi.databio.org/fasta']`
`raise_errors`	`bool`	Whether to raise errors or log them. Defaults to None, which will guess.	`None`

Attributes:

Name	Type	Description
`urls`	`list`	The list of base URLs of the FASTA DRS API.

Source code in refget/clients.py

def __init__(
    self,
    urls: list[str] = ["https://seqcolapi.databio.org/fasta"],
    raise_errors: Optional[bool] = None,
) -> None:
    """
    Initializes the FASTA DRS client.

    Args:
        urls (list, optional): A list of base URLs of the FASTA DRS API.
            Defaults to ["https://seqcolapi.databio.org/fasta"].
        raise_errors (bool, optional): Whether to raise errors or log them.
            Defaults to None, which will guess.

    Attributes:
        urls (list): The list of base URLs of the FASTA DRS API.
    """
    self.urls = [url.rstrip("/") for url in urls]
    if raise_errors is None:
        raise_errors = __name__ == "__main__"
    self.raise_errors = raise_errors

build_fai

build_fai(digest, seqcol_client=None)

Build a complete .fai index file content for a FASTA.

FAI format per line: NAME LENGTH OFFSET LINEBASES LINEWIDTH

Parameters:

Name	Type	Description	Default
`digest`	`str`	The sequence collection digest	required
`seqcol_client`	`SequenceCollectionClient`	SequenceCollectionClient to use. If None, uses parent client or creates one.	`None`

Returns:

Type	Description
`str`	String content of the .fai file

Source code in refget/clients.py

def build_fai(self, digest: str, seqcol_client: "SequenceCollectionClient" = None) -> str:
    """
    Build a complete .fai index file content for a FASTA.

    FAI format per line: NAME\tLENGTH\tOFFSET\tLINEBASES\tLINEWIDTH

    Args:
        digest (str): The sequence collection digest
        seqcol_client (SequenceCollectionClient, optional): SequenceCollectionClient
            to use. If None, uses parent client or creates one.

    Returns:
        (str): String content of the .fai file
    """
    # Get FAI index data
    index = self.get_index(digest)
    if not index:
        raise ValueError(f"No FAI index for {digest}")

    # Get sequence collection for names/lengths
    if seqcol_client is None:
        # Use parent client if we were created via SequenceCollectionClient.fasta
        if hasattr(self, "_seqcol_client") and self._seqcol_client is not None:
            seqcol_client = self._seqcol_client
        else:
            # Derive seqcol URL from fasta URL (strip /fasta suffix)
            base_urls = [url.rsplit("/fasta", 1)[0] for url in self.urls]
            seqcol_client = SequenceCollectionClient(urls=base_urls)

    collection = seqcol_client.get_collection(digest, level=2)
    if not collection:
        raise ValueError(f"No collection found for {digest}")

    names = collection["names"]
    lengths = collection["lengths"]
    offsets = index["offsets"]
    line_bases = index["line_bases"]
    line_width = line_bases + index["extra_line_bytes"]

    # Build FAI lines
    lines = []
    for name, length, offset in zip(names, lengths, offsets):
        # FAI format: NAME LENGTH OFFSET LINEBASES LINEWIDTH
        lines.append(f"{name}\t{length}\t{offset}\t{line_bases}\t{line_width}")

    return "\n".join(lines) + "\n"

download

download(digest, dest_path=None, access_id=None)

Download the FASTA file to a local path.

Parameters:

Name	Type	Description	Default
`digest`	`str`	The sequence collection digest	required
`dest_path`	`str`	Destination file path. If None, uses object name.	`None`
`access_id`	`str`	Specific access method to use. If None, tries all.	`None`

Returns:

Type	Description
`str`	Path to downloaded file

Raises:

Type	Description
`ValueError`	If no access methods available or specified access_id not found

Source code in refget/clients.py

def download(self, digest: str, dest_path: str = None, access_id: str = None) -> str:
    """
    Download the FASTA file to a local path.

    Args:
        digest (str): The sequence collection digest
        dest_path (str, optional): Destination file path. If None, uses object name.
        access_id (str, optional): Specific access method to use. If None, tries all.

    Returns:
        (str): Path to downloaded file

    Raises:
        ValueError: If no access methods available or specified access_id not found
    """
    drs_obj = self.get_object(digest)
    if not drs_obj or not drs_obj.get("access_methods"):
        raise ValueError(f"No access methods for {digest}")

    # Filter to specific access method if requested
    methods = drs_obj["access_methods"]
    if access_id:
        methods = [m for m in methods if m.get("access_id") == access_id]
        if not methods:
            raise ValueError(f"Access method '{access_id}' not found for {digest}")

    # Find first accessible URL
    for method in methods:
        url = None
        if method.get("access_url"):
            access_url = method["access_url"]
            url = access_url.get("url") if isinstance(access_url, dict) else access_url
        elif method.get("access_id"):
            access_info = self.get_access_url(digest, method["access_id"])
            url = access_info.get("url") if access_info else None

        if url:
            if dest_path is None:
                dest_path = drs_obj.get("name", f"{digest}.fa")

            response = requests.get(url, stream=True)
            response.raise_for_status()
            with open(dest_path, "wb") as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
            return dest_path

    raise ValueError(f"No accessible URLs for {digest}")

download_to_store

download_to_store(digest, store, access_id=None, temp_dir=None)

Download the FASTA file and import it into a RefgetStore.

This method downloads the FASTA file from the DRS endpoint and immediately imports it into the provided RefgetStore, enabling local sequence retrieval by digest without re-downloading.

Parameters:

Name	Type	Description	Default
`digest`	`str`	The sequence collection digest	required
`store`	`RefgetStore`	The RefgetStore instance to import into	required
`access_id`	`str`	Specific access method to use. If None, tries all.	`None`
`temp_dir`	`str`	Directory for temporary download. If None, uses system temp.	`None`

Returns:

Type	Description
`str`	The collection digest of the imported sequences

Raises:

Type	Description
`ValueError`	If no access methods available or specified access_id not found
`ImportError`	If gtars/RefgetStore is not available

Example

from refget.store import RefgetStore, StorageMode store = RefgetStore(StorageMode.Encoded) client = FastaDrsClient() collection_digest = client.download_to_store("abc123", store)

Source code in refget/clients.py

def download_to_store(
    self, digest: str, store: "RefgetStore", access_id: str = None, temp_dir: str = None
) -> str:
    """
    Download the FASTA file and import it into a RefgetStore.

    This method downloads the FASTA file from the DRS endpoint and immediately
    imports it into the provided RefgetStore, enabling local sequence retrieval
    by digest without re-downloading.

    Args:
        digest (str): The sequence collection digest
        store (RefgetStore): The RefgetStore instance to import into
        access_id (str, optional): Specific access method to use. If None, tries all.
        temp_dir (str, optional): Directory for temporary download. If None, uses system temp.

    Returns:
        (str): The collection digest of the imported sequences

    Raises:
        ValueError: If no access methods available or specified access_id not found
        ImportError: If gtars/RefgetStore is not available

    Example:
        >>> from refget.store import RefgetStore, StorageMode
        >>> store = RefgetStore(StorageMode.Encoded)
        >>> client = FastaDrsClient()
        >>> collection_digest = client.download_to_store("abc123", store)
    """
    import tempfile
    import os

    # Verify store is available
    try:
        from .store import RefgetStore as RefgetStoreClass
    except ImportError:
        raise ImportError("gtars is required for download_to_store functionality")

    # Download to temporary location
    temp_file = None
    try:
        if temp_dir:
            os.makedirs(temp_dir, exist_ok=True)
            temp_file = os.path.join(temp_dir, f"{digest}.fa")
        else:
            # Create a named temporary file
            fd, temp_file = tempfile.mkstemp(suffix=".fa", prefix=f"{digest}_")
            os.close(fd)  # Close the file descriptor

        # Download the FASTA
        downloaded_path = self.download(digest, dest_path=temp_file, access_id=access_id)
        _LOGGER.info(f"Downloaded FASTA to {downloaded_path}")

        # Import into store
        store.import_fasta(downloaded_path)
        _LOGGER.info(f"Imported FASTA into RefgetStore: {digest}")

        return digest

    finally:
        # Clean up temporary file if we created it in system temp
        if temp_file and not temp_dir and os.path.exists(temp_file):
            try:
                os.remove(temp_file)
            except Exception as e:
                _LOGGER.warning(f"Could not remove temporary file {temp_file}: {e}")

get_access_url

get_access_url(digest, access_id)

Get access URL for a specific access method.

Parameters:

Name	Type	Description	Default
`digest`	`str`	The sequence collection digest	required
`access_id`	`str`	The access ID from the access method	required

Returns:

Type	Description
`dict`	Access URL object

Source code in refget/clients.py

def get_access_url(self, digest: str, access_id: str) -> Optional[dict]:
    """
    Get access URL for a specific access method.

    Args:
        digest (str): The sequence collection digest
        access_id (str): The access ID from the access method

    Returns:
        (dict): Access URL object
    """
    endpoint = f"/objects/{digest}/access/{access_id}"
    return _try_urls(self.urls, endpoint, raise_errors=self.raise_errors)

get_index

get_index(digest)

Get FAI index data for a FASTA file.

Parameters:

Name	Type	Description	Default
`digest`	`str`	The sequence collection digest	required

Returns:

Type	Description
`dict`	Dict with line_bases, extra_line_bytes, offsets

Source code in refget/clients.py

def get_index(self, digest: str) -> Optional[dict]:
    """
    Get FAI index data for a FASTA file.

    Args:
        digest (str): The sequence collection digest

    Returns:
        (dict): Dict with line_bases, extra_line_bytes, offsets
    """
    endpoint = f"/objects/{digest}/index"
    return _try_urls(self.urls, endpoint, raise_errors=self.raise_errors)

get_object

get_object(digest)

Get DRS object metadata for a FASTA file.

Parameters:

Name	Type	Description	Default
`digest`	`str`	The sequence collection digest (which is also the DRS object ID)	required

Returns:

Type	Description
`dict`	DRS object with id, self_uri, size, checksums, access_methods, etc.

Source code in refget/clients.py

def get_object(self, digest: str) -> Optional[dict]:
    """
    Get DRS object metadata for a FASTA file.

    Args:
        digest (str): The sequence collection digest (which is also the DRS object ID)

    Returns:
        (dict): DRS object with id, self_uri, size, checksums, access_methods, etc.
    """
    endpoint = f"/objects/{digest}"
    return _try_urls(self.urls, endpoint, raise_errors=self.raise_errors)

service_info

service_info()

Get DRS service info.

Returns:

Type	Description
`dict`	The service information.

Source code in refget/clients.py

def service_info(self) -> Optional[dict]:
    """
    Get DRS service info.

    Returns:
        (dict): The service information.
    """
    endpoint = "/service-info"
    return _try_urls(self.urls, endpoint)

write_fai

write_fai(digest, dest_path, seqcol_client=None)

Write a .fai index file for a FASTA.

Parameters:

Name	Type	Description	Default
`digest`	`str`	The sequence collection digest	required
`dest_path`	`str`	Path to write the .fai file	required
`seqcol_client`	`SequenceCollectionClient`	SequenceCollectionClient to use	`None`

Returns:

Type	Description
`str`	Path to the written file

Source code in refget/clients.py

def write_fai(
    self, digest: str, dest_path: str, seqcol_client: "SequenceCollectionClient" = None
) -> str:
    """
    Write a .fai index file for a FASTA.

    Args:
        digest (str): The sequence collection digest
        dest_path (str): Path to write the .fai file
        seqcol_client (SequenceCollectionClient, optional): SequenceCollectionClient to use

    Returns:
        (str): Path to the written file
    """
    fai_content = self.build_fai(digest, seqcol_client)
    with open(dest_path, "w") as f:
        f.write(fai_content)
    return dest_path

PangenomeClient

Bases: RefgetClient

Agent Classes

Agents provide higher-level abstractions for working with refget data in a PostgreSQL database.

RefgetDBAgent

RefgetDBAgent(engine=None, postgres_str=None, schema=SEQCOL_SCHEMA_PATH, inherent_attrs=DEFAULT_INHERENT_ATTRS, fasta_drs_url_prefix=None)

Bases: object

Primary aggregator agent, interface to all other agents

Parameterized it via these environment variables: - POSTGRES_HOST - POSTGRES_DB - POSTGRES_USER - POSTGRES_PASSWORD

Source code in refget/agents.py

def __init__(
    self,
    engine: Optional[SqlalchemyDatabaseEngine] = None,
    postgres_str: Optional[str] = None,
    schema=SEQCOL_SCHEMA_PATH,
    inherent_attrs: List[str] = DEFAULT_INHERENT_ATTRS,
    fasta_drs_url_prefix: Optional[str] = None,
):  # = "sqlite:///foo.db"
    if engine is not None:
        self.engine = engine
    else:
        if not postgres_str:
            # Configure via environment variables
            POSTGRES_HOST = os.getenv("POSTGRES_HOST")
            POSTGRES_PORT = os.getenv("POSTGRES_PORT")
            POSTGRES_DB = os.getenv("POSTGRES_DB")
            POSTGRES_USER = os.getenv("POSTGRES_USER")
            POSTGRES_PASSWORD = os.getenv("POSTGRES_PASSWORD")
            postgres_str = URL.create(
                "postgresql",
                username=POSTGRES_USER,
                password=POSTGRES_PASSWORD,
                host=POSTGRES_HOST,
                port=int(POSTGRES_PORT) if POSTGRES_PORT else None,
                database=POSTGRES_DB,
            )

        try:
            self.engine = create_engine(postgres_str, echo=False)
        except Exception as e:
            _LOGGER.error(f"Error: {e}")
            _LOGGER.error("Unable to connect to database")
            _LOGGER.error(
                "Please check that you have set the database credentials correctly in the environment variables"
            )
            _LOGGER.error(f"Database engine string: {postgres_str}")
            raise e
    try:
        SQLModel.metadata.create_all(self.engine)
    except Exception as e:
        _LOGGER.error(f"Error: {e}")
        _LOGGER.error("Unable to create tables in the database")
        raise e

    # Read schema
    if schema:
        self.schema_dict = load_json(schema)
        _LOGGER.debug(f"Schema: {self.schema_dict}")
        try:
            self.inherent_attrs = self.schema_dict["ga4gh"]["inherent"]
        except KeyError:
            self.inherent_attrs = inherent_attrs
            _LOGGER.warning(
                f"No 'inherent' attributes found in schema; using defaults: {inherent_attrs}"
            )
    else:
        _LOGGER.warning("No schema provided; using defaults")
        self.schema_dict = None
        self.inherent_attrs = inherent_attrs

    self.__sequence = SequenceAgent(self.engine)
    self.__seqcol = SequenceCollectionAgent(self.engine, self.inherent_attrs, self)
    self.__pangenome = PangenomeAgent(self)
    self.__attribute = AttributeAgent(self.engine)
    self.__fasta_drs = FastaDrsAgent(self.engine, fasta_drs_url_prefix)

calc_similarities

calc_similarities(digestA, digestB)

Calculates the Jaccard similarity between two sequence collections.

This method retrieves two sequence collections using their digests and then computes jaccard similarities for all attributes.

Parameters:

Name	Type	Description	Default
`digestA`	`str`	The digest (identifier) for the first sequence collection.	required
`digestB`	`str`	The digest (identifier) for the second sequence collection.	required

Returns:

Name	Type	Description
`dict`	`dict`	The Jaccard similarity score between the two sequence collections for all present and shared attributes.

Source code in refget/agents.py

def calc_similarities(self, digestA: str, digestB: str) -> dict:
    """
    Calculates the Jaccard similarity between two sequence collections.

    This method retrieves two sequence collections using their digests and then
    computes jaccard similarities for all attributes.

    Args:
        digestA (str): The digest (identifier) for the first sequence collection.
        digestB (str): The digest (identifier) for the second sequence collection.

    Returns:
        dict: The Jaccard similarity score between the two sequence collections for all present and shared attributes.

    """
    A = self.seqcol.get(digestA, return_format="level2")
    B = self.seqcol.get(digestB, return_format="level2")
    return calc_jaccard_similarities(A, B)

calc_similarities_seqcol_dicts

calc_similarities_seqcol_dicts(seqcolA, seqcolB)

Calculates the Jaccard similarity between two sequence collections.

This method retrieves one sequence collections using a digests and then computes jaccard similarities versus another input sequence collection dictionary.

Parameters:

Name	Type	Description	Default
`seqcolA`	`dict`	the first sequence collection in dict format.	required
`seqcolB`	`dict`	the second sequence collection in dict format.	required

Returns:

Name	Type	Description
`dict`	`dict`	The Jaccard similarity score between the two sequence collections for all present and shared attributes.

Source code in refget/agents.py

def calc_similarities_seqcol_dicts(self, seqcolA: dict, seqcolB: dict) -> dict:
    """
    Calculates the Jaccard similarity between two sequence collections.

    This method retrieves one sequence collections using a digests and then
    computes jaccard similarities versus another input sequence collection dictionary.

    Args:
        seqcolA (dict): the first sequence collection in dict format.
        seqcolB (dict): the second sequence collection in dict format.

    Returns:
        dict: The Jaccard similarity score between the two sequence collections for all present and shared attributes.

    """

    return calc_jaccard_similarities(seqcolA, seqcolB)

truncate

truncate()

Delete all records from the database

Source code in refget/agents.py

def truncate(self) -> int:
    """Delete all records from the database"""

    with Session(self.engine) as session:
        statement = delete(SequenceCollection)
        result1 = session.exec(statement)
        statement = delete(Pangenome)
        result = session.exec(statement)
        statement = delete(NamesAttr)
        result = session.exec(statement)
        statement = delete(LengthsAttr)
        result = session.exec(statement)
        statement = delete(SequencesAttr)
        result = session.exec(statement)
        # statement = delete(SortedNameLengthPairsAttr)
        # result = session.exec(statement)
        statement = delete(NameLengthPairsAttr)
        result = session.exec(statement)
        statement = delete(SortedSequencesAttr)
        result = session.exec(statement)

        session.commit()
        return result1.rowcount

SequenceCollectionAgent

SequenceCollectionAgent(engine, inherent_attrs=None, parent=None)

Bases: object

Agent for interacting with database of sequence collection

Source code in refget/agents.py

def __init__(
    self,
    engine: SqlalchemyDatabaseEngine,
    inherent_attrs: Optional[List[str]] = None,
    parent: Optional["RefgetDBAgent"] = None,
) -> None:
    self.engine = engine
    self.inherent_attrs = inherent_attrs
    self.parent = parent

add

add(seqcol, update=False)

Add a sequence collection to the database or update it if it exists

Parameters:

Name	Type	Description	Default
`seqcol`	`SequenceCollection`	The sequence collection to add	required
`update`	`bool`	If True, update an existing collection if it exists	`False`

Returns:

Type	Description
`SequenceCollection`	The added or updated sequence collection

Source code in refget/agents.py

def add(self, seqcol: SequenceCollection, update: bool = False) -> SequenceCollection:
    """
    Add a sequence collection to the database or update it if it exists

    Args:
        seqcol: The sequence collection to add
        update: If True, update an existing collection if it exists

    Returns:
        The added or updated sequence collection
    """
    with Session(self.engine, expire_on_commit=False) as session:
        with session.no_autoflush:
            existing = session.get(SequenceCollection, seqcol.digest)

            if existing and not update:
                return existing

            # Process attributes (create if needed)
            attr_map = {
                "names": (NamesAttr, seqcol.names),
                "sequences": (SequencesAttr, seqcol.sequences),
                "sorted_sequences": (SortedSequencesAttr, seqcol.sorted_sequences),
                "lengths": (LengthsAttr, seqcol.lengths),
                "name_length_pairs": (NameLengthPairsAttr, seqcol.name_length_pairs),
            }

            processed_attrs = {}

            # Create or retrieve attributes
            for attr_name, (attr_class, attr_obj) in attr_map.items():
                attr = session.get(attr_class, attr_obj.digest)
                if not attr:
                    attr = attr_class(**attr_obj.model_dump())
                    session.add(attr)
                processed_attrs[attr_name] = attr

            if existing and update:
                # Update existing collection

                existing_names = [
                    name_model.human_readable_name
                    for name_model in existing.human_readable_names
                ]

                for name_model in seqcol.human_readable_names:
                    if name_model.human_readable_name not in existing_names:

                        new_name = HumanReadableNames(
                            human_readable_name=name_model.human_readable_name,
                            digest=existing.digest,
                        )

                        session.add(new_name)

                        existing.human_readable_names.append(new_name)

                for attr_name, attr in processed_attrs.items():
                    # Update attribute reference
                    setattr(existing, f"{attr_name}_digest", attr.digest)

                    # Update relationship - first remove from all existing collections
                    getattr(attr, "collection", []).append(existing)

                # Update transient attributes
                existing.sorted_name_length_pairs_digest = (
                    seqcol.sorted_name_length_pairs_digest
                )

                session.commit()
                return existing
            else:
                # Create new collection
                new_collection = SequenceCollection(
                    digest=seqcol.digest,
                    human_readable_names=seqcol.human_readable_names,
                    sorted_name_length_pairs_digest=seqcol.sorted_name_length_pairs_digest,
                )

                # Link attributes to collection
                for attr in processed_attrs.values():
                    getattr(attr, "collection", []).append(new_collection)

                session.add(new_collection)
                session.commit()
                return new_collection

add_from_dict

add_from_dict(seqcol_dict, update=False)

Add a sequence collection from a seqcol dictionary

Parameters:

Name	Type	Description	Default
`seqcol_dict`	`dict`	The sequence collection in dictionary form	required
`update`	`bool`	If True, update an existing collection if it exists	`False`

Returns:

Type	Description
`SequenceCollection`	The added or updated sequence collection

Source code in refget/agents.py

def add_from_dict(self, seqcol_dict: dict, update: bool = False) -> SequenceCollection:
    """
    Add a sequence collection from a seqcol dictionary

    Args:
        seqcol_dict (dict): The sequence collection in dictionary form
        update (bool): If True, update an existing collection if it exists

    Returns:
        (SequenceCollection): The added or updated sequence collection
    """
    seqcol = SequenceCollection.from_dict(seqcol_dict, self.inherent_attrs)
    _LOGGER.info(f"SeqCol: {seqcol}")
    _LOGGER.debug(f"SeqCol name_length_pairs: {seqcol.name_length_pairs.value}")
    return self.add(seqcol, update)

add_from_fasta_file

add_from_fasta_file(fasta_file_path, update=False, create_fasta_drs=True, human_readable_name=None)

Given a path to a fasta file, load the sequences into the refget database.

Parameters:

Name	Type	Description	Default
`fasta_file_path`	`str`	Path to the fasta file	required
`update`	`bool`	If True, update an existing collection if it exists	`False`
`create_fasta_drs`	`bool`	If True, create a FastaDrsObject for the FASTA file	`True`
`human_readable_name`	`str`	Optional human-readable name for the collection	`None`

Returns:

Type	Description
`SequenceCollection`	The added or updated sequence collection

Source code in refget/agents.py

def add_from_fasta_file(
    self,
    fasta_file_path: str,
    update: bool = False,
    create_fasta_drs: bool = True,
    human_readable_name: str = None,
) -> SequenceCollection:
    """
    Given a path to a fasta file, load the sequences into the refget database.

    Args:
        fasta_file_path (str): Path to the fasta file
        update (bool): If True, update an existing collection if it exists
        create_fasta_drs (bool): If True, create a FastaDrsObject for the FASTA file
        human_readable_name (str): Optional human-readable name for the collection

    Returns:
       (SequenceCollection): The added or updated sequence collection
    """
    CSC = fasta_to_seqcol_dict(fasta_file_path)
    if human_readable_name:
        CSC["human_readable_names"] = human_readable_name
    seqcol = self.add_from_dict(CSC, update)

    if create_fasta_drs and self.parent and self.parent.fasta_drs:
        drs_obj = FastaDrsObject.from_fasta_file(fasta_file_path, digest=seqcol.digest)
        if self.parent.fasta_drs.url_prefix:
            url = self.parent.fasta_drs.url_prefix + os.path.basename(fasta_file_path)
            drs_obj.access_methods = [
                AccessMethod(type="https", access_url=AccessURL(url=url))
            ]
        self.parent.fasta_drs.add(drs_obj)

    return seqcol

add_from_fasta_file_with_name

add_from_fasta_file_with_name(fasta_file_path, human_readable_name, update=False, create_fasta_drs=True)

Given a path to a fasta file, and a human-readable name, load the sequences into the refget database.

Deprecated: Use add_from_fasta_file(fasta_file_path, human_readable_name=name) instead.

Source code in refget/agents.py

def add_from_fasta_file_with_name(
    self,
    fasta_file_path: str,
    human_readable_name: str,
    update: bool = False,
    create_fasta_drs: bool = True,
) -> SequenceCollection:
    """
    Given a path to a fasta file, and a human-readable name, load the sequences into the refget database.

    Deprecated: Use add_from_fasta_file(fasta_file_path, human_readable_name=name) instead.
    """
    return self.add_from_fasta_file(
        fasta_file_path,
        update=update,
        create_fasta_drs=create_fasta_drs,
        human_readable_name=human_readable_name,
    )

add_from_fasta_pep

add_from_fasta_pep(pep, fa_root, update=False, create_fasta_drs=True)

Given a PEP project and a root directory containing the fasta files, load the fasta files into the refget database.

Parameters:

Name	Type	Description	Default
`pep`	`Project`	PEP project object containing sample metadata	required
`fa_root`	`str`	Root directory containing the fasta files	required
`update`	`bool`	If True, update existing sequence collections	`False`
`create_fasta_drs`	`bool`	If True, create FastaDrsObjects for the FASTA files	`True`

Returns:

Type	Description
`dict`	A dictionary of the digests of the added sequence collections

Source code in refget/agents.py

def add_from_fasta_pep(
    self,
    pep: "peppy.Project",
    fa_root: str,
    update: bool = False,
    create_fasta_drs: bool = True,
) -> dict:
    """
    Given a PEP project and a root directory containing the fasta files,
    load the fasta files into the refget database.

    Args:
        pep (peppy.Project): PEP project object containing sample metadata
        fa_root (str): Root directory containing the fasta files
        update (bool): If True, update existing sequence collections
        create_fasta_drs (bool): If True, create FastaDrsObjects for the FASTA files

    Returns:
        (dict): A dictionary of the digests of the added sequence collections
    """

    total_files = len(pep.samples)
    results = {}
    import time

    for i, s in enumerate(pep.samples, 1):
        fa_path = os.path.join(fa_root, s.fasta)
        _LOGGER.info(f"Loading {fa_path} ({i} of {total_files})")

        start_time = time.time()  # Record start time
        if s.sample_name:
            results[s.fasta] = self.add_from_fasta_file_with_name(
                fa_path, s.sample_name, update, create_fasta_drs
            ).digest
        else:
            results[s.fasta] = self.add_from_fasta_file(
                fa_path, update, create_fasta_drs
            ).digest
        elapsed_time = time.time() - start_time  # Calculate elapsed time

        _LOGGER.info(f"Loaded in {elapsed_time:.2f} seconds")

    return results

get

get(digest, return_format='level2', attribute=None, itemwise_limit=None)

Get a sequence collection by digest

Parameters:

Name	Type	Description	Default
`digest`	`str`	The digest of the sequence collection	required
`return_format`	`str`	The format in which to return the sequence collection	`'level2'`
`attribute`	`str`	Name of an attribute to return, if you just want an attribute	`None`
`itemwise_limit`	`int`	Limit the number of items returned in itemwise format	`None`

Returns:

Type	Description
`SequenceCollection`	The sequence collection (in requested format)

Source code in refget/agents.py

def get(
    self,
    digest: str,
    return_format: str = "level2",
    attribute: Optional[str] = None,
    itemwise_limit: Optional[int] = None,
) -> SequenceCollection | dict | list:
    """
    Get a sequence collection by digest

    Args:
        digest (str): The digest of the sequence collection
        return_format (str): The format in which to return the sequence collection
        attribute (str): Name of an attribute to return, if you just want an attribute
        itemwise_limit (int): Limit the number of items returned in itemwise format

    Returns:
        (SequenceCollection): The sequence collection (in requested format)
    """
    with Session(self.engine) as session:
        statement = select(SequenceCollection).where(SequenceCollection.digest == digest)
        results = session.exec(statement)
        seqcol = results.one_or_none()
        if not seqcol:
            raise ValueError(f"SequenceCollection with digest '{digest}' not found")
        if attribute:
            return getattr(seqcol, attribute).value
        elif return_format == "level2":
            return seqcol.level2()
        elif return_format == "level1":
            return seqcol.level1()
        elif return_format == "itemwise":
            return seqcol.itemwise(itemwise_limit)
        else:
            return seqcol

search_by_attributes

search_by_attributes(filters, offset=0, limit=50)

Search sequence collections by multiple attribute filters (AND logic).

Parameters:

Name	Type	Description	Default
`filters`	`dict`	Dict of {attribute_name: digest} pairs	required
`offset`	`int`	Pagination offset	`0`
`limit`	`int`	Max results to return	`50`

Returns:

Type	Description
`dict`	Dict with pagination info and results

Source code in refget/agents.py

def search_by_attributes(self, filters: dict, offset: int = 0, limit: int = 50) -> dict:
    """
    Search sequence collections by multiple attribute filters (AND logic).

    Args:
        filters: Dict of {attribute_name: digest} pairs
        offset: Pagination offset
        limit: Max results to return

    Returns:
        Dict with pagination info and results
    """
    with Session(self.engine) as session:
        # Start with base query
        list_stmt = select(SequenceCollection)
        cnt_stmt = select(func.count(SequenceCollection.digest))

        # Chain .where() for each filter (creates AND logic)
        for attr_name, attr_digest in filters.items():
            # Validate attribute exists to prevent SQL injection
            if attr_name not in ATTR_TYPE_MAP:
                raise ValueError(f"Unknown attribute: {attr_name}")

            # Build WHERE condition dynamically
            digest_column = getattr(SequenceCollection, f"{attr_name}_digest")
            list_stmt = list_stmt.where(digest_column == attr_digest)
            cnt_stmt = cnt_stmt.where(digest_column == attr_digest)

        # Add pagination
        list_stmt = list_stmt.offset(offset).limit(limit)

        # Execute queries
        cnt_res = session.exec(cnt_stmt)
        list_res = session.exec(list_stmt)
        count = cnt_res.one()
        seqcols = list_res.all()

        return {
            "pagination": {"page": offset // limit, "page_size": limit, "total": count},
            "results": seqcols,
        }

SequenceAgent

SequenceAgent(engine)

Bases: object

Agent for interacting with database of sequences

Source code in refget/agents.py

def __init__(self, engine: SqlalchemyDatabaseEngine) -> None:
    self.engine = engine

PangenomeAgent

PangenomeAgent(parent)

Bases: object

Agent for interacting with database of pangenomes

Source code in refget/agents.py

def __init__(self, parent: "RefgetDBAgent") -> None:
    self.engine = parent.engine
    self.parent = parent

AttributeAgent

AttributeAgent(engine)

Bases: object

Source code in refget/agents.py

def __init__(self, engine: SqlalchemyDatabaseEngine) -> None:
    self.engine = engine

FastaDrsAgent

FastaDrsAgent(engine, url_prefix=None)

Agent for interacting with database of FASTA DRS objects

Source code in refget/agents.py

def __init__(self, engine: SqlalchemyDatabaseEngine, url_prefix: Optional[str] = None) -> None:
    self.engine = engine
    self.url_prefix = url_prefix

add

add(fasta_drs)

Add a FastaDrsObject to the database

Source code in refget/agents.py

def add(self, fasta_drs: FastaDrsObject) -> FastaDrsObject:
    """Add a FastaDrsObject to the database"""
    with Session(self.engine, expire_on_commit=False) as session:
        with session.no_autoflush:
            existing = session.get(FastaDrsObject, fasta_drs.id)
            if existing:
                return existing
            session.add(fasta_drs)
            session.commit()
            return fasta_drs

add_access_method

add_access_method(digest, access_method)

Add an access method to an existing FastaDrsObject.

Parameters:

Name	Type	Description	Default
`digest`	`str`	The digest (object_id) of the DRS object	required
`access_method`	`AccessMethod`	The AccessMethod to add	required

Returns:

Type	Description
`FastaDrsObject`	The updated FastaDrsObject

Source code in refget/agents.py

def add_access_method(self, digest: str, access_method: AccessMethod) -> FastaDrsObject:
    """
    Add an access method to an existing FastaDrsObject.

    Args:
        digest: The digest (object_id) of the DRS object
        access_method: The AccessMethod to add

    Returns:
        The updated FastaDrsObject
    """
    with Session(self.engine, expire_on_commit=False) as session:
        drs_obj = session.get(FastaDrsObject, digest)
        if not drs_obj:
            raise ValueError(f"FastaDrsObject with id '{digest}' not found")
        # Create a new list to ensure SQLAlchemy detects the change
        current_methods = list(drs_obj.access_methods) if drs_obj.access_methods else []
        current_methods.append(access_method)
        drs_obj.access_methods = current_methods
        session.add(drs_obj)
        session.commit()
        return drs_obj

get

get(digest)

Get a FastaDrsObject by its digest (object_id)

Source code in refget/agents.py

def get(self, digest: str) -> FastaDrsObject:
    """Get a FastaDrsObject by its digest (object_id)"""
    with Session(self.engine) as session:
        statement = select(FastaDrsObject).where(FastaDrsObject.id == digest)
        results = session.exec(statement)
        response = results.first()
        if not response:
            raise ValueError(f"FastaDrsObject with id '{digest}' not found")
        return response

list_by_offset

list_by_offset(limit=50, offset=0)

List FastaDrsObjects with pagination

Source code in refget/agents.py

def list_by_offset(self, limit: int = 50, offset: int = 0) -> dict:
    """List FastaDrsObjects with pagination"""
    with Session(self.engine) as session:
        list_stmt = select(FastaDrsObject).offset(offset).limit(limit)
        cnt_stmt = select(func.count(FastaDrsObject.id))
        cnt_res = session.exec(cnt_stmt)
        list_res = session.exec(list_stmt)
        count = cnt_res.one()
        drs_objs = list_res.all()
        return {
            "pagination": {"page": int(offset / limit), "page_size": limit, "total": count},
            "results": drs_objs,
        }

RefgetStore (gtars)

RefgetStore provides high-performance local sequence storage implemented in Rust. It supports:

In-memory and on-disk storage with optional compression
Remote store access with local caching
Sequence retrieval by digest or by collection + name
BED file region extraction for batch operations
FASTA export for individual sequences or regions

See the RefgetStore tutorial for usage examples.

RefgetStore

RefgetStore(mode)

A global store for GA4GH refget sequences with lazy-loading support.

RefgetStore provides content-addressable storage for reference genome sequences following the GA4GH refget specification. Supports both local and remote stores with on-demand sequence loading.

Attributes:

Name	Type	Description
`cache_path`	`Optional[str]`	Local directory path where the store is located or cached. None for in-memory stores.
`remote_url`	`Optional[str]`	Remote URL of the store if loaded remotely, None otherwise.

Note

Boolean evaluation: RefgetStore follows Python container semantics, meaning bool(store) is False for empty stores (like list, dict, etc.). To check if a store variable is initialized (not None), use if store is not None: rather than if store:.

Example::

store = RefgetStore.in_memory()  # Empty store
bool(store)  # False (empty container)
len(store)   # 0

# Wrong: checks emptiness, not initialization
if store:
    process(store)

# Right: checks if variable is set
if store is not None:
    process(store)

Examples:

Create a new store and import sequences::

from gtars.refget import RefgetStore, StorageMode
store = RefgetStore(StorageMode.Encoded)
store.import_fasta("genome.fa")

Open an existing local store::

store = RefgetStore.open_local("/data/hg38")
seq = store.get_substring("chr1_digest", 0, 1000)

Open a remote store with caching::

store = RefgetStore.open_remote(
    "/local/cache",
    "https://example.com/hg38"
)

Create a new empty RefgetStore.

Parameters:

Name	Type	Description	Default
`mode`	`StorageMode`	Storage mode - StorageMode.Raw (uncompressed) or StorageMode.Encoded (bit-packed, space-efficient).	required

Example::

store = RefgetStore(StorageMode.Encoded)

disable_persistence

disable_persistence()

Disable disk persistence for this store.

New sequences will be kept in memory only. Existing Stub sequences can still be loaded from disk if local_path is set.

Example::

store = RefgetStore.open_remote("/cache", "https://example.com")
store.disable_persistence()  # Stop caching new sequences

enable_persistence

enable_persistence(path)

Enable disk persistence for this store.

Sets up the store to write sequences to disk. Any in-memory Full sequences are flushed to disk and converted to Stubs.

Parameters:

Name	Type	Description	Default
`path`	`Union[str, PathLike]`	Directory for storing sequences and metadata.	required

Raises:

Type	Description
`IOError`	If the directory cannot be created or written to.

Example::

store = RefgetStore.in_memory()
store.add_sequence_collection_from_fasta("genome.fa")
store.enable_persistence("/data/store")  # Flush to disk

export_fasta

export_fasta(collection_digest, output_path, sequence_names=None, line_width=None)

Export sequences from a collection to a FASTA file.

Parameters:

Name	Type	Description	Default
`collection_digest`	`str`	Collection to export from.	required
`output_path`	`Union[str, PathLike]`	Path to write FASTA file.	required
`sequence_names`	`Optional[List[str]]`	Optional list of sequence names to export. If None, exports all sequences in the collection.	`None`
`line_width`	`Optional[int]`	Optional line width for wrapping sequences. If None, uses default of 80.	`None`

export_fasta_by_digests

export_fasta_by_digests(digests, output_path, line_width=None)

Export sequences by their digests to a FASTA file.

Parameters:

Name	Type	Description	Default
`digests`	`List[str]`	List of sequence digests to export.	required
`output_path`	`Union[str, PathLike]`	Path to write FASTA file.	required
`line_width`	`Optional[int]`	Optional line width for wrapping sequences. If None, uses default of 80.	`None`

get_collection

get_collection(collection_digest)

Get a collection by digest with all sequences loaded.

Loads the collection and all its sequence data into memory. Use this when you need full access to sequence content.

Parameters:

Name	Type	Description	Default
`collection_digest`	`str`	The collection's SHA-512/24u digest.	required

Returns:

Type	Description
`SequenceCollection`	The collection with all sequence data loaded.

Raises:

Type	Description
`IOError`	If the collection cannot be loaded.

Example::

collection = store.get_collection("uC_UorBNf3YUu1YIDainBhI94CedlNeH")
for seq in collection.sequences:
    print(f"{seq.metadata.name}: {seq.decode()[:20]}...")

get_collection_metadata

get_collection_metadata(collection_digest)

Get metadata for a collection by digest.

Returns lightweight metadata without loading the full collection. Use this for quick lookups of collection information.

Parameters:

Name	Type	Description	Default
`collection_digest`	`str`	The collection's SHA-512/24u digest.	required

Returns:

Type	Description
`Optional[SequenceCollectionMetadata]`	Collection metadata if found, None otherwise.

Example::

meta = store.get_collection_metadata("uC_UorBNf3YUu1YIDainBhI94CedlNeH")
if meta:
    print(f"Collection has {meta.n_sequences} sequences")

get_seqs_bed_file

get_seqs_bed_file(collection_digest, bed_file_path, output_fasta_path)

Extract sequences for BED regions and write to FASTA.

Parameters:

Name	Type	Description	Default
`collection_digest`	`str`	Collection digest to look up sequence names.	required
`bed_file_path`	`Union[str, PathLike]`	Path to BED file with regions.	required
`output_fasta_path`	`Union[str, PathLike]`	Path to write output FASTA file.	required

get_seqs_bed_file_to_vec

get_seqs_bed_file_to_vec(collection_digest, bed_file_path)

Extract sequences for BED regions and return as list.

Parameters:

Name	Type	Description	Default
`collection_digest`	`str`	Collection digest to look up sequence names.	required
`bed_file_path`	`Union[str, PathLike]`	Path to BED file with regions.	required

Returns:

Type	Description
`List[RetrievedSequence]`	List of retrieved sequence segments.

get_sequence

get_sequence(digest)

Retrieve a sequence record by its digest (SHA-512/24u or MD5).

Loads the sequence data if not already in memory. Supports lookup by either SHA-512/24u (preferred) or MD5 digest.

Parameters:

Name	Type	Description	Default
`digest`	`str`	Sequence digest (SHA-512/24u base64url or MD5 hex string).	required

Returns:

Type	Description
`Optional[SequenceRecord]`	The sequence record with data if found, None otherwise.

Example::

record = store.get_sequence("aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2")
if record:
    print(f"Found: {record.metadata.name}")
    print(f"Sequence: {record.decode()[:50]}...")

get_sequence_by_name

get_sequence_by_name(collection_digest, sequence_name)

Retrieve a sequence by collection digest and sequence name.

Looks up a sequence within a specific collection using its name (e.g., "chr1", "chrM"). Loads the sequence data if needed.

Parameters:

Name	Type	Description	Default
`collection_digest`	`str`	The collection's SHA-512/24u digest.	required
`sequence_name`	`str`	Name of the sequence within that collection.	required

Returns:

Type	Description
`Optional[SequenceRecord]`	The sequence record with data if found, None otherwise.

Example::

record = store.get_sequence_by_name(
    "uC_UorBNf3YUu1YIDainBhI94CedlNeH",
    "chr1"
)
if record:
    print(f"Sequence: {record.decode()[:50]}...")

get_sequence_metadata

get_sequence_metadata(seq_digest)

Get metadata for a sequence by digest (no data loaded).

Use this for lightweight lookups when you don't need the actual sequence.

Parameters:

Name	Type	Description	Default
`seq_digest`	`str`	The sequence's SHA-512/24u digest.	required

Returns:

Type	Description
`Optional[SequenceMetadata]`	Sequence metadata if found, None otherwise.

get_substring

get_substring(seq_digest, start, end)

Extract a substring from a sequence.

Retrieves a specific region from a sequence using 0-based, half-open coordinates [start, end). Automatically loads sequence data if not already cached (for lazy-loaded stores).

Parameters:

Name	Type	Description	Default
`seq_digest`	`str`	Sequence digest (SHA-512/24u).	required
`start`	`int`	Start position (0-based, inclusive).	required
`end`	`int`	End position (0-based, exclusive).	required

Returns:

Type	Description
`Optional[str]`	The substring sequence if found, None otherwise.

Example::

# Get first 1000 bases of chr1
seq = store.get_substring("chr1_digest", 0, 1000)
print(f"First 50bp: {seq[:50]}")

import_fasta

import_fasta(file_path)

Import sequences from a FASTA file into the store.

Reads all sequences from a FASTA file and adds them to the store. Computes GA4GH digests and creates a sequence collection.

Parameters:

Name	Type	Description	Default
`file_path`	`Union[str, PathLike]`	Path to the FASTA file.	required

Raises:

Type	Description
`IOError`	If the file cannot be read or parsed.

Example::

store = RefgetStore(StorageMode.Encoded)
store.import_fasta("genome.fa")

in_memory `classmethod`

in_memory()

Create a new in-memory RefgetStore.

Creates a store that keeps all sequences in memory. Use this for temporary processing or when you don't need disk persistence.

Returns:

Type	Description
`RefgetStore`	New empty RefgetStore with Encoded storage mode.

Example::

store = RefgetStore.in_memory()
store.import_fasta("genome.fa")

is_collection_loaded

is_collection_loaded(collection_digest)

Check if a collection is fully loaded.

Returns True if the collection's sequence list is loaded in memory, False if it's only metadata (stub).

Parameters:

Name	Type	Description	Default
`collection_digest`	`str`	The collection's SHA-512/24u digest.	required

Returns:

Type	Description
`bool`	True if loaded, False otherwise.

iter_collections

iter_collections()

Iterate over all collections with their sequences loaded.

This loads all collection data upfront and returns a list of SequenceCollection objects with full sequence data.

For browsing without loading data, use list_collections() instead.

Returns:

Type	Description
`List[SequenceCollection]`	List of all collections with loaded sequences.

Example::

for coll in store.iter_collections():
    print(f"{coll.digest}: {len(coll.sequences)} sequences")

iter_sequences

iter_sequences()

Iterate over all sequences with their data loaded.

This ensures all sequence data is loaded and returns a list of SequenceRecord objects with full sequence data.

For browsing without loading data, use list_sequences() instead.

Returns:

Type	Description
`List[SequenceRecord]`	List of all sequences with loaded data.

Example::

for seq in store.iter_sequences():
    print(f"{seq.metadata.name}: {seq.decode()[:20]}...")

list_collections

list_collections()

List all collection metadata in the store.

Returns metadata for all collections without loading full collection data. Use this for browsing/inventory operations.

Returns:

Type	Description
`List[SequenceCollectionMetadata]`	List of metadata for all collections.

Example::

for meta in store.list_collections():
    print(f"Collection {meta.digest}: {meta.n_sequences} sequences")

list_sequences

list_sequences()

List all sequence metadata in the store.

Returns metadata for all sequences without loading sequence data. Use this for browsing/inventory operations.

Returns:

Type	Description
`List[SequenceMetadata]`	List of metadata for all sequences in the store.

Example::

for meta in store.list_sequences():
    print(f"{meta.name}: {meta.length} bp")

on_disk `classmethod`

on_disk(cache_path)

Create or load a disk-backed RefgetStore.

If the directory contains an existing store (rgstore.json), loads it. Otherwise creates a new store with Encoded mode.

Parameters:

Name	Type	Description	Default
`cache_path`	`Union[str, PathLike]`	Directory path for the store. Created if it doesn't exist.	required

Returns:

Type	Description
`RefgetStore`	RefgetStore (new or loaded from disk).

Example::

store = RefgetStore.on_disk("/data/my_store")
store.import_fasta("genome.fa")
# Store is automatically persisted to disk

open_local `classmethod`

open_local(path)

Open a local RefgetStore from a directory.

Loads only lightweight metadata and stubs. Collections and sequences remain as stubs until explicitly accessed with get_collection()/get_sequence().

Expects: rgstore.json, sequences.rgsi, collections.rgci, collections/*.rgsi

Parameters:

Name	Type	Description	Default
`path`	`Union[str, PathLike]`	Local directory containing the refget store.	required

Returns:

Type	Description
`RefgetStore`	RefgetStore with metadata loaded, sequences lazy-loaded.

Raises:

Type	Description
`IOError`	If the store directory or index files cannot be read.

Example::

store = RefgetStore.open_local("/data/hg38_store")
seq = store.get_substring("chr1_digest", 0, 1000)

open_remote `classmethod`

open_remote(cache_path, remote_url)

Open a remote RefgetStore with local caching.

Loads only lightweight metadata and stubs from the remote URL. Data is fetched on-demand when get_collection()/get_sequence() is called.

By default, persistence is enabled (sequences are cached to disk). Call disable_persistence() after loading to keep only in memory.

Parameters:

Name	Type	Description	Default
`cache_path`	`Union[str, PathLike]`	Local directory to cache downloaded metadata and sequences. Created if it doesn't exist.	required
`remote_url`	`str`	Base URL of the remote refget store (e.g., "https://example.com/hg38" or "s3://bucket/hg38").	required

Returns:

Type	Description
`RefgetStore`	RefgetStore with metadata loaded, sequences fetched on-demand.

Raises:

Type	Description
`IOError`	If remote metadata cannot be fetched or cache cannot be written.

Example::

store = RefgetStore.open_remote(
    "/data/cache/hg38",
    "https://refget-server.com/hg38"
)
# First access fetches from remote and caches
seq = store.get_substring("chr1_digest", 0, 1000)
# Second access uses cache
seq2 = store.get_substring("chr1_digest", 1000, 2000)

set_encoding_mode

set_encoding_mode(mode)

Change the storage mode, re-encoding/decoding existing sequences as needed.

When switching from Raw to Encoded, all Full sequences in memory are encoded (2-bit packed). When switching from Encoded to Raw, all Full sequences in memory are decoded back to raw bytes.

Parameters:

Name	Type	Description	Default
`mode`	`StorageMode`	The storage mode to switch to (StorageMode.Raw or StorageMode.Encoded).	required

Example::

store = RefgetStore.in_memory()
store.set_encoding_mode(StorageMode.Raw)

stats

stats()

Returns statistics about the store.

Returns:

Type	Description
`dict`	dict with keys: - 'n_sequences': Total number of sequences (Stub + Full) - 'n_sequences_loaded': Number of sequences with data loaded (Full) - 'n_collections': Total number of collections (Stub + Full) - 'n_collections_loaded': Number of collections with sequences loaded (Full) - 'storage_mode': Storage mode ('Raw' or 'Encoded') - 'total_disk_size': Total size of all files on disk in bytes

Note

n_collections_loaded only reflects collections fully loaded in memory. For remote stores, collections are loaded on-demand when accessed.

Example::

stats = store.stats()
print(f"Store has {stats['n_sequences']} sequences")
print(f"Collections: {stats['n_collections']} total, {stats['n_collections_loaded']} loaded")

write_store_to_directory

write_store_to_directory(root_path, seqdata_path_template)

Write the store to a directory on disk.

Persists the store with all sequences and metadata to disk using the RefgetStore directory format.

Parameters:

Name	Type	Description	Default
`root_path`	`Union[str, PathLike]`	Directory path to write the store to.	required
`seqdata_path_template`	`str`	Path template for sequence files (e.g., "sequences/%s2/%s.seq" where %s2 = first 2 chars of digest, %s = full digest).	required

Example::

store.write_store_to_directory(
    "/data/my_store",
    "sequences/%s2/%s.seq"
)

Digest Functions

Low-level functions for computing GA4GH digests:

sha512t24u_digest

sha512t24u_digest(readable)

Compute the GA4GH SHA-512/24u digest for a sequence.

This function computes the GA4GH refget standard digest (truncated SHA-512, base64url encoded) for a given sequence string or bytes.

Parameters:

Name	Type	Description	Default
`readable`	`Union[str, bytes]`	Input sequence as str or bytes.	required

Returns:

Type	Description
`str`	The SHA-512/24u digest (32 character base64url string).

Raises:

Type	Description
`TypeError`	If input is not str or bytes.

Example:: from gtars.refget import sha512t24u_digest digest = sha512t24u_digest("ACGT") print(digest) # Output: 'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'

canonical_str

canonical_str(item)

Convert a dict into a canonical string representation

Source code in refget/utils.py

def canonical_str(item: dict) -> bytes:
    """Convert a dict into a canonical string representation"""
    return json.dumps(
        item, separators=(",", ":"), ensure_ascii=False, allow_nan=False, sort_keys=True
    ).encode()