Data Models

The refget package uses Pydantic and SQLModel for data validation and database ORM. These models represent the core data structures for sequence collections, DRS objects, and related metadata.

Data models

Data Models are only needed if you want to develop new packages that rely on the refget Python API.

Model hierarchy

DrsObject (base)
└── FastaDrsObject (table)

SQLModel (base)
├── SequenceCollection (table)
├── Pangenome (table)
├── Sequence (table)
├── AccessMethod
├── AccessURL
└── Checksum

Core Models

SequenceCollection

The primary model representing a GA4GH sequence collection.

SequenceCollection

Bases: SQLModel

A SQLModel/pydantic model that represents a refget sequence collection.

digest `class-attribute` `instance-attribute`

digest = Field(primary_key=True)

Top-level digest of the SequenceCollection.

lengths `class-attribute` `instance-attribute`

lengths = Relationship(back_populates='collection')

Array of sequence lengths.

name_length_pairs `class-attribute` `instance-attribute`

name_length_pairs = Relationship(back_populates='collection')

Array of name-length pairs, representing the coordinate system of the collection.

names `class-attribute` `instance-attribute`

names = Relationship(back_populates='collection')

Array of sequence names.

sequences `class-attribute` `instance-attribute`

sequences = Relationship(back_populates='collection')

Array of sequence digests.

sorted_name_length_pairs_digest `class-attribute` `instance-attribute`

sorted_name_length_pairs_digest = Field()

Digest of the sorted name-length pairs, representing a unique digest of sort-invariant coordinate system.

sorted_sequences `class-attribute` `instance-attribute`

sorted_sequences = Relationship(back_populates='collection')

Array of sorted sequence digests.

from_PySequenceCollection `classmethod`

from_PySequenceCollection(gtars_seq_col)

Given a PySequenceCollection object (from Rust bindings), create a SequenceCollection object.

Parameters:

Name	Type	Description	Default
`gtars_seq_col`	`PySequenceCollection`	PySequenceCollection object from Rust bindings.	required

Returns:

Type	Description
`SequenceCollection`	The SequenceCollection object.

Raises:

Type	Description
`ImportError`	If gtars is not installed (required for this conversion)

Source code in refget/models.py

@classmethod
def from_PySequenceCollection(
    cls, gtars_seq_col: "gtarsSequenceCollection"
) -> "SequenceCollection":
    """
    Given a PySequenceCollection object (from Rust bindings), create a SequenceCollection object.

    Args:
       gtars_seq_col (PySequenceCollection): PySequenceCollection object from Rust bindings.

    Returns:
        (SequenceCollection): The SequenceCollection object.

    Raises:
        ImportError: If gtars is not installed (required for this conversion)
    """
    return seqcol_from_gtars(gtars_seq_col)

from_dict `classmethod`

from_dict(seqcol_dict, inherent_attrs=DEFAULT_INHERENT_ATTRS)

Given a dict representation of a sequence collection, create a SequenceCollection object. This is the primary way to create a SequenceCollection object.

Parameters:

Name	Type	Description	Default
`seqcol_dict`	`dict`	Dictionary representation of a canonical sequence collection object	required
`inherent_attrs`	`list`	List of inherent attributes to digest	`DEFAULT_INHERENT_ATTRS`

Returns:

Type	Description
`SequenceCollection`	The SequenceCollection object

Source code in refget/models.py

@classmethod
def from_dict(
    cls, seqcol_dict: dict, inherent_attrs: Optional[list] = DEFAULT_INHERENT_ATTRS
) -> "SequenceCollection":
    """
    Given a dict representation of a sequence collection, create a SequenceCollection object.
    This is the primary way to create a SequenceCollection object.

    Args:
        seqcol_dict (dict): Dictionary representation of a canonical sequence collection object
        inherent_attrs (list, optional): List of inherent attributes to digest

    Returns:
        (SequenceCollection): The SequenceCollection object
    """

    # Validate collated attributes have matching lengths
    cls._validate_collated_attributes(seqcol_dict)

    # validate_seqcol(seqcol_dict)
    level1_dict = seqcol_dict_to_level1_dict(seqcol_dict)
    seqcol_digest = level1_dict_to_seqcol_digest(level1_dict, inherent_attrs)

    # Now, build the actual pydantic models
    sequences_attr = SequencesAttr(
        digest=level1_dict["sequences"], value=seqcol_dict["sequences"]
    )

    names_attr = NamesAttr(digest=level1_dict["names"], value=seqcol_dict["names"])

    lengths_attr = LengthsAttr(digest=level1_dict["lengths"], value=seqcol_dict["lengths"])

    nlp = build_name_length_pairs(seqcol_dict)
    nlp_attr = NameLengthPairsAttr(digest=sha512t24u_digest(canonical_str(nlp)), value=nlp)
    _LOGGER.debug(f"nlp: {nlp}")
    _LOGGER.debug(f"Name-length pairs: {nlp_attr}")

    snlp_digests = []  # sorted_name_length_pairs digests
    for i in range(len(nlp)):
        snlp_digests.append(sha512t24u_digest(canonical_str(nlp[i])))
    snlp_digests.sort()

    # you can build it like this, but instead I'm just building it from the nlp, to save compute
    # snlp = build_sorted_name_length_pairs(seqcol_dict)
    # v = ",".join(snlp)
    snlp_digest_level1 = sha512t24u_digest(canonical_str(snlp_digests))

    # This is now a transient attribute, so we don't need to store it in the database.
    # snlp_attr = SortedNameLengthPairsAttr(digest=snlp_digest_level1, value=snlp_digests)

    sorted_sequences_value = copy(seqcol_dict["sequences"])
    sorted_sequences_value.sort()
    sorted_sequences_digest = sha512t24u_digest(canonical_str(sorted_sequences_value))
    sorted_sequences_attr = SortedSequencesAttr(
        digest=sorted_sequences_digest, value=sorted_sequences_value
    )
    _LOGGER.debug(f"sorted_sequences_value: {sorted_sequences_value}")
    _LOGGER.debug(f"sorted_sequences_digest: {sorted_sequences_digest}")
    _LOGGER.debug(f"sorted_sequences_attr: {sorted_sequences_attr}")

    human_readable_names_list = []
    if "human_readable_names" in seqcol_dict and seqcol_dict["human_readable_names"]:
        # Assuming 'human_readable_name' is a list of strings in the input dictionary
        if isinstance(seqcol_dict["human_readable_names"], list):
            for name_str in seqcol_dict["human_readable_names"]:
                human_readable_names_list.append(
                    HumanReadableNames(human_readable_name=name_str, digest=seqcol_digest)
                )
        # Handle single string input (convert to list)
        elif isinstance(seqcol_dict["human_readable_names"], str):
            human_readable_names_list.append(
                HumanReadableNames(
                    human_readable_name=seqcol_dict["human_readable_names"],
                    digest=seqcol_digest,
                )
            )

    seqcol = SequenceCollection(
        digest=seqcol_digest,
        human_readable_names=human_readable_names_list,
        sequences=sequences_attr,
        sorted_sequences=sorted_sequences_attr,
        names=names_attr,
        lengths=lengths_attr,
        name_length_pairs=nlp_attr,
        sorted_name_length_pairs_digest=snlp_digest_level1,
    )

    _LOGGER.debug(f"seqcol: {seqcol}")

    return seqcol

from_fasta_file `classmethod`

from_fasta_file(fasta_file)

Given a FASTA file, create a SequenceCollection object.

Parameters:

Name	Type	Description	Default
`fasta_file`	`str`	Path to a FASTA file	required

Returns:

Type	Description
`SequenceCollection`	The SequenceCollection object

Raises:

Type	Description
`ImportError`	If gtars is not installed (required for FASTA processing)

Source code in refget/models.py

@classmethod
def from_fasta_file(cls, fasta_file: str) -> "SequenceCollection":
    """
    Given a FASTA file, create a SequenceCollection object.

    Args:
        fasta_file (str): Path to a FASTA file

    Returns:
        (SequenceCollection): The SequenceCollection object

    Raises:
        ImportError: If gtars is not installed (required for FASTA processing)
    """
    seqcol = fasta_to_seqcol_dict(fasta_file)
    return cls.from_dict(seqcol)

itemwise

itemwise(limit=None)

Converts object into a list of dictionaries, one for each sequence in the collection.

Source code in refget/models.py

def itemwise(self, limit=None):
    """
    Converts object into a list of dictionaries, one for each sequence in the collection.
    """
    if limit and len(self.sequences.value) > limit:
        raise ValueError(f"Too many sequences to format itemwise: {len(self.sequences.value)}")
    list_of_dicts = []
    for i in range(len(self.lengths.value)):
        list_of_dicts.append(
            {
                "name": self.names.value[i],
                "length": self.lengths.value[i],
                "sequence": self.sequences.value[i],
            }
        )
    return list_of_dicts

level1

level1()

Converts object into dict of level 1 representation of the SequenceCollection.

Returns attribute digests for most attributes, but returns raw values for passthru attributes. Note: Passthru handling for dict-based construction happens in seqcol_dict_to_level1_dict(). When passthru attributes are added to the database model, return .value instead of .digest here.

Source code in refget/models.py

def level1(self):
    """
    Converts object into dict of level 1 representation of the SequenceCollection.

    Returns attribute digests for most attributes, but returns raw values for passthru attributes.
    Note: Passthru handling for dict-based construction happens in seqcol_dict_to_level1_dict().
    When passthru attributes are added to the database model, return .value instead of .digest here.
    """
    return {
        "lengths": self.lengths.digest,
        "names": self.names.digest,
        "sequences": self.sequences.digest,
        "sorted_sequences": self.sorted_sequences.digest,
        "name_length_pairs": self.name_length_pairs.digest,
        "sorted_name_length_pairs": self.sorted_name_length_pairs_digest,
    }

level2

level2()

Converts object into dict of level 2 representation of the SequenceCollection.

Source code in refget/models.py

def level2(self):
    """
    Converts object into dict of level 2 representation of the SequenceCollection.
    """
    return {
        "lengths": self.lengths.value,
        "names": self.names.value,
        "sequences": self.sequences.value,
        "sorted_sequences": self.sorted_sequences.value,
        "name_length_pairs": self.name_length_pairs.value,
        # sorted_name_length_pairs is transient - only digest stored, not value
    }

FastaDrsObject

A DRS object specialized for FASTA files, storing file metadata and FAI index information.

FastaDrsObject

Bases: DrsObject

A DRS object specialized for FASTA sequence files. Stores file metadata including size, checksums (SHA-256, MD5, and refget sequence collection digest), and creation time. The refget digest serves as the object ID, enabling content-addressable retrieval.

from_fasta_file `classmethod`

from_fasta_file(fasta_file, digest=None)

Given a FASTA file, create a FastaDrsObject object, return a populated FastaDrsObject with computed size and checksum.

Parameters:

Name	Type	Description	Default
`fasta_file`	`str`	Path to a FASTA file	required
`digest`	`str`	The refget digest of the sequence collection (optional). If not included, it will be computed	`None`

Returns:

Type	Description
`FastaDrsObject`	The FastaDrsObject object

Raises:

Type	Description
`ImportError`	If gtars is not installed (required for FASTA processing)

Source code in refget/models.py

@classmethod
def from_fasta_file(cls, fasta_file: str, digest: str = None) -> "FastaDrsObject":
    """
    Given a FASTA file, create a FastaDrsObject object,
    return a populated FastaDrsObject with computed size and checksum.

    Args:
        fasta_file (str): Path to a FASTA file
        digest (str): The refget digest of the sequence collection
            (optional). If not included, it will be computed

    Returns:
        (FastaDrsObject): The FastaDrsObject object

    Raises:
        ImportError: If gtars is not installed (required for FASTA processing)
    """
    return create_fasta_drs_object(fasta_file, digest)

to_response

to_response(base_uri=None)

Return a copy of this object with self_uri populated for API response.

Parameters:

Name	Type	Description	Default
`base_uri`	`str`	Base URI for the DRS service (e.g., "drs://seqcolapi.databio.org") If not provided, returns self unchanged.	`None`

Returns:

Type	Description
`FastaDrsObject`	FastaDrsObject with self_uri populated

Source code in refget/models.py

def to_response(self, base_uri: str = None) -> "FastaDrsObject":
    """
    Return a copy of this object with self_uri populated for API response.

    Args:
        base_uri: Base URI for the DRS service (e.g., "drs://seqcolapi.databio.org")
                 If not provided, returns self unchanged.

    Returns:
        FastaDrsObject with self_uri populated
    """
    if base_uri is None:
        return self

    return self.model_copy(update={"self_uri": f"{base_uri}/{self.id}"})

DrsObject

Base model for GA4GH Data Repository Service (DRS) objects.

DrsObject

Bases: SQLModel

A data object representing a single blob of bytes with metadata, checksums, and access methods. DRS objects are self-contained and provide all information needed for clients to retrieve the data. Conforms to GA4GH Data Repository Service (DRS) specification v1.4.0.

coerce_access_methods `classmethod`

coerce_access_methods(v)

Coerce dicts to AccessMethod objects when loading from JSON.

Source code in refget/models.py

@field_validator("access_methods", mode="before")
@classmethod
def coerce_access_methods(cls, v):
    """Coerce dicts to AccessMethod objects when loading from JSON."""
    if v is None:
        return []
    return [
        AccessMethod.model_validate(item) if isinstance(item, dict) else item for item in v
    ]

coerce_checksums `classmethod`

coerce_checksums(v)

Coerce dicts to Checksum objects when loading from JSON.

Source code in refget/models.py

@field_validator("checksums", mode="before")
@classmethod
def coerce_checksums(cls, v):
    """Coerce dicts to Checksum objects when loading from JSON."""
    if v is None:
        return []
    return [Checksum.model_validate(item) if isinstance(item, dict) else item for item in v]

serialize_access_methods

serialize_access_methods(v)

Serialize AccessMethod objects (or dicts) to dicts for JSON output.

Source code in refget/models.py

@field_serializer("access_methods")
def serialize_access_methods(self, v):
    """Serialize AccessMethod objects (or dicts) to dicts for JSON output."""
    if v is None:
        return []
    return [item.model_dump() if hasattr(item, "model_dump") else item for item in v]

serialize_checksums

serialize_checksums(v)

Serialize Checksum objects (or dicts) to dicts for JSON output.

Source code in refget/models.py

@field_serializer("checksums")
def serialize_checksums(self, v):
    """Serialize Checksum objects (or dicts) to dicts for JSON output."""
    if v is None:
        return []
    return [item.model_dump() if hasattr(item, "model_dump") else item for item in v]

Pangenome

A collection of sequence collections representing a pangenome.

Pangenome

Bases: SQLModel

from_dict `classmethod`

from_dict(pangenome_obj, inherent_attrs=None)

Given a dict representation of a pangenome, create a Pangenome object. This is the primary way to create a Pangenome object.

Parameters:

Name	Type	Description	Default
`pangenome_obj`	`dict`	Dictionary representation of a canonical pangenome object	required

Returns:

Type	Description
`Pangenome`	The Pangenome object

Source code in refget/models.py

@classmethod
def from_dict(cls, pangenome_obj: dict, inherent_attrs: Optional[list] = None) -> "Pangenome":
    """
    Given a dict representation of a pangenome, create a Pangenome object.
    This is the primary way to create a Pangenome object.

    Args:
        pangenome_obj (dict): Dictionary representation of a canonical pangenome object

    Returns:
        (Pangenome): The Pangenome object
    """
    raise NotImplementedError("This method is not yet implemented.")

level1

level1()

Converts object into dict of level 1 representation of the Pangenome.

Source code in refget/models.py

def level1(self):
    """Converts object into dict of level 1 representation of the Pangenome."""
    return {"names": self.names_digest, "collections": self.collections_digest}

level2

level2()

Converts object into dict of level 2 representation of the Pangenome.

Source code in refget/models.py

def level2(self):
    """Converts object into dict of level 2 representation of the Pangenome."""
    return {
        "names": self.names.value.split(","),
        "collections": [x.digest for x in self.collections],
    }

level3

level3()

Converts object into dict of level 3 representation of the Pangenome.

Source code in refget/models.py

def level3(self):
    """Converts object into dict of level 3 representation of the Pangenome."""
    return {
        "names": self.names.value.split(","),
        "collections": [x.level1() for x in self.collections],
    }

level4

level4()

Converts object into dict of level 4 representation of the Pangenome.

Source code in refget/models.py

def level4(self):
    """Converts object into dict of level 4 representation of the Pangenome."""
    return {
        "names": self.names.value.split(","),
        "collections": [x.level2() for x in self.collections],
    }

Sequence

An individual sequence with its digest and content.

Sequence

Bases: SQLModel

Supporting Models

AccessMethod

Describes how to access object bytes (protocol type, URL, region).

AccessMethod

Bases: SQLModel

Describes a method for accessing object bytes, including the protocol type (e.g., https, s3, gs) and either a direct URL or an access_id for the /access endpoint. At least one of access_url or access_id must be provided.

DRS 1.5.0 adds the 'cloud' field to explicitly specify the cloud provider.

AccessURL

A fully resolvable URL with optional headers for authentication.

AccessURL

Bases: SQLModel

A fully resolvable URL that can be used to fetch the actual object bytes. Optionally includes headers (e.g., authorization tokens) required for access.

Checksum

A checksum for data integrity verification.

Checksum

Bases: SQLModel

A checksum for data integrity verification. The type field indicates the hash algorithm (e.g., "sha-256", "md5") and the checksum field contains the hex-string encoded hash value.

Response Models

PaginationResult

Pagination metadata for list endpoints.

PaginationResult

Bases: BaseModel

ResultsSequenceCollections

Paginated sequence collection results.

ResultsSequenceCollections

Bases: BaseModel

Sequence collection results with pagination

Similarities

Results from Jaccard similarity calculations.

Similarities

Bases: BaseModel

Model to contain results from similarities calculations

Attribute Tables

These models store individual attributes of sequence collections in normalized database tables:

NamesAttr

Bases: SQLModel

LengthsAttr

Bases: SQLModel

SequencesAttr

Bases: SQLModel

NameLengthPairsAttr

Bases: SQLModel

Usage Examples

Creating a SequenceCollection from a FASTA file

from refget.models import SequenceCollection

# From a FASTA file (requires gtars)
seqcol = SequenceCollection.from_fasta_file("genome.fa")

# Access different representations
print(seqcol.digest)  # Top-level digest
print(seqcol.level1())  # Attribute digests
print(seqcol.level2())  # Full arrays
print(seqcol.itemwise())  # Per-sequence dicts

Creating a SequenceCollection from a dictionary

from refget.models import SequenceCollection

seqcol_dict = {
    "names": ["chr1", "chr2"],
    "lengths": [1000, 2000],
    "sequences": ["SQ.abc123...", "SQ.def456..."]
}

seqcol = SequenceCollection.from_dict(seqcol_dict)

Creating a FastaDrsObject

from refget.models import FastaDrsObject

# From a FASTA file
drs_obj = FastaDrsObject.from_fasta_file("genome.fa")

# Access DRS metadata
print(drs_obj.id)  # Sequence collection digest
print(drs_obj.size)  # File size in bytes
print(drs_obj.checksums)  # SHA-256, MD5
print(drs_obj.access_methods)  # How to download

Data Models

Model hierarchy

Core Models

SequenceCollection

SequenceCollection

digest class-attribute instance-attribute

lengths class-attribute instance-attribute

name_length_pairs class-attribute instance-attribute

names class-attribute instance-attribute

sequences class-attribute instance-attribute

sorted_name_length_pairs_digest class-attribute instance-attribute

sorted_sequences class-attribute instance-attribute

from_PySequenceCollection classmethod

from_dict classmethod

from_fasta_file classmethod

itemwise

level1

level2

FastaDrsObject

FastaDrsObject

from_fasta_file classmethod

to_response

DrsObject

DrsObject

coerce_access_methods classmethod

coerce_checksums classmethod

serialize_access_methods

serialize_checksums

Pangenome

Pangenome

from_dict classmethod

level1

level2

level3

level4

Sequence

Sequence

Supporting Models

AccessMethod

AccessMethod

AccessURL

AccessURL

Checksum

Checksum

Response Models

PaginationResult

PaginationResult

ResultsSequenceCollections

ResultsSequenceCollections

Similarities

Similarities

Attribute Tables

NamesAttr

NamesAttr

LengthsAttr

LengthsAttr

SequencesAttr

SequencesAttr

NameLengthPairsAttr

NameLengthPairsAttr

Usage Examples

Creating a SequenceCollection from a FASTA file

Creating a SequenceCollection from a dictionary

Creating a FastaDrsObject

digest `class-attribute` `instance-attribute`

lengths `class-attribute` `instance-attribute`

name_length_pairs `class-attribute` `instance-attribute`

names `class-attribute` `instance-attribute`

sequences `class-attribute` `instance-attribute`

sorted_name_length_pairs_digest `class-attribute` `instance-attribute`

sorted_sequences `class-attribute` `instance-attribute`

from_PySequenceCollection `classmethod`

from_dict `classmethod`

from_fasta_file `classmethod`

from_fasta_file `classmethod`

coerce_access_methods `classmethod`

coerce_checksums `classmethod`

from_dict `classmethod`