Skip to content

Data Models

The refget package uses Pydantic and SQLModel for data validation and database ORM. These models represent the core data structures for sequence collections, DRS objects, and related metadata.

Data models

Data Models are only needed if you want to develop new packages that rely on the refget Python API.

Model hierarchy

DrsObject (base)
└── FastaDrsObject (table)

SQLModel (base)
├── SequenceCollection (table)
├── Pangenome (table)
├── Sequence (table)
├── AccessMethod
├── AccessURL
└── Checksum

Core Models

SequenceCollection

The primary model representing a GA4GH sequence collection.

SequenceCollection

Bases: SQLModel

A SQLModel/pydantic model that represents a refget sequence collection.

digest class-attribute instance-attribute
digest = Field(primary_key=True)

Top-level digest of the SequenceCollection.

lengths class-attribute instance-attribute
lengths = Relationship(back_populates='collection')

Array of sequence lengths.

name_length_pairs class-attribute instance-attribute
name_length_pairs = Relationship(back_populates='collection')

Array of name-length pairs, representing the coordinate system of the collection.

names class-attribute instance-attribute
names = Relationship(back_populates='collection')

Array of sequence names.

sequences class-attribute instance-attribute
sequences = Relationship(back_populates='collection')

Array of sequence digests.

sorted_name_length_pairs_digest class-attribute instance-attribute
sorted_name_length_pairs_digest = Field()

Digest of the sorted name-length pairs, representing a unique digest of sort-invariant coordinate system.

sorted_sequences class-attribute instance-attribute
sorted_sequences = Relationship(back_populates='collection')

Array of sorted sequence digests.

from_PySequenceCollection classmethod
from_PySequenceCollection(gtars_seq_col)

Given a PySequenceCollection object (from Rust bindings), create a SequenceCollection object.

Parameters:

Name Type Description Default
gtars_seq_col PySequenceCollection

PySequenceCollection object from Rust bindings.

required

Returns:

Type Description
SequenceCollection

The SequenceCollection object.

Raises:

Type Description
ImportError

If gtars is not installed (required for this conversion)

Source code in refget/models.py
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
@classmethod
def from_PySequenceCollection(
    cls, gtars_seq_col: "gtarsSequenceCollection"
) -> "SequenceCollection":
    """
    Given a PySequenceCollection object (from Rust bindings), create a SequenceCollection object.

    Args:
       gtars_seq_col (PySequenceCollection): PySequenceCollection object from Rust bindings.

    Returns:
        (SequenceCollection): The SequenceCollection object.

    Raises:
        ImportError: If gtars is not installed (required for this conversion)
    """
    return seqcol_from_gtars(gtars_seq_col)
from_dict classmethod
from_dict(seqcol_dict, inherent_attrs=DEFAULT_INHERENT_ATTRS)

Given a dict representation of a sequence collection, create a SequenceCollection object. This is the primary way to create a SequenceCollection object.

Parameters:

Name Type Description Default
seqcol_dict dict

Dictionary representation of a canonical sequence collection object

required
inherent_attrs list

List of inherent attributes to digest

DEFAULT_INHERENT_ATTRS

Returns:

Type Description
SequenceCollection

The SequenceCollection object

Source code in refget/models.py
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
@classmethod
def from_dict(
    cls, seqcol_dict: dict, inherent_attrs: Optional[list] = DEFAULT_INHERENT_ATTRS
) -> "SequenceCollection":
    """
    Given a dict representation of a sequence collection, create a SequenceCollection object.
    This is the primary way to create a SequenceCollection object.

    Args:
        seqcol_dict (dict): Dictionary representation of a canonical sequence collection object
        inherent_attrs (list, optional): List of inherent attributes to digest

    Returns:
        (SequenceCollection): The SequenceCollection object
    """

    # Validate collated attributes have matching lengths
    cls._validate_collated_attributes(seqcol_dict)

    # validate_seqcol(seqcol_dict)
    level1_dict = seqcol_dict_to_level1_dict(seqcol_dict)
    seqcol_digest = level1_dict_to_seqcol_digest(level1_dict, inherent_attrs)

    # Now, build the actual pydantic models
    sequences_attr = SequencesAttr(
        digest=level1_dict["sequences"], value=seqcol_dict["sequences"]
    )

    names_attr = NamesAttr(digest=level1_dict["names"], value=seqcol_dict["names"])

    lengths_attr = LengthsAttr(digest=level1_dict["lengths"], value=seqcol_dict["lengths"])

    nlp = build_name_length_pairs(seqcol_dict)
    nlp_attr = NameLengthPairsAttr(digest=sha512t24u_digest(canonical_str(nlp)), value=nlp)
    _LOGGER.debug(f"nlp: {nlp}")
    _LOGGER.debug(f"Name-length pairs: {nlp_attr}")

    snlp_digests = []  # sorted_name_length_pairs digests
    for i in range(len(nlp)):
        snlp_digests.append(sha512t24u_digest(canonical_str(nlp[i])))
    snlp_digests.sort()

    # you can build it like this, but instead I'm just building it from the nlp, to save compute
    # snlp = build_sorted_name_length_pairs(seqcol_dict)
    # v = ",".join(snlp)
    snlp_digest_level1 = sha512t24u_digest(canonical_str(snlp_digests))

    # This is now a transient attribute, so we don't need to store it in the database.
    # snlp_attr = SortedNameLengthPairsAttr(digest=snlp_digest_level1, value=snlp_digests)

    sorted_sequences_value = copy(seqcol_dict["sequences"])
    sorted_sequences_value.sort()
    sorted_sequences_digest = sha512t24u_digest(canonical_str(sorted_sequences_value))
    sorted_sequences_attr = SortedSequencesAttr(
        digest=sorted_sequences_digest, value=sorted_sequences_value
    )
    _LOGGER.debug(f"sorted_sequences_value: {sorted_sequences_value}")
    _LOGGER.debug(f"sorted_sequences_digest: {sorted_sequences_digest}")
    _LOGGER.debug(f"sorted_sequences_attr: {sorted_sequences_attr}")

    human_readable_names_list = []
    if "human_readable_names" in seqcol_dict and seqcol_dict["human_readable_names"]:
        # Assuming 'human_readable_name' is a list of strings in the input dictionary
        if isinstance(seqcol_dict["human_readable_names"], list):
            for name_str in seqcol_dict["human_readable_names"]:
                human_readable_names_list.append(
                    HumanReadableNames(human_readable_name=name_str, digest=seqcol_digest)
                )
        # Handle single string input (convert to list)
        elif isinstance(seqcol_dict["human_readable_names"], str):
            human_readable_names_list.append(
                HumanReadableNames(
                    human_readable_name=seqcol_dict["human_readable_names"],
                    digest=seqcol_digest,
                )
            )

    seqcol = SequenceCollection(
        digest=seqcol_digest,
        human_readable_names=human_readable_names_list,
        sequences=sequences_attr,
        sorted_sequences=sorted_sequences_attr,
        names=names_attr,
        lengths=lengths_attr,
        name_length_pairs=nlp_attr,
        sorted_name_length_pairs_digest=snlp_digest_level1,
    )

    _LOGGER.debug(f"seqcol: {seqcol}")

    return seqcol
from_fasta_file classmethod
from_fasta_file(fasta_file)

Given a FASTA file, create a SequenceCollection object.

Parameters:

Name Type Description Default
fasta_file str

Path to a FASTA file

required

Returns:

Type Description
SequenceCollection

The SequenceCollection object

Raises:

Type Description
ImportError

If gtars is not installed (required for FASTA processing)

Source code in refget/models.py
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
@classmethod
def from_fasta_file(cls, fasta_file: str) -> "SequenceCollection":
    """
    Given a FASTA file, create a SequenceCollection object.

    Args:
        fasta_file (str): Path to a FASTA file

    Returns:
        (SequenceCollection): The SequenceCollection object

    Raises:
        ImportError: If gtars is not installed (required for FASTA processing)
    """
    seqcol = fasta_to_seqcol_dict(fasta_file)
    return cls.from_dict(seqcol)
itemwise
itemwise(limit=None)

Converts object into a list of dictionaries, one for each sequence in the collection.

Source code in refget/models.py
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
def itemwise(self, limit=None):
    """
    Converts object into a list of dictionaries, one for each sequence in the collection.
    """
    if limit and len(self.sequences.value) > limit:
        raise ValueError(f"Too many sequences to format itemwise: {len(self.sequences.value)}")
    list_of_dicts = []
    for i in range(len(self.lengths.value)):
        list_of_dicts.append(
            {
                "name": self.names.value[i],
                "length": self.lengths.value[i],
                "sequence": self.sequences.value[i],
            }
        )
    return list_of_dicts
level1
level1()

Converts object into dict of level 1 representation of the SequenceCollection.

Returns attribute digests for most attributes, but returns raw values for passthru attributes. Note: Passthru handling for dict-based construction happens in seqcol_dict_to_level1_dict(). When passthru attributes are added to the database model, return .value instead of .digest here.

Source code in refget/models.py
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
def level1(self):
    """
    Converts object into dict of level 1 representation of the SequenceCollection.

    Returns attribute digests for most attributes, but returns raw values for passthru attributes.
    Note: Passthru handling for dict-based construction happens in seqcol_dict_to_level1_dict().
    When passthru attributes are added to the database model, return .value instead of .digest here.
    """
    return {
        "lengths": self.lengths.digest,
        "names": self.names.digest,
        "sequences": self.sequences.digest,
        "sorted_sequences": self.sorted_sequences.digest,
        "name_length_pairs": self.name_length_pairs.digest,
        "sorted_name_length_pairs": self.sorted_name_length_pairs_digest,
    }
level2
level2()

Converts object into dict of level 2 representation of the SequenceCollection.

Source code in refget/models.py
673
674
675
676
677
678
679
680
681
682
683
684
def level2(self):
    """
    Converts object into dict of level 2 representation of the SequenceCollection.
    """
    return {
        "lengths": self.lengths.value,
        "names": self.names.value,
        "sequences": self.sequences.value,
        "sorted_sequences": self.sorted_sequences.value,
        "name_length_pairs": self.name_length_pairs.value,
        # sorted_name_length_pairs is transient - only digest stored, not value
    }

FastaDrsObject

A DRS object specialized for FASTA files, storing file metadata and FAI index information.

FastaDrsObject

Bases: DrsObject

A DRS object specialized for FASTA sequence files. Stores file metadata including size, checksums (SHA-256, MD5, and refget sequence collection digest), and creation time. The refget digest serves as the object ID, enabling content-addressable retrieval.

from_fasta_file classmethod
from_fasta_file(fasta_file, digest=None)

Given a FASTA file, create a FastaDrsObject object, return a populated FastaDrsObject with computed size and checksum.

Parameters:

Name Type Description Default
fasta_file str

Path to a FASTA file

required
digest str

The refget digest of the sequence collection (optional). If not included, it will be computed

None

Returns:

Type Description
FastaDrsObject

The FastaDrsObject object

Raises:

Type Description
ImportError

If gtars is not installed (required for FASTA processing)

Source code in refget/models.py
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
@classmethod
def from_fasta_file(cls, fasta_file: str, digest: str = None) -> "FastaDrsObject":
    """
    Given a FASTA file, create a FastaDrsObject object,
    return a populated FastaDrsObject with computed size and checksum.

    Args:
        fasta_file (str): Path to a FASTA file
        digest (str): The refget digest of the sequence collection
            (optional). If not included, it will be computed

    Returns:
        (FastaDrsObject): The FastaDrsObject object

    Raises:
        ImportError: If gtars is not installed (required for FASTA processing)
    """
    return create_fasta_drs_object(fasta_file, digest)
to_response
to_response(base_uri=None)

Return a copy of this object with self_uri populated for API response.

Parameters:

Name Type Description Default
base_uri str

Base URI for the DRS service (e.g., "drs://seqcolapi.databio.org") If not provided, returns self unchanged.

None

Returns:

Type Description
FastaDrsObject

FastaDrsObject with self_uri populated

Source code in refget/models.py
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
def to_response(self, base_uri: str = None) -> "FastaDrsObject":
    """
    Return a copy of this object with self_uri populated for API response.

    Args:
        base_uri: Base URI for the DRS service (e.g., "drs://seqcolapi.databio.org")
                 If not provided, returns self unchanged.

    Returns:
        FastaDrsObject with self_uri populated
    """
    if base_uri is None:
        return self

    return self.model_copy(update={"self_uri": f"{base_uri}/{self.id}"})

DrsObject

Base model for GA4GH Data Repository Service (DRS) objects.

DrsObject

Bases: SQLModel

A data object representing a single blob of bytes with metadata, checksums, and access methods. DRS objects are self-contained and provide all information needed for clients to retrieve the data. Conforms to GA4GH Data Repository Service (DRS) specification v1.4.0.

coerce_access_methods classmethod
coerce_access_methods(v)

Coerce dicts to AccessMethod objects when loading from JSON.

Source code in refget/models.py
287
288
289
290
291
292
293
294
295
@field_validator("access_methods", mode="before")
@classmethod
def coerce_access_methods(cls, v):
    """Coerce dicts to AccessMethod objects when loading from JSON."""
    if v is None:
        return []
    return [
        AccessMethod.model_validate(item) if isinstance(item, dict) else item for item in v
    ]
coerce_checksums classmethod
coerce_checksums(v)

Coerce dicts to Checksum objects when loading from JSON.

Source code in refget/models.py
272
273
274
275
276
277
278
@field_validator("checksums", mode="before")
@classmethod
def coerce_checksums(cls, v):
    """Coerce dicts to Checksum objects when loading from JSON."""
    if v is None:
        return []
    return [Checksum.model_validate(item) if isinstance(item, dict) else item for item in v]
serialize_access_methods
serialize_access_methods(v)

Serialize AccessMethod objects (or dicts) to dicts for JSON output.

Source code in refget/models.py
297
298
299
300
301
302
@field_serializer("access_methods")
def serialize_access_methods(self, v):
    """Serialize AccessMethod objects (or dicts) to dicts for JSON output."""
    if v is None:
        return []
    return [item.model_dump() if hasattr(item, "model_dump") else item for item in v]
serialize_checksums
serialize_checksums(v)

Serialize Checksum objects (or dicts) to dicts for JSON output.

Source code in refget/models.py
280
281
282
283
284
285
@field_serializer("checksums")
def serialize_checksums(self, v):
    """Serialize Checksum objects (or dicts) to dicts for JSON output."""
    if v is None:
        return []
    return [item.model_dump() if hasattr(item, "model_dump") else item for item in v]

Pangenome

A collection of sequence collections representing a pangenome.

Pangenome

Bases: SQLModel

from_dict classmethod
from_dict(pangenome_obj, inherent_attrs=None)

Given a dict representation of a pangenome, create a Pangenome object. This is the primary way to create a Pangenome object.

Parameters:

Name Type Description Default
pangenome_obj dict

Dictionary representation of a canonical pangenome object

required

Returns:

Type Description
Pangenome

The Pangenome object

Source code in refget/models.py
380
381
382
383
384
385
386
387
388
389
390
391
392
@classmethod
def from_dict(cls, pangenome_obj: dict, inherent_attrs: Optional[list] = None) -> "Pangenome":
    """
    Given a dict representation of a pangenome, create a Pangenome object.
    This is the primary way to create a Pangenome object.

    Args:
        pangenome_obj (dict): Dictionary representation of a canonical pangenome object

    Returns:
        (Pangenome): The Pangenome object
    """
    raise NotImplementedError("This method is not yet implemented.")
level1
level1()

Converts object into dict of level 1 representation of the Pangenome.

Source code in refget/models.py
394
395
396
def level1(self):
    """Converts object into dict of level 1 representation of the Pangenome."""
    return {"names": self.names_digest, "collections": self.collections_digest}
level2
level2()

Converts object into dict of level 2 representation of the Pangenome.

Source code in refget/models.py
398
399
400
401
402
403
def level2(self):
    """Converts object into dict of level 2 representation of the Pangenome."""
    return {
        "names": self.names.value.split(","),
        "collections": [x.digest for x in self.collections],
    }
level3
level3()

Converts object into dict of level 3 representation of the Pangenome.

Source code in refget/models.py
405
406
407
408
409
410
def level3(self):
    """Converts object into dict of level 3 representation of the Pangenome."""
    return {
        "names": self.names.value.split(","),
        "collections": [x.level1() for x in self.collections],
    }
level4
level4()

Converts object into dict of level 4 representation of the Pangenome.

Source code in refget/models.py
412
413
414
415
416
417
def level4(self):
    """Converts object into dict of level 4 representation of the Pangenome."""
    return {
        "names": self.names.value.split(","),
        "collections": [x.level2() for x in self.collections],
    }

Sequence

An individual sequence with its digest and content.

Sequence

Bases: SQLModel

Supporting Models

AccessMethod

Describes how to access object bytes (protocol type, URL, region).

AccessMethod

Bases: SQLModel

Describes a method for accessing object bytes, including the protocol type (e.g., https, s3, gs) and either a direct URL or an access_id for the /access endpoint. At least one of access_url or access_id must be provided.

DRS 1.5.0 adds the 'cloud' field to explicitly specify the cloud provider.

AccessURL

A fully resolvable URL with optional headers for authentication.

AccessURL

Bases: SQLModel

A fully resolvable URL that can be used to fetch the actual object bytes. Optionally includes headers (e.g., authorization tokens) required for access.

Checksum

A checksum for data integrity verification.

Checksum

Bases: SQLModel

A checksum for data integrity verification. The type field indicates the hash algorithm (e.g., "sha-256", "md5") and the checksum field contains the hex-string encoded hash value.

Response Models

PaginationResult

Pagination metadata for list endpoints.

PaginationResult

Bases: BaseModel

ResultsSequenceCollections

Paginated sequence collection results.

ResultsSequenceCollections

Bases: BaseModel

Sequence collection results with pagination

Similarities

Results from Jaccard similarity calculations.

Similarities

Bases: BaseModel

Model to contain results from similarities calculations

Attribute Tables

These models store individual attributes of sequence collections in normalized database tables:

NamesAttr

NamesAttr

Bases: SQLModel

LengthsAttr

LengthsAttr

Bases: SQLModel

SequencesAttr

SequencesAttr

Bases: SQLModel

NameLengthPairsAttr

NameLengthPairsAttr

Bases: SQLModel

Usage Examples

Creating a SequenceCollection from a FASTA file

from refget.models import SequenceCollection

# From a FASTA file (requires gtars)
seqcol = SequenceCollection.from_fasta_file("genome.fa")

# Access different representations
print(seqcol.digest)  # Top-level digest
print(seqcol.level1())  # Attribute digests
print(seqcol.level2())  # Full arrays
print(seqcol.itemwise())  # Per-sequence dicts

Creating a SequenceCollection from a dictionary

from refget.models import SequenceCollection

seqcol_dict = {
    "names": ["chr1", "chr2"],
    "lengths": [1000, 2000],
    "sequences": ["SQ.abc123...", "SQ.def456..."]
}

seqcol = SequenceCollection.from_dict(seqcol_dict)

Creating a FastaDrsObject

from refget.models import FastaDrsObject

# From a FASTA file
drs_obj = FastaDrsObject.from_fasta_file("genome.fa")

# Access DRS metadata
print(drs_obj.id)  # Sequence collection digest
print(drs_obj.size)  # File size in bytes
print(drs_obj.checksums)  # SHA-256, MD5
print(drs_obj.access_methods)  # How to download