Data Models
The refget package uses Pydantic and SQLModel for data validation and database ORM. These models represent the core data structures for sequence collections, DRS objects, and related metadata.
Data models
Data Models are only needed if you want to develop new packages that rely on the refget Python API.
Model hierarchy
DrsObject (base)
└── FastaDrsObject (table)
SQLModel (base)
├── SequenceCollection (table)
├── Pangenome (table)
├── Sequence (table)
├── AccessMethod
├── AccessURL
└── Checksum
Core Models
SequenceCollection
The primary model representing a GA4GH sequence collection.
SequenceCollection
Bases: SQLModel
A SQLModel/pydantic model that represents a refget sequence collection.
digest
class-attribute
instance-attribute
digest = Field(primary_key=True)
Top-level digest of the SequenceCollection.
lengths
class-attribute
instance-attribute
lengths = Relationship(back_populates='collection')
Array of sequence lengths.
name_length_pairs
class-attribute
instance-attribute
name_length_pairs = Relationship(back_populates='collection')
Array of name-length pairs, representing the coordinate system of the collection.
names
class-attribute
instance-attribute
names = Relationship(back_populates='collection')
Array of sequence names.
sequences
class-attribute
instance-attribute
sequences = Relationship(back_populates='collection')
Array of sequence digests.
sorted_name_length_pairs_digest
class-attribute
instance-attribute
sorted_name_length_pairs_digest = Field()
Digest of the sorted name-length pairs, representing a unique digest of sort-invariant coordinate system.
sorted_sequences
class-attribute
instance-attribute
sorted_sequences = Relationship(back_populates='collection')
Array of sorted sequence digests.
from_PySequenceCollection
classmethod
from_PySequenceCollection(gtars_seq_col)
Given a PySequenceCollection object (from Rust bindings), create a SequenceCollection object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
gtars_seq_col
|
PySequenceCollection
|
PySequenceCollection object from Rust bindings. |
required |
Returns:
| Type | Description |
|---|---|
SequenceCollection
|
The SequenceCollection object. |
Raises:
| Type | Description |
|---|---|
ImportError
|
If gtars is not installed (required for this conversion) |
Source code in refget/models.py
638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 | |
from_dict
classmethod
from_dict(seqcol_dict, inherent_attrs=DEFAULT_INHERENT_ATTRS)
Given a dict representation of a sequence collection, create a SequenceCollection object. This is the primary way to create a SequenceCollection object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqcol_dict
|
dict
|
Dictionary representation of a canonical sequence collection object |
required |
inherent_attrs
|
list
|
List of inherent attributes to digest |
DEFAULT_INHERENT_ATTRS
|
Returns:
| Type | Description |
|---|---|
SequenceCollection
|
The SequenceCollection object |
Source code in refget/models.py
542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 | |
from_fasta_file
classmethod
from_fasta_file(fasta_file)
Given a FASTA file, create a SequenceCollection object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fasta_file
|
str
|
Path to a FASTA file |
required |
Returns:
| Type | Description |
|---|---|
SequenceCollection
|
The SequenceCollection object |
Raises:
| Type | Description |
|---|---|
ImportError
|
If gtars is not installed (required for FASTA processing) |
Source code in refget/models.py
525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 | |
itemwise
itemwise(limit=None)
Converts object into a list of dictionaries, one for each sequence in the collection.
Source code in refget/models.py
686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 | |
level1
level1()
Converts object into dict of level 1 representation of the SequenceCollection.
Returns attribute digests for most attributes, but returns raw values for passthru attributes. Note: Passthru handling for dict-based construction happens in seqcol_dict_to_level1_dict(). When passthru attributes are added to the database model, return .value instead of .digest here.
Source code in refget/models.py
656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 | |
level2
level2()
Converts object into dict of level 2 representation of the SequenceCollection.
Source code in refget/models.py
673 674 675 676 677 678 679 680 681 682 683 684 | |
FastaDrsObject
A DRS object specialized for FASTA files, storing file metadata and FAI index information.
FastaDrsObject
Bases: DrsObject
A DRS object specialized for FASTA sequence files. Stores file metadata including size, checksums (SHA-256, MD5, and refget sequence collection digest), and creation time. The refget digest serves as the object ID, enabling content-addressable retrieval.
from_fasta_file
classmethod
from_fasta_file(fasta_file, digest=None)
Given a FASTA file, create a FastaDrsObject object, return a populated FastaDrsObject with computed size and checksum.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fasta_file
|
str
|
Path to a FASTA file |
required |
digest
|
str
|
The refget digest of the sequence collection (optional). If not included, it will be computed |
None
|
Returns:
| Type | Description |
|---|---|
FastaDrsObject
|
The FastaDrsObject object |
Raises:
| Type | Description |
|---|---|
ImportError
|
If gtars is not installed (required for FASTA processing) |
Source code in refget/models.py
340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 | |
to_response
to_response(base_uri=None)
Return a copy of this object with self_uri populated for API response.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_uri
|
str
|
Base URI for the DRS service (e.g., "drs://seqcolapi.databio.org") If not provided, returns self unchanged. |
None
|
Returns:
| Type | Description |
|---|---|
FastaDrsObject
|
FastaDrsObject with self_uri populated |
Source code in refget/models.py
324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 | |
DrsObject
Base model for GA4GH Data Repository Service (DRS) objects.
DrsObject
Bases: SQLModel
A data object representing a single blob of bytes with metadata, checksums, and access methods. DRS objects are self-contained and provide all information needed for clients to retrieve the data. Conforms to GA4GH Data Repository Service (DRS) specification v1.4.0.
coerce_access_methods
classmethod
coerce_access_methods(v)
Coerce dicts to AccessMethod objects when loading from JSON.
Source code in refget/models.py
287 288 289 290 291 292 293 294 295 | |
coerce_checksums
classmethod
coerce_checksums(v)
Coerce dicts to Checksum objects when loading from JSON.
Source code in refget/models.py
272 273 274 275 276 277 278 | |
serialize_access_methods
serialize_access_methods(v)
Serialize AccessMethod objects (or dicts) to dicts for JSON output.
Source code in refget/models.py
297 298 299 300 301 302 | |
serialize_checksums
serialize_checksums(v)
Serialize Checksum objects (or dicts) to dicts for JSON output.
Source code in refget/models.py
280 281 282 283 284 285 | |
Pangenome
A collection of sequence collections representing a pangenome.
Pangenome
Bases: SQLModel
from_dict
classmethod
from_dict(pangenome_obj, inherent_attrs=None)
Given a dict representation of a pangenome, create a Pangenome object. This is the primary way to create a Pangenome object.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pangenome_obj
|
dict
|
Dictionary representation of a canonical pangenome object |
required |
Returns:
| Type | Description |
|---|---|
Pangenome
|
The Pangenome object |
Source code in refget/models.py
380 381 382 383 384 385 386 387 388 389 390 391 392 | |
level1
level1()
Converts object into dict of level 1 representation of the Pangenome.
Source code in refget/models.py
394 395 396 | |
level2
level2()
Converts object into dict of level 2 representation of the Pangenome.
Source code in refget/models.py
398 399 400 401 402 403 | |
level3
level3()
Converts object into dict of level 3 representation of the Pangenome.
Source code in refget/models.py
405 406 407 408 409 410 | |
level4
level4()
Converts object into dict of level 4 representation of the Pangenome.
Source code in refget/models.py
412 413 414 415 416 417 | |
Sequence
An individual sequence with its digest and content.
Sequence
Bases: SQLModel
Supporting Models
AccessMethod
Describes how to access object bytes (protocol type, URL, region).
AccessMethod
Bases: SQLModel
Describes a method for accessing object bytes, including the protocol type (e.g., https, s3, gs) and either a direct URL or an access_id for the /access endpoint. At least one of access_url or access_id must be provided.
DRS 1.5.0 adds the 'cloud' field to explicitly specify the cloud provider.
AccessURL
A fully resolvable URL with optional headers for authentication.
AccessURL
Bases: SQLModel
A fully resolvable URL that can be used to fetch the actual object bytes. Optionally includes headers (e.g., authorization tokens) required for access.
Checksum
A checksum for data integrity verification.
Checksum
Bases: SQLModel
A checksum for data integrity verification. The type field indicates the hash algorithm (e.g., "sha-256", "md5") and the checksum field contains the hex-string encoded hash value.
Response Models
PaginationResult
Pagination metadata for list endpoints.
PaginationResult
Bases: BaseModel
ResultsSequenceCollections
Paginated sequence collection results.
ResultsSequenceCollections
Bases: BaseModel
Sequence collection results with pagination
Similarities
Results from Jaccard similarity calculations.
Similarities
Bases: BaseModel
Model to contain results from similarities calculations
Attribute Tables
These models store individual attributes of sequence collections in normalized database tables:
NamesAttr
NamesAttr
Bases: SQLModel
LengthsAttr
LengthsAttr
Bases: SQLModel
SequencesAttr
SequencesAttr
Bases: SQLModel
NameLengthPairsAttr
NameLengthPairsAttr
Bases: SQLModel
Usage Examples
Creating a SequenceCollection from a FASTA file
from refget.models import SequenceCollection
# From a FASTA file (requires gtars)
seqcol = SequenceCollection.from_fasta_file("genome.fa")
# Access different representations
print(seqcol.digest) # Top-level digest
print(seqcol.level1()) # Attribute digests
print(seqcol.level2()) # Full arrays
print(seqcol.itemwise()) # Per-sequence dicts
Creating a SequenceCollection from a dictionary
from refget.models import SequenceCollection
seqcol_dict = {
"names": ["chr1", "chr2"],
"lengths": [1000, 2000],
"sequences": ["SQ.abc123...", "SQ.def456..."]
}
seqcol = SequenceCollection.from_dict(seqcol_dict)
Creating a FastaDrsObject
from refget.models import FastaDrsObject
# From a FASTA file
drs_obj = FastaDrsObject.from_fasta_file("genome.fa")
# Access DRS metadata
print(drs_obj.id) # Sequence collection digest
print(drs_obj.size) # File size in bytes
print(drs_obj.checksums) # SHA-256, MD5
print(drs_obj.access_methods) # How to download