Refget Python API Documentation
FASTA Processing
digest_fasta
digest_fasta(fasta)
Digest all sequences in a FASTA file and compute collection-level digests.
This function reads a FASTA file and computes GA4GH-compliant digests for each sequence, as well as collection-level digests (Level 1 and Level 2) following the GA4GH refget specification.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fasta
|
Union[str, PathLike]
|
Path to FASTA file (str or PathLike). |
required |
Returns:
| Type | Description |
|---|---|
SequenceCollection
|
Collection containing all sequences with their metadata and computed digests. |
Raises:
| Type | Description |
|---|---|
IOError
|
If the FASTA file cannot be read or parsed. |
Example:: from gtars.refget import digest_fasta collection = digest_fasta("genome.fa") print(f"Collection digest: {collection.digest}") print(f"Number of sequences: {len(collection)}")
fasta_to_seqcol_dict
fasta_to_seqcol_dict(fasta_file_path)
Convert a FASTA file into a Sequence Collection dict.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fasta_file_path
|
Union[str, Path]
|
Path to the FASTA file |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
A canonical sequence collection dictionary |
Raises:
| Type | Description |
|---|---|
ImportError
|
If gtars is not installed (required for FASTA processing) |
Source code in refget/utils.py
317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 | |
compare_seqcols
compare_seqcols(A, B)
Workhorse comparison function
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
A
|
SeqColDict
|
Sequence collection A |
required |
B
|
SeqColDict
|
Sequence collection B |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
Following formal seqcol specification comparison function return value |
Source code in refget/utils.py
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | |
calc_jaccard_similarities
calc_jaccard_similarities(A, B)
Takes two sequence collections and calculates jaccard similarties for all attributes
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
A
|
SeqColDict
|
Sequence collection A |
required |
B
|
SeqColDict
|
Sequence collection B |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict[str, float]
|
Jaccard similarities for all attributes |
Source code in refget/utils.py
178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 | |
validate_seqcol
validate_seqcol(seqcol_obj, schema=None)
Validate a seqcol object against the seqcol schema. Returns True if valid, raises InvalidSeqColError if not, which enumerates the errors. Retrieve individual errors with exception.errors
Source code in refget/utils.py
45 46 47 48 49 50 51 52 53 54 55 56 57 | |
validate_seqcol_bool
validate_seqcol_bool(seqcol_obj, schema=None)
Validate a seqcol object against the seqcol schema. Returns True if valid, False if not.
To enumerate the errors, use validate_seqcol instead.
Source code in refget/utils.py
33 34 35 36 37 38 39 40 41 42 | |
FastAPI Integration
create_refget_router
create_refget_router(sequences=False, collections=True, pangenomes=False, fasta_drs=False, compliance=True, refget_store_url=None)
Create a FastAPI router for the sequence collection API. This router provides endpoints for retrieving and comparing sequence collections. You can choose which endpoints to include by setting the sequences, collections, pangenomes, or fasta_drs flags.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequences
|
bool
|
Include sequence endpoints |
False
|
collections
|
bool
|
Include sequence collection endpoints |
True
|
pangenomes
|
bool
|
Include pangenome endpoints |
False
|
fasta_drs
|
bool
|
Include FASTA DRS endpoints |
False
|
refget_store_url
|
str
|
URL of backing RefgetStore (e.g., s3://bucket/store/) |
None
|
Returns:
| Type | Description |
|---|---|
APIRouter
|
A FastAPI router with the specified endpoints |
Examples:
app.include_router(create_refget_router(fasta_drs=True), prefix="/seqcol")
Source code in refget/router.py
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | |
Client Classes
The client module provides interfaces for interacting with refget-compliant servers.
SequenceClient
SequenceClient(urls=['https://www.ebi.ac.uk/ena/cram'], raise_errors=None)
Bases: RefgetClient
A client for interacting with a refget sequences API.
Initializes the sequences client.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
urls
|
list
|
A list of base URLs of the sequences API. Defaults to ["https://www.ebi.ac.uk/ena/cram/sequence/"]. |
['https://www.ebi.ac.uk/ena/cram']
|
raise_errors
|
bool
|
Whether to raise errors or log them. Defaults to None, which will guess. |
None
|
Attributes: urls (list): The list of base URLs of the sequences API.
Source code in refget/clients.py
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | |
get_metadata
get_metadata(digest)
Retrieves metadata for a given sequence digest.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The digest of the sequence. |
required |
Returns:
| Type | Description |
|---|---|
dict
|
The metadata. |
Source code in refget/clients.py
93 94 95 96 97 98 99 100 101 102 103 104 | |
get_sequence
get_sequence(digest, start=None, end=None)
Retrieves a sequence for a given digest.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The digest of the sequence. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The sequence. |
Source code in refget/clients.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | |
SequenceCollectionClient
SequenceCollectionClient(urls=['https://seqcolapi.databio.org'], raise_errors=None)
Bases: RefgetClient
A client for interacting with a refget sequence collections API.
Initializes the sequence collection client.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
urls
|
list
|
A list of base URLs of the sequence collection API. Defaults to ["https://seqcolapi.databio.org"]. |
['https://seqcolapi.databio.org']
|
Attributes:
| Name | Type | Description |
|---|---|---|
urls |
list
|
The list of base URLs of the sequence collection API. |
Source code in refget/clients.py
112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | |
build_chrom_sizes
build_chrom_sizes(digest)
Build a chrom.sizes file content for a sequence collection.
Format per line: NAME\tLENGTH
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The sequence collection digest |
required |
Returns:
| Type | Description |
|---|---|
str
|
String content of the chrom.sizes file |
Source code in refget/clients.py
244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 | |
build_fai
build_fai(digest)
Build a complete .fai index file content for a FASTA.
FAI format per line: NAME\tLENGTH\tOFFSET\tLINEBASES\tLINEWIDTH
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The sequence collection digest |
required |
Returns:
| Type | Description |
|---|---|
str
|
String content of the .fai file |
Source code in refget/clients.py
217 218 219 220 221 222 223 224 225 226 227 228 229 | |
compare
compare(digest1, digest2)
Compares two sequence collections hosted on the server.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest1
|
str
|
The digest of the first sequence collection. |
required |
digest2
|
str
|
The digest of the second sequence collection. |
required |
Returns:
| Type | Description |
|---|---|
dict
|
The JSON response containing the comparison of the two sequence collections. |
Source code in refget/clients.py
313 314 315 316 317 318 319 320 321 322 323 324 325 | |
compare_local
compare_local(digest, local_collection)
Compares a server-hosted sequence collection with a local collection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The digest of the server-hosted sequence collection. |
required |
local_collection
|
dict
|
A level 2 sequence collection representation. |
required |
Returns:
| Type | Description |
|---|---|
dict
|
The JSON response containing the comparison. |
Source code in refget/clients.py
327 328 329 330 331 332 333 334 335 336 337 338 339 | |
download_fasta
download_fasta(digest, dest_path=None, access_id=None)
Download the FASTA file to a local path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The sequence collection digest |
required |
dest_path
|
str
|
Destination file path. If None, uses object name. |
None
|
access_id
|
str
|
Specific access method to use. If None, tries all. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Path to downloaded file |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no access methods available or specified access_id not found |
Source code in refget/clients.py
166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 | |
download_fasta_to_store
download_fasta_to_store(digest, store, access_id=None, temp_dir=None)
Download the FASTA file and import it into a RefgetStore.
This method downloads the FASTA file from the DRS endpoint and immediately imports it into the provided RefgetStore, enabling local sequence retrieval by digest without re-downloading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The sequence collection digest |
required |
store
|
RefgetStore
|
The RefgetStore instance to import into |
required |
access_id
|
str
|
Specific access method to use. If None, tries all. |
None
|
temp_dir
|
str
|
Directory for temporary download. If None, uses system temp. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
The collection digest of the imported sequences |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no access methods available or specified access_id not found |
ImportError
|
If gtars/RefgetStore is not available |
Example
from refget.store import RefgetStore from refget.clients import SequenceCollectionClient store = RefgetStore.in_memory() client = SequenceCollectionClient() collection_digest = client.download_fasta_to_store("abc123", store)
Now you can retrieve sequences by digest from the local store
seq = store.get_substring(sequence_digest, 0, 100)
Source code in refget/clients.py
183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 | |
get_attribute
get_attribute(attribute, digest)
Retrieves a specific attribute value by its digest.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attribute
|
str
|
The attribute name (e.g., "names", "lengths", "sequences"). |
required |
digest
|
str
|
The level 1 digest of the attribute. |
required |
Returns:
| Type | Description |
|---|---|
dict
|
The JSON response containing the attribute value. |
Source code in refget/clients.py
299 300 301 302 303 304 305 306 307 308 309 310 311 | |
get_collection
get_collection(digest, level=2)
Retrieves a sequence collection for a given digest and detail level.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The digest of the sequence collection. |
required |
level
|
int
|
The level of detail for the sequence collection. Defaults to 2. |
2
|
Returns:
| Type | Description |
|---|---|
dict
|
The JSON response containing the sequence collection. |
Source code in refget/clients.py
285 286 287 288 289 290 291 292 293 294 295 296 297 | |
get_fasta
get_fasta(digest)
Get DRS object metadata for a FASTA file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The sequence collection digest (which is also the DRS object ID) |
required |
Returns:
| Type | Description |
|---|---|
dict
|
DRS object with id, self_uri, size, checksums, access_methods, etc. |
Source code in refget/clients.py
142 143 144 145 146 147 148 149 150 151 152 | |
get_fasta_index
get_fasta_index(digest)
Get FAI index data for a FASTA file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The sequence collection digest |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Dict with line_bases, extra_line_bytes, offsets |
Source code in refget/clients.py
154 155 156 157 158 159 160 161 162 163 164 | |
get_refget_store
get_refget_store(cache_dir)
Get a RefgetStore instance connected to the server's backing store.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache_dir
|
str
|
Local directory for caching store data |
required |
Returns:
| Type | Description |
|---|---|
RefgetStore
|
RefgetStore instance loaded from remote |
Raises:
| Type | Description |
|---|---|
ValueError
|
If server doesn't have a RefgetStore configured |
ImportError
|
If gtars is not installed |
Source code in refget/clients.py
425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 | |
get_refget_store_url
get_refget_store_url()
Discover RefgetStore URL from service-info if available.
Returns:
| Type | Description |
|---|---|
str
|
The RefgetStore URL if configured, None otherwise. |
Source code in refget/clients.py
412 413 414 415 416 417 418 419 420 421 422 423 | |
is_fasta_drs_enabled
is_fasta_drs_enabled()
Check if FastaDRS endpoints are available.
Returns:
| Type | Description |
|---|---|
bool
|
True if FastaDRS is enabled, False otherwise. |
Source code in refget/clients.py
402 403 404 405 406 407 408 409 410 | |
list_attributes
list_attributes(attribute, page=None, page_size=None)
Lists all available values for a given attribute with optional paging support.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attribute
|
str
|
The attribute to list values for. |
required |
page
|
int
|
The page number to retrieve. Defaults to None. |
None
|
page_size
|
int
|
The number of items per page. Defaults to None. |
None
|
Returns:
| Type | Description |
|---|---|
dict
|
The JSON response containing the list of available values for the attribute. |
Source code in refget/clients.py
369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 | |
list_collections
list_collections(page=None, page_size=None, **filters)
Lists all available sequence collections with optional paging and attribute filtering support.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
page
|
int
|
The page number to retrieve. Defaults to None. |
None
|
page_size
|
int
|
The number of items per page. Defaults to None. |
None
|
**filters
|
Any
|
Optional attribute filters (e.g., names="abc123", lengths="def456"). Values should be level 1 digests of the attributes. |
{}
|
Returns:
| Type | Description |
|---|---|
dict
|
The JSON response containing the list of available sequence collections. |
Source code in refget/clients.py
341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 | |
service_info
service_info()
Retrieves information about the service.
Returns:
| Type | Description |
|---|---|
dict
|
The service information. |
Source code in refget/clients.py
392 393 394 395 396 397 398 399 400 | |
write_chrom_sizes
write_chrom_sizes(digest, dest_path)
Write a chrom.sizes file for a sequence collection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The sequence collection digest |
required |
dest_path
|
str
|
Path to write the chrom.sizes file |
required |
Returns:
| Type | Description |
|---|---|
str
|
Path to the written file |
Source code in refget/clients.py
269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 | |
write_fai
write_fai(digest, dest_path)
Write a .fai index file for a FASTA.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The sequence collection digest |
required |
dest_path
|
str
|
Path to write the .fai file |
required |
Returns:
| Type | Description |
|---|---|
str
|
Path to the written file |
Source code in refget/clients.py
231 232 233 234 235 236 237 238 239 240 241 242 | |
FastaDrsClient
FastaDrsClient(urls=['https://seqcolapi.databio.org/fasta'], raise_errors=None)
Bases: RefgetClient
A client for interacting with FASTA files via GA4GH DRS endpoints.
Initializes the FASTA DRS client.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
urls
|
list
|
A list of base URLs of the FASTA DRS API. Defaults to ["https://seqcolapi.databio.org/fasta"]. |
['https://seqcolapi.databio.org/fasta']
|
raise_errors
|
bool
|
Whether to raise errors or log them. Defaults to None, which will guess. |
None
|
Attributes:
| Name | Type | Description |
|---|---|---|
urls |
list
|
The list of base URLs of the FASTA DRS API. |
Source code in refget/clients.py
460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 | |
build_fai
build_fai(digest, seqcol_client=None)
Build a complete .fai index file content for a FASTA.
FAI format per line: NAME LENGTH OFFSET LINEBASES LINEWIDTH
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The sequence collection digest |
required |
seqcol_client
|
SequenceCollectionClient
|
SequenceCollectionClient to use. If None, uses parent client or creates one. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
String content of the .fai file |
Source code in refget/clients.py
648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 | |
download
download(digest, dest_path=None, access_id=None)
Download the FASTA file to a local path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The sequence collection digest |
required |
dest_path
|
str
|
Destination file path. If None, uses object name. |
None
|
access_id
|
str
|
Specific access method to use. If None, tries all. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Path to downloaded file |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no access methods available or specified access_id not found |
Source code in refget/clients.py
532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 | |
download_to_store
download_to_store(digest, store, access_id=None, temp_dir=None)
Download the FASTA file and import it into a RefgetStore.
This method downloads the FASTA file from the DRS endpoint and immediately imports it into the provided RefgetStore, enabling local sequence retrieval by digest without re-downloading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The sequence collection digest |
required |
store
|
RefgetStore
|
The RefgetStore instance to import into |
required |
access_id
|
str
|
Specific access method to use. If None, tries all. |
None
|
temp_dir
|
str
|
Directory for temporary download. If None, uses system temp. |
None
|
Returns:
| Type | Description |
|---|---|
str
|
The collection digest of the imported sequences |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no access methods available or specified access_id not found |
ImportError
|
If gtars/RefgetStore is not available |
Example
from refget.store import RefgetStore store = RefgetStore.in_memory() client = FastaDrsClient() collection_digest = client.download_to_store("abc123", store)
Source code in refget/clients.py
581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 | |
get_access_url
get_access_url(digest, access_id)
Get access URL for a specific access method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The sequence collection digest |
required |
access_id
|
str
|
The access ID from the access method |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Access URL object |
Source code in refget/clients.py
508 509 510 511 512 513 514 515 516 517 518 519 520 | |
get_index
get_index(digest)
Get FAI index data for a FASTA file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The sequence collection digest |
required |
Returns:
| Type | Description |
|---|---|
dict
|
Dict with line_bases, extra_line_bytes, offsets |
Source code in refget/clients.py
495 496 497 498 499 500 501 502 503 504 505 506 | |
get_object
get_object(digest)
Get DRS object metadata for a FASTA file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The sequence collection digest (which is also the DRS object ID) |
required |
Returns:
| Type | Description |
|---|---|
dict
|
DRS object with id, self_uri, size, checksums, access_methods, etc. |
Source code in refget/clients.py
482 483 484 485 486 487 488 489 490 491 492 493 | |
service_info
service_info()
Get DRS service info.
Returns:
| Type | Description |
|---|---|
dict
|
The service information. |
Source code in refget/clients.py
522 523 524 525 526 527 528 529 530 | |
write_fai
write_fai(digest, dest_path, seqcol_client=None)
Write a .fai index file for a FASTA.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The sequence collection digest |
required |
dest_path
|
str
|
Path to write the .fai file |
required |
seqcol_client
|
SequenceCollectionClient
|
SequenceCollectionClient to use |
None
|
Returns:
| Type | Description |
|---|---|
str
|
Path to the written file |
Source code in refget/clients.py
695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 | |
PangenomeClient
Bases: RefgetClient
Agent Classes
Agents provide higher-level abstractions for working with refget data in a PostgreSQL database.
RefgetDBAgent
RefgetDBAgent(engine=None, postgres_str=None, schema=SEQCOL_SCHEMA_PATH, inherent_attrs=DEFAULT_INHERENT_ATTRS, fasta_drs_url_prefix=None)
Bases: object
Primary aggregator agent, interface to all other agents
Parameterized it via these environment variables: - POSTGRES_HOST - POSTGRES_DB - POSTGRES_USER - POSTGRES_PASSWORD
Source code in refget/agents.py
761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 | |
calc_similarities
calc_similarities(digestA, digestB)
Calculates the Jaccard similarity between two sequence collections.
This method retrieves two sequence collections using their digests and then computes jaccard similarities for all attributes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digestA
|
str
|
The digest (identifier) for the first sequence collection. |
required |
digestB
|
str
|
The digest (identifier) for the second sequence collection. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
The Jaccard similarity score between the two sequence collections for all present and shared attributes. |
Source code in refget/agents.py
879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 | |
calc_similarities_seqcol_dicts
calc_similarities_seqcol_dicts(seqcolA, seqcolB)
Calculates the Jaccard similarity between two sequence collections.
This method retrieves one sequence collections using a digests and then computes jaccard similarities versus another input sequence collection dictionary.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqcolA
|
dict
|
the first sequence collection in dict format. |
required |
seqcolB
|
dict
|
the second sequence collection in dict format. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
dict |
dict
|
The Jaccard similarity score between the two sequence collections for all present and shared attributes. |
Source code in refget/agents.py
903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 | |
truncate
truncate()
Delete all records from the database
Source code in refget/agents.py
953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 | |
SequenceCollectionAgent
SequenceCollectionAgent(engine, inherent_attrs=None, parent=None)
Bases: object
Agent for interacting with database of sequence collection
Source code in refget/agents.py
170 171 172 173 174 175 176 177 178 | |
add
add(seqcol, update=False)
Add a sequence collection to the database or update it if it exists
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqcol
|
SequenceCollection
|
The sequence collection to add |
required |
update
|
bool
|
If True, update an existing collection if it exists |
False
|
Returns:
| Type | Description |
|---|---|
SequenceCollection
|
The added or updated sequence collection |
Source code in refget/agents.py
261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 | |
add_from_dict
add_from_dict(seqcol_dict, update=False)
Add a sequence collection from a seqcol dictionary
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seqcol_dict
|
dict
|
The sequence collection in dictionary form |
required |
update
|
bool
|
If True, update an existing collection if it exists |
False
|
Returns:
| Type | Description |
|---|---|
SequenceCollection
|
The added or updated sequence collection |
Source code in refget/agents.py
347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 | |
add_from_fasta_file
add_from_fasta_file(fasta_file_path, update=False, create_fasta_drs=True, human_readable_name=None)
Given a path to a fasta file, load the sequences into the refget database.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fasta_file_path
|
str
|
Path to the fasta file |
required |
update
|
bool
|
If True, update an existing collection if it exists |
False
|
create_fasta_drs
|
bool
|
If True, create a FastaDrsObject for the FASTA file |
True
|
human_readable_name
|
str
|
Optional human-readable name for the collection |
None
|
Returns:
| Type | Description |
|---|---|
SequenceCollection
|
The added or updated sequence collection |
Source code in refget/agents.py
363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 | |
add_from_fasta_file_with_name
add_from_fasta_file_with_name(fasta_file_path, human_readable_name, update=False, create_fasta_drs=True)
Given a path to a fasta file, and a human-readable name, load the sequences into the refget database.
Deprecated: Use add_from_fasta_file(fasta_file_path, human_readable_name=name) instead.
Source code in refget/agents.py
398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 | |
add_from_fasta_pep
add_from_fasta_pep(pep, fa_root, update=False, create_fasta_drs=True)
Given a PEP project and a root directory containing the fasta files, load the fasta files into the refget database.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pep
|
Project
|
PEP project object containing sample metadata |
required |
fa_root
|
str
|
Root directory containing the fasta files |
required |
update
|
bool
|
If True, update existing sequence collections |
False
|
create_fasta_drs
|
bool
|
If True, create FastaDrsObjects for the FASTA files |
True
|
Returns:
| Type | Description |
|---|---|
dict
|
A dictionary of the digests of the added sequence collections |
Source code in refget/agents.py
417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 | |
get
get(digest, return_format='level2', attribute=None, itemwise_limit=None)
Get a sequence collection by digest
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The digest of the sequence collection |
required |
return_format
|
str
|
The format in which to return the sequence collection |
'level2'
|
attribute
|
str
|
Name of an attribute to return, if you just want an attribute |
None
|
itemwise_limit
|
int
|
Limit the number of items returned in itemwise format |
None
|
Returns:
| Type | Description |
|---|---|
SequenceCollection
|
The sequence collection (in requested format) |
Source code in refget/agents.py
180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 | |
search_by_attributes
search_by_attributes(filters, offset=0, limit=50)
Search sequence collections by multiple attribute filters (AND logic).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
filters
|
dict
|
Dict of {attribute_name: digest} pairs |
required |
offset
|
int
|
Pagination offset |
0
|
limit
|
int
|
Max results to return |
50
|
Returns:
| Type | Description |
|---|---|
dict
|
Dict with pagination info and results |
Source code in refget/agents.py
474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 | |
SequenceAgent
SequenceAgent(engine)
Bases: object
Agent for interacting with database of sequences
Source code in refget/agents.py
103 104 | |
PangenomeAgent
PangenomeAgent(parent)
Bases: object
Agent for interacting with database of pangenomes
Source code in refget/agents.py
547 548 549 | |
AttributeAgent
AttributeAgent(engine)
Bases: object
Source code in refget/agents.py
632 633 | |
FastaDrsAgent
FastaDrsAgent(engine, url_prefix=None)
Agent for interacting with database of FASTA DRS objects
Source code in refget/agents.py
687 688 689 | |
add
add(fasta_drs)
Add a FastaDrsObject to the database
Source code in refget/agents.py
701 702 703 704 705 706 707 708 709 710 | |
add_access_method
add_access_method(digest, access_method)
Add an access method to an existing FastaDrsObject.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The digest (object_id) of the DRS object |
required |
access_method
|
AccessMethod
|
The AccessMethod to add |
required |
Returns:
| Type | Description |
|---|---|
FastaDrsObject
|
The updated FastaDrsObject |
Source code in refget/agents.py
726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 | |
get
get(digest)
Get a FastaDrsObject by its digest (object_id)
Source code in refget/agents.py
691 692 693 694 695 696 697 698 699 | |
list_by_offset
list_by_offset(limit=50, offset=0)
List FastaDrsObjects with pagination
Source code in refget/agents.py
712 713 714 715 716 717 718 719 720 721 722 723 724 | |
RefgetStore (gtars)
RefgetStore provides high-performance local sequence storage implemented in Rust. It supports:
- In-memory and on-disk storage with optional compression
- Remote store access with local caching
- Sequence retrieval by digest or by collection + name
- BED file region extraction for batch operations
- FASTA export for individual sequences or regions
See the RefgetStore tutorial for usage examples.
RefgetStore
A global store for GA4GH refget sequences with lazy-loading support.
RefgetStore provides content-addressable storage for reference genome sequences following the GA4GH refget specification. Supports both local and remote stores with on-demand sequence loading.
Attributes:
| Name | Type | Description |
|---|---|---|
cache_path |
Optional[str]
|
Local directory path where the store is located or cached. None for in-memory stores. |
remote_url |
Optional[str]
|
Remote URL of the store if loaded remotely, None otherwise. |
quiet |
bool
|
Whether the store suppresses progress output. |
storage_mode |
StorageMode
|
Current storage mode (Raw or Encoded). |
Note
Boolean evaluation: RefgetStore follows Python container semantics,
meaning bool(store) is False for empty stores (like list,
dict, etc.). To check if a store variable is initialized (not None),
use if store is not None: rather than if store:.
Example::
store = RefgetStore.in_memory() # Empty store
bool(store) # False (empty container)
len(store) # 0
# Wrong: checks emptiness, not initialization
if store:
process(store)
# Right: checks if variable is set
if store is not None:
process(store)
Examples:
Create a new store and import sequences::
from gtars.refget import RefgetStore
store = RefgetStore.in_memory()
store.add_sequence_collection_from_fasta("genome.fa")
Open an existing local store::
store = RefgetStore.open_local("/data/hg38")
seq = store.get_substring("chr1_digest", 0, 1000)
Open a remote store with caching::
store = RefgetStore.open_remote(
"/local/cache",
"https://example.com/hg38"
)
is_persisting
property
is_persisting
Whether the store is currently persisting to disk.
Example::
store = RefgetStore.in_memory()
print(store.is_persisting) # False
store.enable_persistence("/data/store")
print(store.is_persisting) # True
quiet
property
quiet
Whether the store is in quiet mode.
storage_mode
property
storage_mode
Current storage mode (Raw or Encoded).
add_collection_alias
add_collection_alias(namespace, alias, digest)
Add a collection alias: namespace/alias maps to collection digest.
add_sequence
add_sequence(sequence, force=False)
Add a sequence to the store without collection association.
The sequence can be created using digest_sequence() and later
retrieved by its digest via get_sequence().
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
SequenceRecord
|
A SequenceRecord created by |
required |
force
|
bool
|
If True, overwrite existing. If False (default), skip duplicates. |
False
|
Raises:
| Type | Description |
|---|---|
IOError
|
If the sequence cannot be stored. |
Example::
from gtars.refget import RefgetStore, digest_sequence
store = RefgetStore.in_memory()
seq = digest_sequence(b"ACGTACGT")
store.add_sequence(seq)
retrieved = store.get_sequence(seq.metadata.sha512t24u)
add_sequence_alias
add_sequence_alias(namespace, alias, digest)
Add a sequence alias: namespace/alias maps to sequence digest.
add_sequence_collection
add_sequence_collection(collection, force=False)
Add a pre-built SequenceCollection to the store.
Adds a SequenceCollection (created via digest_fasta() or programmatically)
directly to the store without reading from a FASTA file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
collection
|
SequenceCollection
|
A SequenceCollection to add. |
required |
force
|
bool
|
If True, overwrite existing collections/sequences. If False (default), skip duplicates. |
False
|
Raises:
| Type | Description |
|---|---|
IOError
|
If the collection cannot be stored. |
Example::
from gtars.refget import RefgetStore, digest_fasta
store = RefgetStore.in_memory()
collection = digest_fasta("genome.fa")
store.add_sequence_collection(collection)
add_sequence_collection_from_fasta
add_sequence_collection_from_fasta(file_path, force=False, namespaces=None)
Add a sequence collection from a FASTA file.
Reads a FASTA file, digests the sequences, creates a SequenceCollection, and adds it to the store along with all its sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Union[str, PathLike]
|
Path to the FASTA file to import. |
required |
force
|
bool
|
If True, overwrite existing collections/sequences. If False (default), skip duplicates. |
False
|
namespaces
|
Optional[List[str]]
|
Optional list of namespace prefixes to extract aliases from
FASTA headers. For example, ["ncbi", "refseq"] will scan headers
for tokens like |
None
|
Returns:
| Type | Description |
|---|---|
tuple[SequenceCollectionMetadata, bool]
|
A tuple containing: - SequenceCollectionMetadata: Metadata for the collection. - bool: True if the collection was newly added, False if it already existed. |
Raises:
| Type | Description |
|---|---|
IOError
|
If the file cannot be read or processed. |
Example::
store = RefgetStore.in_memory()
metadata, was_new = store.add_sequence_collection_from_fasta("genome.fa")
print(f"{'Added' if was_new else 'Skipped'}: {metadata.digest}")
# Extract aliases from FASTA headers
metadata, was_new = store.add_sequence_collection_from_fasta(
"genome.fa", namespaces=["ncbi", "refseq"]
)
compare
compare(digest_a, digest_b)
Compare two collections by digest.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest_a
|
str
|
First collection digest. |
required |
digest_b
|
str
|
Second collection digest. |
required |
Returns:
| Type | Description |
|---|---|
dict
|
dict with keys: digests, attributes, array_elements. |
disable_ancillary_digests
disable_ancillary_digests()
Disable computation of ancillary digests.
disable_attribute_index
disable_attribute_index()
Disable indexed attribute lookup, using brute-force scan instead.
disable_encoding
disable_encoding()
Disable encoding, use raw byte storage.
Decodes any existing Encoded sequences in memory.
Example::
store = RefgetStore.in_memory()
store.disable_encoding() # Switch to Raw mode
disable_persistence
disable_persistence()
Disable disk persistence for this store.
New sequences will be kept in memory only. Existing Stub sequences can still be loaded from disk if local_path is set.
Example::
store = RefgetStore.open_remote("/cache", "https://example.com")
store.disable_persistence() # Stop caching new sequences
enable_ancillary_digests
enable_ancillary_digests()
Enable computation of ancillary digests.
enable_attribute_index
enable_attribute_index()
Enable indexed attribute lookup (not yet implemented).
enable_encoding
enable_encoding()
Enable 2-bit encoding for space efficiency.
Re-encodes any existing Raw sequences in memory.
Example::
store = RefgetStore.in_memory()
store.disable_encoding() # Switch to Raw
store.enable_encoding() # Back to Encoded
enable_persistence
enable_persistence(path)
Enable disk persistence for this store.
Sets up the store to write sequences to disk. Any in-memory Full sequences are flushed to disk and converted to Stubs.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Union[str, PathLike]
|
Directory for storing sequences and metadata. |
required |
Raises:
| Type | Description |
|---|---|
IOError
|
If the directory cannot be created or written to. |
Example::
store = RefgetStore.in_memory()
store.add_sequence_collection_from_fasta("genome.fa")
store.enable_persistence("/data/store") # Flush to disk
export_fasta
export_fasta(collection_digest, output_path, sequence_names=None, line_width=None)
Export sequences from a collection to a FASTA file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
collection_digest
|
str
|
Collection to export from. |
required |
output_path
|
Union[str, PathLike]
|
Path to write FASTA file. |
required |
sequence_names
|
Optional[List[str]]
|
Optional list of sequence names to export. If None, exports all sequences in the collection. |
None
|
line_width
|
Optional[int]
|
Optional line width for wrapping sequences. If None, uses default of 80. |
None
|
export_fasta_by_digests
export_fasta_by_digests(seq_digests, output_path, line_width=None)
Export sequences by their digests to a FASTA file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seq_digests
|
List[str]
|
List of sequence digests to export. |
required |
output_path
|
Union[str, PathLike]
|
Path to write FASTA file. |
required |
line_width
|
Optional[int]
|
Optional line width for wrapping sequences. If None, uses default of 80. |
None
|
export_fasta_from_regions
export_fasta_from_regions(collection_digest, bed_file_path, output_file_path)
Export sequences from BED file regions to a FASTA file.
Reads a BED file defining genomic regions and exports the sequences for those regions to a FASTA file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
collection_digest
|
str
|
The collection's SHA-512/24u digest. |
required |
bed_file_path
|
Union[str, PathLike]
|
Path to BED file defining regions. |
required |
output_file_path
|
Union[str, PathLike]
|
Path to write the output FASTA file. |
required |
Raises:
| Type | Description |
|---|---|
IOError
|
If files cannot be read/written or sequences not found. |
Example::
store.export_fasta_from_regions(
"uC_UorBNf3YUu1YIDainBhI94CedlNeH",
"regions.bed",
"output.fa"
)
find_collections_by_attribute
find_collections_by_attribute(attr_name, attr_digest)
Find collections by attribute digest.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attr_name
|
str
|
Attribute name (names, lengths, sequences, name_length_pairs, sorted_name_length_pairs, sorted_sequences). |
required |
attr_digest
|
str
|
The digest to search for. |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List of collection digests that have the matching attribute. |
get_aliases_for_collection
get_aliases_for_collection(digest)
Reverse lookup: find all (namespace, alias) pairs pointing to this collection digest.
get_aliases_for_sequence
get_aliases_for_sequence(digest)
Reverse lookup: find all (namespace, alias) pairs pointing to this sequence digest.
get_attribute
get_attribute(attr_name, attr_digest)
Get attribute array by digest.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
attr_name
|
str
|
Attribute name (names, lengths, or sequences). |
required |
attr_digest
|
str
|
The digest to search for. |
required |
Returns:
| Type | Description |
|---|---|
Optional[list]
|
The attribute array, or None if not found. |
get_collection
get_collection(collection_digest)
Get a collection by digest with all sequences loaded.
Loads the collection and all its sequence data into memory. Use this when you need full access to sequence content.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
collection_digest
|
str
|
The collection's SHA-512/24u digest. |
required |
Returns:
| Type | Description |
|---|---|
SequenceCollection
|
The collection with all sequence data loaded. |
Raises:
| Type | Description |
|---|---|
IOError
|
If the collection cannot be loaded. |
Example::
collection = store.get_collection("uC_UorBNf3YUu1YIDainBhI94CedlNeH")
for seq in collection.sequences:
print(f"{seq.metadata.name}: {seq.decode()[:20]}...")
get_collection_by_alias
get_collection_by_alias(namespace, alias)
Resolve a collection alias and return the loaded collection.
Returns None if the alias is not found.
get_collection_level1
get_collection_level1(digest)
Get level 1 representation (attribute digests) for a collection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
Collection digest. |
required |
Returns:
| Type | Description |
|---|---|
dict
|
dict with spec-compliant field names (names, lengths, sequences, |
dict
|
plus optional name_length_pairs, sorted_name_length_pairs, sorted_sequences). |
get_collection_level2
get_collection_level2(digest)
Get level 2 representation (full arrays, spec format) for a collection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
Collection digest. |
required |
Returns:
| Type | Description |
|---|---|
dict
|
dict with names (list[str]), lengths (list[int]), sequences (list[str]). |
get_collection_metadata
get_collection_metadata(collection_digest)
Get metadata for a collection by digest.
Returns lightweight metadata without loading the full collection. Use this for quick lookups of collection information.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
collection_digest
|
str
|
The collection's SHA-512/24u digest. |
required |
Returns:
| Type | Description |
|---|---|
Optional[SequenceCollectionMetadata]
|
Collection metadata if found, None otherwise. |
Example::
meta = store.get_collection_metadata("uC_UorBNf3YUu1YIDainBhI94CedlNeH")
if meta:
print(f"Collection has {meta.n_sequences} sequences")
get_collection_metadata_by_alias
get_collection_metadata_by_alias(namespace, alias)
Resolve a collection alias to collection metadata (no data loading).
get_fhr_metadata
get_fhr_metadata(collection_digest)
Get FHR metadata for a collection. Returns None if missing.
get_sequence
get_sequence(digest)
Retrieve a sequence record by its digest (SHA-512/24u or MD5).
Loads the sequence data if not already in memory. Supports lookup by either SHA-512/24u (preferred) or MD5 digest. Automatically strips "SQ." prefix if present (case-insensitive).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
Sequence digest (SHA-512/24u base64url or MD5 hex string), optionally with "SQ." prefix. |
required |
Returns:
| Type | Description |
|---|---|
SequenceRecord
|
The sequence record with data. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the sequence is not found. |
Example::
record = store.get_sequence("aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2")
print(f"Found: {record.metadata.name}")
# Also works with SQ. prefix
record = store.get_sequence("SQ.aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2")
get_sequence_by_alias
get_sequence_by_alias(namespace, alias)
Resolve a sequence alias and return the loaded sequence record.
Returns None if the alias is not found.
get_sequence_by_name
get_sequence_by_name(collection_digest, sequence_name)
Retrieve a sequence by collection digest and sequence name.
Looks up a sequence within a specific collection using its name (e.g., "chr1", "chrM"). Loads the sequence data if needed. Automatically strips "SQ." prefix from collection digest if present.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
collection_digest
|
str
|
The collection's SHA-512/24u digest, optionally with "SQ." prefix. |
required |
sequence_name
|
str
|
Name of the sequence within that collection. |
required |
Returns:
| Type | Description |
|---|---|
SequenceRecord
|
The sequence record with data. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the sequence is not found. |
Example::
record = store.get_sequence_by_name(
"uC_UorBNf3YUu1YIDainBhI94CedlNeH",
"chr1"
)
print(f"Sequence: {record.decode()[:50]}...")
get_sequence_metadata
get_sequence_metadata(seq_digest)
Get metadata for a sequence by digest (no data loaded).
Use this for lightweight lookups when you don't need the actual sequence. Automatically strips "SQ." prefix from digest if present.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seq_digest
|
str
|
The sequence's SHA-512/24u digest, optionally with "SQ." prefix. |
required |
Returns:
| Type | Description |
|---|---|
Optional[SequenceMetadata]
|
Sequence metadata if found, None otherwise. |
get_sequence_metadata_by_alias
get_sequence_metadata_by_alias(namespace, alias)
Resolve a sequence alias to sequence metadata (no data loading).
get_substring
get_substring(seq_digest, start, end)
Extract a substring from a sequence.
Retrieves a specific region from a sequence using 0-based, half-open coordinates [start, end). Automatically loads sequence data if not already cached (for lazy-loaded stores). Automatically strips "SQ." prefix from digest if present.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seq_digest
|
str
|
Sequence digest (SHA-512/24u), optionally with "SQ." prefix. |
required |
start
|
int
|
Start position (0-based, inclusive). |
required |
end
|
int
|
End position (0-based, exclusive). |
required |
Returns:
| Type | Description |
|---|---|
str
|
The substring sequence. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the sequence is not found. |
Example::
# Get first 1000 bases of chr1
seq = store.get_substring("chr1_digest", 0, 1000)
print(f"First 50bp: {seq[:50]}")
has_ancillary_digests
has_ancillary_digests()
Returns whether ancillary digests are enabled.
has_attribute_index
has_attribute_index()
Returns whether the on-disk attribute index is enabled.
in_memory
classmethod
in_memory()
Create a new in-memory RefgetStore.
Creates a store that keeps all sequences in memory. Use this for temporary processing or when you don't need disk persistence.
Returns:
| Type | Description |
|---|---|
RefgetStore
|
New empty RefgetStore with Encoded storage mode. |
Example::
store = RefgetStore.in_memory()
store.add_sequence_collection_from_fasta("genome.fa")
into_readonly
into_readonly()
Convert to a ReadonlyRefgetStore for concurrent read access.
Consumes this store (replacing it with an empty in-memory store)
and returns a ReadonlyRefgetStore whose read methods all use &self
(no mutable borrow), making it suitable for Arc<ReadonlyRefgetStore>
in servers.
Call load_all_collections() or load_collection() before
converting, since ReadonlyRefgetStore cannot lazy-load.
Returns:
| Name | Type | Description |
|---|---|---|
ReadonlyRefgetStore |
ReadonlyRefgetStore
|
An immutable store suitable for concurrent access. |
Example::
store = RefgetStore.open_remote("/cache", "https://example.com")
store.load_all_collections()
readonly = store.into_readonly()
coll = readonly.get_collection("abc123")
is_collection_loaded
is_collection_loaded(collection_digest)
Check if a collection is fully loaded.
Returns True if the collection's sequence list is loaded in memory, False if it's only metadata (stub).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
collection_digest
|
str
|
The collection's SHA-512/24u digest. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if loaded, False otherwise. |
iter_collections
iter_collections()
Iterate over all collections with their sequences loaded.
This loads all collection data upfront and returns a list of SequenceCollection objects with full sequence data.
For browsing without loading data, use list_collections() instead.
Returns:
| Type | Description |
|---|---|
List[SequenceCollection]
|
List of all collections with loaded sequences. |
Example::
for coll in store.iter_collections():
print(f"{coll.digest}: {len(coll.sequences)} sequences")
iter_sequences
iter_sequences()
Iterate over all sequences with their data loaded.
This ensures all sequence data is loaded and returns a list of SequenceRecord objects with full sequence data.
For browsing without loading data, use list_sequences() instead.
Returns:
| Type | Description |
|---|---|
List[SequenceRecord]
|
List of all sequences with loaded data. |
Example::
for seq in store.iter_sequences():
print(f"{seq.metadata.name}: {seq.decode()[:20]}...")
list_collection_alias_namespaces
list_collection_alias_namespaces()
List all collection alias namespaces.
list_collection_aliases
list_collection_aliases(namespace)
List all aliases in a collection alias namespace.
list_collections
list_collections(page=0, page_size=100, filters=None)
List collections with pagination and optional attribute filtering.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
page
|
int
|
0-indexed page number. |
0
|
page_size
|
int
|
Number of results per page. |
100
|
filters
|
Optional[Dict[str, str]]
|
Optional attribute filters (AND logic). Keys are attribute names (names, lengths, sequences, name_length_pairs, sorted_name_length_pairs, sorted_sequences), values are digests. |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Dict with "results" (list of SequenceCollectionMetadata) and |
Dict[str, Any]
|
"pagination" (dict with page, page_size, total). |
Example::
# Get first page of all collections
result = store.list_collections()
for meta in result["results"]:
print(f"{meta.digest}: {meta.n_sequences} sequences")
print(f"Total: {result['pagination']['total']}")
# Filter by names digest
result = store.list_collections(filters={"names": "abc123"})
list_fhr_metadata
list_fhr_metadata()
List all collection digests that have FHR metadata.
list_sequence_alias_namespaces
list_sequence_alias_namespaces()
List all sequence alias namespaces.
list_sequence_aliases
list_sequence_aliases(namespace)
List all aliases in a sequence alias namespace.
list_sequences
list_sequences()
List all sequence metadata in the store.
Returns metadata for all sequences without loading sequence data. Use this for browsing/inventory operations.
Returns:
| Type | Description |
|---|---|
List[SequenceMetadata]
|
List of metadata for all sequences in the store. |
Example::
for meta in store.list_sequences():
print(f"{meta.name}: {meta.length} bp")
load_collection_aliases
load_collection_aliases(namespace, path)
Load collection aliases from a TSV file (alias\tdigest per line).
load_fhr_metadata
load_fhr_metadata(collection_digest, path)
Load FHR metadata from a JSON file and attach it to a collection.
load_sequence_aliases
load_sequence_aliases(namespace, path)
Load sequence aliases from a TSV file (alias\tdigest per line).
on_disk
classmethod
on_disk(cache_path)
Create or load a disk-backed RefgetStore.
If the directory contains an existing store (rgstore.json), loads it. Otherwise creates a new store with Encoded mode.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache_path
|
Union[str, PathLike]
|
Directory path for the store. Created if it doesn't exist. |
required |
Returns:
| Type | Description |
|---|---|
RefgetStore
|
RefgetStore (new or loaded from disk). |
Example::
store = RefgetStore.on_disk("/data/my_store")
store.add_sequence_collection_from_fasta("genome.fa")
# Store is automatically persisted to disk
open_local
classmethod
open_local(path)
Open a local RefgetStore from a directory.
Loads only lightweight metadata and stubs. Collections and sequences remain as stubs until explicitly accessed with get_collection()/get_sequence().
Expects: rgstore.json, sequences.rgsi, collections.rgci, collections/*.rgsi
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Union[str, PathLike]
|
Local directory containing the refget store. |
required |
Returns:
| Type | Description |
|---|---|
RefgetStore
|
RefgetStore with metadata loaded, sequences lazy-loaded. |
Raises:
| Type | Description |
|---|---|
IOError
|
If the store directory or index files cannot be read. |
Example::
store = RefgetStore.open_local("/data/hg38_store")
seq = store.get_substring("chr1_digest", 0, 1000)
open_remote
classmethod
open_remote(cache_path, remote_url)
Open a remote RefgetStore with local caching.
Loads only lightweight metadata and stubs from the remote URL. Data is fetched on-demand when get_collection()/get_sequence() is called.
By default, persistence is enabled (sequences are cached to disk).
Call disable_persistence() after loading to keep only in memory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache_path
|
Union[str, PathLike]
|
Local directory to cache downloaded metadata and sequences. Created if it doesn't exist. |
required |
remote_url
|
str
|
Base URL of the remote refget store (e.g., "https://example.com/hg38" or "s3://bucket/hg38"). |
required |
Returns:
| Type | Description |
|---|---|
RefgetStore
|
RefgetStore with metadata loaded, sequences fetched on-demand. |
Raises:
| Type | Description |
|---|---|
IOError
|
If remote metadata cannot be fetched or cache cannot be written. |
Example::
store = RefgetStore.open_remote(
"/data/cache/hg38",
"https://refget-server.com/hg38"
)
# First access fetches from remote and caches
seq = store.get_substring("chr1_digest", 0, 1000)
# Second access uses cache
seq2 = store.get_substring("chr1_digest", 1000, 2000)
remove_collection
remove_collection(digest, remove_orphan_sequences=False)
Remove a collection from the store.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
digest
|
str
|
The collection's SHA-512/24u digest string. |
required |
remove_orphan_sequences
|
bool
|
If True, also remove sequences no longer referenced by any remaining collection. Default: False. |
False
|
Returns:
| Type | Description |
|---|---|
bool
|
True if the collection was found and removed, False if not found. |
remove_collection_alias
remove_collection_alias(namespace, alias)
Remove a single collection alias. Returns True if it existed.
remove_fhr_metadata
remove_fhr_metadata(collection_digest)
Remove FHR metadata for a collection.
remove_sequence_alias
remove_sequence_alias(namespace, alias)
Remove a single sequence alias. Returns True if it existed.
set_encoding_mode
set_encoding_mode(mode)
Change the storage mode, re-encoding/decoding existing sequences as needed.
When switching from Raw to Encoded, all Full sequences in memory are encoded (2-bit packed). When switching from Encoded to Raw, all Full sequences in memory are decoded back to raw bytes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mode
|
StorageMode
|
The storage mode to switch to (StorageMode.Raw or StorageMode.Encoded). |
required |
Example::
store = RefgetStore.in_memory()
store.set_encoding_mode(StorageMode.Raw)
set_fhr_metadata
set_fhr_metadata(collection_digest, metadata)
Set FHR metadata for a collection.
set_quiet
set_quiet(quiet)
Set whether to suppress progress output.
When quiet is True, operations like add_sequence_collection_from_fasta will not print progress messages.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
quiet
|
bool
|
Whether to suppress progress output. |
required |
Example::
store = RefgetStore.in_memory()
store.set_quiet(True)
store.add_sequence_collection_from_fasta("genome.fa") # No output
stats
stats()
Returns statistics about the store.
Returns:
| Type | Description |
|---|---|
dict
|
dict with keys: - 'n_sequences': Total number of sequences (Stub + Full) - 'n_sequences_loaded': Number of sequences with data loaded (Full) - 'n_collections': Total number of collections (Stub + Full) - 'n_collections_loaded': Number of collections with sequences loaded (Full) - 'storage_mode': Storage mode ('Raw' or 'Encoded') |
Note
n_collections_loaded only reflects collections fully loaded in memory. For remote stores, collections are loaded on-demand when accessed.
Example::
stats = store.stats()
print(f"Store has {stats['n_sequences']} sequences")
print(f"Collections: {stats['n_collections']} total, {stats['n_collections_loaded']} loaded")
store_exists
classmethod
store_exists(path)
Check whether a valid RefgetStore exists at the given path.
Returns True if the path contains a store manifest file, indicating the store has been initialized. Returns False if the path does not exist or does not contain a store.
This avoids hardcoding knowledge of the store's internal file format in calling code.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Union[str, PathLike]
|
Path to the store directory. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if a store exists at the path, False otherwise. |
Example::
from gtars.refget import RefgetStore
RefgetStore.store_exists("/data/hg38_store") # True
RefgetStore.store_exists("/tmp/empty") # False
substrings_from_regions
substrings_from_regions(collection_digest, bed_file_path)
Get substrings for BED file regions as a list.
Reads a BED file and returns a list of sequences for each region.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
collection_digest
|
str
|
The collection's SHA-512/24u digest. |
required |
bed_file_path
|
Union[str, PathLike]
|
Path to BED file defining regions. |
required |
Returns:
| Type | Description |
|---|---|
List[RetrievedSequence]
|
List of retrieved sequence segments. |
Raises:
| Type | Description |
|---|---|
IOError
|
If files cannot be read or sequences not found. |
Example::
sequences = store.substrings_from_regions(
"uC_UorBNf3YUu1YIDainBhI94CedlNeH",
"regions.bed"
)
for seq in sequences:
print(f"{seq.chrom_name}:{seq.start}-{seq.end}")
write
write()
Write the store using its configured paths.
Convenience method for disk-backed stores. Uses the store's own local_path and seqdata_path_template.
Raises:
| Type | Description |
|---|---|
IOError
|
If the store cannot be written. |
write_store_to_dir
write_store_to_dir(root_path, seqdata_path_template=None)
Write the store to a directory on disk.
Persists the store with all sequences and metadata to disk using the RefgetStore directory format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root_path
|
Union[str, PathLike]
|
Directory path to write the store to. |
required |
seqdata_path_template
|
Optional[str]
|
Optional path template for sequence files (e.g., "sequences/%s2/%s.seq" where %s2 = first 2 chars of digest, %s = full digest). Uses default if not specified. |
None
|
Example::
store.write_store_to_dir("/data/my_store")
store.write_store_to_dir("/data/my_store", "sequences/%s2/%s.seq")
Digest Functions
Low-level functions for computing GA4GH digests:
sha512t24u_digest
sha512t24u_digest(readable)
Compute the GA4GH SHA-512/24u digest for a sequence.
This function computes the GA4GH refget standard digest (truncated SHA-512, base64url encoded) for a given sequence string or bytes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
readable
|
Union[str, bytes]
|
Input sequence as str or bytes. |
required |
Returns:
| Type | Description |
|---|---|
str
|
The SHA-512/24u digest (32 character base64url string). |
Raises:
| Type | Description |
|---|---|
TypeError
|
If input is not str or bytes. |
Example:: from gtars.refget import sha512t24u_digest digest = sha512t24u_digest("ACGT") print(digest) # Output: 'aKF498dAxcJAqme6QYQ7EZ07-fiw8Kw2'
canonical_str
canonical_str(item)
Convert a dict into a canonical string representation
Source code in refget/utils.py
21 22 23 24 25 | |