import refget
from refget import SequenceCollectionClient
Connecting to a service¶
In order to use the client, you'll need a remote service API to connect to.
Then, you can create a SequenceCollectionClient
to interact with the service from within Python.
You could use the demo instance at https://seqcolapi.databio.org/
to test things out.
Or, you can also start a local demo service cloning the refget
package (https://github.com/refgenie/refget) and then running
bash deployment/demo_up.sh
This will launch a temporary postgres database and load it with 6 small demo sequences, and then run a barebones API service on localhost. For this demo, I'll connect to the localhost service like this:
seqcol_client = SequenceCollectionClient(urls=["http://127.0.0.1:8100"])
seqcol_client
<SequenceCollectionClient> Service ID: org.databio.seqcolapi Service Name: Sequence collections API URLs: http://127.0.0.1:8100
Listing available sequence collections¶
Now that you have a client connected to a server, you can interact with this object to query the API. First, check what sequence collections are available in this server:
seqcol_client.list_collections()
{'pagination': {'page': 0, 'page_size': 100, 'total': 6}, 'results': ['XZlrcEGi6mlopZ2uD8ObHkQB1d0oDwKk', 'QvT5tAQ0B8Vkxd-qFftlzEk2QyfPtgOv', 'Tpdsg75D4GKCGEHtIiDSL9Zx-DSuX5V8', 'UNGAdNDmBbQbHihecPPFxwTydTcdFKxL', 'sv7GIP1K0qcskIKF3iaBmQpaum21vH74', 'aVzHaGFlUDUNF2IEmNdzS_A8lCY0stQH']}
This gives you top-level digests for the collections.
Retrieving a sequence collection¶
Retrieve a collections using its digest like this:
seqcol_client.get_collection("XZlrcEGi6mlopZ2uD8ObHkQB1d0oDwKk")
{'lengths': [8, 4, 4], 'names': ['chrX', 'chr1', 'chr2'], 'sequences': ['SQ.iYtREV555dUFKg2_agSJW6suquUyPpMw', 'SQ.YBbVX0dLKG1ieEDCiMmkrTZFt_Z5Vdaj', 'SQ.AcLxtBuKEPk_7PGE_H4dGElwZHCujwH6'], 'sorted_sequences': ['SQ.AcLxtBuKEPk_7PGE_H4dGElwZHCujwH6', 'SQ.YBbVX0dLKG1ieEDCiMmkrTZFt_Z5Vdaj', 'SQ.iYtREV555dUFKg2_agSJW6suquUyPpMw'], 'name_length_pairs': [{'length': 8, 'name': 'chrX'}, {'length': 4, 'name': 'chr1'}, {'length': 4, 'name': 'chr2'}]}
This gives you the level 2 representation of the sequence collection, which is the canonical, expanded representation. You can also request the more compact level 1 representation, which gives you digests for each of the attributes:
seqcol_client.get_collection("XZlrcEGi6mlopZ2uD8ObHkQB1d0oDwKk", level=1)
{'lengths': 'cGRMZIb3AVgkcAfNv39RN7hnT5Chk7RX', 'names': 'Fw1r9eRxfOZD98KKrhlYQNEdSRHoVxAG', 'sequences': '0uDQVLuHaOZi1u76LjV__yrVUIz9Bwhr', 'sorted_sequences': 'KgWo6TT1Lqw6vgkXU9sYtCU9xwXoDt6M', 'name_length_pairs': 'B9MESWM8k-hK_OeQK8bZNAG74pLY0Ujq', 'sorted_name_length_pairs': 'wwE4PUok50YyEF2Ne8BBA5__zk92CZH8'}
These attributes are useful because you can use them in the same way you use a top-level sequence digest to look up values of a specific attribute using the get_attribute
function.
For example, here we will use the lengths digest to retrieve just the value of this attribute. You can see it matches the expanded version retrieved above:
seqcol_client.get_attribute("lengths", "cGRMZIb3AVgkcAfNv39RN7hnT5Chk7RX")
[8, 4, 4]
We can also discover attributes available in the server with the list_attributes
function, which will list all available values of a specific attribute:
seqcol_client.list_attributes("lengths", page_size=3)
{'pagination': {'page': 0, 'page_size': 3, 'total': 3}, 'results': ['cGRMZIb3AVgkcAfNv39RN7hnT5Chk7RX', 'x5qpE4FtMkvlwpKIzvHs3a02Nex5tthp', '7-_HdxYiRf-AJLBKOTaJUdxXrUkIXs6T']}
Discovering sequence collections with specific attributes¶
One of the useful applications of attribute digests is that we can use them to discover other sequence collections that share a specific attribute value.
For example, say we want to find all the collections hosted by this server that have the particular set of sequence lengths [8,4,4]
.
We can use the list_collections
function again, but this time adding some new parameters to specify that we want to retrieve the collections with a specific value for the lengths
attribute, like this:
seqcol_client.list_collections(page=1,
page_size=2,
attribute="lengths",
attribute_digest="cGRMZIb3AVgkcAfNv39RN7hnT5Chk7RX")
{'pagination': {'page': 4, 'page_size': 2, 'total': 4}, 'results': ['UNGAdNDmBbQbHihecPPFxwTydTcdFKxL', 'aVzHaGFlUDUNF2IEmNdzS_A8lCY0stQH']}
This will allow you to identify other sequence collections.
Comparing two sequence collections¶
One of the powerful advanced features of the sequence collections standard is the comparison function, which allows you to get detailed information about how similar two sequence collections are.
In this example, let's compare the two sequence collections that had identical lengths
attributes, to see how these two collections differ. Remember, if they had no differences, they would have the same top-level digest, so we know they're different somehow... the comparison function will give us more information.
seqcol_client.compare(
"UNGAdNDmBbQbHihecPPFxwTydTcdFKxL",
"aVzHaGFlUDUNF2IEmNdzS_A8lCY0stQH")
{'digests': {'a': 'UNGAdNDmBbQbHihecPPFxwTydTcdFKxL', 'b': 'aVzHaGFlUDUNF2IEmNdzS_A8lCY0stQH'}, 'attributes': {'a_only': [], 'b_only': [], 'a_and_b': ['lengths', 'name_length_pairs', 'names', 'sequences', 'sorted_sequences']}, 'array_elements': {'a': {'lengths': 3, 'name_length_pairs': 3, 'names': 3, 'sequences': 3, 'sorted_sequences': 3}, 'b': {'lengths': 3, 'name_length_pairs': 3, 'names': 3, 'sequences': 3, 'sorted_sequences': 3}, 'a_and_b': {'lengths': 3, 'name_length_pairs': 1, 'names': 3, 'sequences': 3, 'sorted_sequences': 3}, 'a_and_b_same_order': {'lengths': True, 'name_length_pairs': True, 'names': False, 'sequences': True, 'sorted_sequences': True}}}
Using pydantic models¶
One of the really cool things about the refget
package is that it provides pydantic models for sequence collections and other relevant data types. We can use these objects to analyze and manage sequence collections locally. Let's walk through some of the things you can do with these objects.
We provide a SequenceCollection
object that gives you some nice ways to interact with these objects in Python. From a dictionary representation you retrieve from an API, you can construct a Pydantic object like this:
seqcol_dict = seqcol_client.get_collection("XZlrcEGi6mlopZ2uD8ObHkQB1d0oDwKk")
seqcol = refget.SequenceCollection.from_dict(seqcol_dict)
seqcol
SequenceCollection(digest='XZlrcEGi6mlopZ2uD8ObHkQB1d0oDwKk', sorted_name_length_pairs_digest='wwE4PUok50YyEF2Ne8BBA5__zk92CZH8')
This object is very useful. You can use it to get this sequence collection in a variety of different formats:
seqcol.level2()
{'lengths': [8, 4, 4], 'names': ['chrX', 'chr1', 'chr2'], 'sequences': ['SQ.iYtREV555dUFKg2_agSJW6suquUyPpMw', 'SQ.YBbVX0dLKG1ieEDCiMmkrTZFt_Z5Vdaj', 'SQ.AcLxtBuKEPk_7PGE_H4dGElwZHCujwH6'], 'sorted_sequences': ['SQ.AcLxtBuKEPk_7PGE_H4dGElwZHCujwH6', 'SQ.YBbVX0dLKG1ieEDCiMmkrTZFt_Z5Vdaj', 'SQ.iYtREV555dUFKg2_agSJW6suquUyPpMw'], 'name_length_pairs': [{'length': 8, 'name': 'chrX'}, {'length': 4, 'name': 'chr1'}, {'length': 4, 'name': 'chr2'}]}
seqcol.level1()
{'lengths': 'cGRMZIb3AVgkcAfNv39RN7hnT5Chk7RX', 'names': 'Fw1r9eRxfOZD98KKrhlYQNEdSRHoVxAG', 'sequences': '0uDQVLuHaOZi1u76LjV__yrVUIz9Bwhr', 'sorted_sequences': 'KgWo6TT1Lqw6vgkXU9sYtCU9xwXoDt6M', 'name_length_pairs': 'B9MESWM8k-hK_OeQK8bZNAG74pLY0Ujq', 'sorted_name_length_pairs': 'wwE4PUok50YyEF2Ne8BBA5__zk92CZH8'}
seqcol.lengths.digest
'cGRMZIb3AVgkcAfNv39RN7hnT5Chk7RX'
seqcol.itemwise()
[{'name': 'chrX', 'length': 8, 'sequence': 'SQ.iYtREV555dUFKg2_agSJW6suquUyPpMw'}, {'name': 'chr1', 'length': 4, 'sequence': 'SQ.YBbVX0dLKG1ieEDCiMmkrTZFt_Z5Vdaj'}, {'name': 'chr2', 'length': 4, 'sequence': 'SQ.AcLxtBuKEPk_7PGE_H4dGElwZHCujwH6'}]
You can access individual attributes like this:
seqcol.name_length_pairs.digest
'B9MESWM8k-hK_OeQK8bZNAG74pLY0Ujq'
seqcol.name_length_pairs.value
[{'length': 8, 'name': 'chrX'}, {'length': 4, 'name': 'chr1'}, {'length': 4, 'name': 'chr2'}]
Because this is a SQLModel
object, you could also use this to create and interact with a database easily. You can find reference documentation in the models section.