Accessions and Cohorts

Accessions

Consider an instance where you want to characterize data generated from an experiment. This can be accomplished by using an Accession object. An accession is a simple structure that houses an identifier along with associated data files and accompanying metadata.

class minus80.Accession(name, files=None, **kwargs)[source]

From google: Definition (noun): a new item added to an existing collection of books, paintings, or artifacts.

An Accession is an item that exists in an experimental collection.

Most of the time an accession is interoperable with a sample. However, the term sample can become confusing when an experiment has multiple samplings from the same sample, e.g. timecourse or different tissues.

The minimal amount of information needed is an identifier:

In [1]: x = m80.Accession('Sample1')

In [2]: x
Out[2]: Accession(Sample1, files=set(), {})

Files and metadata can be associated with the accession upon initialization using the files kwarg and any number of additional kwargs for metadata:

In [3]: x = m80.Accession('Sample2',
   ...:     files=['reads1.fastq','reads2.fastq'],
   ...:     gender='male',
   ...:     age=45
   ...: )
   ...: 

In [4]: x
Out[4]: Accession(Sample2, files={'reads1.fastq', 'reads2.fastq'}, {'gender': 'male', 'age': 45})

Once created, data can be accessed from the object in the usual, pythonic way:

In [5]: 'reads1.fastq' in x.files
Out[5]: True

In [6]: x['gender']
Out[6]: 'male'

In [7]: x['age']
Out[7]: 45

However useful, most experiments do not consist of a single accession. Accessions become powerful when they are analyzed together in some sort of experimental context. A group of accessions is called a Cohort.

Cohorts

A Cohort represent a named set of accessions. A Cohort is a build once, utilize many data structure and is backed by a database on disk. Revisiting the -80C Freezer analogy, Accessions are the units which are frozen and unfrozen from the database but only when they are added to a Cohort. Think of Accessions that are added to Cohorts as from a master mix, they can be unfrozen and used multiple times through multiple analyses and are only changed when they are updated in the Cohort. This is an important concept as multiple instances of the same accession can be generated from the same Cohort (see example below).

You can interact with Accessions within a cohort in a variety of ways.

class minus80.Cohort(name)[source]

A Cohort is a named set of accessions. Once cohorts are created, they are persistant as they are stored in the disk by minus80.

A Cohort is initialized by name:

In [8]: c = m80.Cohort('experiment1')

In [9]: a1 = m80.Accession('acc1',pval=0.05)

In [10]: a2 = m80.Accession('acc2')

In [11]: a3 = m80.Accession('acc3')

In [12]: c.add_accessions([a1,a2,a3])
Out[12]: 
[Accession(Accession(acc1, files=set(), {'pval': 0.05}), files=set(), {'pval': '0.05', 'AID': 1}),
 Accession(Accession(acc2, files=set(), {}), files=set(), {'AID': 2}),
 Accession(Accession(acc3, files=set(), {}), files=set(), {'AID': 3})]

Note

When an Accession is added to a Cohort, it is assigned an internal integer ID (called AID). These can be useful when short identifiers are needed but are not guaranteed to be the same across different cohorts. E.g. a1 might have an AID of 1 in c1 but an AID of 12 in c2. The AID is assigned based on the order it was added to the Cohort.

Interacting with Cohorts is pythonic:

In [13]: len(c)
Out[13]: 3

In [14]: a1 in c
Out[14]: True

In [15]: for x in c:
   ....:     print(x.name)
   ....: 
acc1
acc2
acc3

Accessions are accessed from the Cohort by identifier:

In [16]: a1dup = c['acc1']

In [17]: a1dup
Out[17]: Accession(acc1, files=set(), {'pval': '0.05', 'AID': 1})

Changing the values of an accession do not influence the frozen version from the Cohort. Another instance of the same Accessions will have the original value.

In [18]: instance1 = c['acc1']

In [19]: instance1['pval'] = 0.04

In [20]: instance2 = c['acc1']

In [21]: instance1['pval']
Out[21]: 0.04

In [22]: instance2['pval']
Out[22]: '0.05'

Some other convenience functions are useful for randomized analyses:

In [23]: random_accession = c.random_accession()

In [24]: random_accession
Out[24]: Accession(acc2, files=set(), {'AID': 2})

In [25]: random_accessions = list(c.random_accessions(n=2))

In [26]: random_accessions
Out[26]: 
[Accession(acc3, files=set(), {'AID': 3}),
 Accession(acc2, files=set(), {'AID': 2})]