Sequences to Counts:Making Microbiome Data

class: center, middle, inverse, title-slide

# Sequences to Counts:</br>Making Microbiome Data
## 📚EPID 674📚
### Brendan J. Kelly, MD, MS
### Updated: 26 May 2020

---

background-image: url(data:image/png;base64,#svg/dna.svg)
background-size: 500px
background-position: 99% 50%
class: middle, inverse

.pad-left[

### Recap: data embed wet and dry lab choices

### 16S ribosomal RNA (amplicon) sequencing

### Metagenomic sequencing

]

---
background-image: url(data:image/png;base64,#svg/dna.svg)
background-size: 500px
background-position: 99% 50%
class: center, middle, inverse

# Measurement process

---

# Wet Lab Decisions

.pad-left[
- Specimen acquisition:

- how are samples obtained?
    
    - how are samples processed? (e.g., cell-associated bacteria from BAL; difficult-to-lyse Mycobacteria or spores)

- Sequencing library preparation:

- __amplicon sequencing__
    
    - __metagenomic sequencing__

]

---

# Dry Lab Decisions

.pad-left[

- Sequence modality: short reads versus long reads?

- Barcodes and de-multiplexing: Golay codes -- how much error is too much error?

- Paired-end joining: how much overlap and how much error?

- Sequence clustering:

- __amplicon: operational taxonomic units or sequence variants__
    
    - __metagenomic: assignment vs assembly__

]

---
background-image: url(data:image/png;base64,#svg/dna.svg)
background-size: 500px
background-position: 99% 50%
class: center, middle, inverse

# 16S rRNA (amplicon) sequencing

---

# What is amplicon/tag sequencing?

.pad-left[

- Amplicon/tag sequence strategy:

- identify species by marker genes
    
    - 16S rRNA gene for bacteria and archaea
    
    - 18S/ITS for fungi

- DNA “library” produced by PCR using targeted primers

- In contrast, whole-genome “shotgun” metagenomic sequencing targets all DNA present (“random” shearing and amplification to produce DNA library)

]

---

# 16S Ribosomal RNA (rRNA) Gene

.pad-left[

- “Universal” identifier for bacteria:

- Present in every bacterial species (copy number varies)
    
    - Highly conserved (functional RNA in ribosome; not translated)
    
    - Approximately 1500bp long
    
    - Primers land on conserved regions, amplify over variable region or regions

- Sequencing over variable regions resolves source bacterium

]

---

# 16S Ribosomal RNA (rRNA) Gene

.pad-left[

- __Reading:__ Yarza P, Yilmaz P, Pruesse E, Glöckner FO, Ludwig W, Schleifer K-H, Whitman WB, Euzéby J, Amann R, Rosselló-Móra R. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat Rev Microbiol. 2014 Sep;12(9):635–645. Available from: http://dx.doi.org/10.1038/nrmicro3330 PMID: 25118885

- __(optional) Reading:__ David Quammen _The Tangled Tree of Life_ 2019

- __Reflection:__ How does 16S ribosomal RNA (rRNA) gene sequencing compare to the microscopy, selective stains, and biochemical assays used to characterize bacteria and archaea?

]

---
background-image: url(data:image/png;base64,#img/variable_regions_16S.png)
background-size: contain

.footnote[Ashelford _Appl Env Micro_ 2005]

---
background-image: url(data:image/png;base64,#img/map_variable_regions_16S.png)
background-size: contain

.footnote[Yarza P _Nat Rev Micro_ 2014]

---

# How to bin amplicon gene sequences?

.pad-left[

- Operational taxonomic units = __OTUs__

- de-novo OTUs
    
    - reference-based OTUs

- Amplicon sequence variants = __ASVs__

- incorporate sequence error-correction model
    
    - error model choices: DADA2, deblur, etc

]

---

# Operational Taxonomic Units (OTUs)

.pad-left[

- OTUs: clusters of DNA sequences with similarity approximately equal to similarity across a defined species

- Coined by Smith & Sokol _Principles of Numerical Taxonomy_ (1963): avoid definition that refers to established taxa

- 97% similarity threshold commonly applied to 16S:

- Stackebrandt & Goebel _Int J Syst Evol Micro_ 1994: "homology values below about 97.5%... unlikely that two organisms have more than 60 to 70% DNA similarity and hence that they are related at the species level"  
    
    
]

---

# De Novo vs Reference-Based OTUs

.pad-left[

- De novo OTUs: cluster by less than a fixed sequence dissimilarity threshold (97%):

- dataset-dependence: boundaries and membership depend on dataset in which sequences are defined; even with infinite sequencing depth and 0 errors, OTUs depend on relative abundance across samples
    
    - cannot compare DN OTUs defined in two different datasets (Schloss & Westcott _AEM_ 2011; Westcott & Schloss _PeerJ_ 2015; Kopylova _mSystems_ 2016)

- with large number of sequences, clustering de novo OTUs may be prohibitively slow

]

---

# Methods of Forming OTUs

.pad-left[

- DOTUR (2005): multiple alignment with all reads, distance matrix, cluster reads into OTUs based on distance matrix

- CD-Hit (2006): sort reads by length, read by read -- if similar to existing cluster, place there; otherwise create new cluster

- Mothur (2009): replaces DOTUR

- Uclust (2010): like CD-Hit, adopted by the QIIME pipeline

- M-pick (2013): variable cluster size

- Swarm (2014): single-linkage clustering then split per mixture model

]

---

# De Novo vs Reference-Based OTUs

.pad-left[

- Reference-based OTUs: reads sufficiently similar to a sequence in a reference database are recruited into the corresponding OTU (bins are taxonomic):

- OTUs are “properties of a reference database”; reference sequences in the database define the labels (bins)
    
    - valid comparison across data sets is possible, but only if the same reference database is used
    
    - biological variation not in the reference database is lost

- sequences that do not match reference are often discarded

]

---
# Which OTUs are valid?

.pad-left[

- De novo OTUs? De novo OTUs from rarefied reads?

- Reference-based OTUs?

- Which clustering algorithm? Which reference database?

- OTU clusters are defined across the entire study…

- ... __what happens when you re-analyze with new companion specimens?__

]

---

# Amplicon Sequence Variants (ASVs)

.pad-left[

- Do novo process to discriminate biological sequences from sequence reading errors (typically on basis of number of repeated observations of distinct sequences, i.e. sequence abundance):

- cannot be performed independently on each read
    
    - smallest unit of ASV formation: a single sample
    
    - nevertheless, consistent because ASVs "represent a biological reality that exists outside of the data being analyzed”
    
- ASVs from different samples can be validly compared!

]

---

# Amplicon Sequence Variants (ASVs)

.pad-left[

- Progression of development for ASV (error-correction) approach:

- Eren et al _Methods Ecol Evol_ 2013: “supervised” oligotyping -- after alignment, concatenation of nucleotides from information-rich, variable positions in sequencing reads defines an oligotype
    
    - Eren et al _ISME Journal_ 2015:  "unsupervised” oligotyping -- minimum entropy decompensition to partition marker gene datasets iteratively
    
    - Tikhonov et al _ISME Journal_ 2015: “clustering reads into OTUs underexploits quality of modern seq data”
    
]

---

# Amplicon Sequence Variants (ASVs)

.pad-left[

- Commonly used implementations:

- __DADA2__ (Callahan et al _Nature Methods_ 2016): "disentangling biological variation from sequencing errors" -- Poisson error model quantifies rate at which an amplicon read is produced from sequence as a function of sequence composition and quality

- __deblur__ (Amir A et al _mSystems_ 2017): uses Illumina MiSeq/HiSea error profiles to obtain putative error-free sequences

]

---

# Amplicon Sequence Variants (ASVs)

.pad-left[

- Independent inference by sample

- Consistent labels (sequences themselves)

- Computational tractability

- Robust to combining data sets (facilitate meta-analysis and replication)

- Independent from reference data (like de novo OTUs; facilitate studies of new environments)

]

---
background-image: url(data:image/png;base64,#svg/dna.svg)
background-size: 500px
background-position: 99% 50%
class: center, middle, inverse

# Metagenomic sequencing

---

# Metagenomic Sequence Generation

.pad-left[

- Whole-genome “shotgun” metagenomic sequencing:

- sequence all DNA present (not restricted to bacteria/archaea)
    
    - “random” shearing and amplification to produce DNA library

- contrast to amplicon sequencing, which targets a "tag" gene

- Metagenomic sequence reads can be analyzed themselves, or can be joined to form contigs prior to further analysis

]

---

# Metagenomic Sequence Processing

.pad-left[

- Gene-level assignment of raw reads & taxonomic transformation

- how to convert gene assignments into taxonomic assignments?
    
    - different models: Metaphlan2, Kraken, etc

- "Contig" (contiguous sequence) assembly prior to assignment

- how best to assemble contigs?
    
    - if reliable contigs, taxonomic assignments may be more precise

]

---

# Metagenomic Sequence Processing

.pad-left[

- __Reading:__ Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014 Mar 3;15(3):R46. Available from: http://dx.doi.org/10.1186/gb-2014-15-3-r46 PMCID: PMC4053813

- __Reading:__ Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods. 2012 Jun 10;9(8):811–814. Available from: http://dx.doi.org/10.1038/nmeth.2066 PMCID: PMC3443552

- __Reading:__ Ghurye JS, Cepeda-Espinoza V, Pop M. Metagenomic Assembly: Overview, Challenges and Applications. Yale J Biol Med. 2016 Sep;89(3):353–362. Available from: https://www.ncbi.nlm.nih.gov/pubmed/27698619 PMCID: PMC5045144

]

---

# Metagenomic Sequence Processing

.pad-left[

- __Reflection:__ What are the advantages of read assignment? What are the advantages of contig assembly?

- __Reflection:__ How are the accuracy of metagenomic assignments evaluated?

- __Reflection:__ What are the advantages and disadvantages of metagenomic sequencing, versus amplicon sequencing, for your own research?

]

---
class: center, middle, inverse
background-image: url(data:image/png;base64,#svg/conjugation.svg)
background-size: 500px
background-position: 50% 50%

# Questions?
### Post to the discussion board!

---
background-image: url(data:image/png;base64,#svg/bacteria.svg)
background-size: 100px
background-position: 98% 90%
class: center, middle

# Thank you!
#### Slides available: [github.com/bjklab](https://github.com/bjklab/EPID674_002_sequences-to-counts.git)
#### [brendank@pennmedicine.upenn.edu](brendank@pennmedicine.upenn.edu)