class: center, middle, inverse, title-slide # Sequences to Counts:Making Microbiome Data ## 📚EPID 674📚 ### Brendan J. Kelly, MD, MS ### Updated: 26 May 2020 --- background-image: url(data:image/png;base64,#svg/dna.svg) background-size: 500px background-position: 99% 50% class: middle, inverse .pad-left[ ### Recap: data embed wet and dry lab choices ### 16S ribosomal RNA (amplicon) sequencing ### Metagenomic sequencing ] --- background-image: url(data:image/png;base64,#svg/dna.svg) background-size: 500px background-position: 99% 50% class: center, middle, inverse # Measurement process --- # Wet Lab Decisions .pad-left[ - Specimen acquisition: - how are samples obtained? - how are samples processed? (e.g., cell-associated bacteria from BAL; difficult-to-lyse Mycobacteria or spores) - Sequencing library preparation: - __amplicon sequencing__ - __metagenomic sequencing__ ] --- # Dry Lab Decisions .pad-left[ - Sequence modality: short reads versus long reads? - Barcodes and de-multiplexing: Golay codes -- how much error is too much error? - Paired-end joining: how much overlap and how much error? - Sequence clustering: - __amplicon: operational taxonomic units or sequence variants__ - __metagenomic: assignment vs assembly__ ] --- background-image: url(data:image/png;base64,#svg/dna.svg) background-size: 500px background-position: 99% 50% class: center, middle, inverse # 16S rRNA (amplicon) sequencing --- # What is amplicon/tag sequencing? .pad-left[ - Amplicon/tag sequence strategy: - identify species by marker genes - 16S rRNA gene for bacteria and archaea - 18S/ITS for fungi - DNA “library” produced by PCR using targeted primers - In contrast, whole-genome “shotgun” metagenomic sequencing targets all DNA present (“random” shearing and amplification to produce DNA library) ] --- # 16S Ribosomal RNA (rRNA) Gene .pad-left[ - “Universal” identifier for bacteria: - Present in every bacterial species (copy number varies) - Highly conserved (functional RNA in ribosome; not translated) - Approximately 1500bp long - Primers land on conserved regions, amplify over variable region or regions - Sequencing over variable regions resolves source bacterium ] --- # 16S Ribosomal RNA (rRNA) Gene .pad-left[ - __Reading:__ Yarza P, Yilmaz P, Pruesse E, Glöckner FO, Ludwig W, Schleifer K-H, Whitman WB, Euzéby J, Amann R, Rosselló-Móra R. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat Rev Microbiol. 2014 Sep;12(9):635–645. Available from: http://dx.doi.org/10.1038/nrmicro3330 PMID: 25118885 - __(optional) Reading:__ David Quammen _The Tangled Tree of Life_ 2019 - __Reflection:__ How does 16S ribosomal RNA (rRNA) gene sequencing compare to the microscopy, selective stains, and biochemical assays used to characterize bacteria and archaea? ] --- background-image: url(data:image/png;base64,#img/variable_regions_16S.png) background-size: contain .footnote[Ashelford _Appl Env Micro_ 2005] --- background-image: url(data:image/png;base64,#img/map_variable_regions_16S.png) background-size: contain .footnote[Yarza P _Nat Rev Micro_ 2014] --- # How to bin amplicon gene sequences? .pad-left[ - Operational taxonomic units = __OTUs__ - de-novo OTUs - reference-based OTUs - Amplicon sequence variants = __ASVs__ - incorporate sequence error-correction model - error model choices: DADA2, deblur, etc ] --- # Operational Taxonomic Units (OTUs) .pad-left[ - OTUs: clusters of DNA sequences with similarity approximately equal to similarity across a defined species - Coined by Smith & Sokol _Principles of Numerical Taxonomy_ (1963): avoid definition that refers to established taxa - 97% similarity threshold commonly applied to 16S: - Stackebrandt & Goebel _Int J Syst Evol Micro_ 1994: "homology values below about 97.5%... unlikely that two organisms have more than 60 to 70% DNA similarity and hence that they are related at the species level" ] --- # De Novo vs Reference-Based OTUs .pad-left[ - De novo OTUs: cluster by less than a fixed sequence dissimilarity threshold (97%): - dataset-dependence: boundaries and membership depend on dataset in which sequences are defined; even with infinite sequencing depth and 0 errors, OTUs depend on relative abundance across samples - cannot compare DN OTUs defined in two different datasets (Schloss & Westcott _AEM_ 2011; Westcott & Schloss _PeerJ_ 2015; Kopylova _mSystems_ 2016) - with large number of sequences, clustering de novo OTUs may be prohibitively slow ] --- # Methods of Forming OTUs .pad-left[ - DOTUR (2005): multiple alignment with all reads, distance matrix, cluster reads into OTUs based on distance matrix - CD-Hit (2006): sort reads by length, read by read -- if similar to existing cluster, place there; otherwise create new cluster - Mothur (2009): replaces DOTUR - Uclust (2010): like CD-Hit, adopted by the QIIME pipeline - M-pick (2013): variable cluster size - Swarm (2014): single-linkage clustering then split per mixture model ] --- # De Novo vs Reference-Based OTUs .pad-left[ - Reference-based OTUs: reads sufficiently similar to a sequence in a reference database are recruited into the corresponding OTU (bins are taxonomic): - OTUs are “properties of a reference database”; reference sequences in the database define the labels (bins) - valid comparison across data sets is possible, but only if the same reference database is used - biological variation not in the reference database is lost - sequences that do not match reference are often discarded ] --- # Which OTUs are valid? .pad-left[ - De novo OTUs? De novo OTUs from rarefied reads? - Reference-based OTUs? - Which clustering algorithm? Which reference database? - OTU clusters are defined across the entire study… - ... __what happens when you re-analyze with new companion specimens?__ ] --- # Amplicon Sequence Variants (ASVs) .pad-left[ - Do novo process to discriminate biological sequences from sequence reading errors (typically on basis of number of repeated observations of distinct sequences, i.e. sequence abundance): - cannot be performed independently on each read - smallest unit of ASV formation: a single sample - nevertheless, consistent because ASVs "represent a biological reality that exists outside of the data being analyzed” - ASVs from different samples can be validly compared! ] --- # Amplicon Sequence Variants (ASVs) .pad-left[ - Progression of development for ASV (error-correction) approach: - Eren et al _Methods Ecol Evol_ 2013: “supervised” oligotyping -- after alignment, concatenation of nucleotides from information-rich, variable positions in sequencing reads defines an oligotype - Eren et al _ISME Journal_ 2015: "unsupervised” oligotyping -- minimum entropy decompensition to partition marker gene datasets iteratively - Tikhonov et al _ISME Journal_ 2015: “clustering reads into OTUs underexploits quality of modern seq data” ] --- # Amplicon Sequence Variants (ASVs) .pad-left[ - Commonly used implementations: - __DADA2__ (Callahan et al _Nature Methods_ 2016): "disentangling biological variation from sequencing errors" -- Poisson error model quantifies rate at which an amplicon read is produced from sequence as a function of sequence composition and quality - __deblur__ (Amir A et al _mSystems_ 2017): uses Illumina MiSeq/HiSea error profiles to obtain putative error-free sequences ] --- # Amplicon Sequence Variants (ASVs) .pad-left[ - Independent inference by sample - Consistent labels (sequences themselves) - Computational tractability - Robust to combining data sets (facilitate meta-analysis and replication) - Independent from reference data (like de novo OTUs; facilitate studies of new environments) ] --- background-image: url(data:image/png;base64,#svg/dna.svg) background-size: 500px background-position: 99% 50% class: center, middle, inverse # Metagenomic sequencing --- # Metagenomic Sequence Generation .pad-left[ - Whole-genome “shotgun” metagenomic sequencing: - sequence all DNA present (not restricted to bacteria/archaea) - “random” shearing and amplification to produce DNA library - contrast to amplicon sequencing, which targets a "tag" gene - Metagenomic sequence reads can be analyzed themselves, or can be joined to form contigs prior to further analysis ] --- # Metagenomic Sequence Processing .pad-left[ - Gene-level assignment of raw reads & taxonomic transformation - how to convert gene assignments into taxonomic assignments? - different models: Metaphlan2, Kraken, etc - "Contig" (contiguous sequence) assembly prior to assignment - how best to assemble contigs? - if reliable contigs, taxonomic assignments may be more precise ] --- # Metagenomic Sequence Processing .pad-left[ - __Reading:__ Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014 Mar 3;15(3):R46. Available from: http://dx.doi.org/10.1186/gb-2014-15-3-r46 PMCID: PMC4053813 - __Reading:__ Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat Methods. 2012 Jun 10;9(8):811–814. Available from: http://dx.doi.org/10.1038/nmeth.2066 PMCID: PMC3443552 - __Reading:__ Ghurye JS, Cepeda-Espinoza V, Pop M. Metagenomic Assembly: Overview, Challenges and Applications. Yale J Biol Med. 2016 Sep;89(3):353–362. Available from: https://www.ncbi.nlm.nih.gov/pubmed/27698619 PMCID: PMC5045144 ] --- # Metagenomic Sequence Processing .pad-left[ - __Reflection:__ What are the advantages of read assignment? What are the advantages of contig assembly? - __Reflection:__ How are the accuracy of metagenomic assignments evaluated? - __Reflection:__ What are the advantages and disadvantages of metagenomic sequencing, versus amplicon sequencing, for your own research? ] --- class: center, middle, inverse background-image: url(data:image/png;base64,#svg/conjugation.svg) background-size: 500px background-position: 50% 50% # Questions? ### Post to the discussion board! --- background-image: url(data:image/png;base64,#svg/bacteria.svg) background-size: 100px background-position: 98% 90% class: center, middle # Thank you! #### Slides available: [github.com/bjklab](https://github.com/bjklab/EPID674_002_sequences-to-counts.git) #### [brendank@pennmedicine.upenn.edu](brendank@pennmedicine.upenn.edu)