class: center, middle, inverse, title-slide # Measuring the Microbiome:Methods & Tools ## 📚EPID 674📚 ### Brendan J. Kelly, MD, MS ### Kyle Bittinger, PhD ### Updated: 25 May 2020 --- background-image: url(data:image/png;base64,#svg/bacteria.svg) background-size: 500px background-position: 50% 50% class: center, middle, inverse # Welcome! --- # Where am I? How did I get here? .pull-left[ <span style="font-size: 1.25rem; margin-left: 10px"> __Epidemiology 674__ </span> <span style="font-size: 1.25rem; margin-left: 10px"> __Measuring the Microbiome:__ </span> <span style="font-size: 1.25rem; margin-left: 10px"> __Methods & Tools__ </span> ] .pull-right[ You study epidemiology... .pad-left[ or microbiology... or informatics... or statistics* ] ] .footnote[ *Great opportunity to learn from each other, but this will be challenging online. ] --- # Course Goals .pad-left[ - How do we measure the microbiome? - How are microbiome data structured? - What are the pitfalls and dangers of microbiome data analysis? - What analytic methods and model frameworks help address those challenges? - __Reproducibility and rigor in microbiome research (e.g., reproducible reports)__ ] --- background-image: url(data:image/png;base64,#svg/petri_mixed.svg) background-size: 500px background-position: 50% 50% class: center, middle, inverse # How to measure the microbiome? --- # "The Microbiome" .pad-left[ - Loose definition: “the microorganisms in a particular environment” (synonymous with “microbiota”) - Strict definition: “the combined genetic material of the microorganisms in a particular environment” - emphasis on nucleic acid sequencing - “omics” often defined by the tools on which they depend ] --- # Biology/Ecology + Technology .pad-left[ - Biology/Ecology: - previously undetected microbial species & interspecific interactions - clinical relevance of these species and interactions - Technology: - [depends on high-throughput ("next generation") sequencing](https://youtu.be/fCd6B5HRaZ8) - __NOT__ “unbiased” surveys of microbial communities - sequencing error, wet and dry lab decisions... ] --- # Wet Lab Decisions .pad-left[ - Specimen acquisition: - how are samples obtained? - how are samples processed? (e.g., cell-associated bacteria from BAL; difficult-to-lyse Mycobacteria or spores) - Sequencing library preparation: - amplicon sequencing? - metagenomic sequencing? ] --- # Dry Lab Decisions .pad-left[ - Sequence modality: short reads versus long reads? - Barcodes and de-multiplexing: Golay codes -- how much error is too much error? - Paired-end joining: how much overlap and how much error? - Sequence clustering: - amplicon: operational taxonomic units or sequence variants? - metagenomic: assignment vs assembly? ] --- background-image: url(data:image/png;base64,#svg/spreadsheet_cell.svg) background-size: 500px background-position: 50% 50% class: center, middle, inverse # How are microbiome data structured? --- # More Data, More Problems .pad-left[ - Measurement methods and tools differ - Data embed wet and dry lab choices - __BUT__ data structures similar and create common concerns: - sparse count data - misclassification risks - high-dimensional data (p >> n) ] --- # Microbiome Data Structure .pad-left[ - Sparse count data generated by sequence binning/assignment - Typically 1e2 - 1e4 microbiome variables (exceed observations) - Cross-sectional or time-series measures - Always think on two levels: - What's there? __generative model for observed microbiota / genes__ - How do we know? __measurement model given constraints of our tools__ ] --- background-image: url(data:image/png;base64,#svg/danger-svgrepo-com.svg) background-size: 500px background-position: 50% 35% class: center, middle, inverse # Pitfalls and dangers... --- # How do microbiome studies go wrong? .pad-left[ - "Big data" (p >> n) problems: - BSTA 785: Statistical Methods for Genomic Data Analysis - BSTA 789: Big Data - we will briefly survey commonly used techniques (dimension reduction, supervised and unsupervised clustering strategies) to bolster study design and planning - Microbiome-specific problems... ] --- # How do microbiome studies go wrong? .pad-left[ - What is adequate sampling to estimate composition? ("shallow shotgun") - How are microbes shared across study participants? ("cage effects") - How to link sequence data to extant knowledge? (binning ITS data) - Absolute abundance versus relative abundance? (Gloor et al) - How to discriminate sub-species community members? ("strain-level") ] --- background-image: url(data:image/png;base64,#svg/stacked_barchart.svg) background-size: 500px background-position: 50% 50% class: center, middle, inverse # ... analysis methods and models --- # Microbiome Analysis Methods .pad-left[ - Summary statistics (alpha diversity) - Distance metrics (beta diversity) - Mixture models (Dirichlet multinomial mixtures) - Non-parametric tests (PERMANOVA/adonis) - Data transformations (centered log-ratio transform) - __No best method to address all questions!__ ] --- background-image: url(data:image/png;base64,#svg/computer_talk.svg) background-size: 500px background-position: 50% 50% class: center, middle, inverse # How are we going to do this online? --- # EPID 674 Approach .pad-left[ - To understand these decisions, you need to make them… … and then make them again another way... - You can’t learn microbiome data analysis without looking at microbiome data - You can’t learn R without looking at R code - Microbiome data demand reproducible research strategies ] --- # EPID 674 2020 (COVID-19 edition) .pad-left[ - Class schedule (adjusted for online learning) - Assignments and evaluations - Outline of topics covered - [rstudio.cloud](https://rstudio.cloud) (R & Rstudio) to __learn by doing!__ - Recommended resources ] --- # EPID 674: Class Schedule .pad-left[ - Classes Tuesdays & Thursdays May 26 - June 30, 2020 - Each class session: 1. lecture chunks (prerecorded for your convenience) 2. readings & Canvas discussion 3. R coding primers 4. live discussion 3-4pm on BlueJeans: [https://bluejeans.com/9715046666](https://bluejeans.com/9715046666) - Complete lectures, readings, & primers __before__ live sessions -- bring questions! ] --- # EPID 674: Assignments .pad-left[ - Assignments: 1. readings & Canvas discussion boards 2. R code exercises 3. in-class presentations 4. final presentation (reproducible analysis report with RMarkdown) ] --- # EPID 674: Evaluation .pad-left[ - Grades: - 40% participation (Canvas & BlueJeans discussion, in-class coding exercises, in-class presentations) - 30% homework (independent R code exercises) - 30% final presentation (reproducible analysis report with Rmarkdown) ] --- # EPID 674: Outline .pad-left[ - Week 1: sequencing, sequence clusters & microbiome data - Week 2: taxonomy, phylogeny, binning & (mis)classification - Week 3: "diversity" as summary statistic or distance - Week 4: compositional data analysis & mixture models - Week 5: dimension reduction methods, ordination - Week 6: planning microbiome studies ] --- # EPID 674: Resources .pad-left[ - [rstudio.cloud](https://rstudio.cloud) - supported method for assignments & in-class exercises - if you want to install R & RStudio locally, please do so! (R: download & install from [CRAN](https://cran.r-project.org/)) (Rstudio: download & install from [rstudio.com](https://www.rstudio.com)) ] --- # EPID 674: Resources .pad-left[ - R programming & “tidyverse”: - Wickham’s _R for Data Science_ ([https://r4ds.had.co.nz/](https://r4ds.had.co.nz/)) - Analysis: - Legendre & Legendre’s _Numerical Ecology_ - James, Witten, Hastie & Tibshirani’s _Introduction to Statistical Learning with Applications in R_ ([http://faculty.marshall.usc.edu/gareth-james/ISL/](http://faculty.marshall.usc.edu/gareth-james/ISL/)) ] --- class: center, middle, inverse background-image: url(data:image/png;base64,#svg/conjugation.svg) background-size: 500px background-position: 50% 50% # Questions? ### Post to the discussion board! --- background-image: url(data:image/png;base64,#svg/bacteria.svg) background-size: 100px background-position: 98% 90% class: center, middle # Thank you! #### Slides available: [github.com/bjklab](https://www.gihub.com/bjklab/microbiome-measures_01_data-and-danger) #### [brendank@pennmedicine.upenn.edu](brendank@pennmedicine.upenn.edu)