class: center, middle, inverse, title-slide # Alpha Diversity:Summarizing Microbial Communities ## 📚EPID 674📚 ### Brendan J. Kelly, MD, MS ### Updated: 09 June 2020 --- background-image: url(data:image/png;base64,#svg/candyjar.svg) background-size: 500px background-position: 99% 50% class: middle, inverse .pad-left[ ### R/tidyverse recap ### Summarizing OTU tables ### Alpha diversity ### Rarefaction vs rarefying ### R's vegan package ] --- background-image: url(data:image/png;base64,#svg/candyjar.svg) background-size: 500px background-position: 99% 50% class: center, middle, inverse # R/tidyverse recap --- # R: Making New Variables .pad-left[ - `tidyverse` to create new variables - Make new variables & keep old variables: - `mutate()` function - must return same length as input, or length 1 (wll be recycled) - Make a new variable & lose old variables: - `summarise()` function (often follows `group_by()`) - simplest use case: return length of 1 ] --- # R: Making New Variables .pull-left[ ```r # install tidyverse once install.packages('tidyverse') # load tidyverse functions each time you use library(tidyverse) ``` ] .pull-right[ - Code at left installs and loads the `tidyverse` package - The `tidyverse` package includes a set of other packages that permit streamlined data processing - See Hadley Wickham's _R For Data Science_: [https://r4ds.had.co.nz/](https://r4ds.had.co.nz/) ] --- # `mutate()` .pull-left[ ```r # make sure tidyverse loaded library(tidyverse) # load (trimmed) HMP V1-V3 OTU table otu <- read_csv( file = "./data/HMP_OTU_table_longformat.csv.gz", ) otu # show what you've read ``` ] .pull-right[ ``` ## # A tibble: 1,380,480 x 4 ## HMPbodysubsite specimen_id otu_id read_count ## <chr> <dbl> <chr> <dbl> ## 1 Anterior_nares 700014445 OTU_97.1 0 ## 2 Anterior_nares 700014445 OTU_97.10 0 ## 3 Anterior_nares 700014445 OTU_97.100 0 ## 4 Anterior_nares 700014445 OTU_97.1000 0 ## 5 Anterior_nares 700014445 OTU_97.10000 0 ## 6 Anterior_nares 700014445 OTU_97.10001 0 ## 7 Anterior_nares 700014445 OTU_97.10002 0 ## 8 Anterior_nares 700014445 OTU_97.10003 0 ## 9 Anterior_nares 700014445 OTU_97.10004 0 ## 10 Anterior_nares 700014445 OTU_97.10005 0 ## # … with 1,380,470 more rows ``` ] --- # `mutate()` .pull-left[ ```r otu %>% mutate( HMPbodysubsite = gsub("_"," ",HMPbodysubsite) ) ``` ] .pull-right[ ``` ## # A tibble: 1,380,480 x 4 ## HMPbodysubsite specimen_id otu_id read_count ## <chr> <dbl> <chr> <dbl> ## 1 Anterior nares 700014445 OTU_97.1 0 ## 2 Anterior nares 700014445 OTU_97.10 0 ## 3 Anterior nares 700014445 OTU_97.100 0 ## 4 Anterior nares 700014445 OTU_97.1000 0 ## 5 Anterior nares 700014445 OTU_97.10000 0 ## 6 Anterior nares 700014445 OTU_97.10001 0 ## 7 Anterior nares 700014445 OTU_97.10002 0 ## 8 Anterior nares 700014445 OTU_97.10003 0 ## 9 Anterior nares 700014445 OTU_97.10004 0 ## 10 Anterior nares 700014445 OTU_97.10005 0 ## # … with 1,380,470 more rows ``` ] --- # `mutate()` .pull-left[ ```r otu %>% mutate( HMPbodysubsite = gsub("_"," ",HMPbodysubsite), log_reads = log10(read_count) ) ``` ] .pull-right[ ``` ## # A tibble: 1,380,480 x 5 ## HMPbodysubsite specimen_id otu_id read_count log_reads ## <chr> <dbl> <chr> <dbl> <dbl> ## 1 Anterior nares 700014445 OTU_97.1 0 -Inf ## 2 Anterior nares 700014445 OTU_97.10 0 -Inf ## 3 Anterior nares 700014445 OTU_97.100 0 -Inf ## 4 Anterior nares 700014445 OTU_97.1000 0 -Inf ## 5 Anterior nares 700014445 OTU_97.10000 0 -Inf ## 6 Anterior nares 700014445 OTU_97.10001 0 -Inf ## 7 Anterior nares 700014445 OTU_97.10002 0 -Inf ## 8 Anterior nares 700014445 OTU_97.10003 0 -Inf ## 9 Anterior nares 700014445 OTU_97.10004 0 -Inf ## 10 Anterior nares 700014445 OTU_97.10005 0 -Inf ## # … with 1,380,470 more rows ``` ] --- # `summarise()` .pull-left[ ```r otu %>% group_by(HMPbodysubsite) %>% summarise( mean_reads = mean(read_count, na.rm = TRUE)) %>% ungroup() ``` ] .pull-right[ ``` ## # A tibble: 16 x 2 ## HMPbodysubsite mean_reads ## * <chr> <dbl> ## 1 Anterior_nares 0.0958 ## 2 Attached_Keratinized_gingiva 0.121 ## 3 Buccal_mucosa 0.172 ## 4 Hard_palate 0.116 ## 5 Left_Retroauricular_crease 0.0403 ## 6 Mid_vagina 0.267 ## 7 Palatine_Tonsils 0.133 ## 8 Posterior_fornix 0.257 ## 9 Right_Retroauricular_crease 0.0835 ## 10 Saliva 0.213 ## 11 Stool 0.221 ## 12 Subgingival_plaque 0.262 ## 13 Supragingival_plaque 0.260 ## 14 Throat 0.117 ## 15 Tongue_dorsum 0.202 ## 16 Vaginal_introitus 0.247 ``` ] --- # `summarise()` .pull-left[ ```r otu %>% group_by(HMPbodysubsite) %>% summarise( mean_reads = mean(read_count, na.rm = TRUE), sum_reads = sum(read_count, na.rm = TRUE) ) %>% ungroup() ``` ] .pull-right[ ``` ## # A tibble: 16 x 3 ## HMPbodysubsite mean_reads sum_reads ## * <chr> <dbl> <dbl> ## 1 Anterior_nares 0.0958 8262 ## 2 Attached_Keratinized_gingiva 0.121 10458 ## 3 Buccal_mucosa 0.172 14831 ## 4 Hard_palate 0.116 9977 ## 5 Left_Retroauricular_crease 0.0403 3481 ## 6 Mid_vagina 0.267 23043 ## 7 Palatine_Tonsils 0.133 11445 ## 8 Posterior_fornix 0.257 22189 ## 9 Right_Retroauricular_crease 0.0835 7207 ## 10 Saliva 0.213 18336 ## 11 Stool 0.221 19072 ## 12 Subgingival_plaque 0.262 22573 ## 13 Supragingival_plaque 0.260 22404 ## 14 Throat 0.117 10061 ## 15 Tongue_dorsum 0.202 17431 ## 16 Vaginal_introitus 0.247 21354 ``` ] --- background-image: url(data:image/png;base64,#svg/candyjar.svg) background-size: 500px background-position: 99% 50% class: center, middle, inverse # Summarizing OTU tables --- # High Dimensional Microbiome Data .pad-left[ - Typical OTU table orientation in microbiome studies: - 43140 rows - 32 columns ] .pull-left[ ```r # TYPICAL OTU TABLE ORIENTATION IN MICROBIOME STUDIES otu %>% reshape2::acast(otu_id ~ specimen_id, # rows = otu_id, columns = specimen_id value.var = "read_count") %>% .[1:10,1:5] # 43140 ROWS & 32 COLUMNS ``` ] .pull-right[ ``` ## 700013549 700014386 700014403 700014409 700014412 ## OTU_97.1 0 0 0 0 0 ## OTU_97.10 0 0 6 4 1 ## OTU_97.100 0 0 133 7 1 ## OTU_97.1000 0 0 0 0 0 ## OTU_97.10000 0 0 0 0 0 ## OTU_97.10001 0 0 0 0 0 ## OTU_97.10002 0 0 0 0 0 ## OTU_97.10003 0 0 0 0 0 ## OTU_97.10004 0 0 0 0 0 ## OTU_97.10005 0 0 0 0 0 ``` ] --- # High Dimensional Microbiome Data .pad-left[ - Typical species table orientation in ecology studies: - 32 rows - 43140 columns ] .pull-left[ ```r # TYPICAL SPECIES TABLE ORIENTATION IN ECOLOGY STUDIES otu %>% reshape2::acast(specimen_id ~ otu_id, # rows = specimen_id, columns = otu_id value.var = "read_count") %>% .[1:10,1:5] # 32 ROWS & 43140 COLUMNS ``` ] .pull-right[ ``` ## OTU_97.1 OTU_97.10 OTU_97.100 OTU_97.1000 OTU_97.10000 ## 700013549 0 0 0 0 0 ## 700014386 0 0 0 0 0 ## 700014403 0 6 133 0 0 ## 700014409 0 4 7 0 0 ## 700014412 0 1 1 0 0 ## 700014415 0 5 4 0 0 ## 700014418 0 2 0 0 0 ## 700014421 0 3 25 0 0 ## 700014424 0 1 5 0 0 ## 700014427 0 1 0 0 0 ``` ] --- # High Dimensional Microbiome Data .pad-left[ - How to deal with high-dimensional microbiome data? - __Descriptive (e.g., heatmaps and stacked barplots)__ - Test a priori hypotheses regarding specific OTUs/taxa - Reduce dimensions: - single summary statistic (alpha diversity) - pairwise distances (beta diversity) with PCoA or PERMANOVA - community types (mixture modeling) ] --- # Descriptive Plots .pad-left[ - Visualization of OTU table: - typically present counts as a proportion of sample total - choice of sample order can highlight group differences - Limitations: - cannot depict full list of OTUs - space dictates taxonomic level presented ] --- background-image: url(data:image/png;base64,#img/hmp_heatmap.png) background-size: contain --- background-image: url(data:image/png;base64,#img/hmp_barplot.png) background-size: contain --- # High Dimensional Microbiome Data .pad-left[ - How to deal with high-dimensional microbiome data? - Descriptive (e.g., heatmaps and stacked barplots) - __Test a priori hypotheses regarding specific OTUs/taxa__ - Reduce dimensions: - single summary statistic (alpha diversity) - pairwise distances (beta diversity) with PCoA or PERMANOVA - community types (mixture modeling) ] --- # Single-Taxon Hypotheses .pad-left[ - You suspect Bacteroides has a relationship with outcome of interest... - _Bacteroides_ (genus)? - _Bacteroidaceae_ (family)? - _Bacteroidales_ (order)? - _Bacteroidetes_ (class)? - Hypotheses focusing on specific taxa often fail to account for possibility of selection bias from culture. ] --- # High Dimensional Microbiome Data .pad-left[ - How to deal with high-dimensional microbiome data? - Descriptive (e.g., heatmaps and stacked barplots) - Test a priori hypotheses regarding specific OTUs/taxa - __Reduce dimensions:__ - __single summary statistic (alpha diversity)__ - pairwise distances (beta diversity) with PCoA or PERMANOVA - community types (mixture modeling) ] --- background-image: url(data:image/png;base64,#svg/candyjar.svg) background-size: 500px background-position: 99% 50% class: center, middle, inverse # Alpha Diversity --- # Alpha Diversity .pad-left[ - One solution to the p >> n problem: - capture entire community with a single numerical value - richness: what's there? - evenness: how are members distributed? - diversity: richness + evenness ] --- # Alpha Diversity .pad-left[ - Many alpha diversity metrics (weight richness/evenness): - species number, Chao1 (singletons & doubletons) - Shannon diversity: `$$H' = - \sum{ p_{i} * \log_{b}{(p_{i})} }$$` (note: typically natural log or base 2 are used) ] --- # Aside on Information Theory .pad-left[ - Shannon diversity: `$$H' = - \sum{ p_{i} * \log_{b}{(p_{i})} }$$` - Claude Shannon & information entropy: `$$H(p) = - \sum{ p_{i} * \log_{b}{(p_{i})} }$$` "The uncertainty contained in a probability distribution is the average log-probability of an event." (McElreath _Statistical Rethinking, 2nd_ 2020) ] --- # Alpha Diversity: Chao1 .pad-left[ - The Chao1 index estimates the total species __richness__: `$$S_{Chao1} = S_{obs} + \frac{n_{1}^2}{2*n_{2}}$$` (S = number of species; n1 = number singletons; n2 = number doubletons) - Chao1 is particularly useful for data sets skewed toward the low-abundance classes, as is likely to be the case with microbes. ] --- background-image: url(data:image/png;base64,#img/hmp_shannon.png) background-size: contain --- background-image: url(data:image/png;base64,#svg/candyjar.svg) background-size: 500px background-position: 99% 50% class: center, middle, inverse # Rarefaction vs Rarefying --- background-image: url(data:image/png;base64,#img/counting_uncountable.png) background-size: contain --- # Microbes Too Diverse to Count? .pad-left[ - “In any community, the number of types of organisms observed increases with sampling effort until all types are observed.” - “The relationship between the number of types observed and sampling effort gives information about the total diversity of the sampled community.” - “Pattern can be visualized by plotting an accumulation or a rank-abundance curve.” ] --- # Microbes Too Diverse to Count? .pad-left[ - “An accumulation curve is a plot of the cumulative number of types observed versus sampling effort.” - “Because all communities contain a finite number of species, if the surveyors continued to sample, the curves would eventually reach an asymptote at the actual community richness (number of types).” - “The curves contain information about how well the communities have been sampled (i.e., what fraction of the species in the community have been detected).” ] --- background-image: url(data:image/png;base64,#img/accumulation_curves.png) background-size: contain --- # Microbes Too Diverse to Count? .pad-left[ - “The more concave-downward the curve, the better sampled the community.” - “The idea that microbial diversity cannot be estimated comes from the fact that many microbial accumulation curves are linear or close to linear because of high diversity, small sample sizes, or both.” - “Ultimately, microbes—like tropical insects—are too diverse to count exhaustively.” ] --- background-image: url(data:image/png;base64,#img/rarefaction.png) background-size: contain --- # Rarefaction & Rarefying .pad-left[ - “Rarefaction was originally introduced as a method of estimating species richness (i.e., an alternative to parametric or nonparametric estimators like Chao1)" - “Rarefaction compares observed richness among sites, treatments, or habitats that have been unequally sampled. A rarefied curve results from averaging randomizations of the observed accumulation curve…. The variance around the repeated randomizations allows one to compare the observed richness among samples.” ] --- background-image: url(data:image/png;base64,#img/inadmissable.png) background-size: contain --- # Rarefaction & Rarefying .pad-left[ - “Microbiome analysis workflows often begin with an ad hoc library size normalization by random subsampling without replacement, or so-called rarefying…. There is confusion in the literature regarding terminology, and sometimes this normalization approach is conflated with a non-parametric resampling technique — called rarefaction, or individual-based taxon resampling curves — that can be justified for coverage analysis or species richness estimation in some settings.” ] --- # Rarefaction & Rarefying .pad-left[ - “Here we emphasize the distinction between taxon re-sampling curves and normalization by strictly adhering to the terms rarefying or rarefied counts when referring to the normalization procedure, respecting the original definition for rarefaction.” - "Rarefying defined by the following steps: (1) Select a minimum library size... (2) Discard libraries (microbiome samples) that have fewer reads than [minimum library size]... (3) Subsample the remaining libraries without replacement such that they all have [same] size" - “Rarefying is now an exceedingly common precursor to microbiome multivariate workflows that seek to relate sample covariates to sample-wise distance matrices.” ] --- # Rarefaction & Rarefying .pad-left[ - “Despite its current popularity in microbiome analyses __rarefying biological count data is statistically inadmissible__ because it requires the omission of available valid data. This holds even if repeated rarefying trials are compared for stability as previously suggested.” ] --- background-image: url(data:image/png;base64,#svg/candyjar.svg) background-size: 500px background-position: 99% 50% class: center, middle, inverse # R's vegan package --- # `vegan::diversity()` .pull-left[ ```r library(vegan) diversity(x = c(100,0,5,10), index = "shannon", base = 2) diversity(x = c(5,5,5,5), index = "shannon", base = 2) ``` ] .pull-right[ ``` ## [1] 0.6784071 ``` ``` ## [1] 2 ``` ] --- # `vegan::rrarefy()` .pull-left[ ```r library(vegan) rrarefy(x = c(100,0,5,10), sample = 10) rrarefy(x = c(1421,170,205,3607), sample = 1000) ``` ] .pull-right[ ``` ## [,1] [,2] [,3] [,4] ## [1,] 9 0 1 0 ``` ``` ## [,1] [,2] [,3] [,4] ## [1,] 267 35 30 668 ``` ] --- class: center, middle, inverse background-image: url(data:image/png;base64,#svg/conjugation.svg) background-size: 500px background-position: 50% 50% # Questions? ### Post to the discussion board! --- background-image: url(data:image/png;base64,#svg/bacteria.svg) background-size: 100px background-position: 98% 90% class: center, middle # Thank you! #### Slides available: [github.com/bjklab](https://github.com/bjklab/EPID674_002_sequences-to-counts.git) #### [brendank@pennmedicine.upenn.edu](brendank@pennmedicine.upenn.edu)