Alpha Diversity:Summarizing Microbial Communities

class: center, middle, inverse, title-slide

# Alpha Diversity:</br>Summarizing Microbial Communities
## 📚EPID 674📚
### Brendan J. Kelly, MD, MS
### Updated: 09 June 2020

---

background-image: url(data:image/png;base64,#svg/candyjar.svg)
background-size: 500px
background-position: 99% 50%
class: middle, inverse

.pad-left[

### R/tidyverse recap

### Summarizing OTU tables

### Alpha diversity

### Rarefaction vs rarefying

### R's vegan package

]

---
background-image: url(data:image/png;base64,#svg/candyjar.svg)
background-size: 500px
background-position: 99% 50%
class: center, middle, inverse

# R/tidyverse recap

---

# R: Making New Variables

.pad-left[
- `tidyverse` to create new variables

- Make new variables & keep old variables:

- `mutate()` function
    
    - must return same length as input, or length 1 (wll be recycled)

- Make a new variable & lose old variables:

- `summarise()` function (often follows `group_by()`)
    
    - simplest use case: return length of 1

]

---

# R: Making New Variables

.pull-left[

```r
# install tidyverse once 
install.packages('tidyverse')

# load tidyverse functions each time you use
library(tidyverse)
```

]

.pull-right[

- Code at left installs and loads the `tidyverse` package

- The `tidyverse` package includes a set of other packages that permit streamlined data processing

- See Hadley Wickham's _R For Data Science_: [https://r4ds.had.co.nz/](https://r4ds.had.co.nz/)

]

---

# `mutate()`

.pull-left[

```r
# make sure tidyverse loaded
library(tidyverse)

# load (trimmed) HMP V1-V3 OTU table 
otu <- read_csv(
  file = 
    "./data/HMP_OTU_table_longformat.csv.gz",
)

otu # show what you've read
```

]

.pull-right[

```
## # A tibble: 1,380,480 x 4
##    HMPbodysubsite specimen_id otu_id       read_count
##    <chr>                <dbl> <chr>             <dbl>
##  1 Anterior_nares   700014445 OTU_97.1              0
##  2 Anterior_nares   700014445 OTU_97.10             0
##  3 Anterior_nares   700014445 OTU_97.100            0
##  4 Anterior_nares   700014445 OTU_97.1000           0
##  5 Anterior_nares   700014445 OTU_97.10000          0
##  6 Anterior_nares   700014445 OTU_97.10001          0
##  7 Anterior_nares   700014445 OTU_97.10002          0
##  8 Anterior_nares   700014445 OTU_97.10003          0
##  9 Anterior_nares   700014445 OTU_97.10004          0
## 10 Anterior_nares   700014445 OTU_97.10005          0
## # … with 1,380,470 more rows
```

]

---

# `mutate()`

.pull-left[

```r
otu %>%
  mutate(
    HMPbodysubsite =
           gsub("_"," ",HMPbodysubsite)
    )
```

]

.pull-right[

```
## # A tibble: 1,380,480 x 4
##    HMPbodysubsite specimen_id otu_id       read_count
##    <chr>                <dbl> <chr>             <dbl>
##  1 Anterior nares   700014445 OTU_97.1              0
##  2 Anterior nares   700014445 OTU_97.10             0
##  3 Anterior nares   700014445 OTU_97.100            0
##  4 Anterior nares   700014445 OTU_97.1000           0
##  5 Anterior nares   700014445 OTU_97.10000          0
##  6 Anterior nares   700014445 OTU_97.10001          0
##  7 Anterior nares   700014445 OTU_97.10002          0
##  8 Anterior nares   700014445 OTU_97.10003          0
##  9 Anterior nares   700014445 OTU_97.10004          0
## 10 Anterior nares   700014445 OTU_97.10005          0
## # … with 1,380,470 more rows
```

]

---

# `mutate()`

.pull-left[

```r
otu %>%
  mutate(
    HMPbodysubsite =
           gsub("_"," ",HMPbodysubsite),
    log_reads = log10(read_count)
    )
```

]

.pull-right[

```
## # A tibble: 1,380,480 x 5
##    HMPbodysubsite specimen_id otu_id       read_count log_reads
##    <chr>                <dbl> <chr>             <dbl>     <dbl>
##  1 Anterior nares   700014445 OTU_97.1              0      -Inf
##  2 Anterior nares   700014445 OTU_97.10             0      -Inf
##  3 Anterior nares   700014445 OTU_97.100            0      -Inf
##  4 Anterior nares   700014445 OTU_97.1000           0      -Inf
##  5 Anterior nares   700014445 OTU_97.10000          0      -Inf
##  6 Anterior nares   700014445 OTU_97.10001          0      -Inf
##  7 Anterior nares   700014445 OTU_97.10002          0      -Inf
##  8 Anterior nares   700014445 OTU_97.10003          0      -Inf
##  9 Anterior nares   700014445 OTU_97.10004          0      -Inf
## 10 Anterior nares   700014445 OTU_97.10005          0      -Inf
## # … with 1,380,470 more rows
```

]

---

# `summarise()`

.pull-left[

```r
otu %>%
  group_by(HMPbodysubsite) %>%
  summarise(
    mean_reads = mean(read_count, na.rm = TRUE)) %>%
  ungroup()
```

]

.pull-right[

```
## # A tibble: 16 x 2
##    HMPbodysubsite               mean_reads
##  * <chr>                             <dbl>
##  1 Anterior_nares                   0.0958
##  2 Attached_Keratinized_gingiva     0.121 
##  3 Buccal_mucosa                    0.172 
##  4 Hard_palate                      0.116 
##  5 Left_Retroauricular_crease       0.0403
##  6 Mid_vagina                       0.267 
##  7 Palatine_Tonsils                 0.133 
##  8 Posterior_fornix                 0.257 
##  9 Right_Retroauricular_crease      0.0835
## 10 Saliva                           0.213 
## 11 Stool                            0.221 
## 12 Subgingival_plaque               0.262 
## 13 Supragingival_plaque             0.260 
## 14 Throat                           0.117 
## 15 Tongue_dorsum                    0.202 
## 16 Vaginal_introitus                0.247
```

]

---

# `summarise()`

.pull-left[

```r
otu %>%
  group_by(HMPbodysubsite) %>%
  summarise(
    mean_reads = mean(read_count, na.rm = TRUE),
    sum_reads = sum(read_count, na.rm = TRUE)
    ) %>%
  ungroup()
```

]

.pull-right[

```
## # A tibble: 16 x 3
##    HMPbodysubsite               mean_reads sum_reads
##  * <chr>                             <dbl>     <dbl>
##  1 Anterior_nares                   0.0958      8262
##  2 Attached_Keratinized_gingiva     0.121      10458
##  3 Buccal_mucosa                    0.172      14831
##  4 Hard_palate                      0.116       9977
##  5 Left_Retroauricular_crease       0.0403      3481
##  6 Mid_vagina                       0.267      23043
##  7 Palatine_Tonsils                 0.133      11445
##  8 Posterior_fornix                 0.257      22189
##  9 Right_Retroauricular_crease      0.0835      7207
## 10 Saliva                           0.213      18336
## 11 Stool                            0.221      19072
## 12 Subgingival_plaque               0.262      22573
## 13 Supragingival_plaque             0.260      22404
## 14 Throat                           0.117      10061
## 15 Tongue_dorsum                    0.202      17431
## 16 Vaginal_introitus                0.247      21354
```

]

---
background-image: url(data:image/png;base64,#svg/candyjar.svg)
background-size: 500px
background-position: 99% 50%
class: center, middle, inverse

# Summarizing OTU tables

---

# High Dimensional Microbiome Data

.pad-left[

- Typical OTU table orientation in microbiome studies:
    - 43140 rows
    - 32 columns

]

.pull-left[

```r
# TYPICAL OTU TABLE ORIENTATION IN MICROBIOME STUDIES

otu %>%
  reshape2::acast(otu_id ~ specimen_id,
                  # rows = otu_id, columns = specimen_id
                  value.var = "read_count") %>%
     .[1:10,1:5]

# 43140 ROWS & 32 COLUMNS
```

]

.pull-right[

```
##              700013549 700014386 700014403 700014409 700014412
## OTU_97.1             0         0         0         0         0
## OTU_97.10            0         0         6         4         1
## OTU_97.100           0         0       133         7         1
## OTU_97.1000          0         0         0         0         0
## OTU_97.10000         0         0         0         0         0
## OTU_97.10001         0         0         0         0         0
## OTU_97.10002         0         0         0         0         0
## OTU_97.10003         0         0         0         0         0
## OTU_97.10004         0         0         0         0         0
## OTU_97.10005         0         0         0         0         0
```

]

---

# High Dimensional Microbiome Data

.pad-left[

- Typical species table orientation in ecology studies:
    - 32 rows
    - 43140 columns

]

.pull-left[

```r
# TYPICAL SPECIES TABLE ORIENTATION IN ECOLOGY STUDIES

otu %>%
  reshape2::acast(specimen_id ~ otu_id,
                  # rows = specimen_id, columns = otu_id
                  value.var = "read_count") %>%
   .[1:10,1:5]

# 32 ROWS & 43140 COLUMNS
```

]

.pull-right[

```
##           OTU_97.1 OTU_97.10 OTU_97.100 OTU_97.1000 OTU_97.10000
## 700013549        0         0          0           0            0
## 700014386        0         0          0           0            0
## 700014403        0         6        133           0            0
## 700014409        0         4          7           0            0
## 700014412        0         1          1           0            0
## 700014415        0         5          4           0            0
## 700014418        0         2          0           0            0
## 700014421        0         3         25           0            0
## 700014424        0         1          5           0            0
## 700014427        0         1          0           0            0
```

]

---

# High Dimensional Microbiome Data

.pad-left[

- How to deal with high-dimensional microbiome data?

- __Descriptive (e.g., heatmaps and stacked barplots)__
    
- Test a priori hypotheses regarding specific OTUs/taxa

- Reduce dimensions:

- single summary statistic (alpha diversity)
    
    - pairwise distances (beta diversity) with PCoA or PERMANOVA
    
    - community types (mixture modeling)

]

---

# Descriptive Plots

.pad-left[

- Visualization of OTU table:

- typically present counts as a proportion of sample total
    
    - choice of sample order can highlight group differences

- Limitations:

- cannot depict full list of OTUs
    
    - space dictates taxonomic level presented

]

---
background-image: url(data:image/png;base64,#img/hmp_heatmap.png)
background-size: contain

---
background-image: url(data:image/png;base64,#img/hmp_barplot.png)
background-size: contain

---

# High Dimensional Microbiome Data

.pad-left[

- How to deal with high-dimensional microbiome data?

- Descriptive (e.g., heatmaps and stacked barplots)
    
- __Test a priori hypotheses regarding specific OTUs/taxa__

- Reduce dimensions:

- single summary statistic (alpha diversity)
    
    - pairwise distances (beta diversity) with PCoA or PERMANOVA
    
    - community types (mixture modeling)

]

---

# Single-Taxon Hypotheses

.pad-left[

- You suspect Bacteroides has a relationship with outcome of interest...

- _Bacteroides_ (genus)?
    
    - _Bacteroidaceae_ (family)?
    
    - _Bacteroidales_ (order)?
    
    - _Bacteroidetes_ (class)?

- Hypotheses focusing on specific taxa often fail to account for possibility of selection bias from culture.

]

---

# High Dimensional Microbiome Data

.pad-left[

- How to deal with high-dimensional microbiome data?

- Descriptive (e.g., heatmaps and stacked barplots)
    
- Test a priori hypotheses regarding specific OTUs/taxa

- __Reduce dimensions:__

- __single summary statistic (alpha diversity)__
    
    - pairwise distances (beta diversity) with PCoA or PERMANOVA
    
    - community types (mixture modeling)

]

---
background-image: url(data:image/png;base64,#svg/candyjar.svg)
background-size: 500px
background-position: 99% 50%
class: center, middle, inverse

# Alpha Diversity

---

# Alpha Diversity

.pad-left[

- One solution to the p >> n problem:

- capture entire community with a single numerical value
    
    - richness: what's there?
    
    - evenness: how are members distributed?
    
    - diversity: richness + evenness

]

---

# Alpha Diversity

.pad-left[

- Many alpha diversity metrics (weight richness/evenness):

- species number, Chao1 (singletons & doubletons)
    
    - Shannon diversity: 
    
    `$$H' = - \sum{ p_{i} * \log_{b}{(p_{i})} }$$`
    
    (note: typically natural log or base 2 are used)

]

---

# Aside on Information Theory

.pad-left[

- Shannon diversity:

`$$H' = - \sum{ p_{i} * \log_{b}{(p_{i})} }$$`
    
- Claude Shannon & information entropy:

`$$H(p) = - \sum{ p_{i} * \log_{b}{(p_{i})} }$$`
    
"The uncertainty contained in a probability distribution is the average log-probability of an event." (McElreath _Statistical Rethinking, 2nd_ 2020)

]

---

# Alpha Diversity: Chao1

.pad-left[

- The Chao1 index estimates the total species __richness__:

`$$S_{Chao1} = S_{obs} + \frac{n_{1}^2}{2*n_{2}}$$`
    
(S = number of species; n1 = number singletons; n2 = number doubletons)
    
- Chao1 is particularly useful for data sets skewed toward the low-abundance classes, as is likely to be the case with microbes.

]

---
background-image: url(data:image/png;base64,#img/hmp_shannon.png)
background-size: contain

---
background-image: url(data:image/png;base64,#svg/candyjar.svg)
background-size: 500px
background-position: 99% 50%
class: center, middle, inverse

# Rarefaction vs Rarefying

---
background-image: url(data:image/png;base64,#img/counting_uncountable.png)
background-size: contain

---

# Microbes Too Diverse to Count?

.pad-left[

- “In any community, the number of types of organisms observed increases with sampling effort until all types are observed.”

- “The relationship between the number of types observed and sampling effort gives information about the total diversity of the sampled community.”

- “Pattern can be visualized by plotting an accumulation or a rank-abundance curve.”

]

---

# Microbes Too Diverse to Count?

.pad-left[

- “An accumulation curve is a plot of the cumulative number of types observed versus sampling effort.”

- “Because all communities contain a finite number of species, if the surveyors continued to sample, the curves would eventually reach an asymptote at the actual community richness (number of types).”

- “The curves contain information about how well the communities have been sampled (i.e., what fraction of the species in the community have been detected).”

]

---
background-image: url(data:image/png;base64,#img/accumulation_curves.png)
background-size: contain

---

# Microbes Too Diverse to Count?

.pad-left[

- “The more concave-downward the curve, the better sampled the community.”

- “The idea that microbial diversity cannot be estimated comes from the fact that many microbial accumulation curves are linear or close to linear because of high diversity, small sample sizes, or both.”

- “Ultimately, microbes—like tropical insects—are too diverse to count exhaustively.”

]

---
background-image: url(data:image/png;base64,#img/rarefaction.png)
background-size: contain

---

# Rarefaction & Rarefying

.pad-left[

- “Rarefaction was originally introduced as a method of estimating species richness (i.e., an alternative to parametric or nonparametric estimators like Chao1)"

- “Rarefaction compares observed richness among sites, treatments, or habitats that have been unequally sampled. A rarefied curve results from averaging randomizations of the observed accumulation curve…. The variance around the repeated randomizations allows one to compare the observed richness among samples.”

]

---
background-image: url(data:image/png;base64,#img/inadmissable.png)
background-size: contain

---

# Rarefaction & Rarefying

.pad-left[

- “Microbiome analysis workflows often begin with an ad hoc library size normalization by random subsampling without replacement, or so-called rarefying…. There is confusion in the literature regarding terminology, and sometimes this normalization approach is conflated with a non-parametric resampling technique — called rarefaction, or individual-based taxon resampling curves — that can be justified for coverage analysis or species richness estimation in some settings.”

]

---

# Rarefaction & Rarefying

.pad-left[

- “Here we emphasize the distinction between taxon re-sampling curves and normalization by strictly adhering to the terms rarefying or rarefied counts when referring to the normalization procedure, respecting the original definition for rarefaction.”

- "Rarefying defined by the following steps: (1) Select a minimum library size... (2) Discard libraries (microbiome samples) that have fewer reads than [minimum library size]... (3) Subsample the remaining libraries without replacement such that they all have [same] size"

- “Rarefying is now an exceedingly common precursor to microbiome multivariate workflows that seek to relate sample covariates to sample-wise distance matrices.”

]

---

# Rarefaction & Rarefying

.pad-left[

- “Despite its current popularity in microbiome analyses __rarefying biological count data is statistically inadmissible__ because it requires the omission of available valid data. This holds even if repeated rarefying trials are compared for stability as previously suggested.”

]

---
background-image: url(data:image/png;base64,#svg/candyjar.svg)
background-size: 500px
background-position: 99% 50%
class: center, middle, inverse

# R's vegan package

---

# `vegan::diversity()`

.pull-left[

```r
library(vegan)

diversity(x = c(100,0,5,10), index = "shannon", base = 2)

diversity(x = c(5,5,5,5), index = "shannon", base = 2)
```

]

.pull-right[

```
## [1] 0.6784071
```

```
## [1] 2
```

]

---

# `vegan::rrarefy()`

.pull-left[

```r
library(vegan)

rrarefy(x = c(100,0,5,10), sample = 10)

rrarefy(x = c(1421,170,205,3607), sample = 1000)
```

]

.pull-right[

```
##      [,1] [,2] [,3] [,4]
## [1,]    9    0    1    0
```

```
##      [,1] [,2] [,3] [,4]
## [1,]  267   35   30  668
```

]

---
class: center, middle, inverse
background-image: url(data:image/png;base64,#svg/conjugation.svg)
background-size: 500px
background-position: 50% 50%

# Questions?
### Post to the discussion board!

---
background-image: url(data:image/png;base64,#svg/bacteria.svg)
background-size: 100px
background-position: 98% 90%
class: center, middle

# Thank you!
#### Slides available: [github.com/bjklab](https://github.com/bjklab/EPID674_002_sequences-to-counts.git)
#### [brendank@pennmedicine.upenn.edu](brendank@pennmedicine.upenn.edu)