Introduction • dysbiosisR

In human microbiome research, there is interest in measuring “dysbiosis” which in many cases relies on comparing a case group with a control (without known disease). Several methods have been reported in literature and reviewed by Wei S et al., in their 2021 article Determining gut microbial dysbiosis: a review of applied indexes for assessment of intestinal microbiota imbalances. The dysbiosisR provides tools for six approaches covering three different methods: Neighborhood classification, Random forest prediction and Combined alpha-beta diversity.

It is important to note that the term ‘dysbiosis’ is a vaguely defined term in most instances and different methods are employed. We noticed a lack of tools for convenient and reproducible measurement of “dysbiosis” measures/scores/index. To address this gap we accumulated some commonly used approaches into a more standardized manner in an R package we call dysbiosisR.
We would also like to direct the readers and users to debates surrounding the concept of dysbiosis and suggest to interpret these measures by considering its pitfalls.

Load libraries

library(dysbiosisR)
library(ggplot2)
library(microbiome)
library(dplyr)
data("WirbelJ_2018")

Print the available methods.

dysbiosisOverview()
#> You are using dysbiosisR version: [1] '1.0.4'
#> 
#> Following score (s) are currently supported
#> 1. dysbiosisMedianCLV: Lloyd-Price J, Arze CAnanthakrishnan AN et al. (2019)
#> 2. euclideanDistCentroids: AlShawaqfeh MK et al. (2017)
#> 3. cloudStatistic: Montassier E et al. (2018)
#> 4. combinedShannonJSD: Santiago M et al. (2019)
#> 5. dysbiosisOBB: Saffouri GB, Shields-Cutler R et al. (2019)
#> 6. distanceToReferencePlane: Halfvarson J, Brislawn CJ et al. (2017)

Here we use the test data from Wirbel J et al., (2018). The data consists of CRC and controls samples.

Data

WirbelJ_2018
#> phyloseq-class experiment-level object
#> otu_table()   OTU Table:         [ 530 taxa and 125 samples ]
#> sample_data() Sample Data:       [ 125 samples by 24 sample variables ]
#> tax_table()   Taxonomy Table:    [ 530 taxa by 6 taxonomic ranks ]
#> phy_tree()    Phylogenetic Tree: [ 530 tips and 529 internal nodes ]

# check first five sample sums
sample_sums(WirbelJ_2018)[1:5]
#> CCMD11006829ST-21-0 CCMD12232071ST-21-0 CCMD13071240ST-21-0 CCMD13934959ST-21-0 
#>            99.83789            99.98053            99.92165            99.95711 
#> CCMD14479708ST-21-0 
#>            98.87883

These are relative abundance data.

table(sample_data(WirbelJ_2018)$disease)
#> 
#>     CRC healthy 
#>      60      65

There are 60 CRC samples and 65 control/healthy samples.

Dysbiosis Measures

Dysbiosis score based on median community level variation

Calculates median variation in a given sample compared to a reference sample group. Here will calculate median variation for a sample compared to a reference sample set. The user can provide a custom distance matrix. If the user provides a Bray-Curtis dissimilarity matrix, then the resulting score is comparable to the ‘dysbiosis score’ reported in Lloyd-Price J, Arze C, Ananthakrishnan AN et al. (2019)

dist.mat <- phyloseq::distance(WirbelJ_2018, "bray")
# get reference samples
ref.samples <- sample_names(subset_samples(WirbelJ_2018, 
                                           disease == "healthy"))

dysbiosis_1 <- dysbiosisMedianCLV(WirbelJ_2018,
                                  dist_mat = dist.mat,
                                  reference_samples = ref.samples)

The original article identifies highly divergent samples from the reference set using the 90th percentile of the dysbiosis score. This can be identified using the quantile function.

dysbiosis_thres <- quantile(subset(dysbiosis_1, disease == "CRC")$score, 0.9)
normobiosis_thres <- quantile(subset(dysbiosis_1, disease == "CRC")$score, 0.1)

dysbiosis_1 <- dysbiosis_1 |> 
  mutate(isDysbiostic = ifelse(score >= dysbiosis_thres, TRUE, FALSE))

# check in each group how many satisfy the threshold 
dysbiosis_1 |> 
  dplyr::count(disease, isDysbiostic)
#>   disease isDysbiostic  n
#> 1     CRC        FALSE 54
#> 2     CRC         TRUE  6
#> 3 healthy        FALSE 61
#> 4 healthy         TRUE  4

Visualize Dysbiosis Score between CRC and healthy group.

p1 <- plotDysbiosis(df=dysbiosis_1,
                    xvar="disease",
                    yvar="score",
                    colors=c(CRC="brown3", healthy="steelblue"),
                    show_points = FALSE) +
  labs(x="", y="Dysbiosis Score") +
  theme_bw(base_size = 14)
p1

# order the data 
dysbiosis_1$disease <- factor(dysbiosis_1$disease, 
                              levels = c("healthy", "CRC"))

roc_1 <- pROC::roc(as.factor(dysbiosis_1$disease),
                   dysbiosis_1$score ,
                   #direction= ">",
                   plot=TRUE,
                   ci = TRUE,
                   auc.polygon=TRUE,
                   max.auc.polygon=TRUE,
                   print.auc=TRUE)

The AUC for this approach is 0.6469.

Dysbiosis score based on Euclidean distance to group centroids

Calculates difference in euclidean distance (ED) for a sample to group centroids. For example, sample_1 to control centroid minus sample_1 to case centroid. The user can provide a custom distance matrix. This approach was used in AlShawaqfeh MK et al. (2017).

dysbiosis_2 <- euclideanDistCentroids(WirbelJ_2018,
                                      dist_mat = dist.mat,
                                      use_squared = TRUE,
                                      group_col = "disease",
                                      control_label = "healthy",
                                      case_label = "CRC")
dysbiosis_2[1:3,1:3]
#>                     CentroidDist_CRC CentroidDist_healthy CentroidDist_score
#> CCMD11006829ST-21-0        0.2912347            0.3684569         0.07722225
#> CCMD12232071ST-21-0        0.2744104            0.3415175         0.06710710
#> CCMD13071240ST-21-0        0.2172320            0.2697302         0.05249821

Here, the CentroidDist_score can be used as a dysbiosis score.

# order the data 
dysbiosis_2$disease <- factor(dysbiosis_2$disease, 
                              levels = c("healthy", "CRC"))

roc_2 <- pROC::roc(as.factor(dysbiosis_2$disease),
                   dysbiosis_2$CentroidDist_score ,
                   #direction= ">",
                   plot=TRUE,
                   ci = TRUE,
                   auc.polygon=TRUE,
                   max.auc.polygon=TRUE,
                   print.auc=TRUE)

The AUC for this approach is 0.797.

p3 <- plotDysbiosis(df=dysbiosis_2,
                    xvar="disease",
                    yvar="CentroidDist_score",
                    colors=c(CRC="brown3", healthy="steelblue"),
                    show_points = FALSE) +
  labs(x="", y="Dysbiosis Score (Centroid)")
p3

Combined Alpha Beta Diveristy Based Score

The combined alpha-beta diversity approach was used by Santiago M E et al. 2019. This approach uses Shannon diversity as the alpha diversity measure and Jensen–Shannon divergence as the beta diversity measure. The score is mean difference of Shannon diversity between test sample and all references samples multiplied by the mean JSD of the test sample to all reference samples. When calculating this score for reference samples, the sample being used is excluded from calculating means for alpha and beta diversity.

dysbiosis_3 <- combinedShannonJSD(WirbelJ_2018,
                                  reference_samples = ref.samples)
dysbiosis_3[1:3,1:3]
#>                     ShannonJSDScore   study_name          subject_id
#> CCMD11006829ST-21-0      0.19887588 WirbelJ_2018 CCMD11006829ST-21-0
#> CCMD12232071ST-21-0      0.07682596 WirbelJ_2018 CCMD12232071ST-21-0
#> CCMD13071240ST-21-0      0.19646724 WirbelJ_2018 CCMD13071240ST-21-0

p4 <- plotDysbiosis(df=dysbiosis_3,
                    xvar="disease",
                    yvar="ShannonJSDScore",
                    colors=c(CRC="brown3", healthy="steelblue"),
                    show_points = FALSE)+
  labs(x="", y="Shannon-JSD\nDysbiosis Score")
p4

# order the data 
dysbiosis_3$disease <- factor(dysbiosis_3$disease, 
                              levels = c("healthy", "CRC"))

roc_3 <- pROC::roc(as.factor(dysbiosis_3$disease),
                   dysbiosis_3$ShannonJSDScore ,
                   #direction= ">",
                   plot=TRUE,
                   ci = TRUE,
                   auc.polygon=TRUE,
                   max.auc.polygon=TRUE,
                   print.auc=TRUE)

The AUC for this approach is 0.582.

Cloud-based LOcally linear Unbiased Dysbiosis (CLOUD) test

Cloud-based LOcally linear Unbiased Dysbiosis (CLOUD) test is a non-parametric test and returns a measure of dysbiosis. The function was adapted from the original article by Montassier E et al. 2018. Here, a user defines a set of reference samples from which distance of every other sample is calculated. When calculating the CLOUD stats the k is an important parameter specified by argument k_num.

cloud.results <- cloudStatistic(WirbelJ_2018,
                                dist_mat = dist.mat,
                                reference_samples = ref.samples,
                                ndim=-1,
                                k_num=5)
cloud.results[1:3,1:5]
#>                         stats pvals  log2Stats   study_name          subject_id
#> CCMD10032470ST-11-0 0.6673767     1 -0.5834267 WirbelJ_2018 CCMD10032470ST-11-0
#> CCMD10191450ST-11-0 0.6603512     1 -0.5986947 WirbelJ_2018 CCMD10191450ST-11-0
#> CCMD11006829ST-21-0 0.8141492     1 -0.2966349 WirbelJ_2018 CCMD11006829ST-21-0

Here, the log2Stats can be used as a dysbiosis score.

# order the data 
cloud.results$disease <- factor(cloud.results$disease, 
                                levels = c("healthy", "CRC"))

roc_4 <- pROC::roc(as.factor(cloud.results$disease),
                   cloud.results$log2Stats ,
                   #direction= ">",
                   plot=TRUE,
                   ci = TRUE,
                   auc.polygon=TRUE,
                   max.auc.polygon=TRUE,
                   print.auc=TRUE)

The AUC for this approach is 0.631.

p2 <- plotDysbiosis(df=cloud.results,
                    xvar="disease",
                    yvar="log2Stats",
                    colors=c(CRC="brown3", healthy="steelblue"),
                    show_points = FALSE) +
  labs(x="", y="Dysbiosis CLOUD Score")
p2

Random Forest Prediction Based Score

The original article Saffouri GB, Shields-Cutler R et al. 2019 reported a Symptom Index abbreviated as SI. In this approach the feature abundances are used for Random Forest classification and the resulting out of bag (OOB) predicted probability of being classified in disease group is considered as an SI or also dysbiosis index. The dysbiosisOBB function in this package allows for calculating this measure with some level of freedom of tuneRF and random forest parameters via the randomForest R package.

# data are relative abundances summed to 100 or
# near 100 val
dysbiosis.oob <- dysbiosisOBB(WirbelJ_2018,
                              group_col = "disease",
                              case_label = "CRC",
                              seed_value = 1235,
                              add_tuneRF_params = list(ntreeTry=100,
                                                       stepFactor=1.5,
                                                       improve=0.01,
                                                       trace=TRUE,
                                                       dobest=FALSE),
                              ntree = 100,
                              plot_roc = TRUE)
#> mtry = 23  OOB error = 20.8% 
#> Searching left ...
#> mtry = 16    OOB error = 22.4% 
#> -0.07692308 0.01 
#> Searching right ...
#> mtry = 34    OOB error = 19.2% 
#> 0.07692308 0.01 
#> mtry = 51    OOB error = 17.6% 
#> 0.08333333 0.01 
#> mtry = 76    OOB error = 18.4% 
#> -0.04545455 0.01

Visualize

p5 <- plotDysbiosis(df=dysbiosis.oob,
                    xvar="disease",
                    yvar="oob.score",
                    colors=c(CRC="brown3", healthy="steelblue"),
                    show_points = FALSE)+
  labs(x="", y="OOB Dysbiosis Score")
p5

All these measure may be influenced by host factors such as age, gender, BMI, etc. It is advised to investigate these factors when analyzing the dysbiosis measures.

Distance to reference plane

The original article Halfvarson J, Brislawn CJ, Lamendella R et al. 2017 reported a measure based on the deviation from a healthy or reference plane. The plane is fitted in PCoA-space based on user-defined distances between samples, for example a Bray-Curtis dissimilarity matrix. Briefly, a two-dimensional plane is fitted to the first three PCoA-coordinates of samples from healthy subjects using a least-squares method. Next, the distance between a given sample to the plane is calculated, which can be considered as a measure of deviation from normal.

The original implementation of the reference plane calculation was written in Python by the authors of the paper. Our R implementation was benchmarked against the original code and yields identical results.

As with other measures based on a distance matrix, results can vary based on the distance chosen. Here, we applied the Bray-Curtis dissimilarity metric, whereas Halfvarson J et al. used the UniFrac distance.

dtrp.results <- distanceToReferencePlane(
  WirbelJ_2018,
  dist_mat = dist.mat,
  reference_samples = ref.samples)
dtrp.results[1:3,1:5]
#>                      dtrpScore   study_name          subject_id body_site
#> CCMD11006829ST-21-0 0.05128249 WirbelJ_2018 CCMD11006829ST-21-0     stool
#> CCMD12232071ST-21-0 0.22944516 WirbelJ_2018 CCMD12232071ST-21-0     stool
#> CCMD13071240ST-21-0 0.14400941 WirbelJ_2018 CCMD13071240ST-21-0     stool
#>                     study_condition
#> CCMD11006829ST-21-0             CRC
#> CCMD12232071ST-21-0             CRC
#> CCMD13071240ST-21-0             CRC

# order the data 
dtrp.results$disease <- factor(dtrp.results$disease, 
                               levels = c("healthy", "CRC"))
roc_6 <- pROC::roc(as.factor(dtrp.results$disease),
                   dtrp.results$dtrpScore,
                   #direction= ">",
                   plot=TRUE,
                   ci = TRUE,
                   auc.polygon=TRUE,
                   max.auc.polygon=TRUE,
                   grid=TRUE,
                   print.auc=TRUE)

The AUC for this approach is 0.561.

Visualise

p6 <- plotDysbiosis(df=dtrp.results,
                    xvar="disease",
                    yvar="dtrpScore",
                    colors=c(CRC="brown3", healthy="steelblue"),
                    show_points = FALSE)+
  labs(x="", y="Distance to reference plane Score")
p6

Gradient plot

Current dysbiosis measures do not necessarily reveal a yes and no classification of healthy and disease states. This is a challenge in gut microbiome research where individuality plays an important role. To demonstrate this, we can visualize this gradient with plotDysbiosisGradient. We use the results from the the random forest prediction based score and dysbiosisMedianCLV approach along with the dysbiosis_thres and normobiosis_thres values we obtained above.


# define gradient cols
# browns <- c("#723d46", "#86525b", "#996770", "#ad7d86", 
#             "#c1949d", "#d6abb4", "#eac3cc", "#ede5e6")
volcano <- c("#003f5c", "#58508d","#bc5090","#ff6361", "#ffa600")

# RF OOB  
p.obb <- plotDysbiosisGradient(df=dysbiosis.oob,
                               score="oob.score",
                               high_line = 0.5,
                               # low_line = normobiosis_thres,
                               group_var = "disease",
                               group_colors=c("healthy" = "steelblue", 
                                              "CRC"= "brown3"),
                               point_size = 2,
                               bg_colors = rev(volcano),
                               jitter_width = 0.1) +
  labs(y="Dysbiosis Score", subtitle = "RF OOB Probability") +
  # adjust the x and y values to fit to plot
  ggplot2::annotate("text", x = 0.7, y = 0.55,
                    label = "OOB probability\n(0.5)", color="white")

# Median CLV    
p.clv <- plotDysbiosisGradient(df=dysbiosis_1,
                               score="score",
                               high_line = dysbiosis_thres,
                               low_line = normobiosis_thres,
                               group_var = "disease",
                               group_colors=c("healthy" = "steelblue", 
                                              "CRC"= "brown3"),
                               point_size = 2,
                               bg_colors = rev(volcano),
                               jitter_width = 0.1) +
  labs(y="Dysbiosis Score", subtitle = "Median CLV") +
  # adjust the x and y values to fit to plot
  ggplot2::annotate("text", x = 0.6, y = dysbiosis_thres+0.03,
                    label = "Dysbiosis\nThreshold", color="white")+
  ggplot2::annotate("text", x = 0.65, y = normobiosis_thres-0.03,
                    label = "Normobiosis\nThreshold", color="white")

library(patchwork)
p.obb + p.clv + plot_layout(guides = "collect")

Additional notes: The ROC AUC curves generated for a single data set are susceptible to over fitting or under fitting and therefore, the dysbiosis scores should be considered to be data specific. In order to build/use generalized prediction models based on dysbiosis measures and inferred cut-offs, different training and test data must be used. We see the application of these dysbiosis measure more specifically for single data sets where one would like to ‘capture disease associated imbalance’ into a single sample specific score. This score may be used to check for associations with other disease related features.

Caution: The dysbiosis score in itself does not tell anything about the cause or consequence and hence careful interpretation is of paramount importance.

Here, for example we ask how different is the gut microbiota at different CRC stages compared to control gut microbiota? We can look at two measures, alpha diversity and dysbiosis score in different CRC stages.

# get alpha diversity example Shannon diversity  
sample_data(WirbelJ_2018)$shannon <- microbiome::diversity(WirbelJ_2018, "shannon")[,1]

shannon.plot <- meta(WirbelJ_2018) |> 
  mutate(crc.stage = ifelse(is.na(ajcc), "CTRL", ajcc)) |> 
  mutate(crc.stage = factor(crc.stage, levels= c("CTRL", "0", "i", "ii", "iii", "iv"))) |> 
  ggplot(aes(crc.stage, shannon)) +
  geom_boxplot(width=0.5, aes(color=disease))+
  geom_jitter(width = 0.2, aes(color=disease)) +
  theme_bw() +
  labs(x="CRC stages", y="Shannon") +
  scale_color_manual("Disease", values = c("healthy" = "steelblue", 
                                           "CRC"= "brown3"))

# random forest prediction based score  
dys.plot <- dysbiosis.oob |> 
  mutate(crc.stage = ifelse(is.na(ajcc), "CTRL", ajcc)) |> 
  mutate(crc.stage = factor(crc.stage, levels= c("CTRL", "0", "i", "ii", "iii", "iv"))) |> 
  ggplot(aes(crc.stage, oob.score)) +
  geom_boxplot(width=0.5, aes(color=disease))+
  geom_jitter(width = 0.1, aes(color=disease)) +
  theme_bw() +
  labs(x="CRC stages", y="Dysbiosis Score\n(RF OOB)") +
  scale_color_manual("Disease", values = c("healthy" = "steelblue", 
                                              "CRC"= "brown3"))
shannon.plot + dys.plot + plot_layout(guides = "collect")

sessionInfo()
#> R version 4.2.2 (2022-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.5 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] patchwork_1.1.2   dplyr_1.0.10      dysbiosisR_1.0.4  microbiome_1.18.0
#> [5] ggplot2_3.4.0     phyloseq_1.40.0  
#> 
#> loaded via a namespace (and not attached):
#>  [1] nlme_3.1-160           bitops_1.0-7           fs_1.5.2              
#>  [4] usedist_0.4.0          RColorBrewer_1.1-3     rprojroot_2.0.3       
#>  [7] GenomeInfoDb_1.32.4    tools_4.2.2            bslib_0.4.1           
#> [10] utf8_1.2.2             R6_2.5.1               vegan_2.6-4           
#> [13] BiocGenerics_0.42.0    mgcv_1.8-41            colorspace_2.0-3      
#> [16] ggdist_3.2.0           permute_0.9-7          rhdf5filters_1.8.0    
#> [19] ade4_1.7-20            withr_2.5.0            tidyselect_1.2.0      
#> [22] compiler_4.2.2         textshaping_0.3.6      cli_3.4.1             
#> [25] Biobase_2.56.0         desc_1.4.2             labeling_0.4.2        
#> [28] sass_0.4.4             scales_1.2.1           randomForest_4.7-1.1  
#> [31] pkgdown_2.0.6          systemfonts_1.0.4      stringr_1.4.1         
#> [34] digest_0.6.30          rmarkdown_2.18         XVector_0.36.0        
#> [37] pkgconfig_2.0.3        htmltools_0.5.3        highr_0.9             
#> [40] fastmap_1.1.0          rlang_1.0.6            farver_2.1.1          
#> [43] jquerylib_0.1.4        generics_0.1.3         jsonlite_1.8.3        
#> [46] distributional_0.3.1   RCurl_1.98-1.9         magrittr_2.0.3        
#> [49] GenomeInfoDbData_1.2.8 biomformat_1.24.0      Matrix_1.5-1          
#> [52] Rcpp_1.0.9             munsell_0.5.0          S4Vectors_0.34.0      
#> [55] Rhdf5lib_1.18.2        fansi_1.0.3            ape_5.6-2             
#> [58] lifecycle_1.0.3        stringi_1.7.8          pROC_1.18.0           
#> [61] yaml_2.3.6             MASS_7.3-58.1          zlibbioc_1.42.0       
#> [64] rhdf5_2.40.0           Rtsne_0.16             plyr_1.8.8            
#> [67] grid_4.2.2             parallel_4.2.2         crayon_1.5.2          
#> [70] lattice_0.20-45        Biostrings_2.64.1      splines_4.2.2         
#> [73] multtest_2.52.0        knitr_1.41             pillar_1.8.1          
#> [76] igraph_1.3.5           reshape2_1.4.4         codetools_0.2-18      
#> [79] stats4_4.2.2           glue_1.6.2             evaluate_0.18         
#> [82] data.table_1.14.6      vctrs_0.5.1            foreach_1.5.2         
#> [85] gtable_0.3.1           purrr_0.3.5            tidyr_1.2.1           
#> [88] cachem_1.0.6           xfun_0.35              ragg_1.2.4            
#> [91] survival_3.4-0         tibble_3.1.8           iterators_1.0.14      
#> [94] memoise_2.0.1          IRanges_2.30.1         cluster_2.1.4         
#> [97] ellipsis_0.3.2