Automate Microbiome Analysis Using Quarto

Author

Sudarshan A. Shetty

Note

Before you start, please familiarize yourself with RStudio and its RStudio projects.

As someone analyzing microbiome data, you may find yourself frequently working with different data sets and performing the same basic analyses over and over again. These standard tasks, such as quality checking, calculating alpha and beta diversity, and visualizing the composition between groups, can be time-consuming if you’re constantly copy-pasting code or searching for solutions on Stack Overflow or GitHub.

Wouldn’t it be great if you could automate these analyses? That’s where Quarto comes in. Developed by Posit (formerly RStudio), Quarto is the next generation version of Rmarkdown (.rmd) tool that makes it easy to perform these common tasks and more. In fact, if you’re already familiar with Rmarkdown, you’ll find the Quarto document (.qmd) very similar to .rmd. Here you can find the installation details.

Here’s a simple example of what you can do with Quarto: read in a phyloseq object stored as an *.rds file, calculate reads per sample, Shannon diversity, and Observed ASVs, and investigate whether there is a correlation between reads per sample and Shannon diversity, Observed ASVs. I use the data from Fuentes et al. that consists of gut microbiota profiles from older adults with (two time points, L1 and L2) and without (C) influenza-like illness. This information is stored in the column called ILI in the sample_data of the phyloseq object FuentesIliGutData.rds. You can access the data from the GitHub repo of this website.

---
title: "Microbiota Analysis"
author: Sudarshan A. Shetty
date: "2023-01-08"
format:
  html:
    code-tools: true
    code-fold: false
    code-copy: true
    self-contained: true
    toc-location: left
knitr: 
    opts_chunk: 
        collapse: TRUE
        dpi: 150
        message: FALSE
        warning: FALSE
        error: TRUE
editor: visual
---

In this document, we investigate whether the Shannon diversity and Observed species of samples is correlated to their sequencing depth.  

## Load libraries  
```{r}
library(phyloseq)
library(microbiome)
library(ggpubr) # stats with viz
library(patchwork) # combine plots 
```

## Read Data  
```{r}
pseq <- readRDS("data/FuentesIliGutData.rds")
```

## Check Reads  
```{r}
# add reads/sample to sample_data()
sample_data(pseq)$reads.per.sample <- sample_sums(pseq)
# Compare between groups
pseq |> 
    microbiome::meta() |> 
    ggplot(aes(ILI, reads.per.sample)) +
    geom_violin(aes(color=ILI))+
    geom_boxplot(aes(color=ILI), width=0.2)+
    labs(x="", y="Reads/sample")
```

## Alpha diversity  
```{r}
# we can add shannon diversity to sample_data(pseq)
sample_data(pseq)$shannon <- microbiome::diversity(pseq, index="shannon")[,1]
sample_data(pseq)$observed <- microbiome::richness(pseq, index="observed")[,1]
p.shannon <- meta(pseq) |> 
    ggplot(aes(ILI, shannon)) +
    geom_violin(aes(color=ILI, shannon))+
    geom_boxplot(aes(color=ILI, shannon), width=0.2) +
    labs(x="", y="Shannon Diversity")+
    theme_bw()
p.observed <- meta(pseq) |> 
    ggplot(aes(ILI, observed)) +
    geom_violin(aes(color=ILI, shannon))+
    geom_boxplot(aes(color=ILI, shannon), width=0.2) +
    labs(x="", y="Observed ASVs")+
    theme_bw()

(p.shannon | p.observed ) + 
    plot_layout(guides = "collect") +
    plot_annotation(tag_levels = "a")
```

## Correlation  
Check if reads per sample is correlated to Shannon diversity and Observed ASVs. 

```{r}
p.shanon.cor <- meta(pseq) |> 
    ggplot(aes(reads.per.sample, shannon, group=ILI)) +
    geom_point(aes(color=ILI), alpha=0.5) +
    geom_smooth(method = "lm", aes(color=ILI), show.legend = FALSE) +
    ggpubr::stat_cor(method = "spearman", aes_string(color=primary_var), show.legend = FALSE,
                     label.x.npc = 0.5, label.y.npc = 0.3, hjust = 0)+
    labs(x="Reads/sample", y="Shannon Diversity") +
    theme_bw()

p.observed.cor <- meta(pseq) |> 
    ggplot(aes(reads.per.sample, observed, group=ILI)) +
    geom_point(aes(color=ILI), alpha=0.5) +
    geom_smooth(method = "lm", aes(color=ILI), show.legend = FALSE) +
    ggpubr::stat_cor(method = "spearman", aes_string(color=primary_var), show.legend = FALSE,
                     label.x.npc = 0.5, label.y.npc = 0.3, hjust = 0) +
    labs(x="Reads/sample", y="Observed ASVs") +
    theme_bw()

(p.shanon.cor | p.observed.cor) + 
    plot_layout(guides = "collect") +
    plot_annotation(tag_levels = "a")
```

```{r}
sessionInfo()
```

As you may notice this is one of the common tasks in a microbiota profiling study. This can be boring if you’re constantly writing or copying the same code.

But what if you could automate these tasks and save yourself time and effort? In a recent post, Mike Mahoney demonstrated How to use Quarto for Parameterized Reporting. This inspired me to try Quarto for automating my analysis.

One of the handy features of Quarto is the params option, which can be used in the .qmd file to customize your analysis for each project. This option is also available in the Rmarkdown format. Here’s an example of how to use params in .qmd:

---
title: "Microbiota Analysis"
author: Sudarshan A. Shetty
date: "2023-01-08"
format:
  html:
    code-tools: true
    code-fold: false
    code-copy: true
    self-contained: true
    toc-location: left
knitr: 
    opts_chunk: 
        collapse: TRUE
        dpi: 150
        message: FALSE
        warning: FALSE
        error: TRUE
editor: visual

params:
    pseq_path: "data/FuentesIliGutData.rds"
    primary_var: "ILI"
---


```{r}
library(phyloseq)
library(microbiome)
library(ggpubr) # stats with viz
library(patchwork) # combine plots 
```

## Usage of parameters  
The parameters can be passed on simply as follows at the start of the code chunk
```{r}
pseq_path = params$pseq_path
primary_var = params$primary_var
```


## Read Data
```{r}
pseq <- readRDS(pseq_path)
print(pseq)
```
And rest of the analysis...

Did you notice the pseq_path: “data/FuentesIliGutData.rds” and primary_var: ILI variables in the params section of .qmd file? These are floating variables that can be customized for each project. For every new data set, you can change the pseq_path and primary_var and render the .qmd file.

To make it even easier to reuse this analysis, you can save the file for example basic_analysis.qmd in your RProject directory and write a simple function called build_my_report() that processes the .qmd file and renders it as an html report.

Here’s a draft of what the full basic_analysis.qmd might look like.

---
title: "Microbiota Analysis"
author: Sudarshan A. Shetty
date: "2023-01-08"
format:
  html:
    code-tools: true
    code-fold: false
    code-copy: true
    self-contained: true
    toc-location: left
knitr: 
    opts_chunk: 
        collapse: TRUE
        dpi: 150
        message: FALSE
        warning: FALSE
        error: TRUE
editor: visual
params:
    pseq_path: pseq_path
    primary_var: primary_var
---

In this document, we investigate whether the Shannon diversity and Observed ASVs of samples is correlated to their sequencing depth.  
## Load libraries

```{r}
library(phyloseq)
library(microbiome)
library(ggpubr) # stats with viz
library(patchwork) # combine plots 
```

## Define parameters  
```{r}
pseq_path = params$pseq_path
primary_var = params$primary_var
```


## Read Data
```{r}
pseq <- readRDS(pseq_path)
```

## Check Reads
```{r}
# add reads/sample to sample_data()
sample_data(pseq)$reads.per.sample <- sample_sums(pseq)
# Compare between groups

# notice here the use of ase_string instead of aes for ggplot. Since our primary variable 'primary_var' will be in quotes, we can use this handy option in ggplot2  
pseq |> 
    microbiome::meta() |> 
    ggplot(aes_string(primary_var, "reads.per.sample")) +
    geom_violin(aes_string(color=primary_var))+
    geom_boxplot(aes_string(color=primary_var), width=0.2)+
    labs(x="", y="Reads/sample")+
    theme_bw()
```


## Alpha diversity
```{r}
# we can add shannon diversity to sample_data(pseq)
sample_data(pseq)$shannon <- microbiome::diversity(pseq, index="shannon")[,1]
sample_data(pseq)$observed <- microbiome::richness(pseq, index="observed")[,1]
p.shannon <- meta(pseq) |> 
    ggplot(aes_string(primary_var, "shannon")) +
    geom_violin(aes_string(color=primary_var))+
    geom_boxplot(aes_string(color=primary_var), width=0.2) +
    labs(x="", y="Shannon Diversity")+
    theme_bw()
p.observed <- meta(pseq) |> 
    ggplot(aes_string(primary_var, "observed")) +
    geom_violin(aes_string(color=primary_var))+
    geom_boxplot(aes_string(color=primary_var), width=0.2) +
    labs(x="", y="Observed ASVs")+
    theme_bw()

(p.shannon | p.observed ) + 
    plot_layout(guides = "collect") +
    plot_annotation(tag_levels = "a")
```

## Correlation
Check if reads per sample is correlated to Shannon diversity
```{r}

p.shanon.cor <- meta(pseq) |> 
    ggplot(aes_string("reads.per.sample", "shannon", group=primary_var)) +
    geom_point(aes_string(color=primary_var), alpha=0.5) +
    geom_smooth(method = "lm", aes_string(color=primary_var), show.legend = FALSE) +
    ggpubr::stat_cor(method = "spearman", aes_string(color=primary_var), show.legend = FALSE,
                     label.x.npc = 0.5, label.y.npc = 0.3, hjust = 0)+
    labs(x="Reads/sample", y="Shannon Diversity") +
    theme_bw() 

p.observed.cor <- meta(pseq) |> 
    ggplot(aes_string("reads.per.sample", "observed", group=primary_var)) +
    geom_point(aes_string(color=primary_var), alpha=0.5) +
    geom_smooth(method = "lm", aes_string(color=primary_var), show.legend = FALSE) +
    ggpubr::stat_cor(method = "spearman", aes_string(color=primary_var), show.legend = FALSE,
                     label.x.npc = 0.5, label.y.npc = 0.3, hjust = 0) +
    labs(x="Reads/sample", y="Observed ASVs") +
    theme_bw() 


(p.shanon.cor | p.observed.cor) + 
    plot_layout(guides = "collect") +
    plot_annotation(tag_levels = "a")
```

```{r}
sessionInfo()
```

Notice that the ILI variable has been replaced with primary_var in the params and in the plotting aes_string. This allows you to customize the analysis for each project simply by changing the values of these floating variables in your build_my_report() function.

Below is the build_my_report() function.

build_my_report <- function(qmd_file = NULL,
                            pseq_path = NULL,
                            primary_var = NULL) {
    
    {
        # extracting the name of the source .qmd file
        base_name <- qmd_file
        base_name <- gsub(".qmd", "", base_name)
        # render the file 
        quarto::quarto_render(input = qmd_file,
                              # We add the suffix '_report' to output html file
                              output_file = paste0(base_name, "_report.html"),
                              quiet = FALSE,
                              # supply the params arguments 
                              execute_params = list("pseq_path" = pseq_path,
                                                    "primary_var" = primary_var),
                              execute_debug = FALSE,
                              cache = NULL,
                              cache_refresh = FALSE,
                              debug = FALSE)
    }
}

The build_my_report(qmd_file, pseq_path, primary_var) function takes three arguments:

qmd_file is the name of the .qmd template file (e.g. basic_analysis.qmd).
pseq_path is the path to the phyloseq object that you want to analyze.
primary_var is the name of the primary variable you want to investigate. In this case, ILI is the column name in the sample_data() of the FuentesIliGutData phyloseq object.

You can create an RProject and organize your files. Here is one simple example.

Structure of your `RProject`:   
my_project/   
    |----- my_project.Rproj # short description of what it contains.
    |----- basic_analysis.qmd # Template .qmd file.  
    |----- build_my_report.R # R file with function to render report from .qmd file. 
    |----- data/ # A folder where data is stored.  
            |----- FuentesIliGutData.rds

After this you can simply source the function from the file build_my_report.R in your session.

source("build_my_report.R") # source function to run

And run build_my_report() by supplying the necessary files/info to the arguments.

# run function 
build_my_report(qmd_file = "basic_analysis.qmd",
                pseq_path = "data/FuentesIliGutData.rds",
                primary_var = "ILI")

Note

Check the rendered html file in the tab Example Report above!

I think that Quarto is a useful tool for microbiome researchers looking to streamline their workflows and save time on routine analyses. By using the params option in the .qmd file, you can easily customize your analysis for each project, while the ability to store your .qmd file in your RProject directory means you can easily reuse your analysis across multiple projects. The reports can be standardized and shared easily with collaborators. There are several aesthetic and practical settings like hiding codes, defining visualization themes, etc as well as using the knitr::opts_chunk for setting global rules for your documents. Moreover, beyond html, reports can be generated as pdf or word doc files. You can familiarize yourself with Quarto quide and explore the numerous options for automating your data analysis workflows. Give it a try and see how it can change the way you approach your routine microbiome data analytics.