Code
library(tidyverse)
library(ggthemes)
library(here)
library(gt)Wastewater and environmental surveillance (WES) is a powerful public health tool that helps us gain insight into the the health of communities. By systematically, collecting, testing, and analysing wastewater, faecal sludge, and contaminated surface water samples, we can better understand disease trends in the community that contributes to those samples. WES is more efficient than clinical disease surveillance, because a single sample can represent hundreds to thousands of people. It is not biased by access to healthcare, healthcare-seeking behavior, or symptoms. For these reasons, WES is particularly well-suited to low-resource settings, where access to healthcare and financial resources are limited. The goal of the analysis is to better understand the leading causes of disease burden in Africa that can be monitored using WES.
The data for this analysis was obtained from the World Health Organization (WHO) Global Health Estimates (GHE) for 2021 (World Health Organization 2024). First, the Africa region top-20 summary table was saved as a .csv file and used to identify which WES-identifiable causes of disease to include in the analysis. Of the top 20, the top 10 DALYs (disability adjusted life years) by cause for Africa were evaluated for suitability, based on a literature search.
Next, the GHE 2021 Summary Tables by country for all ages were cleaned up in Excel to remove all countries and causes that were not relevant to this project. The single-sex data was also removed. I then transposed columns to rows in Excel to get country-level disease burdens for each of the selected causes. Lastly, the cells in the GHE 2021 Summary Tables spreadsheet were color-coded to denote data quality and completeness. I created a new variable called data_quality, where I assigned a quality level to each country’s data, according to the color. Lastly, I saved the resulting data as a .csv file. (Note: I completed some of these steps in Excel prior to learning how to do them in R.)
I first loaded the packages needed for the analysis, then imported the datasets from .csv files.
library(tidyverse)
library(ggthemes)
library(here)
library(gt)ghe_afr <- read_csv(here::here("data/raw/ghe2021_daly_bycountry_2021_afr.csv"))
ghe_top20_afr <- read_csv(here::here("data/raw/ghe_top20_DALYs_AFR_2021.csv"))
#FacilityTypes_2024 <- read_csv(here::here("data/raw/JMP_AFR_san_FacilityTypes_2024.csv"))
#ServiceLevels_2024 <- read_csv(here::here("data/raw/JMP_AFR_san_ServiceLevels_2024.csv"))
#FacilityTypes_2021 <- read_csv(here::here("data/raw/JMP_AFR_san_FacilityTypes_2021.csv"))
#ServiceLevels_2021 <- read_csv(here::here("data/raw/JMP_AFR_san_ServiceLevels_2021.csv"))Once the data was imported, I pivoted the country-level disease burden data to ensure that each observation contained only one “cause” and one measurement of disease burden “DALYs”.
ghe_afr_long <- ghe_afr |>
pivot_longer(cols = all_cause:covid19,
names_to = "cause",
values_to = "DALYs")To select WES-identifiable causes of disease to include in the analysis, the top 10 DALYs by cause for Africa, shown in Table 1 below, were considered. A literature search was conducted to identify which of the top 10 causes could be tracked, at least partly, using WES.
ghe_top20_afr |>
slice_head(n = 11) |>
gt() |>
tab_footnote(footnote = md("Source: Global Health Estimates 2021: Disease burden by Cause, Age, Sex, by Country and by Region, 2000-2021. Geneva, World Health Organization; 2024.")) |>
tab_style(
style = cell_text(color = "red"), # Changing the color to red
locations = cells_body(rows = c(2, 3, 5, 6, 8, 9)))| Rank | Cause | DALYs (000s) | % DALY | DALYs per 100,000 |
|---|---|---|---|---|
| 0 | All Causes | 599504 | 100.0 | 50871.2 |
| 1 | Lower respiratory infections | 51910 | 8.7 | 4404.8 |
| 2 | Malaria | 50101 | 8.4 | 4251.3 |
| 3 | Preterm birth complications | 36729 | 6.1 | 3116.6 |
| 4 | Diarrhoeal diseases | 36503 | 6.1 | 3097.5 |
| 5 | HIV/AIDS | 26945 | 4.5 | 2286.4 |
| 6 | Birth asphyxia and birth trauma | 26762 | 4.5 | 2270.9 |
| 7 | Tuberculosis | 25901 | 4.3 | 2197.9 |
| 8 | COVID-19 | 17566 | 2.9 | 1490.6 |
| 9 | Stroke | 15207 | 2.5 | 1290.4 |
| 10 | Road injury | 13915 | 2.3 | 1180.8 |
| Source: Global Health Estimates 2021: Disease burden by Cause, Age, Sex, by Country and by Region, 2000-2021. Geneva, World Health Organization; 2024. | ||||
Of the top 10 causes, the following were selected (also shown in red in Table 1), based on evidence from the literature that the pathogens that are at least partly responsible for these diseases can be detected in wastewater, faecal sludge, and/or surface water samples:
Lower respiratory infections (Tiwari et al. 2025)
Malaria (Diamond et al. 2023; Reboud et al. 2019)
Diarrhoeal disease [Huang et al. (2022); Tubatsi and Kebaabetswe (2022); Saasa et al. (2024); ]
HIV/AIDS (Wolfe et al. 2024; Alshehri, Birch, and Greaves 2025)
Tuberculosis (Jensen 1954; Mtetwa et al. 2022, 2023)
COVID-19 (Medema et al. 2020; Maposa et al. 2025; Barnes et al. 2023)
Together, these 6 causes were responsible for nearly 35% of all DALYs in Africa in 2021. See Table 2 and Figure 1 for a summary of DALYs from these 6 causes in African countries.
#removed the all_cause cause from summary statistics
ghe_afr_long_filtered <- ghe_afr_long |>
filter(cause != "all_cause")
ghe_afr_long_filtered |>
group_by(cause) |>
summarise(min = min(DALYs),
max = max(DALYs),
mean = mean(DALYs),
median = median(DALYs),
sd = sd(DALYs)) |>
gt() |>
fmt_number(decimals = 2) |>
cols_label(
cause = "Cause",
min = "Minimum",
max = "Maximum",
mean = "Mean",
median = "Median",
sd = "Std Deviation")| Cause | Minimum | Maximum | Mean | Median | Std Deviation |
|---|---|---|---|---|---|
| covid19 | 1.97 | 4,758.44 | 470.60 | 211.03 | 904.91 |
| dd | 0.19 | 12,834.97 | 716.73 | 273.21 | 1,795.12 |
| hiv_aids | 0.01 | 3,697.89 | 503.94 | 169.13 | 798.87 |
| lri | 1.31 | 15,724.62 | 1,045.22 | 473.36 | 2,253.53 |
| malaria | 0.00 | 16,901.82 | 941.62 | 253.74 | 2,457.95 |
| tb | 0.06 | 6,857.44 | 498.38 | 162.28 | 1,052.48 |
ggplot(data = ghe_afr_long_filtered,
mapping = aes(x = cause,
y = DALYs,
fill = cause)) +
geom_boxplot() +
theme_minimal() +
theme(legend.position = "none") +
coord_cartesian(ylim = c(0, 1500)) +
labs(x = "Cause of DALYs",
y = "DALYs (000s)") +
theme(axis.title.x = element_text(vjust = -1))
I added variables to the ghe_afr dataset to better understand what percentage of all DALYs in a country could be attributed to causes that could be at least partially monitored using WES. This is data is plotted in a bar graph in Figure 2, shown below.
ghe_afr_expanded <- ghe_afr |>
mutate(causes_WES = tb + hiv_aids + dd + malaria + lri + covid19)
ghe_afr_expanded2 <- ghe_afr_expanded |>
mutate(causes_WES_percent = 100 * causes_WES / all_cause)ggplot(data = ghe_afr_expanded2,
aes(x = reorder(country, causes_WES_percent),
y = causes_WES_percent,
fill = country)) +
geom_col(position = position_dodge(width = 0.8)) +
theme_minimal() +
theme(legend.position = "none") +
theme(axis.text.x = element_text(angle=90, vjust=.5, hjust=1)) +
labs(x = "Country", y = "Percentage of DALYs")
The GHE data is rated by WHO for quality and completeness. Of the 54 countries in Africa, 49 have GHE data as poor, as shown in Figure 3 below. Using WES data to supplement the GHE data could give countries more information on which to base health policy.
qual_levels <- c("very low",
"low",
"medium",
"high")
data_quality_levels <- ghe_afr |>
mutate(data_quality = factor(data_quality, levels = qual_levels))ggplot(data = data_quality_levels,
mapping = aes(x = data_quality,
fill = data_quality)) +
geom_bar() +
theme_minimal() +
theme(legend.position = "none") +
geom_text(
stat = "count",
aes(label = after_stat(count)),
vjust = -0.5) +
data_quality_levels + labs(x = "Data Quality", y = "# of countries")
WES has the potential to help identify communities where infectious diseases are present or increasing (as relevant), so that ministries of health can more efficiently target those communities with public health interventions.
Countries with a higher percentage of DALYs attributed to causes that could be monitored using WES may achieve more impact from WES data.
Using WES data to supplement the GHE data, which is of poor quality in the vast majority of African countries, could give ministries of health more information on which to base health policy.