Code
library(readxl) # import Excel file
library(janitor) # clean variable names
library(tidyverse) # data manipulation, exploration and visualization This report is the capstone project for the DS4OWD-002 course and applies the skills learned in data management and visualisation to a real-world cholera surveillance dataset. This project uses clinical surveillance data from a pragmatic trial conducted in Uvira, Democratic Republic of the Congo, between 2017 and 2021 (Gallandat 2024). The trial evaluated the impact of improved piped water infrastructure on cholera and diarrhoeal disease incidence in an urban, cholera-endemic setting.
One of the datasets analysed here contains individual-level records for 4,556 admissions for acute watery diarrhoea at the cholera treatment centre in Uvira. For each admission, key variables include age, sex, date of admission, residential cluster, rapid diagnostic test (RDT) result, treatment outcome, and trial time indicators. In addition, two related datasets extracted from other sheets of the same Excel workbook provide monthly, cluster-level indicators of piped water service quality and a city-wide measure of water service continuity.
The main objective of this analysis is to describe temporal and spatial patterns in suspected and RDT-confirmed cholera cases in Uvira over the period 2017–2021, as well as to explore how these patterns relate to indicators of piped water service quality and continuity. The analysis focuses on monthly trends in suspected admissions and RDT positivity, heterogeneity between residential clusters, and simple visual comparisons with water service quality and continuity indicators, as a first step towards understanding potential impacts of improvements in piped water services on cholera risk.
To ensure a transparent and reproducible workflow, the analysis followed recommended data-management practices (this course, Wilson (2017), Grolemund (2023)). The raw data were preserved exactly as received and kept in a dedicated directory without modification. All subsequent processing steps were performed on a copy of the raw file, ensuring that the original data remain intact and permanently accessible. The raw data files were also stored in more than one location to provide additional security GitHub repository.
Thus, the data analysed in this report is stored as a xlsx file in the data/processed directory. All analyses were conducted in R within Posit Cloud, using the readxl, janitor, and tidyverse (including lubridate, purrr, ggplot2, tidyr, forcats…) packages for data import, cleaning, manipulation, and visualisation.
library(readxl) # import Excel file
library(janitor) # clean variable names
library(tidyverse) # data manipulation, exploration and visualization Data were imported from the data/raw directory and stored as an object named clinical_raw (clinical surveillance data), index_raw (quality index of the water), and continuity_raw (average proportion of daily hours during which pumps were functioning at the water treatment and pumping station).
clinical_raw <- read_excel(here::here("/cloud/project/data/raw/Dataset_2017-2021_OSF.xlsx"), sheet = 2) # use the second sheet, where clinical surveillance data are stored
index_raw <- read_excel(here::here("/cloud/project/data/raw/Dataset_2017-2021_OSF.xlsx"), sheet = 4) # use the fourth sheet, where quality water data are stored
continuity_raw <- read_excel(here::here("/cloud/project/data/raw/Dataset_2017-2021_OSF.xlsx"), sheet = 7) # use the seventh sheet, where average proportion of daily hours during which pumps were functioning at the water treatment and pumping stationAnd each imported object was briefly inspected using to verify the structure and variable types:
datasets_raw <- list( # all datasets stored in a list
clinical_raw = clinical_raw,
index_raw = index_raw,
continuity_raw = continuity_raw)
glimpse(datasets_raw) # check column names and types of each datasetList of 3
$ clinical_raw : tibble [4,559 × 11] (S3: tbl_df/tbl/data.frame)
..$ id : num [1:4559] 11065 11066 11067 11068 11069 ...
..$ date_admit: POSIXct[1:4559], format: "2017-01-01" "2017-01-01" ...
..$ age : num [1:4559] 67 39 3 3 24 ...
..$ sex : chr [1:4559] "F" "F" "F" "M" ...
..$ cluster : num [1:4559] 14 9 11 9 16 6 11 10 11 11 ...
..$ confirmed : chr [1:4559] "FALSE" "FALSE" "FALSE" "FALSE" ...
..$ outcome : chr [1:4559] "NA" "NA" "discharged" "discharged" ...
..$ month : num [1:4559] 1 1 1 1 1 1 1 1 1 1 ...
..$ year : num [1:4559] 2017 2017 2017 2017 2017 ...
..$ trial_time: logi [1:4559] FALSE FALSE FALSE FALSE FALSE FALSE ...
..$ ctc : chr [1:4559] "ctc" "ctc" "ctc" "ctc" ...
$ index_raw : tibble [60 × 17] (S3: tbl_df/tbl/data.frame)
..$ Month: POSIXct[1:60], format: "2017-01-01" "2017-02-01" ...
..$ 1 : num [1:60] 7.68 6.76 7.68 7.68 6.72 ...
..$ 2 : num [1:60] 15.6 15.6 15.6 15.6 15.6 ...
..$ 3 : num [1:60] 21.4 20.6 20.6 20.5 20.8 ...
..$ 4 : num [1:60] 9.29 9.29 9.29 9.29 9.29 ...
..$ 5 : num [1:60] 6.72 6.72 6.72 6.72 6.72 ...
..$ 6 : num [1:60] 13.5 13.5 13.5 13.5 13.5 ...
..$ 7 : num [1:60] 25 25 25 25 25 25 25 25 25 25 ...
..$ 8 : num [1:60] 23.9 23.9 23.9 23.9 23.9 ...
..$ 9 : num [1:60] 25 25 25 25 25 25 25 25 25 25 ...
..$ 10 : num [1:60] 38.1 37.5 33.3 33.3 32.6 ...
..$ 11 : num [1:60] 24 23.7 23.8 23.5 23.6 ...
..$ 12 : num [1:60] 10.28 10.15 10.28 6.88 6.88 ...
..$ 13 : num [1:60] 29.7 29.9 30 29.8 29.7 ...
..$ 14 : num [1:60] 8.76 8.76 8.76 8.76 8.76 ...
..$ 15 : num [1:60] 18.9 18.9 18.9 18.9 18.9 ...
..$ 16 : num [1:60] 6.72 6.72 6.72 6.72 6.72 ...
$ continuity_raw: tibble [60 × 2] (S3: tbl_df/tbl/data.frame)
..$ Month : POSIXct[1:60], format: "2017-01-01" "2017-02-01" ...
..$ All clusters (city-level): num [1:60] 65.3 60.4 59.7 61.5 60.6 ...
The raw clinical dataset (clinical_raw) was then transformed into an analysis-ready format. This process involved applying consistent and informative variable names, replacing arbitrary missing-value codes with standard NA representations, and standardising variable types (e.g., dates, numeric values, and categorical factors). All cleaning and transformation steps were performed through scripted procedures, each accompanied by concise comments documenting its purpose, ensuring that the full data-preparation pipeline can be rerun, inspected, or adapted. Records with missing cluster, year, or admission date were removed, yielding a minimally cleaned dataset saved as clinical_processed, containing 4,556 complete observations.
The RDT confirmation variable (confirmed) was recoded from the original character values (“TRUE”, “FALSE”, and missing) into a binary indicator (1 = RDT positive, 0 = RDT negative, NA = not tested). The treatment outcome variable (outcome) was converted to a categorical factor corresponding to discharge status, treatment refusal, transfer, and death.
clinical_processed <- clinical_raw |>
janitor::clean_names() |> # ensure safe, consistent column names, more info here: https://cran.r-project.org/web/packages/janitor/vignettes/janitor.html
mutate(
id = as.character(id), # keep id as character (not numeric)
date_admit = as.Date(date_admit), # ensure admission date is a Date
age = as.numeric(age), # age as numeric
sex = factor(sex), # sex as factor (F/M)
cluster = as.integer(cluster), # cluster as integer ID
confirmed = dplyr::case_when(confirmed == "TRUE" ~ 1L,
confirmed == "FALSE" ~ 0L,
TRUE ~ NA_integer_), # anything unexpected -> NA
month = as.integer(month), # month as integer
year = as.integer(year), # year as integer
trial_time = as.numeric(trial_time), # trial time as numeric
ctc = factor(ctc), # ctc as factor
outcome = na_if(outcome, "NA"), # NA in outcome as a character
outcome = factor(outcome), # outcome as a factor
outcome = forcats::fct_recode( # rename for readable purposes
outcome,
"Death" = "death",
"Death on arrival" = "death_on_arrival",
"Discharged" = "discharged",
"Left before discharge" = "left_before_discharged",
"Transferred" = "transferred"
),
outcome = forcats::fct_relevel( # order levels in a meaningful sequence
outcome,
"Discharged",
"Transferred",
"Left before discharge",
"Death",
"Death on arrival")
) |>
filter( # minimal clean: drop rows missing key info
!is.na(cluster),
!is.na(year),
!is.na(date_admit)
)
glimpse(clinical_processed) # quick check of datasetRows: 4,556
Columns: 11
$ id <chr> "11065", "11066", "11067", "11068", "11069", "11070", "1107…
$ date_admit <date> 2017-01-01, 2017-01-01, 2017-01-02, 2017-01-02, 2017-01-03…
$ age <dbl> 67.000000, 39.000000, 3.000000, 3.000000, 24.000000, 14.000…
$ sex <fct> F, F, F, M, F, M, F, M, F, M, M, F, M, F, M, F, F, F, M, M,…
$ cluster <int> 14, 9, 11, 9, 16, 6, 11, 10, 11, 11, 11, 11, 2, 13, 4, 11, …
$ confirmed <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, NA…
$ outcome <fct> NA, NA, Discharged, Discharged, Discharged, Discharged, Dis…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ year <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017,…
$ trial_time <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ ctc <fct> ctc, ctc, ctc, ctc, ctc, ctc, ctc, ctc, ctc, ctc, ctc, ctc,…
In parallel, the water service quality index sheet (index_raw) was processed to obtain a monthly, cluster-level indicator. Column names were cleaned, the month column was converted to a Date variable, and auxiliary year and month variables were derived. This resulted in index_processed, a processed dataset with one row per month and cluster-level index values expressed on a 0–100 scale.
index_processed <- index_raw |>
janitor::clean_names() |> # digit column names are prefixed with "x"
rename(
month_date = month # rename month column to month_date
) |>
mutate(
month_date = as.Date(month_date), # ensure month_date is a Date
year = year(month_date), # extract year, more info about handling dates here: https://lubridate.tidyverse.org/index.html
month = month(month_date) # extract month (1–12)
)
glimpse(index_processed)Rows: 60
Columns: 19
$ month_date <date> 2017-01-01, 2017-02-01, 2017-03-01, 2017-04-01, 2017-05-01…
$ x1 <dbl> 7.6767, 6.7563, 7.6767, 7.6767, 6.7179, 6.7179, 6.7179, 6.7…
$ x2 <dbl> 15.649, 15.649, 15.649, 15.649, 15.649, 15.649, 15.649, 15.…
$ x3 <dbl> 21.388, 20.612, 20.612, 20.517, 20.767, 21.294, 21.388, 21.…
$ x4 <dbl> 9.2922, 9.2922, 9.2922, 9.2922, 9.2922, 9.2922, 9.2922, 9.2…
$ x5 <dbl> 6.7179, 6.7179, 6.7179, 6.7179, 6.7179, 6.7179, 6.7179, 6.7…
$ x6 <dbl> 13.465, 13.465, 13.465, 13.465, 13.465, 13.465, 13.465, 13.…
$ x7 <dbl> 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25,…
$ x8 <dbl> 23.912, 23.912, 23.912, 23.912, 23.912, 23.911, 23.912, 23.…
$ x9 <dbl> 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25, 25,…
$ x10 <dbl> 38.087, 37.500, 33.255, 33.255, 32.637, 34.448, 36.306, 32.…
$ x11 <dbl> 23.960, 23.663, 23.848, 23.515, 23.570, 23.809, 23.796, 24.…
$ x12 <dbl> 10.2820, 10.1510, 10.2820, 6.8784, 6.8784, 6.8162, 6.8784, …
$ x13 <dbl> 29.656, 29.940, 30.022, 29.806, 29.656, 30.719, 30.286, 25.…
$ x14 <dbl> 8.7637, 8.7637, 8.7637, 8.7637, 8.7637, 8.7637, 8.7637, 8.7…
$ x15 <dbl> 18.852, 18.852, 18.852, 18.852, 18.852, 18.852, 18.852, 18.…
$ x16 <dbl> 6.7179, 6.7179, 6.7179, 6.7179, 6.7179, 6.7179, 6.7179, 6.7…
$ year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017,…
$ month <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7,…
The continuity sheet (continuity_raw) was processed in a similar way to derive a city-wide water service continuity indicator. Column names were cleaned, the month column was converted to a Date variable, and year and month variables were added. The continuity variable was explicitly cast to numeric, yielding continuity_processed, a monthly time series of the proportion of daily hours during which pumps were functioning at the main water treatment and pumping station.
continuity_processed <- continuity_raw |>
janitor::clean_names() |>
rename(
month_date = month,
continuity = all_clusters_city_level
) |>
mutate(
month_date = as.Date(month_date),
continuity = as.numeric(continuity), # ensure continuity is numeric
year = year(month_date),
month = month(month_date)
)
glimpse(continuity_processed) Rows: 60
Columns: 4
$ month_date <date> 2017-01-01, 2017-02-01, 2017-03-01, 2017-04-01, 2017-05-01…
$ continuity <dbl> 65.33, 60.35, 59.73, 61.55, 60.65, 58.17, 60.75, 57.78, 55.…
$ year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017,…
$ month <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7,…
Missing values in the water service indices were not removed at the initial data-processing stage for two main reasons:
NA would discard months that are still informative for most clusters, or months with valid clinical data but incomplete water service information.colSums(is.na(index_processed)) # checking if and where the NA aremonth_date x1 x2 x3 x4 x5 x6
0 3 2 3 3 2 2
x7 x8 x9 x10 x11 x12 x13
2 3 3 2 2 2 2
x14 x15 x16 year month
2 2 2 0 0
colSums(is.na(continuity_processed))month_date continuity year month
0 3 0 0
Instead, missing values were addressed during analysis on a case-by-case basis, by excluding incomplete observations where necessary, while avoiding unnecessary data loss or imputation.
Finallly, all processed datasets (clinical_processed, index_processed, and continuity_processed) were then written to the data/processed directory as CSV files to ensure reproducibility and to separate raw from analysis-ready data.
datasets_processed <- list( # List of processed datasets to be saved
clinical_processed = clinical_processed,
index_processed = index_processed,
continuity_processed = continuity_processed
)
purrr::iwalk( # write each dataset to data/processed as a separate CSV, more info here: https://purrr.tidyverse.org/index.html
datasets_processed,
~ readr::write_csv(
.x,
here::here("data/processed", paste0(.y, ".csv"))
)
)To address questions about temporal and spatial patterns, the individual-level line list was aggregated to the cluster–month level. For each combination of year, month, and residential cluster, the following quantities were computed: the number of suspected cholera admissions, the number of patients tested with RDT, the number of RDT-positive cases, and the proportion of positive tests (RDT positivity). This aggregation provided a compact time series at cluster level and facilitates visualization of trends and heterogeneity.
clinical_monthly <- clinical_processed |>
group_by(year, month, cluster) |>
summarise(
n_admissions = n(), # all suspected cases
n_tested = sum(!is.na(confirmed)), # patients with RDT done
n_confirmed = sum(confirmed == 1, na.rm = TRUE), # RDT-positive cases
rdt_positivity = if_else(n_tested > 0,
n_confirmed / n_tested, NA_real_) # avoid division by 0
) |>
ungroup()A simple set of summary statistics was calculated to describe overall admissions, testing, and RDT positivity across the study period.
clinical_summary_table <- clinical_monthly |>
summarise(
total_admissions = sum(n_admissions, na.rm = TRUE), # total suspected admissions
total_tested = sum(n_tested, na.rm = TRUE), # total RDT tested
total_confirmed = sum(n_confirmed, na.rm = TRUE), # total RDT confirmed
overall_positivity = round((total_confirmed / total_tested)*100, 1) # positivity (%)
)
clinical_summary_table_display <- clinical_summary_table |>
rename(
"Total admissions" = total_admissions, # renaming for readable purposes
"Total tested (RDT)" = total_tested,
"Total confirmed (RDT positive)" = total_confirmed,
"Overall RDT positivity (%)" = overall_positivity
)
knitr::kable(
clinical_summary_table_display,
align = rep("c", ncol(clinical_summary_table_display))
)| Total admissions | Total tested (RDT) | Total confirmed (RDT positive) | Overall RDT positivity (%) |
|---|---|---|---|
| 4556 | 3309 | 1503 | 45.4 |
To link the cluster-level water service quality index with the monthly clinical data and facilitate plotting, the processed index dataset was reshaped from wide to long format. In this structure, each row corresponds to a unique combination of month and cluster, with a single column (index_value) containing the water service quality index.
index_long <- index_processed |>
pivot_longer(
cols = starts_with("x"), # cluster columns (x1, x2, ...)
names_to = "cluster_raw", # original cluster column names
values_to = "index_value" # water service quality index (0–100)
) |>
mutate(
# Extract numeric cluster ID from names like "x1", "x2", ...
cluster = readr::parse_number(cluster_raw),
index_value = as.numeric(index_value)
)
glimpse(index_long)Rows: 960
Columns: 6
$ month_date <date> 2017-01-01, 2017-01-01, 2017-01-01, 2017-01-01, 2017-01-0…
$ year <dbl> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017…
$ month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2…
$ cluster_raw <chr> "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10…
$ index_value <dbl> 7.6767, 15.6490, 21.3880, 9.2922, 6.7179, 13.4650, 25.0000…
$ cluster <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, …
Table 1 summarises overall admissions, testing, and RDT-confirmed cholera cases over the study period. Between 2017 and 2021, a total of 4,556 suspected cholera admissions were recorded, of which 3,309 (≈73%) underwent RDT testing. Among those tested, 1,503 were RDT positive, corresponding to an overall RDT positivity of about 45%, indicating that roughly one third of all suspected admissions met the definition of confirmed cholera.
To describe overall trends in suspected cholera burden, monthly admissions were summed across all clusters. This highlights the timing and magnitude of cholera peaks over the study period.
clinical_monthly_overall <- clinical_monthly |>
group_by(year, month) |>
summarise(
n_admissions = sum(n_admissions, na.rm = TRUE) # total admissions per month
) |>
ungroup() |>
mutate(
date = as.Date(paste(year, month, "01", sep = "-")) # first day of each month
)
ggplot(data = clinical_monthly_overall,
mapping = aes(x = date,
y = n_admissions)) +
geom_line() +
scale_x_date(
date_breaks = "1 year", # labels: one per year
date_labels = "%Y",
date_minor_breaks = "1 month" # minor grid/ticks: every month
) +
labs(
x = "Date of admission",
y = "Number of suspected cholera admissions"
)
As shown in Figure 1, suspected cholera admissions vary substantially over time, with distinct peaks and troughs between 2017 and 2021. The series is characterized by several sharp peaks, separated by periods of much lower admission numbers, suggesting strong seasonality or intermittent outbreak dynamics. The marked reduction in suspected cholera admissions during most of 2021 is consistent with several contextual changes reported in Uvira. A mass oral cholera vaccination campaign was implemented in mid and late 2020 with high administrative coverage and moderate-to-high coverage in subsequent household surveys, which is likely to have reduced cholera incidence during the following year. In addition, a second cholera treatment unit opened in 2019, so some patients with acute watery diarrhoea may have sought care outside the facility captured in this line list. Ongoing improvements in piped water supply and a modest increase in the proportion of households using REGIDESO taps may also have contributed to lower transmission, although this analysis is descriptive and cannot establish causality.
To approximate trends in confirmed cholera, RDT positivity was calculated as the proportion of tested patients with a positive RDT result in each month.
clinical_monthly_overall_pos <- clinical_monthly |>
group_by(year, month) |>
summarise(
n_tested = sum(n_tested, na.rm = TRUE),
n_confirmed = sum(n_confirmed, na.rm = TRUE)
) |>
ungroup() |>
mutate(
date = as.Date(paste(year, month, "01", sep = "-")),
rdt_positivity = if_else(
n_tested > 0,
n_confirmed / n_tested,
NA_real_
)
)
ggplot(data = clinical_monthly_overall_pos,
mapping = aes(x = date,
y = rdt_positivity)) +
geom_line() +
scale_x_date(
date_breaks = "1 year", # labels: one per year
date_labels = "%Y",
date_minor_breaks = "1 month" # minor grid/ticks: every month
) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(x = "Date of admission",
y = "RDT positivity (%)")
Monthly cholera RDT positivity in Uvira fluctuated markedly over the study period, typically ranging between about 20% and 60%, with only a few months approaching 0% (Figure 2). This indicates that, in most months, a substantial proportion of tested patients met the definition of confirmed cholera.
Because the intervention and water distribution network operate at the cluster (neighbourhood) level, it was important to observe how suspected cholera burden varied between clusters.
clinical_cluster_total <- clinical_monthly |>
group_by(cluster) |>
summarise(
n_admissions_total = sum(n_admissions, na.rm = TRUE)
) |>
ungroup()
ggplot(data = clinical_cluster_total,
mapping = aes(x = factor(cluster),
y = n_admissions_total)) +
geom_col() +
scale_y_continuous(expand = expansion(mult = c(0, 0.05))) +
labs(x = "Cluster",
y = "Total suspected cholera admissions (2017–2021)")
Figure 3 indicates substantial variation in total suspected cholera admissions between clusters. A small number of clusters (9, 11,13, 16) account for a disproportionate share of admissions, suggesting that local environmental or infrastructural conditions may concentrate cholera risk in specific neighbourhoods. On the contrary, clusters 7 and 8 hold the lowest number of cholera admissions.
clinical_monthly |>
mutate(
date = as.Date(paste(year, month, "01", sep = "-"))
) |>
ggplot(mapping = aes(x = date, y = n_admissions)) +
geom_line() +
scale_x_date(
date_breaks = "1 year",
date_labels = "%Y"
) +
facet_wrap(~ cluster, scales = "free_y") +
labs(x = "Month of admission",
y = "Number of suspected admissions") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) # text readable
Figure Figure 4 shows that temporal patterns differ across clusters, with some neighborhoods experiencing repeated peaks. In particular, a subset of clusters shows sustained or recurrent transmission, suggesting persistent local risk factors. These patterns are most pronounced in clusters 3, 4, 5, 6, 9, 11, 13, and 16 which show repeated high peaks over multiple years, consistent with sustained or recurrent transmission rather than isolated outbreaks. In contrast, clusters 7 and 8 exhibit low and infrequent admissions, suggesting either genuinely lower risk or possible under-ascertainment. The marked heterogeneity between clusters indicates that cholera risk is highly focal within the study area and that targeted interventions in the high-burden neighbourhoods (particularly cluster 16) could yield gains in reducing overall case burden.
Treatment outcomes were summarised to provide a simple description of clinical severity among admitted patients. As shown in Figure 5, most admitted patients were discharged, while a smaller proportion died or were transferred to another hospital unit. These outcome distributions provide additional context for interpreting temporal and spatial patterns in suspected and confirmed cholera cases.
clinical_outcome_summary <- clinical_processed |> # summary of outcomes with proportions
filter(!is.na(outcome)) |> # exclude cases without defined outcome
count(outcome) |>
mutate(
prop = n / sum(n, na.rm = TRUE), # compute proportion for each outcome
prop_label = scales::percent(prop, accuracy = 1) # formatted label (e.g. "65%")
)
# Manual color palette with clinical meaning
outcome_cols <- c(
"Discharged" = "#4CAF50", # green (favorable outcome)
"Transferred" = "#FFC107", # amber (intermediate/uncertain)
"Left before discharge" = "#FF9800", # orange (risk)
"Death" = "#D32F2F", # red (severe outcome)
"Death on arrival" = "#B71C1C" # darker red (most severe)
)
ggplot(
data = clinical_outcome_summary,
mapping = aes(
x = outcome,
y = prop,
fill = outcome
)
) +
geom_col() +
geom_text(
aes(label = prop_label), # show percentage label above each bar
vjust = -0.3, # vertical offset above bar
size = 3 # text size
) +
scale_y_continuous(
labels = scales::percent_format(accuracy = 1),
expand = expansion(mult = c(0, 0.1)) # add headroom for labels above bars
) +
scale_fill_manual(
values = outcome_cols,
name = "Treatment outcome"
) +
labs(
x = "Treatment outcome",
y = "Proportion of admissions"
) +
theme(axis.text.x = element_text(size = 8))
After examining the spatial distribution of suspected cholera admissions by cluster, an exploratory analysis was undertaken to assess whether clusters with lower cholera burden tended to coincide with higher values of the piped water service quality index. The aim was not to establish causality, but to explore whether any clear visual pattern could be observed between monthly admissions and water service quality across neighborhoods.
Figure 6 shows the relationship between the monthly water service quality index and suspected cholera admissions across clusters. In most clusters, admissions occur over a relatively narrow range of index values and no clear monotonic association is evident, although some high-burden clusters (4, 5, 6, 14, 16) show higher admissions at lower index values. Other clusters (e.g. 3, 9, 11, 13) continue to experience substantial numbers of admissions even at high index values above 20%, indicating that improvements in this service quality indicator alone are not sufficient to eliminate cholera risk. Overall, this exploratory visual analysis does not support a simple linear or threshold relationship between the index and cholera admissions, and no causal conclusions can be drawn from these descriptive data.
clinical_with_index <- clinical_monthly |> # joining monthly clinical admissions with the water service quality index
left_join(
index_long |>
select(year, month, cluster, index_value),
by = c("year", "month", "cluster")
) |>
filter(
!is.na(index_value) # NA values are removed
)
ggplot(
data = clinical_with_index,
mapping = aes(
x = index_value, # water service quality index
y = n_admissions # number of suspected admissions
)
) +
geom_point(alpha = 0.6) + # points with some transparency
geom_smooth(
method = "loess",
se = FALSE
) + # smooth trend line
facet_wrap(~ cluster, scales = "free_y") + # one panel per cluster
labs(
x = "Water service quality index (0–100%)",
y = "Number of suspected cholera admissions per month"
)
To compare monthly suspected cholera admissions across quartiles of the water service quality index, Figure 7 shows that median admissions are relatively similar across categories, with substantial overlap between interquartile ranges. This suggests no clear graded decrease in admissions with higher index values. Months classified in the “High” category include both low and high admission counts, indicating that, in these data, higher values of the index are not consistently associated with lower cholera burden. No simple dose–response relationship can be inferred from this exploratory comparison.
clinical_with_index_cat <- clinical_with_index |> # Create index categories (quartiles)
mutate(
index_cat = cut(
index_value,
breaks = quantile(index_value, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE),
include.lowest = TRUE,
labels = c("Very low", "Low", "Medium", "High")
)
)
ggplot(
data = clinical_with_index_cat,
mapping = aes(
x = index_cat,
y = n_admissions
)
) +
geom_boxplot() +
labs(
x = "Water service quality index category",
y = "Monthly suspected cholera admissions"
)
Lastly, Figure 8 shows the relationship between monthly continuity of piped water service and the total number of suspected cholera admissions. Overall, there is no clear linear association: high admission counts are observed across a wide range of continuity values. The loess curve suggests a modest decline in admissions between roughly 55% and 65% continuity, but admissions remain highly variable at all levels, and several months with relatively high continuity still coincide with substantial cholera burden. These descriptive patterns do not support a simple dose–response relationship between continuity and cholera admissions, and no causal interpretation can be drawn from this exploratory analysis.
clinical_overall_with_cont <- clinical_monthly_overall |> # joining overall monthly admissions with the continuity indicator
left_join(
continuity_processed |>
select(year, month, continuity), # join on year and month
by = c("year", "month")
) |>
filter(
!is.na(continuity)
)
ggplot(
data = clinical_overall_with_cont,
mapping = aes(
x = continuity,
y = n_admissions
)
) +
geom_point() +
geom_smooth(
method = "loess", # more info here https://r-statistics.co/Loess-Regression-With-R.html
se = FALSE
) +
labs(
x = "Continuity of water service (% of daily hours with functioning pumps)",
y = "Total monthly suspected cholera admissions"
)
Burden and positivity. From 2017–2021, 4,556 suspected cholera admissions were recorded, about three quarters were RDT-tested, and ≈45% of those tested were positive, indicating a substantial confirmed cholera burden.
Temporal and spatial variation. Admissions showed clear peaks and troughs over time and were highly concentrated in a few clusters (especially 9, 11, 13, and 16), while others (7 and 8) had very few cases, consistent with focal transmission.
Treatment outcomes. Most admitted patients were discharged, and deaths or transfers represented a small proportion of recorded outcomes, suggesting substantial morbidity but relatively low recorded in-facility mortality in these data.
Water indicators and cholera burden. Simple visual comparisons did not show a clear monotonic or dose–response relationship between piped water service quality/continuity and cholera admissions; high case counts occurred across a wide range of index and continuity values.
Interpretation and limits. This is a purely descriptive, visually based analysis and cannot establish or exclude causal effects of water supply improvements. These findings should therefore be interpreted alongside the original trial results (Gallandat 2024), where formal modelling did identify inverse associations between water service quality/continuity and cholera incidence.