library(tidyverse)
library(here)
library(dplyr)
library(readr)
library(stringr)
library(countrycode)
library(gt)
library(sf)
library(rnaturalearth)
library(viridis)
library(ggplot2)
library(patchwork)Global Screening of Wastewater Pollution Risk from Treatment Plants
Introduction
Wastewater treatment is a critical public-health and environmental protection measure, yet treatment performance and monitoring remain uneven across countries. SDG Target 6.3 aims, by 2030, to improve water quality by reducing pollution, eliminating dumping, minimizing releases of hazardous chemicals and materials, halving the proportion of untreated wastewater, and substantially increasing recycling and safe reuse globally. Despite progress, a substantial share of domestic wastewater is still not safely treated, including cases where treatment is limited to primary processes or where effluent does not meet applicable discharge standards(Jones et al. 2021). This project uses the HydroWASTE dataset to develop a screening-level “risk score” for wastewater treatment plants (WWTPs), combining treatment level, population served, and receiving-water dilution potential to identify locations where potential impacts may be higher. The results provide a transparent and replicable basis for comparing countries and highlighting where improved data availability, stronger treatment performance, or targeted investment may be most needed
Methods
Dataset
This project uses HydroWASTE, a global, spatially explicit database of 58,502 WWTPs compiled from national and regional sources and harmonized to a consistent global structure. The dataset provides WWTP locations and key operational attributes such as treatment level, population served, treated wastewater discharge, and linkage to the river network via an estimated outfall location. Importantly for this project, HydroWASTE also provides an estimated dilution factor (DF), defined as the ratio of natural discharge in the receiving water body to the WWTP effluent discharge, which can be used as a screening proxy for dilution capacity and potential exposure conditions (Ehalt Macedo et al. 2022).
Data Cleaning and Processing
To create an analysis-ready dataset, the following steps were applied:
Variable selection
Retained only fields required for the screening analysis (treatment level, population served, wastewater discharge, dilution factor/river linkage, and identifiers), and dropped non-essential qualitative fields.
Operational status harmonization
Collapsed the original status categories into Operational and Not operational. To support assessment of both current and potential future risk, planned/under-construction/proposed facilities were treated as Operational, while closed/decommissioned/Not operational plants were treated as Not operational.
Missing values and zeros
Standardized blanks and “Not reported” as missing where applicable. For key numeric fields where zero values likely indicate “unknown/not provided,” 0 values were converted to NA (notably for population served and treated wastewater discharge).
DF and river-network edge cases
Because DF can be blank for coastal/large-lake outfalls and some records are not linked to a mapped river reach, a “df-case” category was created (DF available/computed; coastal/large-lake; not river-linked). This enabled consistent, transparent handling of DF-related missingness during scoring.
After cleaning, the scored dataset was restricted to records classified as operational and with non-missing inputs required to compute the risk score.
raw_data <- read_csv(here::here("data/raw/hydrowaste_raw.csv"))processed_data <- raw_data |>
select(WASTE_ID, COUNTRY, CNTRY_ISO,STATUS, POP_SERVED, WASTE_DIS, LEVEL, DF,HYRIV_ID,RIVER_DIS, COAST_10KM, COAST_50KM) |>
mutate( STATUS = case_when(
STATUS %in% c("Closed","Decommissioned", "Non-Operational") ~ "Not operational",
TRUE ~ "Operational"
),
POP_SERVED = na_if(POP_SERVED, 0),
WASTE_DIS = na_if(WASTE_DIS, 0),
HYRIV_ID = na_if(as.character(HYRIV_ID), ""),
DF = na_if(as.character(DF), ""),
DF = parse_number(DF)) |>
mutate(river_connected = !is.na(HYRIV_ID),
df_case = case_when(
is.na(WASTE_DIS) ~ "df_missing_zero_discharge",
!river_connected & (COAST_10KM == 1 | COAST_50KM == 1) ~ "coastal_or_lake",
!river_connected ~ "not_in_river_network",
river_connected & is.na(DF) ~ "df_missing_other",
TRUE ~ "df_computed"))
write_csv(processed_data,
here::here("data/processed/my-processed-data.csv"))Risk score methodology
A screening level wastewater pollution risk score was constructed to combine three drivers of potential impact: treatment level (pollutant reduction potential), population served (proxy for wastewater load), and dilution factor (DF) (proxy for receiving-water dilution capacity). Each variable was converted to an ordinal sub-score and the overall score was computed additively as:
risk_score= f(treatment level) + log10 (population served) + f(dilution factor).
Country-level results were then summarized using the median risk score and the number of WWTP records to support cross-country comparison and visualization.
#checking the order of LEVEL
processed_data |>
count(LEVEL)# A tibble: 3 × 2
LEVEL n
<chr> <int>
1 Advanced 22855
2 Primary 881
3 Secondary 34766
order_LEVEL <- c("Primary", "Secondary", "Advanced")
processed_data_scored <- processed_data |>
mutate(LEVEL = factor(LEVEL, levels = order_LEVEL)) |>
filter(STATUS == "Operational") |>
filter(!is.na(POP_SERVED)) |>
filter(!is.na(WASTE_DIS)) |>
mutate(treat_score =case_when(
LEVEL == "Primary" ~ 3,
LEVEL == "Secondary" ~ 2,
LEVEL == "Advanced" ~ 1
)) |>
relocate(treat_score, .after = LEVEL) |>
mutate(log10_pop = log10(POP_SERVED)) |>
relocate(log10_pop, .after = POP_SERVED) |>
mutate(df_score_raw = case_when(
is.na(DF) ~ NA_real_,
DF < 10 ~ 3,
DF < 100 ~ 2,
DF < 1000 ~ 1,
TRUE ~ 0
),
df_score = case_when(
df_case == "df_computed" ~ df_score_raw,
df_case == "coastal_or_lake" ~ 0, #assumption: high dilution
df_case == "not_in_river_network" ~ 1, #assumption: neutral
TRUE ~ NA_real_)) |>
relocate(df_case, df_score_raw, df_score, .after = DF) |>
mutate(pop_score = ntile(log10_pop, 4) -1 ) |> #Population score as dataset relative quartiles (0-3)
relocate(pop_score, .after = log10_pop) |>
filter(!is.na(treat_score), !is.na(pop_score), !is.na(df_score)) |> mutate(risk_score = treat_score + pop_score + df_score)Results
Table 1 ranks the top 20 countries with the highest WWTP risk scores based on the median risk score across plants in each country. The highest median score (9.0) is observed for Yemen, Burkina Faso, and Ethiopia (8.75). Because several countries in the top ranks have very small sample sizes (n=1–4), these results should be interpreted as a screening signal (flagging potential concern) rather than a definitive national performance ranking, while countries with larger counts (e.g., China, Jordan) provide more stable estimates.
tab_top_risk <- processed_data_scored |>
mutate(
iso3_clean = recode(CNTRY_ISO, "XXK" = "XKX")) |> #Fixing the error 'Some values were not matched unambiguously: XKX
mutate(
COUNTRY = countrycode(iso3_clean, "iso3c", "country.name", warn = FALSE),
region = countrycode(iso3_clean, "iso3c", "region", warn = FALSE)) |>
group_by(region, COUNTRY, CNTRY_ISO) |>
summarise(
n_wwtp = n(),
median_risk = median(risk_score, na.rm = TRUE),
mean_risk = mean(risk_score, na.rm = TRUE),
.groups = "drop") |> #Dropping groups makes the result easier/safer to use for tables, plotting
filter(!is.na(region)) |>
arrange(desc(median_risk)) |>
slice_head(n=20)
tab_top_risk |>
gt() |>
tab_header(
title = "Top 20 countries by WWTP risk score",
subtitle = "Ranked by median risk score (HydroWASTE, processed dataset)") |>
cols_label(
region = "Region",
COUNTRY = "Country",
CNTRY_ISO = "ISO3",
n_wwtp = "WWTPs (n)",
median_risk = "Median risk score",
mean_risk = "Mean risk score") |>
fmt_number(columns = c(median_risk, mean_risk), decimals = 3) |>
tab_source_note(source_note = "Source: HydroWASTE (processed and scored by author).")| Top 20 countries by WWTP risk score | |||||
| Ranked by median risk score (HydroWASTE, processed dataset) | |||||
| Region | Country | ISO3 | WWTPs (n) | Median risk score | Mean risk score |
|---|---|---|---|---|---|
| Middle East & North Africa | Yemen | YEM | 1 | 9.000 | 9.000 |
| Sub-Saharan Africa | Burkina Faso | BFA | 1 | 9.000 | 9.000 |
| Sub-Saharan Africa | Ethiopia | ETH | 4 | 9.000 | 8.750 |
| East Asia & Pacific | Indonesia | IDN | 16 | 8.000 | 7.562 |
| Sub-Saharan Africa | Congo - Kinshasa | COD | 3 | 8.000 | 7.333 |
| Sub-Saharan Africa | Ghana | GHA | 4 | 8.000 | 8.000 |
| Sub-Saharan Africa | Senegal | SEN | 2 | 8.000 | 8.000 |
| East Asia & Pacific | Mongolia | MNG | 2 | 7.500 | 7.500 |
| Latin America & Caribbean | Venezuela | VEN | 12 | 7.500 | 7.000 |
| Middle East & North Africa | Jordan | JOR | 28 | 7.500 | 7.143 |
| South Asia | Nepal | NPL | 8 | 7.500 | 7.125 |
| Sub-Saharan Africa | Uganda | UGA | 10 | 7.500 | 7.600 |
| East Asia & Pacific | China | CHN | 2184 | 7.000 | 6.601 |
| East Asia & Pacific | Thailand | THA | 23 | 7.000 | 6.391 |
| Europe & Central Asia | Tajikistan | TJK | 5 | 7.000 | 6.800 |
| Latin America & Caribbean | Cuba | CUB | 14 | 7.000 | 5.500 |
| Latin America & Caribbean | Guatemala | GTM | 8 | 7.000 | 7.250 |
| Middle East & North Africa | Lebanon | LBN | 1 | 7.000 | 7.000 |
| Middle East & North Africa | Morocco | MAR | 17 | 7.000 | 6.647 |
| Middle East & North Africa | Palestinian Territories | PSE | 10 | 7.000 | 7.200 |
| Source: HydroWASTE (processed and scored by author). | |||||
Table 2 shows that the lowest risk category is dominated by high-income countries in Europe & Central Asia plus North America and Oceania, with New Zealand and Iceland showing the lowest median risk score of 2.0. Most of the remaining “safest” countries have a median risk score of 3.0, including large datasets such as the United States (13,539 WWTPs), France (2,993), and Italy (2,859), suggesting broadly lower screening-level risk across many facilities rather than a result driven by a handful of plants. A few entries have very small sample sizes (e.g., Fiji, British Virgin Islands, Tonga), so those rankings should be interpreted as indicative given limited WWTP coverage rather than definitive national performance.
top20_safest <- processed_data_scored |>
mutate(
iso3_clean = recode(CNTRY_ISO, "XXK" = "XKX")) |> #Fixing the error 'Some values were not matched unambiguously: XKX
mutate(
COUNTRY = countrycode(iso3_clean, "iso3c", "country.name", warn = FALSE),
region = countrycode(iso3_clean, "iso3c", "region", warn = FALSE)) |>
group_by(region, COUNTRY, CNTRY_ISO) |>
summarise(
n_wwtp = n(),
median_risk = median(risk_score, na.rm = TRUE),
mean_risk = mean(risk_score, na.rm = TRUE),
.groups = "drop") |> #Dropping groups makes the result easier/safer to use for tables, plotting
filter(!is.na(region)) |>
arrange((median_risk), desc(n_wwtp)) |> # if there is a tie in median risk, the one with more WWTPs appears first
slice_head(n=20)
top20_safest |>
gt() |>
tab_header(
title = "Bottom 20 countries by WWTP risk score(safest)",
subtitle = "Ranked by median risk score (HydroWASTE, processed dataset)") |>
cols_label(
region = "Region",
COUNTRY = "Country",
CNTRY_ISO = "ISO3",
n_wwtp = "WWTPs (n)",
median_risk = "Median risk score",
mean_risk = "Mean risk score") |>
fmt_number(columns = c(median_risk, mean_risk), decimals = 3) |>
tab_source_note(source_note = "Source: HydroWASTE (processed and scored by author).")| Bottom 20 countries by WWTP risk score(safest) | |||||
| Ranked by median risk score (HydroWASTE, processed dataset) | |||||
| Region | Country | ISO3 | WWTPs (n) | Median risk score | Mean risk score |
|---|---|---|---|---|---|
| East Asia & Pacific | New Zealand | NZL | 154 | 2.000 | 2.734 |
| Europe & Central Asia | Iceland | ISL | 6 | 2.000 | 2.667 |
| North America | United States | USA | 13539 | 3.000 | 3.227 |
| Europe & Central Asia | France | FRA | 2993 | 3.000 | 3.286 |
| Europe & Central Asia | Italy | ITA | 2859 | 3.000 | 3.352 |
| North America | Canada | CAN | 1225 | 3.000 | 3.043 |
| East Asia & Pacific | Australia | AUS | 791 | 3.000 | 3.601 |
| Europe & Central Asia | Hungary | HUN | 725 | 3.000 | 2.946 |
| Europe & Central Asia | Switzerland | CHE | 667 | 3.000 | 3.033 |
| Europe & Central Asia | Austria | AUT | 623 | 3.000 | 3.255 |
| Europe & Central Asia | Romania | ROU | 532 | 3.000 | 3.489 |
| Europe & Central Asia | Slovakia | SVK | 264 | 3.000 | 3.273 |
| Europe & Central Asia | Sweden | SWE | 251 | 3.000 | 3.155 |
| Europe & Central Asia | Norway | NOR | 124 | 3.000 | 2.855 |
| Europe & Central Asia | Slovenia | SVN | 85 | 3.000 | 3.529 |
| Europe & Central Asia | Latvia | LVA | 58 | 3.000 | 3.000 |
| Europe & Central Asia | Estonia | EST | 35 | 3.000 | 3.314 |
| East Asia & Pacific | Tonga | TON | 2 | 3.000 | 3.000 |
| East Asia & Pacific | Fiji | FJI | 1 | 3.000 | 3.000 |
| Latin America & Caribbean | British Virgin Islands | VGB | 1 | 3.000 | 3.000 |
| Source: HydroWASTE (processed and scored by author). | |||||
Figure 1 visualizes country-level median WWTP risk scores and data coverage simultaneously: warmer colours indicate higher median risk, while larger circles indicate countries with more WWTP records. The map broadly reinforces the table patterns, higher median risk is concentrated across parts of Sub-Saharan Africa, the Middle East & North Africa, and South/East Asia, while North America and much of Europe trend lower. Circle sizes highlight substantial differences in data coverage (e.g., very high coverage in the United States and parts of Europe/China), which is important when interpreting country rankings.
country_risk <- processed_data_scored |>
mutate(iso3_clean = recode(CNTRY_ISO, "XXK" = "XKX")) |>
group_by(iso3_clean) |>
summarise(
n_wwtp = n(),
median_risk = median(risk_score, na.rm = TRUE),
mean_risk = mean(risk_score, na.rm = TRUE),
.groups = "drop")
world <- ne_countries(scale = "medium", returnclass = "sf") |> #Loads the world countries boundary dataset from naturalearth package and returns it as an sf, a special geometry containing country polygons, medium chooses a not so detailed map resolution
select(name, iso_a3, continent, geometry)
world_risk <- world |>
left_join(country_risk, by = c("iso_a3" = "iso3_clean"))
bubbles_sf <- st_point_on_surface(world_risk) |>
st_drop_geometry() |>
bind_cols(st_point_on_surface(world_risk) |>
st_as_sf())
bubbles_sf <- st_point_on_surface(world_risk) |>
mutate(n_wwtp = world_risk$n_wwtp) |>
filter(!is.na(n_wwtp))
ggplot() +
geom_sf(data = world_risk, aes(fill = median_risk),
color = "gray80", linewidth = 0.1) +
geom_sf(
data = bubbles_sf,
aes(size = n_wwtp),
alpha = 0.45 ) +
scale_fill_viridis_c(
option = "C",
na.value = "white",
name = "Median risk") +
scale_size_continuous(
range = c(1, 10),
breaks = c(10, 50, 200, 1000),
name = "WWTP records") +
labs(
title = "Global screening: WWTP risk and data coverage",
subtitle = "Fill shows median risk score; point size reflects number of WWTP observations per country",
caption = "Source: HydroWASTE (processed and scored by author).") +
theme_void() +
theme(legend.position = "right")
Figure 2 shows a clear inverse relationship between treated wastewater discharge and receiving water dilution potential(both on log10scales). As WWTP discharge increases, the estimated DF generally decreases. This pattern implies that higher discharge plants may more often be associated with comparatively lower dilution capacity, which increases screening level pollution risk in the risk score framework as DF is one of the core components. The smoothed lines by treatment level are broadly similar suggesting that at a global screening scale, variation in dilution potential and discharge may be atleast as influential as treatment level in differentiating potential risk.
scatter_data <- processed_data_scored |>
filter(!is.na(WASTE_DIS),WASTE_DIS >0, #As the next step is log, this avids havin log 0s
!is.na(DF), DF >0,
!is.na(LEVEL),
!is.na(POP_SERVED))
scatter_data <- scatter_data |>
mutate(
log10_dis = log10(WASTE_DIS),
log10_df = log10(DF),
log10_pop = log10(POP_SERVED))
ggplot(scatter_data, aes(log10_dis, log10_df, color = LEVEL)) +
geom_point(alpha = 0.05, size = 0.6) +
geom_smooth(se = FALSE, linewidth = 1) +
labs(
title = "WWTP discharge vs dilution factor (log10-transformed)",
subtitle = "Points show individual WWTPs; lines show smoothed trend by treatment level",
x = "log10(Treated wastewater discharge, m³/day)",
y = "log10(Dilution factor, DF)",
color = "Treatment level"
) +
theme_minimal()
Conclusions
Risk patterns align with development context. Higher median risk scores are concentrated in lower-income/less-resourced settings (notably parts of Sub-Saharan Africa and MENA), while lower scores cluster in higher-income settings (Europe, North America, Australia/New Zealand), consistent with broader gaps in wastewater infrastructure and performance(Ravina et al. 2021).
Coverage and reporting remain a limiting factor. Several “highest-risk” countries in ranking are based on very small numbers of HydroWASTE records (e.g., n = 1–4), so results should be treated as indicative signals. This aligns with global SDG 6.3 reporting that domestic wastewater data and safe-treatment reporting are still incomplete and uneven across countries.
This project’s screening metric is not a compliance assessment. The risk score is designed for rapid prioritization (treatment level + population served + dilution potential). It does not replace diagnostics on effluent quality, actual compliance, industrial loads, or site-specific receiving water sensitivity.
In practice, governments and international organizations can use the Top 20 as a shortlist for follow-up (data validation, targeted monitoring, and deeper diagnostics/investment planning) and use the Bottom 20 as a benchmark for comparison while avoiding “safe” claims without additional supporting evidence.