Global Screening of Wastewater Pollution Risk from Treatment Plants

Author

Syson Siima

Published

January 27, 2026

Introduction

Wastewater treatment is a critical public-health and environmental protection measure, yet treatment performance and monitoring remain uneven across countries. SDG Target 6.3 aims, by 2030, to improve water quality by reducing pollution, eliminating dumping, minimizing releases of hazardous chemicals and materials, halving the proportion of untreated wastewater, and substantially increasing recycling and safe reuse globally. Despite progress, a substantial share of domestic wastewater is still not safely treated, including cases where treatment is limited to primary processes or where effluent does not meet applicable discharge standards(Jones et al. 2021). This project uses the HydroWASTE dataset to develop a screening-level “risk score” for wastewater treatment plants (WWTPs), combining treatment level, population served, and receiving-water dilution potential to identify locations where potential impacts may be higher. The results provide a transparent and replicable basis for comparing countries and highlighting where improved data availability, stronger treatment performance, or targeted investment may be most needed

Methods

Dataset

This project uses HydroWASTE, a global, spatially explicit database of 58,502 WWTPs compiled from national and regional sources and harmonized to a consistent global structure. The dataset provides WWTP locations and key operational attributes such as treatment level, population served, treated wastewater discharge, and linkage to the river network via an estimated outfall location. Importantly for this project, HydroWASTE also provides an estimated dilution factor (DF), defined as the ratio of natural discharge in the receiving water body to the WWTP effluent discharge, which can be used as a screening proxy for dilution capacity and potential exposure conditions (Ehalt Macedo et al. 2022).

Data Cleaning and Processing

To create an analysis-ready dataset, the following steps were applied:

Variable selection

Retained only fields required for the screening analysis (treatment level, population served, wastewater discharge, dilution factor/river linkage, and identifiers), and dropped non-essential qualitative fields.

Operational status harmonization

Collapsed the original status categories into Operational and Not operational. To support assessment of both current and potential future risk, planned/under-construction/proposed facilities were treated as Operational, while closed/decommissioned/Not operational plants were treated as Not operational.

Missing values and zeros

Standardized blanks and “Not reported” as missing where applicable. For key numeric fields where zero values likely indicate “unknown/not provided,” 0 values were converted to NA (notably for population served and treated wastewater discharge).

DF and river-network edge cases

Because DF can be blank for coastal/large-lake outfalls and some records are not linked to a mapped river reach, a “df-case” category was created (DF available/computed; coastal/large-lake; not river-linked). This enabled consistent, transparent handling of DF-related missingness during scoring.

After cleaning, the scored dataset was restricted to records classified as operational and with non-missing inputs required to compute the risk score.

library(tidyverse)
library(here)
library(dplyr)
library(readr)
library(stringr)
library(countrycode)
library(gt)
library(sf)
library(rnaturalearth)
library(viridis)
library(ggplot2)
library(patchwork)

raw_data <- read_csv(here::here("data/raw/hydrowaste_raw.csv"))

processed_data <- raw_data |> 
  select(WASTE_ID, COUNTRY, CNTRY_ISO,STATUS, POP_SERVED, WASTE_DIS, LEVEL, DF,HYRIV_ID,RIVER_DIS, COAST_10KM, COAST_50KM) |> 
  mutate( STATUS = case_when(
    STATUS %in% c("Closed","Decommissioned", "Non-Operational") ~ "Not operational",
    TRUE ~ "Operational"
  ),
  POP_SERVED = na_if(POP_SERVED, 0),
  WASTE_DIS = na_if(WASTE_DIS, 0),
  HYRIV_ID = na_if(as.character(HYRIV_ID), ""),
  DF = na_if(as.character(DF), ""),
  DF = parse_number(DF)) |> 
  mutate(river_connected = !is.na(HYRIV_ID),
    df_case = case_when(
      is.na(WASTE_DIS) ~ "df_missing_zero_discharge",
      !river_connected & (COAST_10KM == 1 | COAST_50KM == 1) ~ "coastal_or_lake",
      !river_connected ~ "not_in_river_network",
      river_connected & is.na(DF) ~ "df_missing_other",
      TRUE ~ "df_computed"))
write_csv(processed_data,
          here::here("data/processed/my-processed-data.csv"))

Risk score methodology

A screening level wastewater pollution risk score was constructed to combine three drivers of potential impact: treatment level (pollutant reduction potential), population served (proxy for wastewater load), and dilution factor (DF) (proxy for receiving-water dilution capacity). Each variable was converted to an ordinal sub-score and the overall score was computed additively as:

risk_score= f(treatment level) + log10 (population served) + f(dilution factor).

Country-level results were then summarized using the median risk score and the number of WWTP records to support cross-country comparison and visualization.

#checking the order of LEVEL
processed_data |> 
  count(LEVEL)

# A tibble: 3 × 2
  LEVEL         n
  <chr>     <int>
1 Advanced  22855
2 Primary     881
3 Secondary 34766

order_LEVEL <- c("Primary", "Secondary", "Advanced")

processed_data_scored <- processed_data |> 
  mutate(LEVEL = factor(LEVEL, levels = order_LEVEL)) |> 
  filter(STATUS == "Operational") |> 
  filter(!is.na(POP_SERVED)) |> 
  filter(!is.na(WASTE_DIS)) |> 
  mutate(treat_score =case_when(
    LEVEL == "Primary"   ~ 3,
    LEVEL == "Secondary" ~ 2,
    LEVEL == "Advanced"  ~ 1
  )) |>  
    relocate(treat_score, .after = LEVEL) |> 
  mutate(log10_pop = log10(POP_SERVED)) |> 
  relocate(log10_pop, .after = POP_SERVED) |> 
  mutate(df_score_raw = case_when(
    is.na(DF) ~ NA_real_,
    DF < 10   ~ 3,
    DF < 100  ~ 2,
    DF < 1000 ~ 1,
    TRUE ~ 0
  ),
  df_score = case_when(
    df_case == "df_computed" ~ df_score_raw,
    df_case == "coastal_or_lake" ~ 0, #assumption: high dilution
    df_case == "not_in_river_network" ~ 1, #assumption: neutral
    TRUE ~ NA_real_)) |> 
  relocate(df_case, df_score_raw, df_score, .after = DF) |> 
  mutate(pop_score = ntile(log10_pop, 4) -1 ) |>  #Population score as dataset relative quartiles (0-3)
  relocate(pop_score, .after = log10_pop) |> 
 filter(!is.na(treat_score), !is.na(pop_score), !is.na(df_score)) |> mutate(risk_score = treat_score + pop_score + df_score)

Results

Table 1 ranks the top 20 countries with the highest WWTP risk scores based on the median risk score across plants in each country. The highest median score (9.0) is observed for Yemen, Burkina Faso, and Ethiopia (8.75). Because several countries in the top ranks have very small sample sizes (n=1–4), these results should be interpreted as a screening signal (flagging potential concern) rather than a definitive national performance ranking, while countries with larger counts (e.g., China, Jordan) provide more stable estimates.

tab_top_risk <- processed_data_scored |> 
  mutate(
    iso3_clean = recode(CNTRY_ISO, "XXK" = "XKX")) |>  #Fixing the error 'Some values were not matched unambiguously: XKX
  mutate( 
     COUNTRY = countrycode(iso3_clean, "iso3c", "country.name", warn = FALSE),
    region  = countrycode(iso3_clean, "iso3c", "region", warn = FALSE)) |> 
    group_by(region, COUNTRY, CNTRY_ISO) |> 
  summarise(
    n_wwtp = n(),
    median_risk = median(risk_score, na.rm = TRUE),
    mean_risk = mean(risk_score, na.rm = TRUE),
    .groups = "drop") |>  #Dropping groups makes the result easier/safer to use for tables, plotting
  filter(!is.na(region)) |> 
  arrange(desc(median_risk)) |> 
  slice_head(n=20)

  tab_top_risk |> 
    gt() |> 
   tab_header(
    title = "Top 20 countries by WWTP risk score",
    subtitle = "Ranked by median risk score (HydroWASTE, processed dataset)") |> 
  cols_label(
    region = "Region",
    COUNTRY = "Country",
    CNTRY_ISO = "ISO3",
    n_wwtp = "WWTPs (n)",
    median_risk = "Median risk score",
    mean_risk = "Mean risk score") |> 
  fmt_number(columns = c(median_risk, mean_risk), decimals = 3) |> 
  tab_source_note(source_note = "Source: HydroWASTE (processed and scored by author).")

Table 1

Region	Country	ISO3	WWTPs (n)	Median risk score	Mean risk score
Top 20 countries by WWTP risk score
Ranked by median risk score (HydroWASTE, processed dataset)
Middle East & North Africa	Yemen	YEM	1	9.000	9.000
Sub-Saharan Africa	Burkina Faso	BFA	1	9.000	9.000
Sub-Saharan Africa	Ethiopia	ETH	4	9.000	8.750
East Asia & Pacific	Indonesia	IDN	16	8.000	7.562
Sub-Saharan Africa	Congo - Kinshasa	COD	3	8.000	7.333
Sub-Saharan Africa	Ghana	GHA	4	8.000	8.000
Sub-Saharan Africa	Senegal	SEN	2	8.000	8.000
East Asia & Pacific	Mongolia	MNG	2	7.500	7.500
Latin America & Caribbean	Venezuela	VEN	12	7.500	7.000
Middle East & North Africa	Jordan	JOR	28	7.500	7.143
South Asia	Nepal	NPL	8	7.500	7.125
Sub-Saharan Africa	Uganda	UGA	10	7.500	7.600
East Asia & Pacific	China	CHN	2184	7.000	6.601
East Asia & Pacific	Thailand	THA	23	7.000	6.391
Europe & Central Asia	Tajikistan	TJK	5	7.000	6.800
Latin America & Caribbean	Cuba	CUB	14	7.000	5.500
Latin America & Caribbean	Guatemala	GTM	8	7.000	7.250
Middle East & North Africa	Lebanon	LBN	1	7.000	7.000
Middle East & North Africa	Morocco	MAR	17	7.000	6.647
Middle East & North Africa	Palestinian Territories	PSE	10	7.000	7.200
Source: HydroWASTE (processed and scored by author).

Table 2 shows that the lowest risk category is dominated by high-income countries in Europe & Central Asia plus North America and Oceania, with New Zealand and Iceland showing the lowest median risk score of 2.0. Most of the remaining “safest” countries have a median risk score of 3.0, including large datasets such as the United States (13,539 WWTPs), France (2,993), and Italy (2,859), suggesting broadly lower screening-level risk across many facilities rather than a result driven by a handful of plants. A few entries have very small sample sizes (e.g., Fiji, British Virgin Islands, Tonga), so those rankings should be interpreted as indicative given limited WWTP coverage rather than definitive national performance.

top20_safest <- processed_data_scored |> 
  mutate(
    iso3_clean = recode(CNTRY_ISO, "XXK" = "XKX")) |>  #Fixing the error 'Some values were not matched unambiguously: XKX
  mutate( 
     COUNTRY = countrycode(iso3_clean, "iso3c", "country.name", warn = FALSE),
    region  = countrycode(iso3_clean, "iso3c", "region", warn = FALSE)) |> 
    group_by(region, COUNTRY, CNTRY_ISO) |> 
  summarise(
    n_wwtp = n(),
    median_risk = median(risk_score, na.rm = TRUE),
    mean_risk = mean(risk_score, na.rm = TRUE),
    .groups = "drop") |>  #Dropping groups makes the result easier/safer to use for tables, plotting
  filter(!is.na(region)) |> 
  arrange((median_risk), desc(n_wwtp)) |> # if there is a tie in median risk, the one with more WWTPs appears first 
  slice_head(n=20)

  top20_safest |> 
    gt() |> 
   tab_header(
    title = "Bottom 20 countries by WWTP risk score(safest)",
    subtitle = "Ranked by median risk score (HydroWASTE, processed dataset)") |> 
  cols_label(
    region = "Region",
    COUNTRY = "Country",
    CNTRY_ISO = "ISO3",
    n_wwtp = "WWTPs (n)",
    median_risk = "Median risk score",
    mean_risk = "Mean risk score") |> 
  fmt_number(columns = c(median_risk, mean_risk), decimals = 3) |> 
  tab_source_note(source_note = "Source: HydroWASTE (processed and scored by author).")

Table 2

Region	Country	ISO3	WWTPs (n)	Median risk score	Mean risk score
Bottom 20 countries by WWTP risk score(safest)
Ranked by median risk score (HydroWASTE, processed dataset)
East Asia & Pacific	New Zealand	NZL	154	2.000	2.734
Europe & Central Asia	Iceland	ISL	6	2.000	2.667
North America	United States	USA	13539	3.000	3.227
Europe & Central Asia	France	FRA	2993	3.000	3.286
Europe & Central Asia	Italy	ITA	2859	3.000	3.352
North America	Canada	CAN	1225	3.000	3.043
East Asia & Pacific	Australia	AUS	791	3.000	3.601
Europe & Central Asia	Hungary	HUN	725	3.000	2.946
Europe & Central Asia	Switzerland	CHE	667	3.000	3.033
Europe & Central Asia	Austria	AUT	623	3.000	3.255
Europe & Central Asia	Romania	ROU	532	3.000	3.489
Europe & Central Asia	Slovakia	SVK	264	3.000	3.273
Europe & Central Asia	Sweden	SWE	251	3.000	3.155
Europe & Central Asia	Norway	NOR	124	3.000	2.855
Europe & Central Asia	Slovenia	SVN	85	3.000	3.529
Europe & Central Asia	Latvia	LVA	58	3.000	3.000
Europe & Central Asia	Estonia	EST	35	3.000	3.314
East Asia & Pacific	Tonga	TON	2	3.000	3.000
East Asia & Pacific	Fiji	FJI	1	3.000	3.000
Latin America & Caribbean	British Virgin Islands	VGB	1	3.000	3.000
Source: HydroWASTE (processed and scored by author).

Figure 1 visualizes country-level median WWTP risk scores and data coverage simultaneously: warmer colours indicate higher median risk, while larger circles indicate countries with more WWTP records. The map broadly reinforces the table patterns, higher median risk is concentrated across parts of Sub-Saharan Africa, the Middle East & North Africa, and South/East Asia, while North America and much of Europe trend lower. Circle sizes highlight substantial differences in data coverage (e.g., very high coverage in the United States and parts of Europe/China), which is important when interpreting country rankings.

country_risk <- processed_data_scored |> 
  mutate(iso3_clean = recode(CNTRY_ISO, "XXK" = "XKX")) |> 
  group_by(iso3_clean) |> 
  summarise(
    n_wwtp = n(),
    median_risk = median(risk_score, na.rm = TRUE),
    mean_risk = mean(risk_score, na.rm = TRUE),
    .groups = "drop")

world <- ne_countries(scale = "medium", returnclass = "sf") |>  #Loads the world countries boundary dataset from naturalearth package and returns it as an sf, a special geometry containing country polygons, medium chooses a not so detailed map resolution
  select(name, iso_a3, continent, geometry)

world_risk <- world |> 
  left_join(country_risk, by = c("iso_a3" = "iso3_clean"))

bubbles_sf <- st_point_on_surface(world_risk) |> 
  st_drop_geometry() |>
  bind_cols(st_point_on_surface(world_risk) |> 
              st_as_sf())
bubbles_sf <- st_point_on_surface(world_risk) |>
  mutate(n_wwtp = world_risk$n_wwtp) |>
  filter(!is.na(n_wwtp))

ggplot() +
  geom_sf(data = world_risk, aes(fill = median_risk),
          color = "gray80", linewidth = 0.1) +
  geom_sf(
    data = bubbles_sf,
    aes(size = n_wwtp),
    alpha = 0.45 ) +
  scale_fill_viridis_c(
    option = "C",
    na.value = "white",
    name = "Median risk") +
  scale_size_continuous(
    range = c(1, 10),
    breaks = c(10, 50, 200, 1000),
    name = "WWTP records") +
  labs(
    title = "Global screening: WWTP risk and data coverage",
    subtitle = "Fill shows median risk score; point size reflects number of WWTP observations per country",
    caption = "Source: HydroWASTE (processed and scored by author).") +
  theme_void() +
  theme(legend.position = "right")

Figure 2 shows a clear inverse relationship between treated wastewater discharge and receiving water dilution potential(both on log10scales). As WWTP discharge increases, the estimated DF generally decreases. This pattern implies that higher discharge plants may more often be associated with comparatively lower dilution capacity, which increases screening level pollution risk in the risk score framework as DF is one of the core components. The smoothed lines by treatment level are broadly similar suggesting that at a global screening scale, variation in dilution potential and discharge may be atleast as influential as treatment level in differentiating potential risk.

scatter_data <- processed_data_scored |> 
  filter(!is.na(WASTE_DIS),WASTE_DIS >0, #As the next step is log, this avids havin log 0s
         !is.na(DF), DF >0,
         !is.na(LEVEL),
         !is.na(POP_SERVED))
scatter_data <- scatter_data |>
  mutate(
    log10_dis = log10(WASTE_DIS),
    log10_df  = log10(DF),
    log10_pop = log10(POP_SERVED))
ggplot(scatter_data, aes(log10_dis, log10_df, color = LEVEL)) +
  geom_point(alpha = 0.05, size = 0.6) +
  geom_smooth(se = FALSE, linewidth = 1) +
  labs(
    title = "WWTP discharge vs dilution factor (log10-transformed)",
    subtitle = "Points show individual WWTPs; lines show smoothed trend by treatment level",
    x = "log10(Treated wastewater discharge, m³/day)",
    y = "log10(Dilution factor, DF)",
    color = "Treatment level"
  ) +
  theme_minimal()

Conclusions

Risk patterns align with development context. Higher median risk scores are concentrated in lower-income/less-resourced settings (notably parts of Sub-Saharan Africa and MENA), while lower scores cluster in higher-income settings (Europe, North America, Australia/New Zealand), consistent with broader gaps in wastewater infrastructure and performance(Ravina et al. 2021).
Coverage and reporting remain a limiting factor. Several “highest-risk” countries in ranking are based on very small numbers of HydroWASTE records (e.g., n = 1–4), so results should be treated as indicative signals. This aligns with global SDG 6.3 reporting that domestic wastewater data and safe-treatment reporting are still incomplete and uneven across countries.
This project’s screening metric is not a compliance assessment. The risk score is designed for rapid prioritization (treatment level + population served + dilution potential). It does not replace diagnostics on effluent quality, actual compliance, industrial loads, or site-specific receiving water sensitivity.
In practice, governments and international organizations can use the Top 20 as a shortlist for follow-up (data validation, targeted monitoring, and deeper diagnostics/investment planning) and use the Bottom 20 as a benchmark for comparison while avoiding “safe” claims without additional supporting evidence.

References

Ehalt Macedo, Heloisa, Bernhard Lehner, Jim Nicell, Günther Grill, Jing Li, Antonio Limtong, and Ranish Shakya. 2022. “Distribution and Characteristics of Wastewater Treatment Plants Within the Global River Network.” Earth System Science Data 14 (2): 559–77. https://doi.org/10.5194/essd-14-559-2022.

Jones, Edward R., Michelle T. H. van Vliet, Manzoor Qadir, and Marc F. P. Bierkens. 2021. “Country-Level and Gridded Estimates of Wastewater Production, Collection, Treatment and Reuse.” Earth System Science Data 13 (2): 237–54. https://doi.org/10.5194/essd-13-237-2021.

Ravina, Marco, Sergio Galletta, Augustin Dagbetin, Omama Ahmed Hussein Kamaleldin, Madalitso Mng’ombe, Lameck Mnyenyembe, Alemayehu Shanko, and MariaChiara Zanetti. 2021. “Urban Wastewater Treatment in African Countries: Evidence from the Hydroaid Initiative.” Sustainability 13 (22): 12828. https://doi.org/10.3390/su132212828.