Global Screening of Wastewater Pollution Risk from Treatment Plants

Author

Syson Siima

Published

January 27, 2026

Introduction

Wastewater treatment is a critical public-health and environmental protection measure, yet treatment performance and monitoring remain uneven across countries. SDG Target 6.3 aims, by 2030, to improve water quality by reducing pollution, eliminating dumping, minimizing releases of hazardous chemicals and materials, halving the proportion of untreated wastewater, and substantially increasing recycling and safe reuse globally. Despite progress, a substantial share of domestic wastewater is still not safely treated, including cases where treatment is limited to primary processes or where effluent does not meet applicable discharge standards(Jones et al. 2021). This project uses the HydroWASTE dataset to develop a screening-level “risk score” for wastewater treatment plants (WWTPs), combining treatment level, population served, and receiving-water dilution potential to identify locations where potential impacts may be higher. The results provide a transparent and replicable basis for comparing countries and highlighting where improved data availability, stronger treatment performance, or targeted investment may be most needed

Methods

Dataset

This project uses HydroWASTE, a global, spatially explicit database of 58,502 WWTPs compiled from national and regional sources and harmonized to a consistent global structure. The dataset provides WWTP locations and key operational attributes such as treatment level, population served, treated wastewater discharge, and linkage to the river network via an estimated outfall location. Importantly for this project, HydroWASTE also provides an estimated dilution factor (DF), defined as the ratio of natural discharge in the receiving water body to the WWTP effluent discharge, which can be used as a screening proxy for dilution capacity and potential exposure conditions (Ehalt Macedo et al. 2022).

Data Cleaning and Processing

To create an analysis-ready dataset, the following steps were applied:

Variable selection

Retained only fields required for the screening analysis (treatment level, population served, wastewater discharge, dilution factor/river linkage, and identifiers), and dropped non-essential qualitative fields.

Operational status harmonization

Collapsed the original status categories into Operational and Not operational. To support assessment of both current and potential future risk, planned/under-construction/proposed facilities were treated as Operational, while closed/decommissioned/Not operational plants were treated as Not operational.

Missing values and zeros

Standardized blanks and “Not reported” as missing where applicable. For key numeric fields where zero values likely indicate “unknown/not provided,” 0 values were converted to NA (notably for population served and treated wastewater discharge).

DF and river-network edge cases

Because DF can be blank for coastal/large-lake outfalls and some records are not linked to a mapped river reach, a “df-case” category was created (DF available/computed; coastal/large-lake; not river-linked). This enabled consistent, transparent handling of DF-related missingness during scoring.

After cleaning, the scored dataset was restricted to records classified as operational and with non-missing inputs required to compute the risk score.

library(tidyverse)
library(here)
library(dplyr)
library(readr)
library(stringr)
library(countrycode)
library(gt)
library(sf)
library(rnaturalearth)
library(viridis)
library(ggplot2)
library(patchwork)
raw_data <- read_csv(here::here("data/raw/hydrowaste_raw.csv"))
processed_data <- raw_data |> 
  select(WASTE_ID, COUNTRY, CNTRY_ISO,STATUS, POP_SERVED, WASTE_DIS, LEVEL, DF,HYRIV_ID,RIVER_DIS, COAST_10KM, COAST_50KM) |> 
  mutate( STATUS = case_when(
    STATUS %in% c("Closed","Decommissioned", "Non-Operational") ~ "Not operational",
    TRUE ~ "Operational"
  ),
  POP_SERVED = na_if(POP_SERVED, 0),
  WASTE_DIS = na_if(WASTE_DIS, 0),
  HYRIV_ID = na_if(as.character(HYRIV_ID), ""),
  DF = na_if(as.character(DF), ""),
  DF = parse_number(DF)) |> 
  mutate(river_connected = !is.na(HYRIV_ID),
    df_case = case_when(
      is.na(WASTE_DIS) ~ "df_missing_zero_discharge",
      !river_connected & (COAST_10KM == 1 | COAST_50KM == 1) ~ "coastal_or_lake",
      !river_connected ~ "not_in_river_network",
      river_connected & is.na(DF) ~ "df_missing_other",
      TRUE ~ "df_computed"))
write_csv(processed_data,
          here::here("data/processed/my-processed-data.csv"))

Risk score methodology

A screening level wastewater pollution risk score was constructed to combine three drivers of potential impact: treatment level (pollutant reduction potential), population served (proxy for wastewater load), and dilution factor (DF) (proxy for receiving-water dilution capacity). Each variable was converted to an ordinal sub-score and the overall score was computed additively as:

risk_score= f(treatment level) + log10 (population served) + f(dilution factor).

Country-level results were then summarized using the median risk score and the number of WWTP records to support cross-country comparison and visualization.

#checking the order of LEVEL
processed_data |> 
  count(LEVEL)
# A tibble: 3 × 2
  LEVEL         n
  <chr>     <int>
1 Advanced  22855
2 Primary     881
3 Secondary 34766
order_LEVEL <- c("Primary", "Secondary", "Advanced")

processed_data_scored <- processed_data |> 
  mutate(LEVEL = factor(LEVEL, levels = order_LEVEL)) |> 
  filter(STATUS == "Operational") |> 
  filter(!is.na(POP_SERVED)) |> 
  filter(!is.na(WASTE_DIS)) |> 
  mutate(treat_score =case_when(
    LEVEL == "Primary"   ~ 3,
    LEVEL == "Secondary" ~ 2,
    LEVEL == "Advanced"  ~ 1
  )) |>  
    relocate(treat_score, .after = LEVEL) |> 
  mutate(log10_pop = log10(POP_SERVED)) |> 
  relocate(log10_pop, .after = POP_SERVED) |> 
  mutate(df_score_raw = case_when(
    is.na(DF) ~ NA_real_,
    DF < 10   ~ 3,
    DF < 100  ~ 2,
    DF < 1000 ~ 1,
    TRUE ~ 0
  ),
  df_score = case_when(
    df_case == "df_computed" ~ df_score_raw,
    df_case == "coastal_or_lake" ~ 0, #assumption: high dilution
    df_case == "not_in_river_network" ~ 1, #assumption: neutral
    TRUE ~ NA_real_)) |> 
  relocate(df_case, df_score_raw, df_score, .after = DF) |> 
  mutate(pop_score = ntile(log10_pop, 4) -1 ) |>  #Population score as dataset relative quartiles (0-3)
  relocate(pop_score, .after = log10_pop) |> 
 filter(!is.na(treat_score), !is.na(pop_score), !is.na(df_score)) |> mutate(risk_score = treat_score + pop_score + df_score)

 Results

Table 1 ranks the top 20 countries with the highest WWTP risk scores based on the median risk score across plants in each country. The highest median score (9.0) is observed for Yemen, Burkina Faso, and Ethiopia (8.75). Because several countries in the top ranks have very small sample sizes (n=1–4), these results should be interpreted as a screening signal (flagging potential concern) rather than a definitive national performance ranking, while countries with larger counts (e.g., China, Jordan) provide more stable estimates.

tab_top_risk <- processed_data_scored |> 
  mutate(
    iso3_clean = recode(CNTRY_ISO, "XXK" = "XKX")) |>  #Fixing the error 'Some values were not matched unambiguously: XKX
  mutate( 
     COUNTRY = countrycode(iso3_clean, "iso3c", "country.name", warn = FALSE),
    region  = countrycode(iso3_clean, "iso3c", "region", warn = FALSE)) |> 
    group_by(region, COUNTRY, CNTRY_ISO) |> 
  summarise(
    n_wwtp = n(),
    median_risk = median(risk_score, na.rm = TRUE),
    mean_risk = mean(risk_score, na.rm = TRUE),
    .groups = "drop") |>  #Dropping groups makes the result easier/safer to use for tables, plotting
  filter(!is.na(region)) |> 
  arrange(desc(median_risk)) |> 
  slice_head(n=20)

  tab_top_risk |> 
    gt() |> 
   tab_header(
    title = "Top 20 countries by WWTP risk score",
    subtitle = "Ranked by median risk score (HydroWASTE, processed dataset)") |> 
  cols_label(
    region = "Region",
    COUNTRY = "Country",
    CNTRY_ISO = "ISO3",
    n_wwtp = "WWTPs (n)",
    median_risk = "Median risk score",
    mean_risk = "Mean risk score") |> 
  fmt_number(columns = c(median_risk, mean_risk), decimals = 3) |> 
  tab_source_note(source_note = "Source: HydroWASTE (processed and scored by author).")
Table 1
Top 20 countries by WWTP risk score
Ranked by median risk score (HydroWASTE, processed dataset)
Region Country ISO3 WWTPs (n) Median risk score Mean risk score
Middle East & North Africa Yemen YEM 1 9.000 9.000
Sub-Saharan Africa Burkina Faso BFA 1 9.000 9.000
Sub-Saharan Africa Ethiopia ETH 4 9.000 8.750
East Asia & Pacific Indonesia IDN 16 8.000 7.562
Sub-Saharan Africa Congo - Kinshasa COD 3 8.000 7.333
Sub-Saharan Africa Ghana GHA 4 8.000 8.000
Sub-Saharan Africa Senegal SEN 2 8.000 8.000
East Asia & Pacific Mongolia MNG 2 7.500 7.500
Latin America & Caribbean Venezuela VEN 12 7.500 7.000
Middle East & North Africa Jordan JOR 28 7.500 7.143
South Asia Nepal NPL 8 7.500 7.125
Sub-Saharan Africa Uganda UGA 10 7.500 7.600
East Asia & Pacific China CHN 2184 7.000 6.601
East Asia & Pacific Thailand THA 23 7.000 6.391
Europe & Central Asia Tajikistan TJK 5 7.000 6.800
Latin America & Caribbean Cuba CUB 14 7.000 5.500
Latin America & Caribbean Guatemala GTM 8 7.000 7.250
Middle East & North Africa Lebanon LBN 1 7.000 7.000
Middle East & North Africa Morocco MAR 17 7.000 6.647
Middle East & North Africa Palestinian Territories PSE 10 7.000 7.200
Source: HydroWASTE (processed and scored by author).

Table 2 shows that the lowest risk category is dominated by high-income countries in Europe & Central Asia plus North America and Oceania, with New Zealand and Iceland showing the lowest median risk score of 2.0. Most of the remaining “safest” countries have a median risk score of 3.0, including large datasets such as the United States (13,539 WWTPs), France (2,993), and Italy (2,859), suggesting broadly lower screening-level risk across many facilities rather than a result driven by a handful of plants. A few entries have very small sample sizes (e.g., Fiji, British Virgin Islands, Tonga), so those rankings should be interpreted as indicative given limited WWTP coverage rather than definitive national performance.

top20_safest <- processed_data_scored |> 
  mutate(
    iso3_clean = recode(CNTRY_ISO, "XXK" = "XKX")) |>  #Fixing the error 'Some values were not matched unambiguously: XKX
  mutate( 
     COUNTRY = countrycode(iso3_clean, "iso3c", "country.name", warn = FALSE),
    region  = countrycode(iso3_clean, "iso3c", "region", warn = FALSE)) |> 
    group_by(region, COUNTRY, CNTRY_ISO) |> 
  summarise(
    n_wwtp = n(),
    median_risk = median(risk_score, na.rm = TRUE),
    mean_risk = mean(risk_score, na.rm = TRUE),
    .groups = "drop") |>  #Dropping groups makes the result easier/safer to use for tables, plotting
  filter(!is.na(region)) |> 
  arrange((median_risk), desc(n_wwtp)) |> # if there is a tie in median risk, the one with more WWTPs appears first 
  slice_head(n=20)

  top20_safest |> 
    gt() |> 
   tab_header(
    title = "Bottom 20 countries by WWTP risk score(safest)",
    subtitle = "Ranked by median risk score (HydroWASTE, processed dataset)") |> 
  cols_label(
    region = "Region",
    COUNTRY = "Country",
    CNTRY_ISO = "ISO3",
    n_wwtp = "WWTPs (n)",
    median_risk = "Median risk score",
    mean_risk = "Mean risk score") |> 
  fmt_number(columns = c(median_risk, mean_risk), decimals = 3) |> 
  tab_source_note(source_note = "Source: HydroWASTE (processed and scored by author).")
Table 2
Bottom 20 countries by WWTP risk score(safest)
Ranked by median risk score (HydroWASTE, processed dataset)
Region Country ISO3 WWTPs (n) Median risk score Mean risk score
East Asia & Pacific New Zealand NZL 154 2.000 2.734
Europe & Central Asia Iceland ISL 6 2.000 2.667
North America United States USA 13539 3.000 3.227
Europe & Central Asia France FRA 2993 3.000 3.286
Europe & Central Asia Italy ITA 2859 3.000 3.352
North America Canada CAN 1225 3.000 3.043
East Asia & Pacific Australia AUS 791 3.000 3.601
Europe & Central Asia Hungary HUN 725 3.000 2.946
Europe & Central Asia Switzerland CHE 667 3.000 3.033
Europe & Central Asia Austria AUT 623 3.000 3.255
Europe & Central Asia Romania ROU 532 3.000 3.489
Europe & Central Asia Slovakia SVK 264 3.000 3.273
Europe & Central Asia Sweden SWE 251 3.000 3.155
Europe & Central Asia Norway NOR 124 3.000 2.855
Europe & Central Asia Slovenia SVN 85 3.000 3.529
Europe & Central Asia Latvia LVA 58 3.000 3.000
Europe & Central Asia Estonia EST 35 3.000 3.314
East Asia & Pacific Tonga TON 2 3.000 3.000
East Asia & Pacific Fiji FJI 1 3.000 3.000
Latin America & Caribbean British Virgin Islands VGB 1 3.000 3.000
Source: HydroWASTE (processed and scored by author).

Figure 1 visualizes country-level median WWTP risk scores and data coverage simultaneously: warmer colours indicate higher median risk, while larger circles indicate countries with more WWTP records. The map broadly reinforces the table patterns, higher median risk is concentrated across parts of Sub-Saharan Africa, the Middle East & North Africa, and South/East Asia, while North America and much of Europe trend lower. Circle sizes highlight substantial differences in data coverage (e.g., very high coverage in the United States and parts of Europe/China), which is important when interpreting country rankings.

country_risk <- processed_data_scored |> 
  mutate(iso3_clean = recode(CNTRY_ISO, "XXK" = "XKX")) |> 
  group_by(iso3_clean) |> 
  summarise(
    n_wwtp = n(),
    median_risk = median(risk_score, na.rm = TRUE),
    mean_risk = mean(risk_score, na.rm = TRUE),
    .groups = "drop")

world <- ne_countries(scale = "medium", returnclass = "sf") |>  #Loads the world countries boundary dataset from naturalearth package and returns it as an sf, a special geometry containing country polygons, medium chooses a not so detailed map resolution
  select(name, iso_a3, continent, geometry)

world_risk <- world |> 
  left_join(country_risk, by = c("iso_a3" = "iso3_clean"))

bubbles_sf <- st_point_on_surface(world_risk) |> 
  st_drop_geometry() |>
  bind_cols(st_point_on_surface(world_risk) |> 
              st_as_sf())
bubbles_sf <- st_point_on_surface(world_risk) |>
  mutate(n_wwtp = world_risk$n_wwtp) |>
  filter(!is.na(n_wwtp))

ggplot() +
  geom_sf(data = world_risk, aes(fill = median_risk),
          color = "gray80", linewidth = 0.1) +
  geom_sf(
    data = bubbles_sf,
    aes(size = n_wwtp),
    alpha = 0.45 ) +
  scale_fill_viridis_c(
    option = "C",
    na.value = "white",
    name = "Median risk") +
  scale_size_continuous(
    range = c(1, 10),
    breaks = c(10, 50, 200, 1000),
    name = "WWTP records") +
  labs(
    title = "Global screening: WWTP risk and data coverage",
    subtitle = "Fill shows median risk score; point size reflects number of WWTP observations per country",
    caption = "Source: HydroWASTE (processed and scored by author).") +
  theme_void() +
  theme(legend.position = "right")
Figure 1

Figure 2 shows a clear inverse relationship between treated wastewater discharge and receiving water dilution potential(both on log10scales). As WWTP discharge increases, the estimated DF generally decreases. This pattern implies that higher discharge plants may more often be associated with comparatively lower dilution capacity, which increases screening level pollution risk in the risk score framework as DF is one of the core components. The smoothed lines by treatment level are broadly similar suggesting that at a global screening scale, variation in dilution potential and discharge may be atleast as influential as treatment level in differentiating potential risk.

scatter_data <- processed_data_scored |> 
  filter(!is.na(WASTE_DIS),WASTE_DIS >0, #As the next step is log, this avids havin log 0s
         !is.na(DF), DF >0,
         !is.na(LEVEL),
         !is.na(POP_SERVED))
scatter_data <- scatter_data |>
  mutate(
    log10_dis = log10(WASTE_DIS),
    log10_df  = log10(DF),
    log10_pop = log10(POP_SERVED))
ggplot(scatter_data, aes(log10_dis, log10_df, color = LEVEL)) +
  geom_point(alpha = 0.05, size = 0.6) +
  geom_smooth(se = FALSE, linewidth = 1) +
  labs(
    title = "WWTP discharge vs dilution factor (log10-transformed)",
    subtitle = "Points show individual WWTPs; lines show smoothed trend by treatment level",
    x = "log10(Treated wastewater discharge, m³/day)",
    y = "log10(Dilution factor, DF)",
    color = "Treatment level"
  ) +
  theme_minimal()
Figure 2

Conclusions

  • Risk patterns align with development context. Higher median risk scores are concentrated in lower-income/less-resourced settings (notably parts of Sub-Saharan Africa and MENA), while lower scores cluster in higher-income settings (Europe, North America, Australia/New Zealand), consistent with broader gaps in wastewater infrastructure and performance(Ravina et al. 2021).

  • Coverage and reporting remain a limiting factor. Several “highest-risk” countries in ranking are based on very small numbers of HydroWASTE records (e.g., n = 1–4), so results should be treated as indicative signals. This aligns with global SDG 6.3 reporting that domestic wastewater data and safe-treatment reporting are still incomplete and uneven across countries.

  • This project’s screening metric is not a compliance assessment. The risk score is designed for rapid prioritization (treatment level + population served + dilution potential). It does not replace diagnostics on effluent quality, actual compliance, industrial loads, or site-specific receiving water sensitivity.

  • In practice, governments and international organizations can use the Top 20 as a shortlist for follow-up (data validation, targeted monitoring, and deeper diagnostics/investment planning) and use the Bottom 20 as a benchmark for comparison while avoiding “safe” claims without additional supporting evidence.

References

Ehalt Macedo, Heloisa, Bernhard Lehner, Jim Nicell, Günther Grill, Jing Li, Antonio Limtong, and Ranish Shakya. 2022. “Distribution and Characteristics of Wastewater Treatment Plants Within the Global River Network.” Earth System Science Data 14 (2): 559–77. https://doi.org/10.5194/essd-14-559-2022.
Jones, Edward R., Michelle T. H. van Vliet, Manzoor Qadir, and Marc F. P. Bierkens. 2021. “Country-Level and Gridded Estimates of Wastewater Production, Collection, Treatment and Reuse.” Earth System Science Data 13 (2): 237–54. https://doi.org/10.5194/essd-13-237-2021.
Ravina, Marco, Sergio Galletta, Augustin Dagbetin, Omama Ahmed Hussein Kamaleldin, Madalitso Mng’ombe, Lameck Mnyenyembe, Alemayehu Shanko, and MariaChiara Zanetti. 2021. “Urban Wastewater Treatment in African Countries: Evidence from the Hydroaid Initiative.” Sustainability 13 (22): 12828. https://doi.org/10.3390/su132212828.