Air Quality Analysis Report

Author
Affiliation

Doreen A. Ndovie

KUHes, MSc Bioinformatics

Published

December 11, 2025

PROJECT DESCRIPTION

This project analyzes spatial and temporal patterns of air pollution in order to identify the factors driving poor air quality across districts. Specifically, the report explores:

  • Which parameters most strongly contribute to air pollution.
  • Spatial–temporal patterns of air pollution across the years.
  • Which districts or regions are most affected.
  • Which pollutants are most prevalent in specific high-pollution regions.
  • Whether there are relationships among pollutants.

All analyses follow a full data pipeline: import, cleaning, processing, visualization, and interpretation.

IMPORTING LIBRARIES

Loading libraries into the workspace

library(tidyverse)
library(ggplot2)
library(dplyr)
library(here)
library(readr)
library(lubridate)
library(knitr)
library(forcats)
raw_data <- read_csv(here::here("data/raw/Air_Quality.csv"))
head(raw_data)
# A tibble: 6 × 12
  `Unique ID` `Indicator ID` Name         Measure `Measure Info` `Geo Type Name`
        <dbl>          <dbl> <chr>        <chr>   <chr>          <chr>          
1      336867            375 Nitrogen di… Mean    ppb            CD             
2      336741            375 Nitrogen di… Mean    ppb            CD             
3      550157            375 Nitrogen di… Mean    ppb            CD             
4      412802            375 Nitrogen di… Mean    ppb            CD             
5      412803            375 Nitrogen di… Mean    ppb            CD             
6      412676            375 Nitrogen di… Mean    ppb            CD             
# ℹ 6 more variables: `Geo Join ID` <dbl>, `Geo Place Name` <chr>,
#   `Time Period` <chr>, Start_Date <chr>, `Data Value` <dbl>, Message <lgl>
view(raw_data)
#processing the raw data by selecting key columns and dropping null values


dir.create(
  here("data", "Processed"),
  recursive = TRUE,
  showWarnings = FALSE
)

processed_data <- raw_data |>
  select(
    Name, Measure, `Measure Info`, `Geo Place Name`,
    `Time Period`, Start_Date, `Data Value`
  ) |>
  tidyr::drop_na() |>
  rename(pollutant = Name)


write_csv(
  processed_data,
  here("data", "Processed", "my_processed-data.csv")
)
raw_path <- "../data/raw/Air_Quality.csv"
raw_df <- readr::read_csv(raw_path, show_col_types = FALSE)

names(raw_df)
 [1] "Unique ID"      "Indicator ID"   "Name"           "Measure"       
 [5] "Measure Info"   "Geo Type Name"  "Geo Join ID"    "Geo Place Name"
 [9] "Time Period"    "Start_Date"     "Data Value"     "Message"       

INTRODUCTION

Air pollution is a major environmental and public-health concern, affecting respiratory, cardiovascular, and overall mortality outcomes. This dataset contains information on particulate-matter–related mortality across regions over several time periods. The purpose of this report is to clean, summarize, visualize, and explore patterns in the dataset to understand spatial variation and key contributing factors. This report follows reproducible workflows using Quarto and R.

METHODS

Raw air-quality mortality data were imported from the data/raw/ directory. Variables were standardized, dates parsed, and numeric fields cleaned. Processed data were saved to data/processed/ along with automatically generated documentation (data_dictionary.csv and README.md). Visualization was completed using ggplot2, and summary tables were produced with kable(). Scaling transformations were applied (log scales and color gradients) to improve interpretability.

RESULTS

Data import & cleaning

# Paths
raw_path <- "../data/raw/Air_Quality.csv"
proc_path <- "../data/processed/air_quality_clean.csv"
dict_path <- "../data/processed/dictionary.csv"
readme_path <- "../data/processed/README.md"

# Import raw data
raw_df <- read_csv(raw_path, show_col_types = FALSE)

# Clean dataset
clean_df <- raw_df |>
  rename_with(~ str_replace_all(., " ", "_")) |>  # convert spaces to _
  mutate(
    start_date = dmy(Start_Date),
    data_value = as.numeric(Data_Value),
    geo_place_name = as.character(Geo_Place_Name)
  ) |>
  select(
    unique_id = Unique_ID,
    indicator_id = Indicator_ID,
    name = Name,
    measure = Measure,
    measure_info = Measure_Info,
    geo_type_name = Geo_Type_Name,
    geo_join_id = Geo_Join_ID,
    geo_place_name,
    time_period = Time_Period,
    start_date,
    data_value,
    message = Message
  ) |>
  filter(!is.na(data_value))

# Save processed data
dir.create("../data/processed", showWarnings = FALSE, recursive = TRUE)
write_csv(clean_df, proc_path)

# ---- DATA DICTIONARY ----
dict <- tibble(
  variable_name = names(clean_df),
  description = c(
    "Unique record identifier for each observation.",
    "Numeric identifier for the environmental health indicator being measured.",
    "Indicator name (e.g. type of pollutant or outcome measured).",
    "Description of what is being measured (e.g. 'Deaths due to PM2.5').",
    "Additional measurement information or metadata.",
    "Geographic classification type (e.g., Borough, UHF34, UHF42, Citywide).",
    "Numeric geographic identifier for spatial joins or mapping.",
    "Name of the geographic location (e.g., Kingsbridge - Riverdale).",
    "Time period of the measurement (e.g., '2005–2007').",
    "Start date of the measurement period.",
    "Numeric value of the measurement (e.g., death rate per 100,000 adults).",
    "Notes or warnings associated with the record (often NA)."
  )
)

write_csv(dict, dict_path)

# ---- README FILE ----
readme_contents <- "
# Processed Data Documentation

This folder contains analysis-ready air quality data.

## Files

- **air_quality_clean.csv**: cleaned dataset
- **dictionary.csv**: variable descriptions
- **README.md**: documentation of processing steps

## Processing Steps

1. Imported raw data from the data/raw folder.
2. Cleaned column names to snake_case.
3. Converted dates and numeric fields.
4. Removed missing values in key variables.
5. Saved cleaned dataset and dictionary in data/processed.
"

writeLines(readme_contents, readme_path)

# Preview cleaned data
head(clean_df)
# A tibble: 6 × 12
  unique_id indicator_id name     measure measure_info geo_type_name geo_join_id
      <dbl>        <dbl> <chr>    <chr>   <chr>        <chr>               <dbl>
1    336867          375 Nitroge… Mean    ppb          CD                    407
2    336741          375 Nitroge… Mean    ppb          CD                    107
3    550157          375 Nitroge… Mean    ppb          CD                    414
4    412802          375 Nitroge… Mean    ppb          CD                    407
5    412803          375 Nitroge… Mean    ppb          CD                    407
6    412676          375 Nitroge… Mean    ppb          CD                    107
# ℹ 5 more variables: geo_place_name <chr>, time_period <chr>,
#   start_date <date>, data_value <dbl>, message <lgl>

Summary Statistics Table

The following table summarizes counts and central tendency of PM2.5-related mortality by time period.
This is Table Table 1.

summary_tbl <- clean_df |> 
  group_by(time_period) |> 
  summarize(
    n = n(),
    mean_rate = mean(data_value, na.rm = TRUE),
    median_rate = median(data_value, na.rm = TRUE),
    sd_rate = sd(data_value, na.rm = TRUE)
  )

kable(summary_tbl, digits = 2)
Table 1
time_period n mean_rate median_rate sd_rate
2005 417 35.02 13.65 38.91
2005-2007 480 46.22 27.05 53.61
2009-2011 480 41.39 22.30 48.02
2010 321 48.06 48.18 41.65
2011 214 1.90 1.90 0.49
2012-2014 480 40.14 19.55 49.89
2013 144 24.06 3.70 49.05
2014 96 1.23 1.08 0.34
2015 144 20.04 2.75 42.44
2015-2017 480 34.66 19.25 41.76
2017-2019 480 28.10 14.00 34.05
2019 321 52.45 51.30 45.66
Annual Average 2009 282 18.36 14.68 8.47
Annual Average 2010 282 16.91 13.58 7.75
Annual Average 2011 282 17.38 13.71 7.68
Annual Average 2012 282 15.75 12.73 7.21
Annual Average 2013 282 15.22 12.47 7.01
Annual Average 2014 282 15.21 12.91 6.57
Annual Average 2015 282 14.70 11.86 6.25
Annual Average 2016 282 13.68 11.39 6.40
Annual Average 2017 282 13.71 11.28 6.52
Annual Average 2018 282 13.10 10.46 6.31
Annual Average 2019 282 12.17 10.06 5.79
Annual Average 2020 282 11.57 9.86 5.68
Annual Average 2021 282 11.88 10.03 5.46
Annual Average 2022 282 11.06 9.19 5.34
Annual Average 2023 282 11.79 8.87 5.31
Summer 2009 423 19.31 20.44 7.15
Summer 2010 423 21.69 19.51 9.08
Summer 2011 423 21.17 18.77 9.03
Summer 2012 423 20.64 17.13 9.90
Summer 2013 423 19.42 16.77 8.73
Summer 2014 423 18.56 15.44 9.43
Summer 2015 423 18.80 15.48 9.32
Summer 2016 423 18.87 14.79 10.67
Summer 2017 423 17.86 14.37 8.58
Summer 2018 423 17.62 13.30 9.38
Summer 2019 423 17.54 13.61 9.40
Summer 2020 423 16.34 12.10 9.87
Summer 2021 423 17.00 12.32 9.40
Summer 2022 423 17.27 11.54 11.62
Summer 2023 423 18.88 12.92 11.43
Winter 2008-09 282 22.19 19.06 9.52
Winter 2009-10 282 18.70 16.23 8.85
Winter 2010-11 282 21.70 18.62 9.10
Winter 2011-12 282 17.20 14.96 8.12
Winter 2012-13 282 18.36 16.40 8.06
Winter 2013-14 282 20.21 19.00 8.30
Winter 2014-15 282 17.45 15.70 8.12
Winter 2015-16 282 16.09 14.02 7.98
Winter 2016-17 282 17.04 14.26 9.13
Winter 2017-18 282 16.67 13.46 8.52
Winter 2018-19 282 14.93 13.67 7.69
Winter 2019-20 282 16.45 13.83 9.12
Winter 2020-21 282 14.37 13.22 7.64
Winter 2021-22 282 14.20 12.31 7.35
Winter 2022-23 282 13.70 10.42 7.51

Summary statistics for PM2.5 mortality by time period.

Visualization 1 — Distribution of Mortality Rates

Figure 1 displays the distribution of particulate matter mortality which shows a closer to normal distribution.

library(ggplot2)

ggplot(clean_df |> filter(data_value > 0), aes(x = data_value)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  scale_x_log10() +
  labs(
    x = "PM2.5 Mortality (per 100,000 adults, log scale)",
    y = "Count",
    title = "Distribution of PM2.5-Related Mortality"
  ) +
  theme_minimal()
Figure 1: Histogram of PM2.5 mortality (log-scaled x-axis).

Finding out which parameters strongly contribute to Air pollution

Figure 2 shows that vehicle usage contributes to air pollution leading to higher mortality rates.

library(dplyr)
library(ggplot2)
avg_param <- clean_df |> 
  group_by(name) |> 
  summarize(mean_value = mean(data_value, na.rm = TRUE)) |> 
  arrange(desc(mean_value))

ggplot(avg_param, aes(x = fct_reorder(name, mean_value), y = mean_value, fill = mean_value)) +
  geom_col() +
  coord_flip() +
  scale_fill_continuous(name = "Average rate") +
  labs(
    x = "Pollutant / Indicator",
    y = "Average Mortality Rate",
    title = "Average Mortality Rate per Pollutant/Indicator"
  ) +
  theme_minimal()
Figure 2: Average mortality rate per pollutant/indicator.

Spatial temporal patterns of air pollution across the years

temporal_trend <- clean_df |> 
  group_by(time_period) |> 
  summarize(avg_rate = mean(data_value, na.rm = TRUE))

ggplot(temporal_trend, aes(x = time_period, y = avg_rate, group = 1)) +
  geom_line(color = "steelblue", size = 1) +
  geom_point(color = "darkred") +
  labs(
    x = "Time Period",
    y = "Average Mortality Rate",
    title = "Temporal Trends in PM2.5 Mortality"
  ) +
  theme_minimal()+
    theme(
    axis.text.x = element_text(angle = 90, hjust = 1, size = 8),
    axis.text.y = element_text(size = 12),
    plot.title = element_text(size = 14, face = "bold")
  )
Figure 3: Temporal trends of average PM2.5-related mortality across years.

Finding out which districts are mostly affected

From Figure 4, it is noted that High Bridge Morrisania has the highest average mortality rate due to air pollution hence targeted efforts would be reqiured to mitigate such.

top_districts <- clean_df |> 
  group_by(geo_place_name) |> 
  summarize(avg_rate = mean(data_value, na.rm = TRUE)) |> 
  arrange(desc(avg_rate)) |> 
  slice_head(n = 10)

ggplot(top_districts, aes(x = fct_reorder(geo_place_name, avg_rate), y = avg_rate, fill = avg_rate)) +
  geom_col() +
  coord_flip() +
  scale_fill_continuous(name = "Average rate") +
  labs(
    x = "District",
    y = "Average Mortality Rate",
    title = "Top 10 Districts by PM2.5 Mortality"
  ) +
  theme_minimal()
Figure 4: Top 10 districts by PM2.5 mortality.

Which pollutants are most prevalent in high-pollution regions

library(dplyr)
library(ggplot2)

# Aggregate average rate per district × pollutant
high_pollution <- clean_df |> 
  group_by(geo_place_name, name) |> 
  summarize(avg_rate = mean(data_value, na.rm = TRUE), .groups = "drop")

# Heatmap
ggplot(high_pollution, aes(x = name, y = geo_place_name, fill = avg_rate)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "white", high = "red", name = "Average rate") +
  labs(
    x = "Pollutant",
    y = "District",
    title = "Pollutant Prevalence Across Districts"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 12),
    axis.text.y = element_text(size = 12),
    plot.title = element_text(size = 14, face = "bold")
  )
Figure 5: Pollutant prevalence in high-pollution districts.

Relationship among pollutants

library(dplyr)
library(ggplot2)
library(reshape2)

# Aggregate by district × pollutant
pollutant_wide <- clean_df |> 
  group_by(geo_place_name, name) |> 
  summarize(avg_value = mean(data_value, na.rm = TRUE), .groups = "drop") |> 
  pivot_wider(names_from = name, values_from = avg_value)

# Correlation matrix
corr_matrix <- cor(pollutant_wide[,-1], use = "pairwise.complete.obs")
corr_df <- melt(corr_matrix)

# Plot
ggplot(corr_df, aes(x=Var1, y=Var2, fill=value)) +
  geom_tile() +
  scale_fill_gradient2(low="blue", mid="white", high="red", midpoint=0, limit=c(-1,1)) +
  labs(title="Correlation Among Pollutants") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, size = 12),
    axis.text.y = element_text(size = 12),
    plot.title = element_text(size = 14, face = "bold")
  )
Figure 6: Correlation among pollutants across districts.

CONCLUSIONS

  • PM2.5-related mortality varies widely across districts and time periods.
  • Certain regions repeatedly show higher-than-average mortality rates.
  • Mortality distributions are heavily right-skewed, indicating small clusters of highly affected areas.
  • Continuous spatial monitoring could improve targeting of pollution-reduction policies.

REFERENCES

The air quality data used in this analysis was obtained from data.gov (U.S. Environmental Protection Agency 2023).

References

U.S. Environmental Protection Agency. 2023. “Air Quality Data.” https://www.data.gov/.