Air Quality Analysis Report

Author

Affiliation

Doreen A. Ndovie

KUHes, MSc Bioinformatics

Published

December 11, 2025

PROJECT DESCRIPTION

This project analyzes spatial and temporal patterns of air pollution in order to identify the factors driving poor air quality across districts. Specifically, the report explores:

Which parameters most strongly contribute to air pollution.
Spatial–temporal patterns of air pollution across the years.
Which districts or regions are most affected.
Which pollutants are most prevalent in specific high-pollution regions.
Whether there are relationships among pollutants.

All analyses follow a full data pipeline: import, cleaning, processing, visualization, and interpretation.

IMPORTING LIBRARIES

Loading libraries into the workspace

library(tidyverse)
library(ggplot2)
library(dplyr)
library(here)
library(readr)
library(lubridate)
library(knitr)
library(forcats)

raw_data <- read_csv(here::here("data/raw/Air_Quality.csv"))

head(raw_data)

# A tibble: 6 × 12
  `Unique ID` `Indicator ID` Name         Measure `Measure Info` `Geo Type Name`
        <dbl>          <dbl> <chr>        <chr>   <chr>          <chr>          
1      336867            375 Nitrogen di… Mean    ppb            CD             
2      336741            375 Nitrogen di… Mean    ppb            CD             
3      550157            375 Nitrogen di… Mean    ppb            CD             
4      412802            375 Nitrogen di… Mean    ppb            CD             
5      412803            375 Nitrogen di… Mean    ppb            CD             
6      412676            375 Nitrogen di… Mean    ppb            CD             
# ℹ 6 more variables: `Geo Join ID` <dbl>, `Geo Place Name` <chr>,
#   `Time Period` <chr>, Start_Date <chr>, `Data Value` <dbl>, Message <lgl>

view(raw_data)

#processing the raw data by selecting key columns and dropping null values


dir.create(
  here("data", "Processed"),
  recursive = TRUE,
  showWarnings = FALSE
)

processed_data <- raw_data |>
  select(
    Name, Measure, `Measure Info`, `Geo Place Name`,
    `Time Period`, Start_Date, `Data Value`
  ) |>
  tidyr::drop_na() |>
  rename(pollutant = Name)


write_csv(
  processed_data,
  here("data", "Processed", "my_processed-data.csv")
)

raw_path <- "../data/raw/Air_Quality.csv"
raw_df <- readr::read_csv(raw_path, show_col_types = FALSE)

names(raw_df)

 [1] "Unique ID"      "Indicator ID"   "Name"           "Measure"       
 [5] "Measure Info"   "Geo Type Name"  "Geo Join ID"    "Geo Place Name"
 [9] "Time Period"    "Start_Date"     "Data Value"     "Message"

INTRODUCTION

Air pollution is a major environmental and public-health concern, affecting respiratory, cardiovascular, and overall mortality outcomes. This dataset contains information on particulate-matter–related mortality across regions over several time periods. The purpose of this report is to clean, summarize, visualize, and explore patterns in the dataset to understand spatial variation and key contributing factors. This report follows reproducible workflows using Quarto and R.

METHODS

Raw air-quality mortality data were imported from the data/raw/ directory. Variables were standardized, dates parsed, and numeric fields cleaned. Processed data were saved to data/processed/ along with automatically generated documentation (data_dictionary.csv and README.md). Visualization was completed using ggplot2, and summary tables were produced with kable(). Scaling transformations were applied (log scales and color gradients) to improve interpretability.

RESULTS

Data import & cleaning

# Paths
raw_path <- "../data/raw/Air_Quality.csv"
proc_path <- "../data/processed/air_quality_clean.csv"
dict_path <- "../data/processed/dictionary.csv"
readme_path <- "../data/processed/README.md"

# Import raw data
raw_df <- read_csv(raw_path, show_col_types = FALSE)

# Clean dataset
clean_df <- raw_df |>
  rename_with(~ str_replace_all(., " ", "_")) |>  # convert spaces to _
  mutate(
    start_date = dmy(Start_Date),
    data_value = as.numeric(Data_Value),
    geo_place_name = as.character(Geo_Place_Name)
  ) |>
  select(
    unique_id = Unique_ID,
    indicator_id = Indicator_ID,
    name = Name,
    measure = Measure,
    measure_info = Measure_Info,
    geo_type_name = Geo_Type_Name,
    geo_join_id = Geo_Join_ID,
    geo_place_name,
    time_period = Time_Period,
    start_date,
    data_value,
    message = Message
  ) |>
  filter(!is.na(data_value))

# Save processed data
dir.create("../data/processed", showWarnings = FALSE, recursive = TRUE)
write_csv(clean_df, proc_path)

# ---- DATA DICTIONARY ----
dict <- tibble(
  variable_name = names(clean_df),
  description = c(
    "Unique record identifier for each observation.",
    "Numeric identifier for the environmental health indicator being measured.",
    "Indicator name (e.g. type of pollutant or outcome measured).",
    "Description of what is being measured (e.g. 'Deaths due to PM2.5').",
    "Additional measurement information or metadata.",
    "Geographic classification type (e.g., Borough, UHF34, UHF42, Citywide).",
    "Numeric geographic identifier for spatial joins or mapping.",
    "Name of the geographic location (e.g., Kingsbridge - Riverdale).",
    "Time period of the measurement (e.g., '2005–2007').",
    "Start date of the measurement period.",
    "Numeric value of the measurement (e.g., death rate per 100,000 adults).",
    "Notes or warnings associated with the record (often NA)."
  )
)

write_csv(dict, dict_path)

# ---- README FILE ----
readme_contents <- "
# Processed Data Documentation

This folder contains analysis-ready air quality data.

## Files

- **air_quality_clean.csv**: cleaned dataset
- **dictionary.csv**: variable descriptions
- **README.md**: documentation of processing steps

## Processing Steps

1. Imported raw data from the data/raw folder.
2. Cleaned column names to snake_case.
3. Converted dates and numeric fields.
4. Removed missing values in key variables.
5. Saved cleaned dataset and dictionary in data/processed.
"

writeLines(readme_contents, readme_path)

# Preview cleaned data
head(clean_df)

# A tibble: 6 × 12
  unique_id indicator_id name     measure measure_info geo_type_name geo_join_id
      <dbl>        <dbl> <chr>    <chr>   <chr>        <chr>               <dbl>
1    336867          375 Nitroge… Mean    ppb          CD                    407
2    336741          375 Nitroge… Mean    ppb          CD                    107
3    550157          375 Nitroge… Mean    ppb          CD                    414
4    412802          375 Nitroge… Mean    ppb          CD                    407
5    412803          375 Nitroge… Mean    ppb          CD                    407
6    412676          375 Nitroge… Mean    ppb          CD                    107
# ℹ 5 more variables: geo_place_name <chr>, time_period <chr>,
#   start_date <date>, data_value <dbl>, message <lgl>

Summary Statistics Table

The following table summarizes counts and central tendency of PM2.5-related mortality by time period.
This is Table Table 1.

summary_tbl <- clean_df |> 
  group_by(time_period) |> 
  summarize(
    n = n(),
    mean_rate = mean(data_value, na.rm = TRUE),
    median_rate = median(data_value, na.rm = TRUE),
    sd_rate = sd(data_value, na.rm = TRUE)
  )

kable(summary_tbl, digits = 2)

Table 1

time_period	n	mean_rate	median_rate	sd_rate
2005	417	35.02	13.65	38.91
2005-2007	480	46.22	27.05	53.61
2009-2011	480	41.39	22.30	48.02
2010	321	48.06	48.18	41.65
2011	214	1.90	1.90	0.49
2012-2014	480	40.14	19.55	49.89
2013	144	24.06	3.70	49.05
2014	96	1.23	1.08	0.34
2015	144	20.04	2.75	42.44
2015-2017	480	34.66	19.25	41.76
2017-2019	480	28.10	14.00	34.05
2019	321	52.45	51.30	45.66
Annual Average 2009	282	18.36	14.68	8.47
Annual Average 2010	282	16.91	13.58	7.75
Annual Average 2011	282	17.38	13.71	7.68
Annual Average 2012	282	15.75	12.73	7.21
Annual Average 2013	282	15.22	12.47	7.01
Annual Average 2014	282	15.21	12.91	6.57
Annual Average 2015	282	14.70	11.86	6.25
Annual Average 2016	282	13.68	11.39	6.40
Annual Average 2017	282	13.71	11.28	6.52
Annual Average 2018	282	13.10	10.46	6.31
Annual Average 2019	282	12.17	10.06	5.79
Annual Average 2020	282	11.57	9.86	5.68
Annual Average 2021	282	11.88	10.03	5.46
Annual Average 2022	282	11.06	9.19	5.34
Annual Average 2023	282	11.79	8.87	5.31
Summer 2009	423	19.31	20.44	7.15
Summer 2010	423	21.69	19.51	9.08
Summer 2011	423	21.17	18.77	9.03
Summer 2012	423	20.64	17.13	9.90
Summer 2013	423	19.42	16.77	8.73
Summer 2014	423	18.56	15.44	9.43
Summer 2015	423	18.80	15.48	9.32
Summer 2016	423	18.87	14.79	10.67
Summer 2017	423	17.86	14.37	8.58
Summer 2018	423	17.62	13.30	9.38
Summer 2019	423	17.54	13.61	9.40
Summer 2020	423	16.34	12.10	9.87
Summer 2021	423	17.00	12.32	9.40
Summer 2022	423	17.27	11.54	11.62
Summer 2023	423	18.88	12.92	11.43
Winter 2008-09	282	22.19	19.06	9.52
Winter 2009-10	282	18.70	16.23	8.85
Winter 2010-11	282	21.70	18.62	9.10
Winter 2011-12	282	17.20	14.96	8.12
Winter 2012-13	282	18.36	16.40	8.06
Winter 2013-14	282	20.21	19.00	8.30
Winter 2014-15	282	17.45	15.70	8.12
Winter 2015-16	282	16.09	14.02	7.98
Winter 2016-17	282	17.04	14.26	9.13
Winter 2017-18	282	16.67	13.46	8.52
Winter 2018-19	282	14.93	13.67	7.69
Winter 2019-20	282	16.45	13.83	9.12
Winter 2020-21	282	14.37	13.22	7.64
Winter 2021-22	282	14.20	12.31	7.35
Winter 2022-23	282	13.70	10.42	7.51

Summary statistics for PM2.5 mortality by time period.

Visualization 1 — Distribution of Mortality Rates

Figure 1 displays the distribution of particulate matter mortality which shows a closer to normal distribution.

library(ggplot2)

ggplot(clean_df |> filter(data_value > 0), aes(x = data_value)) +
  geom_histogram(bins = 30, fill = "steelblue", color = "white") +
  scale_x_log10() +
  labs(
    x = "PM2.5 Mortality (per 100,000 adults, log scale)",
    y = "Count",
    title = "Distribution of PM2.5-Related Mortality"
  ) +
  theme_minimal()

Figure 1: Histogram of PM2.5 mortality (log-scaled x-axis).

Finding out which parameters strongly contribute to Air pollution

Figure 2 shows that vehicle usage contributes to air pollution leading to higher mortality rates.

library(dplyr)
library(ggplot2)
avg_param <- clean_df |> 
  group_by(name) |> 
  summarize(mean_value = mean(data_value, na.rm = TRUE)) |> 
  arrange(desc(mean_value))

ggplot(avg_param, aes(x = fct_reorder(name, mean_value), y = mean_value, fill = mean_value)) +
  geom_col() +
  coord_flip() +
  scale_fill_continuous(name = "Average rate") +
  labs(
    x = "Pollutant / Indicator",
    y = "Average Mortality Rate",
    title = "Average Mortality Rate per Pollutant/Indicator"
  ) +
  theme_minimal()

Figure 2: Average mortality rate per pollutant/indicator.

Spatial temporal patterns of air pollution across the years

temporal_trend <- clean_df |> 
  group_by(time_period) |> 
  summarize(avg_rate = mean(data_value, na.rm = TRUE))

ggplot(temporal_trend, aes(x = time_period, y = avg_rate, group = 1)) +
  geom_line(color = "steelblue", size = 1) +
  geom_point(color = "darkred") +
  labs(
    x = "Time Period",
    y = "Average Mortality Rate",
    title = "Temporal Trends in PM2.5 Mortality"
  ) +
  theme_minimal()+
    theme(
    axis.text.x = element_text(angle = 90, hjust = 1, size = 8),
    axis.text.y = element_text(size = 12),
    plot.title = element_text(size = 14, face = "bold")
  )

Figure 3: Temporal trends of average PM2.5-related mortality across years.

Finding out which districts are mostly affected

From Figure 4, it is noted that High Bridge Morrisania has the highest average mortality rate due to air pollution hence targeted efforts would be reqiured to mitigate such.

top_districts <- clean_df |> 
  group_by(geo_place_name) |> 
  summarize(avg_rate = mean(data_value, na.rm = TRUE)) |> 
  arrange(desc(avg_rate)) |> 
  slice_head(n = 10)

ggplot(top_districts, aes(x = fct_reorder(geo_place_name, avg_rate), y = avg_rate, fill = avg_rate)) +
  geom_col() +
  coord_flip() +
  scale_fill_continuous(name = "Average rate") +
  labs(
    x = "District",
    y = "Average Mortality Rate",
    title = "Top 10 Districts by PM2.5 Mortality"
  ) +
  theme_minimal()

Figure 4: Top 10 districts by PM2.5 mortality.

Which pollutants are most prevalent in high-pollution regions

library(dplyr)
library(ggplot2)

# Aggregate average rate per district × pollutant
high_pollution <- clean_df |> 
  group_by(geo_place_name, name) |> 
  summarize(avg_rate = mean(data_value, na.rm = TRUE), .groups = "drop")

# Heatmap
ggplot(high_pollution, aes(x = name, y = geo_place_name, fill = avg_rate)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "white", high = "red", name = "Average rate") +
  labs(
    x = "Pollutant",
    y = "District",
    title = "Pollutant Prevalence Across Districts"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 12),
    axis.text.y = element_text(size = 12),
    plot.title = element_text(size = 14, face = "bold")
  )

Figure 5: Pollutant prevalence in high-pollution districts.

Relationship among pollutants

library(dplyr)
library(ggplot2)
library(reshape2)

# Aggregate by district × pollutant
pollutant_wide <- clean_df |> 
  group_by(geo_place_name, name) |> 
  summarize(avg_value = mean(data_value, na.rm = TRUE), .groups = "drop") |> 
  pivot_wider(names_from = name, values_from = avg_value)

# Correlation matrix
corr_matrix <- cor(pollutant_wide[,-1], use = "pairwise.complete.obs")
corr_df <- melt(corr_matrix)

# Plot
ggplot(corr_df, aes(x=Var1, y=Var2, fill=value)) +
  geom_tile() +
  scale_fill_gradient2(low="blue", mid="white", high="red", midpoint=0, limit=c(-1,1)) +
  labs(title="Correlation Among Pollutants") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, size = 12),
    axis.text.y = element_text(size = 12),
    plot.title = element_text(size = 14, face = "bold")
  )

Figure 6: Correlation among pollutants across districts.

CONCLUSIONS

PM2.5-related mortality varies widely across districts and time periods.
Certain regions repeatedly show higher-than-average mortality rates.
Mortality distributions are heavily right-skewed, indicating small clusters of highly affected areas.
Continuous spatial monitoring could improve targeting of pollution-reduction policies.

REFERENCES

The air quality data used in this analysis was obtained from data.gov (U.S. Environmental Protection Agency 2023).

References

U.S. Environmental Protection Agency. 2023. “Air Quality Data.” https://www.data.gov/.