library(tidyverse)
library(ggplot2)
library(dplyr)
library(here)
library(readr)
library(lubridate)
library(knitr)
library(forcats)Air Quality Analysis Report
PROJECT DESCRIPTION
This project analyzes spatial and temporal patterns of air pollution in order to identify the factors driving poor air quality across districts. Specifically, the report explores:
- Which parameters most strongly contribute to air pollution.
- Spatial–temporal patterns of air pollution across the years.
- Which districts or regions are most affected.
- Which pollutants are most prevalent in specific high-pollution regions.
- Whether there are relationships among pollutants.
All analyses follow a full data pipeline: import, cleaning, processing, visualization, and interpretation.
IMPORTING LIBRARIES
Loading libraries into the workspace
raw_data <- read_csv(here::here("data/raw/Air_Quality.csv"))head(raw_data)# A tibble: 6 × 12
`Unique ID` `Indicator ID` Name Measure `Measure Info` `Geo Type Name`
<dbl> <dbl> <chr> <chr> <chr> <chr>
1 336867 375 Nitrogen di… Mean ppb CD
2 336741 375 Nitrogen di… Mean ppb CD
3 550157 375 Nitrogen di… Mean ppb CD
4 412802 375 Nitrogen di… Mean ppb CD
5 412803 375 Nitrogen di… Mean ppb CD
6 412676 375 Nitrogen di… Mean ppb CD
# ℹ 6 more variables: `Geo Join ID` <dbl>, `Geo Place Name` <chr>,
# `Time Period` <chr>, Start_Date <chr>, `Data Value` <dbl>, Message <lgl>
view(raw_data)#processing the raw data by selecting key columns and dropping null values
dir.create(
here("data", "Processed"),
recursive = TRUE,
showWarnings = FALSE
)
processed_data <- raw_data |>
select(
Name, Measure, `Measure Info`, `Geo Place Name`,
`Time Period`, Start_Date, `Data Value`
) |>
tidyr::drop_na() |>
rename(pollutant = Name)
write_csv(
processed_data,
here("data", "Processed", "my_processed-data.csv")
)raw_path <- "../data/raw/Air_Quality.csv"
raw_df <- readr::read_csv(raw_path, show_col_types = FALSE)
names(raw_df) [1] "Unique ID" "Indicator ID" "Name" "Measure"
[5] "Measure Info" "Geo Type Name" "Geo Join ID" "Geo Place Name"
[9] "Time Period" "Start_Date" "Data Value" "Message"
INTRODUCTION
Air pollution is a major environmental and public-health concern, affecting respiratory, cardiovascular, and overall mortality outcomes. This dataset contains information on particulate-matter–related mortality across regions over several time periods. The purpose of this report is to clean, summarize, visualize, and explore patterns in the dataset to understand spatial variation and key contributing factors. This report follows reproducible workflows using Quarto and R.
METHODS
Raw air-quality mortality data were imported from the data/raw/ directory. Variables were standardized, dates parsed, and numeric fields cleaned. Processed data were saved to data/processed/ along with automatically generated documentation (data_dictionary.csv and README.md). Visualization was completed using ggplot2, and summary tables were produced with kable(). Scaling transformations were applied (log scales and color gradients) to improve interpretability.
RESULTS
Data import & cleaning
# Paths
raw_path <- "../data/raw/Air_Quality.csv"
proc_path <- "../data/processed/air_quality_clean.csv"
dict_path <- "../data/processed/dictionary.csv"
readme_path <- "../data/processed/README.md"
# Import raw data
raw_df <- read_csv(raw_path, show_col_types = FALSE)
# Clean dataset
clean_df <- raw_df |>
rename_with(~ str_replace_all(., " ", "_")) |> # convert spaces to _
mutate(
start_date = dmy(Start_Date),
data_value = as.numeric(Data_Value),
geo_place_name = as.character(Geo_Place_Name)
) |>
select(
unique_id = Unique_ID,
indicator_id = Indicator_ID,
name = Name,
measure = Measure,
measure_info = Measure_Info,
geo_type_name = Geo_Type_Name,
geo_join_id = Geo_Join_ID,
geo_place_name,
time_period = Time_Period,
start_date,
data_value,
message = Message
) |>
filter(!is.na(data_value))
# Save processed data
dir.create("../data/processed", showWarnings = FALSE, recursive = TRUE)
write_csv(clean_df, proc_path)
# ---- DATA DICTIONARY ----
dict <- tibble(
variable_name = names(clean_df),
description = c(
"Unique record identifier for each observation.",
"Numeric identifier for the environmental health indicator being measured.",
"Indicator name (e.g. type of pollutant or outcome measured).",
"Description of what is being measured (e.g. 'Deaths due to PM2.5').",
"Additional measurement information or metadata.",
"Geographic classification type (e.g., Borough, UHF34, UHF42, Citywide).",
"Numeric geographic identifier for spatial joins or mapping.",
"Name of the geographic location (e.g., Kingsbridge - Riverdale).",
"Time period of the measurement (e.g., '2005–2007').",
"Start date of the measurement period.",
"Numeric value of the measurement (e.g., death rate per 100,000 adults).",
"Notes or warnings associated with the record (often NA)."
)
)
write_csv(dict, dict_path)
# ---- README FILE ----
readme_contents <- "
# Processed Data Documentation
This folder contains analysis-ready air quality data.
## Files
- **air_quality_clean.csv**: cleaned dataset
- **dictionary.csv**: variable descriptions
- **README.md**: documentation of processing steps
## Processing Steps
1. Imported raw data from the data/raw folder.
2. Cleaned column names to snake_case.
3. Converted dates and numeric fields.
4. Removed missing values in key variables.
5. Saved cleaned dataset and dictionary in data/processed.
"
writeLines(readme_contents, readme_path)
# Preview cleaned data
head(clean_df)# A tibble: 6 × 12
unique_id indicator_id name measure measure_info geo_type_name geo_join_id
<dbl> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 336867 375 Nitroge… Mean ppb CD 407
2 336741 375 Nitroge… Mean ppb CD 107
3 550157 375 Nitroge… Mean ppb CD 414
4 412802 375 Nitroge… Mean ppb CD 407
5 412803 375 Nitroge… Mean ppb CD 407
6 412676 375 Nitroge… Mean ppb CD 107
# ℹ 5 more variables: geo_place_name <chr>, time_period <chr>,
# start_date <date>, data_value <dbl>, message <lgl>
Summary Statistics Table
The following table summarizes counts and central tendency of PM2.5-related mortality by time period.
This is Table Table 1.
summary_tbl <- clean_df |>
group_by(time_period) |>
summarize(
n = n(),
mean_rate = mean(data_value, na.rm = TRUE),
median_rate = median(data_value, na.rm = TRUE),
sd_rate = sd(data_value, na.rm = TRUE)
)
kable(summary_tbl, digits = 2)| time_period | n | mean_rate | median_rate | sd_rate |
|---|---|---|---|---|
| 2005 | 417 | 35.02 | 13.65 | 38.91 |
| 2005-2007 | 480 | 46.22 | 27.05 | 53.61 |
| 2009-2011 | 480 | 41.39 | 22.30 | 48.02 |
| 2010 | 321 | 48.06 | 48.18 | 41.65 |
| 2011 | 214 | 1.90 | 1.90 | 0.49 |
| 2012-2014 | 480 | 40.14 | 19.55 | 49.89 |
| 2013 | 144 | 24.06 | 3.70 | 49.05 |
| 2014 | 96 | 1.23 | 1.08 | 0.34 |
| 2015 | 144 | 20.04 | 2.75 | 42.44 |
| 2015-2017 | 480 | 34.66 | 19.25 | 41.76 |
| 2017-2019 | 480 | 28.10 | 14.00 | 34.05 |
| 2019 | 321 | 52.45 | 51.30 | 45.66 |
| Annual Average 2009 | 282 | 18.36 | 14.68 | 8.47 |
| Annual Average 2010 | 282 | 16.91 | 13.58 | 7.75 |
| Annual Average 2011 | 282 | 17.38 | 13.71 | 7.68 |
| Annual Average 2012 | 282 | 15.75 | 12.73 | 7.21 |
| Annual Average 2013 | 282 | 15.22 | 12.47 | 7.01 |
| Annual Average 2014 | 282 | 15.21 | 12.91 | 6.57 |
| Annual Average 2015 | 282 | 14.70 | 11.86 | 6.25 |
| Annual Average 2016 | 282 | 13.68 | 11.39 | 6.40 |
| Annual Average 2017 | 282 | 13.71 | 11.28 | 6.52 |
| Annual Average 2018 | 282 | 13.10 | 10.46 | 6.31 |
| Annual Average 2019 | 282 | 12.17 | 10.06 | 5.79 |
| Annual Average 2020 | 282 | 11.57 | 9.86 | 5.68 |
| Annual Average 2021 | 282 | 11.88 | 10.03 | 5.46 |
| Annual Average 2022 | 282 | 11.06 | 9.19 | 5.34 |
| Annual Average 2023 | 282 | 11.79 | 8.87 | 5.31 |
| Summer 2009 | 423 | 19.31 | 20.44 | 7.15 |
| Summer 2010 | 423 | 21.69 | 19.51 | 9.08 |
| Summer 2011 | 423 | 21.17 | 18.77 | 9.03 |
| Summer 2012 | 423 | 20.64 | 17.13 | 9.90 |
| Summer 2013 | 423 | 19.42 | 16.77 | 8.73 |
| Summer 2014 | 423 | 18.56 | 15.44 | 9.43 |
| Summer 2015 | 423 | 18.80 | 15.48 | 9.32 |
| Summer 2016 | 423 | 18.87 | 14.79 | 10.67 |
| Summer 2017 | 423 | 17.86 | 14.37 | 8.58 |
| Summer 2018 | 423 | 17.62 | 13.30 | 9.38 |
| Summer 2019 | 423 | 17.54 | 13.61 | 9.40 |
| Summer 2020 | 423 | 16.34 | 12.10 | 9.87 |
| Summer 2021 | 423 | 17.00 | 12.32 | 9.40 |
| Summer 2022 | 423 | 17.27 | 11.54 | 11.62 |
| Summer 2023 | 423 | 18.88 | 12.92 | 11.43 |
| Winter 2008-09 | 282 | 22.19 | 19.06 | 9.52 |
| Winter 2009-10 | 282 | 18.70 | 16.23 | 8.85 |
| Winter 2010-11 | 282 | 21.70 | 18.62 | 9.10 |
| Winter 2011-12 | 282 | 17.20 | 14.96 | 8.12 |
| Winter 2012-13 | 282 | 18.36 | 16.40 | 8.06 |
| Winter 2013-14 | 282 | 20.21 | 19.00 | 8.30 |
| Winter 2014-15 | 282 | 17.45 | 15.70 | 8.12 |
| Winter 2015-16 | 282 | 16.09 | 14.02 | 7.98 |
| Winter 2016-17 | 282 | 17.04 | 14.26 | 9.13 |
| Winter 2017-18 | 282 | 16.67 | 13.46 | 8.52 |
| Winter 2018-19 | 282 | 14.93 | 13.67 | 7.69 |
| Winter 2019-20 | 282 | 16.45 | 13.83 | 9.12 |
| Winter 2020-21 | 282 | 14.37 | 13.22 | 7.64 |
| Winter 2021-22 | 282 | 14.20 | 12.31 | 7.35 |
| Winter 2022-23 | 282 | 13.70 | 10.42 | 7.51 |
Summary statistics for PM2.5 mortality by time period.
Visualization 1 — Distribution of Mortality Rates
Figure 1 displays the distribution of particulate matter mortality which shows a closer to normal distribution.
library(ggplot2)
ggplot(clean_df |> filter(data_value > 0), aes(x = data_value)) +
geom_histogram(bins = 30, fill = "steelblue", color = "white") +
scale_x_log10() +
labs(
x = "PM2.5 Mortality (per 100,000 adults, log scale)",
y = "Count",
title = "Distribution of PM2.5-Related Mortality"
) +
theme_minimal()
Finding out which parameters strongly contribute to Air pollution
Figure 2 shows that vehicle usage contributes to air pollution leading to higher mortality rates.
library(dplyr)
library(ggplot2)
avg_param <- clean_df |>
group_by(name) |>
summarize(mean_value = mean(data_value, na.rm = TRUE)) |>
arrange(desc(mean_value))
ggplot(avg_param, aes(x = fct_reorder(name, mean_value), y = mean_value, fill = mean_value)) +
geom_col() +
coord_flip() +
scale_fill_continuous(name = "Average rate") +
labs(
x = "Pollutant / Indicator",
y = "Average Mortality Rate",
title = "Average Mortality Rate per Pollutant/Indicator"
) +
theme_minimal()
Spatial temporal patterns of air pollution across the years
temporal_trend <- clean_df |>
group_by(time_period) |>
summarize(avg_rate = mean(data_value, na.rm = TRUE))
ggplot(temporal_trend, aes(x = time_period, y = avg_rate, group = 1)) +
geom_line(color = "steelblue", size = 1) +
geom_point(color = "darkred") +
labs(
x = "Time Period",
y = "Average Mortality Rate",
title = "Temporal Trends in PM2.5 Mortality"
) +
theme_minimal()+
theme(
axis.text.x = element_text(angle = 90, hjust = 1, size = 8),
axis.text.y = element_text(size = 12),
plot.title = element_text(size = 14, face = "bold")
)
Finding out which districts are mostly affected
From Figure 4, it is noted that High Bridge Morrisania has the highest average mortality rate due to air pollution hence targeted efforts would be reqiured to mitigate such.
top_districts <- clean_df |>
group_by(geo_place_name) |>
summarize(avg_rate = mean(data_value, na.rm = TRUE)) |>
arrange(desc(avg_rate)) |>
slice_head(n = 10)
ggplot(top_districts, aes(x = fct_reorder(geo_place_name, avg_rate), y = avg_rate, fill = avg_rate)) +
geom_col() +
coord_flip() +
scale_fill_continuous(name = "Average rate") +
labs(
x = "District",
y = "Average Mortality Rate",
title = "Top 10 Districts by PM2.5 Mortality"
) +
theme_minimal()
Which pollutants are most prevalent in high-pollution regions
library(dplyr)
library(ggplot2)
# Aggregate average rate per district × pollutant
high_pollution <- clean_df |>
group_by(geo_place_name, name) |>
summarize(avg_rate = mean(data_value, na.rm = TRUE), .groups = "drop")
# Heatmap
ggplot(high_pollution, aes(x = name, y = geo_place_name, fill = avg_rate)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "white", high = "red", name = "Average rate") +
labs(
x = "Pollutant",
y = "District",
title = "Pollutant Prevalence Across Districts"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 12),
axis.text.y = element_text(size = 12),
plot.title = element_text(size = 14, face = "bold")
)
Relationship among pollutants
library(dplyr)
library(ggplot2)
library(reshape2)
# Aggregate by district × pollutant
pollutant_wide <- clean_df |>
group_by(geo_place_name, name) |>
summarize(avg_value = mean(data_value, na.rm = TRUE), .groups = "drop") |>
pivot_wider(names_from = name, values_from = avg_value)
# Correlation matrix
corr_matrix <- cor(pollutant_wide[,-1], use = "pairwise.complete.obs")
corr_df <- melt(corr_matrix)
# Plot
ggplot(corr_df, aes(x=Var1, y=Var2, fill=value)) +
geom_tile() +
scale_fill_gradient2(low="blue", mid="white", high="red", midpoint=0, limit=c(-1,1)) +
labs(title="Correlation Among Pollutants") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, size = 12),
axis.text.y = element_text(size = 12),
plot.title = element_text(size = 14, face = "bold")
)
CONCLUSIONS
- PM2.5-related mortality varies widely across districts and time periods.
- Certain regions repeatedly show higher-than-average mortality rates.
- Mortality distributions are heavily right-skewed, indicating small clusters of highly affected areas.
- Continuous spatial monitoring could improve targeting of pollution-reduction policies.
REFERENCES
The air quality data used in this analysis was obtained from data.gov (U.S. Environmental Protection Agency 2023).