Code
library(tidyverse)
library(dplyr)
library(ggplot2)
library(gt)
library(here)
library(gtsummary)
library(lubridate)
library(leaflet)An exploration to understand the structure and trends of an informal faecal collection sector
| Abbreviation | Comment |
|---|---|
| FS | Faecal sludge |
The ‘fslogisticskampala’ data set is shared on OpenWashdata by Lars Shöbitz. The goal of this dataset is to provide data sources on faecal sludge transporting logistics in Kampala, Uganda. The data is collected from 30th March 2015 until 25th June 2015 and contains the collection location and dislodge location. By reading the shared dataset and the written report the writer felt it would be interesting to see if there is more information and trends present in the dataset which has not be exposed in the narrative. This dataset is chosen for the Data Science for OpenWashData course. And this specific dataset as it contains location data, in form of GPS coordinates, and multiple datasets which can be joined for the analysis. Additionally it also contains time series which is a common aspect in dataset and so useful for getting hands-on experience
First the required libraries need to be imported to be able to read and process the datasets.
library(tidyverse)
library(dplyr)
library(ggplot2)
library(gt)
library(here)
library(gtsummary)
library(lubridate)
library(leaflet)The original set has 2 tables “Fslogisticskampala: Data on Faecal Sludge; Transporting Logistics in Kampala, Uganda” (2024). One set contains information of trips of faecal sludge collection trucks and the second table with metadata on of the trucks. First we start with opening the trips.csv and save it as trips_raw:
trips_raw <- read_csv(here::here("data/raw/trips.csv"))Table 2.1 shows the dictionary of the trips dataset.
read_csv(here::here("data/processed/trips_dictionary.csv")) |>
gt()| variable_name | variable_type | description |
|---|---|---|
| fid | integer | key |
| numberplate | string | used as foreign key to link to trucks dataset |
| date | date | date in ISO 8601 derived style: year/month/day --> '2025/11/20' |
| time | time | time format |
| lat | double | latitude of collection location |
| lon | double | longtude of collection location |
| plant | string | Name of treatment plant |
From the dictionary it can be seen the trips data has information about the each recorded trip. Where the ‘numberplate’ column can be used as the foreign key to link to the truck dataset as described below.
The second dataset to import is the trucks data as published in the website. Read trucks.csv and save as trucks_raw
trucks_raw <- read_csv(here::here("data/raw/trucks.csv"))Table 2.2 shows the dictionary of the trucks set. This is a more simple table containing meta data on the truck. It only contains 2 variables. The first is the ‘numberplate’ which can function as key to link to the trips dataset and the second variable is the volume of the specific truck in m3.
read_csv(here::here("data/processed/trucks_dictionary.csv")) |>
gt()| variable_name | variable_type | description |
|---|---|---|
| numberplate | string | license plate number used for identifying the truck |
| volume | integer | Volume in m3 |
In order to aid analysis a additional table is generated and imported. This table has a daily granularity and contains time intelligence information for the data collection period.
In order to understand the data we will first asses the raw data. First we will count all non NA values for each column (see Table 2.3. In total 5663 records are in the dataset and it can be seen each column has no NA values.
# trips_tbl_sum <- trips_raw |>
colSums(!is.na(trips_raw)) fid numberplate date time lat lon
5653 5653 5653 5653 5653 5653
plant
5653
truck_dist <- trips_raw |>
count(numberplate, name = "n_entries") |>
arrange(desc(n_entries))|>
mutate("Percentage of entries" = round(n_entries / sum(n_entries) * 100,2))
truck_dist# A tibble: 35 × 3
numberplate n_entries `Percentage of entries`
<chr> <int> <dbl>
1 UAT 598T 315 5.57
2 UAQ 024L 286 5.06
3 UAQ 981Z 285 5.04
4 UAV 037R 271 4.79
5 UAU 189B 267 4.72
6 UAT 845E 254 4.49
7 AUS 119X 235 4.16
8 UAN 087N 218 3.86
9 LG 0366-01 216 3.82
10 LG 0257-01 198 3.5
# ℹ 25 more rows
Of the total 35 trucks the truck with the highest amount of entries is only accounting for 5.6% of the entries. So with the data set is spread out enough to represent the truck drivers in Kampala. The trips dataset can be used as a fact table when doing the analysis.
The second dataset called ‘trucks’ can be used as dimension table for the trips data to retrieve meta data from the trucks. In this case only the volume. In total 35 trucks are registered and each has a reported volume as an attribute:
colSums(!is.na(trucks_raw))numberplate volume
35 35
Write the cleaned trips data/processed data as csv after cleaning.
trips <- trips_raw |>
mutate(numberplate = as.character(numberplate)) |>
write_csv(here::here("data/processed/trips.csv"))The raw table is read and with the necessary steps cleaned and stored as trucks for further analysis.
trucks <- trucks_raw |>
mutate(numberplate = as.character(numberplate)) |>
mutate(size_cat = case_when(
volume < 8 ~"small",
volume < 12 ~"medium",
volume < 40 ~"large",
volume < 999 ~"HUGE!",
TRUE ~ "unkown"
))|>
write_csv(here::here("data/processed/trucks.csv"))Since the dim_date and dim_time are generated for this specific reporting purpose it is directly put in the processed folder. This can be done as no additional cleaning steps are required. After reading the document the dataframes need to be generated so they can easily be references in the code.
dim_date <- read_csv(here::here("data/processed/dim_date.csv"))
dim_time <- read_csv(here::here("data/processed/dim_time.csv"))First each cleaned dataset is stored into the folder ‘data/processed folder’. This allows to make a versatile setup where cleaning steps can be incorporated into the used dataset when deemed necessary when analysis the results. After saving the individual tables these will be linked according a ‘star’ schema based on Kimball and Ross (2013). For the analysis the trips data is used as the fact table and the trucks and dim_date tables are linked as dimension tables over which the set can be grouped and filtered. Figure 2.1 shows the schematic of the star schema.
erDiagram
trips }o--o| trucks : numberplate
trips }o--o| dim_date : date
trips }o--o| dim_time : time
trucks {
id numberplate
integer volume
string size_cat
}
trips {
id fid
string numberplate
date date
time time
float lat
float lon
string plant
}
dim_date {
date date
integer day
string day_suffix
integer week_day
string week_day_name
string week_day_name_short
string week_dayname_first_letter
integer day_of_year
integer week_of_month
integer week_of_year
integer month
string month_name
string month_name_Short
string month_name_first_letter
integer quarter
string quarter_name
integer year
integer yyyymm
string year_month
binary is_weekend
date first_date_of_month
date last_date_of_month
date first_date_of_week
date last_date_of_week
string year_week
}
dim_time{
integer hour12
integer hour24
integer minute_of_hour
integer second_of_minute
integer elapsed_minutes
integer elapsed_seconds
string am_pm
time hhmmss
binary working_hours
integer part_of_day_sort
string part_of_day
}erDiagram
trips }o--o| trucks : numberplate
trips }o--o| dim_date : date
trips }o--o| dim_time : time
trucks {
id numberplate
integer volume
string size_cat
}
trips {
id fid
string numberplate
date date
time time
float lat
float lon
string plant
}
dim_date {
date date
integer day
string day_suffix
integer week_day
string week_day_name
string week_day_name_short
string week_dayname_first_letter
integer day_of_year
integer week_of_month
integer week_of_year
integer month
string month_name
string month_name_Short
string month_name_first_letter
integer quarter
string quarter_name
integer year
integer yyyymm
string year_month
binary is_weekend
date first_date_of_month
date last_date_of_month
date first_date_of_week
date last_date_of_week
string year_week
}
dim_time{
integer hour12
integer hour24
integer minute_of_hour
integer second_of_minute
integer elapsed_minutes
integer elapsed_seconds
string am_pm
time hhmmss
binary working_hours
integer part_of_day_sort
string part_of_day
}
In order generate the star schema the code below can be executed. This code will basically do a left join for of each dimension table to the trips fact table. This generates a big table with 45 variables and 5653 tuples (or rows).
df <- trips |>
left_join(trucks, by = join_by(numberplate )) |>
left_join(dim_date, by = join_by(date) ) |>
left_join(dim_time, by = join_by(time == time_key))This chapter will dive into the results of the dataset. The first visual (Figure 3.3) shows the collections locations and the 2 treatment sites.
The registered trucks have variable volumes. Figure 3.1 shows the distribution volumes of registered trucks.
size_cat_order <- c("small","medium","large","HUGE!")
df |>
mutate(size_cat = factor(size_cat, levels = size_cat_order))|>
ggplot(aes(y = numberplate,
x = volume,
color = size_cat
)) +
geom_point(shape = 4, size = 1, stroke = 2)+
labs(
title = "Truck Volume Distribution",
subtitle = "Plot of all vehicles sizes, with colors showing size category",
x = "Registered Trucks",
y = "Volume (m3)",
color = "Size Category"
)
Looking at the distribution a large variance can be observed, but also different clusters of sizes can be distinguished. By grouping on typical sizes it can be expected different types transport are linked. For examples of different sizes also refer to Linda Strande, Mariska Ronteltap, and Damir Brdjanovic (2014). To summarise the results Figure 3.2 shows the results in a boxplot and Table 3.1 shows statistics of the created groups.
df |>
ggplot(aes(y = volume,
x= ''
)) +
geom_boxplot()+
geom_point(aes(color = size_cat))+
labs(
title = "Truck Volume Distribution in a Boxplot",
subtitle = "Boxplot of all vehicles, with colors showing size category",
x = "Registered Trucks",
y = "Volume (m3)",
color = "Size Category"
)
trucks |>
mutate(size_cat = factor(size_cat, levels = size_cat_order))|>
group_by(size_cat) |>
summarise(count = n(),
mean = round(mean(volume),1),
median = median(volume),
min = min(volume),
max = max(volume),
sd = round(sd(volume),1)
)|>
gt()|>
tab_header(
title = "Summary table of truck sizes, by size category")|>
tab_options(
table.width = pct(80),
table.align = "center",
heading.align = "center",
table.border.top.width = px(2),
table.border.top.color = "black",
table.border.bottom.width = px(2),
table.border.bottom.color = "black"
)| Summary table of truck sizes, by size category | ||||||
| size_cat | count | mean | median | min | max | sd |
|---|---|---|---|---|---|---|
| small | 18 | 4.2 | 4.0 | 3 | 6 | 0.9 |
| medium | 5 | 10.0 | 10.0 | 10 | 10 | 0.0 |
| large | 8 | 33.6 | 36.0 | 25 | 37 | 4.8 |
| HUGE! | 4 | 68.5 | 68.5 | 65 | 72 | 4.0 |
Apart from the collection location the trips dataset also states to which treatment plant the truck went for dislodging. One called Bugolobi and the other Lubigi. Where Lubigi is located at the Northern ring and Bugolobi at the southern part of the downtown. In order to see whether the distance from the one collection location and the treatment plant is correlated all collection points are mapped in fig-map_collection_locations and colours according the dislodging location. Also the locations of the treatment plants have been added as a reference.
map = leaflet(data = df) |>
addTiles() |>
addCircleMarkers(~lon, ~lat
,popup = ~as.character(plant)
, color = ~ifelse(plant == "Bugolobi" , "red", "blue")
, radius = 0.5
) |>
addMarkers(lng = ~c(32.6071673,32.5458844)
,lat = ~c(0.3190139,0.3472747)
,popup = ~c("Bugolobi FS treatment plant","Lubigi FS treatment plant")
)
mapBased on the collection location one can observe the collection location has some correlation to the dislodging location as collection at the south and east tend to more dislodged at Bugolobi and collections from the North and West tend to be dislodged at Lubigi. Altough some correlation can be observed the collection location is clearly not the only factor for deciding whether to dislodge at a certain treatment plant.
pal <- colorFactor(
palette = c("blue", "orange", "green", "red", "purple"),
domain = df$size_cat
)
map <- leaflet() |>
addTiles(group = "OSM") |>
addProviderTiles(providers$Esri.WorldImagery, group = "Satellite")
for (sc in unique(df$size_cat)) {
map <- map |>
addCircleMarkers(
data = df[df$size_cat == sc, ],
~lon, ~lat,
fillColor = pal(sc),
fillOpacity = 0.7,
stroke = FALSE,
radius = 3,
group = sc
)
}
map <- map |>
addMarkers(
lng = c(32.6071673, 32.5458844),
lat = c(0.3190139, 0.3472747),
popup = c(
"Bugolobi FS treatment plant",
"Lubigi FS treatment plant"
),
group = "Treatment plants"
) |>
addLayersControl(
baseGroups = c("OSM", "Satellite"),
overlayGroups = c(unique(df$size_cat), "Treatment plants"),
options = layersControlOptions(collapsed = TRUE)
)
mapDoes size matter? In Figure 3.4 the collection points are mapped again but this time the colour is based on the size categories. Although it is hard conclude from the available data, but it seems the are not clear clusters of dots for the specific sizing. For real proof more demographic data is required. Since distance seems to be not the only factor to choose dislodging location, we can look further for other drivers. The first option is to check whether the week day has an effect.
week_day_order <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")
df|>
mutate(week_day_name = factor(week_day_name, levels = week_day_order))|>
ggplot(aes(x= week_day_name, fill = plant))+
geom_bar()+
facet_wrap(~ plant, scales = "free_y", ncol = 1)+
labs(
title = "Collections per Weekday",
subtitle = "Counts of collection per day of the week",
x = "Day of Week",
y = "# of collections",
fill = "Plant"
)
Figure 3.5 plots the counts of collections per day of week. Secondary the colour in the graph indicates whether at which of the dislodging plants the sludge was dumped. For the Lubigi plant every day, also during weekends, seems constant. For the Bugolobi plant one can see that Thursdays seems to be the busiest followed two more quiet days on Friday and Saturday.
ggplot(df,aes(x=hour24,
fill = plant))+
geom_bar()+
labs(
title = "Count of collection per hour",
subtitle = "Count of collections, with colors showing plant at which the truck dislodged",
x = "Hour of the day",
y = "# of entries",
color = "Dislodging location"
)
By counting timestamps and grouping them per hour of the day one can observe the time most collections are taking place (see Figure 3.6). Clearly it can be see most activity takes place during daytime and not at night. It starts to get busy from 5:00 in the morning and getting to almost no activity after 18:00. It also looks like both plant are equally used for dislodging over the whole day. So no correlation of collection time and dislodging location can be observed.
df |>
ggplot(aes(y = numberplate,
fill = plant,
position="fill"))+
scale_x_continuous(labels = scales::percent) +
geom_bar(position = "fill")+
labs(title = "Percentage of dislodgings at each treatment plant grouped by numberplate")
Figure 3.7 seems to bring us closer to what factor is decisive on which plant a truck will dislodge. From the table above we can observe that all trucks just go to only 1 plant only. In order to proof this statement Table 3.2 shows a count of trucks dislodging solely at 1 location and the number of trucks dislodging at 2 or more locations. From this table it can be confirmed that 15 and 20 trucks respectively are dislodging at Bugolobi or Lubigi and actually zero trucks are ‘multi plant’ trucks.
plate_types <- df |>
mutate(
plant = str_squish(plant),
numberplate = str_squish(numberplate)
) |>
distinct(numberplate, plant) |>
group_by(numberplate) |>
summarise(
n_plants = n_distinct(plant),
single_plant = if_else(n_plants == 1, first(plant), NA_character_),
.groups = "drop"
) |>
mutate(group = if_else(n_plants == 1, single_plant, "multi plant")) |>
count(group, name = "count") |>
rename(`single or double` = group)
plate_types |>
complete(`single or double` = c("Lubigi", "Bugolobi", "multi plant"),
fill = list(count = 0))|>
gt()| single or double | count |
|---|---|
| Bugolobi | 15 |
| Lubigi | 20 |
| multi plant | 0 |
This report was prepared as the capstone project for the Data Science for OpenWashData course. The primary objective of the project was to develop technical proficiency in the R programming language and to gain experience in the publication and communication of analytical results.
During the course of the analysis, substantial effort was devoted to integrating multiple auxiliary data sources in order to explore trends in faecal sludge collection. Upon reflection, however, it became evident that the principal findings could already be derived from the initial dataset without the need for extensive data augmentation.
A key lesson from this project is the importance of conducting a comprehensive exploratory and analytical assessment of the core dataset at an early stage. Such an approach enables researchers to identify whether additional data sources are genuinely required, thereby reducing unnecessary complexity and improving analytical efficiency. Future analyses would benefit from prioritizing early insight generation before expanding the scope of data integration.
The original dataset on faecal sludge logistics was published under CC-BY The results of this report are under presented here under similar license: