Faecal sludge truck logistics analysis in Kampala, Uganda

An exploration to understand the structure and trends of an informal faecal collection sector

Author

Jos van der Ent

Published

December 16, 2025

Glossary

Abbreviation	Comment
FS	Faecal sludge

1 Introduction

The ‘fslogisticskampala’ data set is shared on OpenWashdata by Lars Shöbitz. The goal of this dataset is to provide data sources on faecal sludge transporting logistics in Kampala, Uganda. The data is collected from 30th March 2015 until 25th June 2015 and contains the collection location and dislodge location. By reading the shared dataset and the written report the writer felt it would be interesting to see if there is more information and trends present in the dataset which has not be exposed in the narrative. This dataset is chosen for the Data Science for OpenWashData course. And this specific dataset as it contains location data, in form of GPS coordinates, and multiple datasets which can be joined for the analysis. Additionally it also contains time series which is a common aspect in dataset and so useful for getting hands-on experience

2 Methods

2.1 Import libraries

First the required libraries need to be imported to be able to read and process the datasets.

Code

library(tidyverse)
library(dplyr)
library(ggplot2)
library(gt)
library(here)
library(gtsummary)
library(lubridate)
library(leaflet)

2.2 Import trips data from ‘fslogisticskampala’

The original set has 2 tables “Fslogisticskampala: Data on Faecal Sludge; Transporting Logistics in Kampala, Uganda” (2024). One set contains information of trips of faecal sludge collection trucks and the second table with metadata on of the trucks. First we start with opening the trips.csv and save it as trips_raw:

Code

trips_raw <- read_csv(here::here("data/raw/trips.csv"))

Table 2.1 shows the dictionary of the trips dataset.

Code

read_csv(here::here("data/processed/trips_dictionary.csv")) |>
    gt()

Table 2.1: Dictionary for trips table

variable_name	variable_type	description
fid	integer	key
numberplate	string	used as foreign key to link to trucks dataset
date	date	date in ISO 8601 derived style: year/month/day --> '2025/11/20'
time	time	time format
lat	double	latitude of collection location
lon	double	longtude of collection location
plant	string	Name of treatment plant

From the dictionary it can be seen the trips data has information about the each recorded trip. Where the ‘numberplate’ column can be used as the foreign key to link to the truck dataset as described below.

2.3 Import trucks data from ‘fslogisticskampala’

The second dataset to import is the trucks data as published in the website. Read trucks.csv and save as trucks_raw

Code

trucks_raw <- read_csv(here::here("data/raw/trucks.csv"))

Table 2.2 shows the dictionary of the trucks set. This is a more simple table containing meta data on the truck. It only contains 2 variables. The first is the ‘numberplate’ which can function as key to link to the trips dataset and the second variable is the volume of the specific truck in m3.

Code

read_csv(here::here("data/processed/trucks_dictionary.csv")) |>
gt()

Table 2.2: Dictionary for trucks table

variable_name	variable_type	description
numberplate	string	license plate number used for identifying the truck
volume	integer	Volume in m3

2.4 Import dim_date

In order to aid analysis a additional table is generated and imported. This table has a daily granularity and contains time intelligence information for the data collection period.

2.5 Checking data integrity

In order to understand the data we will first asses the raw data. First we will count all non NA values for each column (see Table 2.3. In total 5663 records are in the dataset and it can be seen each column has no NA values.

Code

# trips_tbl_sum <- trips_raw |>
  colSums(!is.na(trips_raw))

Table 2.3: Count of number of trips recorded

        fid numberplate        date        time         lat         lon 
       5653        5653        5653        5653        5653        5653 
      plant 
       5653

Code

truck_dist <- trips_raw |>
  count(numberplate, name = "n_entries") |>
  arrange(desc(n_entries))|>
  mutate("Percentage of entries" = round(n_entries / sum(n_entries) * 100,2))

truck_dist

# A tibble: 35 × 3
   numberplate n_entries `Percentage of entries`
   <chr>           <int>                   <dbl>
 1 UAT 598T          315                    5.57
 2 UAQ 024L          286                    5.06
 3 UAQ 981Z          285                    5.04
 4 UAV 037R          271                    4.79
 5 UAU 189B          267                    4.72
 6 UAT 845E          254                    4.49
 7 AUS 119X          235                    4.16
 8 UAN 087N          218                    3.86
 9 LG 0366-01        216                    3.82
10 LG 0257-01        198                    3.5 
# ℹ 25 more rows

Of the total 35 trucks the truck with the highest amount of entries is only accounting for 5.6% of the entries. So with the data set is spread out enough to represent the truck drivers in Kampala. The trips dataset can be used as a fact table when doing the analysis.

The second dataset called ‘trucks’ can be used as dimension table for the trips data to retrieve meta data from the trucks. In this case only the volume. In total 35 trucks are registered and each has a reported volume as an attribute:

Code

  colSums(!is.na(trucks_raw))

numberplate      volume 
         35          35

2.6 Write data into folder with cleaned data

2.6.1 trips

Write the cleaned trips data/processed data as csv after cleaning.

Code

trips <-  trips_raw |>
  mutate(numberplate = as.character(numberplate)) |>
  write_csv(here::here("data/processed/trips.csv"))

2.6.2 trucks

The raw table is read and with the necessary steps cleaned and stored as trucks for further analysis.

Code

trucks <- trucks_raw |>
  mutate(numberplate = as.character(numberplate)) |>
  mutate(size_cat = case_when(
           volume < 8 ~"small",
           volume < 12 ~"medium",
           volume < 40 ~"large",
           volume < 999 ~"HUGE!",
           TRUE ~ "unkown"
           ))|>
  write_csv(here::here("data/processed/trucks.csv"))

2.6.3 dim_date and dim_time

Since the dim_date and dim_time are generated for this specific reporting purpose it is directly put in the processed folder. This can be done as no additional cleaning steps are required. After reading the document the dataframes need to be generated so they can easily be references in the code.

Code

dim_date <- read_csv(here::here("data/processed/dim_date.csv"))
dim_time <- read_csv(here::here("data/processed/dim_time.csv"))

2.6.4 Generate Dataframe for analysis

First each cleaned dataset is stored into the folder ‘data/processed folder’. This allows to make a versatile setup where cleaning steps can be incorporated into the used dataset when deemed necessary when analysis the results. After saving the individual tables these will be linked according a ‘star’ schema based on Kimball and Ross (2013). For the analysis the trips data is used as the fact table and the trucks and dim_date tables are linked as dimension tables over which the set can be grouped and filtered. Figure 2.1 shows the schematic of the star schema.

Code

erDiagram
    trips }o--o| trucks  : numberplate
    trips }o--o| dim_date : date
    trips }o--o| dim_time   : time
    trucks {
        id numberplate  
        integer volume
        string size_cat
    }
    trips {
        id fid
        string numberplate
        date date
        time time
        float lat
        float lon
        string plant
    }
    dim_date {
        date date
        integer day
        string day_suffix
        integer week_day
        string week_day_name
        string week_day_name_short
        string week_dayname_first_letter
        integer day_of_year
        integer week_of_month
        integer week_of_year
        integer month
        string month_name
        string month_name_Short
        string month_name_first_letter
        integer quarter
        string quarter_name
        integer year
        integer yyyymm
        string year_month
        binary is_weekend
        date first_date_of_month
        date last_date_of_month
        date first_date_of_week
        date last_date_of_week
        string year_week
    }
    dim_time{
        integer hour12
        integer hour24
        integer minute_of_hour
        integer second_of_minute
        integer elapsed_minutes
        integer elapsed_seconds
        string am_pm
        time hhmmss
        binary working_hours
        integer part_of_day_sort
        string part_of_day
    }

erDiagram
    trips }o--o| trucks  : numberplate
    trips }o--o| dim_date : date
    trips }o--o| dim_time   : time
    trucks {
        id numberplate  
        integer volume
        string size_cat
    }
    trips {
        id fid
        string numberplate
        date date
        time time
        float lat
        float lon
        string plant
    }
    dim_date {
        date date
        integer day
        string day_suffix
        integer week_day
        string week_day_name
        string week_day_name_short
        string week_dayname_first_letter
        integer day_of_year
        integer week_of_month
        integer week_of_year
        integer month
        string month_name
        string month_name_Short
        string month_name_first_letter
        integer quarter
        string quarter_name
        integer year
        integer yyyymm
        string year_month
        binary is_weekend
        date first_date_of_month
        date last_date_of_month
        date first_date_of_week
        date last_date_of_week
        string year_week
    }
    dim_time{
        integer hour12
        integer hour24
        integer minute_of_hour
        integer second_of_minute
        integer elapsed_minutes
        integer elapsed_seconds
        string am_pm
        time hhmmss
        binary working_hours
        integer part_of_day_sort
        string part_of_day
    }

Figure 2.1: Star schema of data frame

In order generate the star schema the code below can be executed. This code will basically do a left join for of each dimension table to the trips fact table. This generates a big table with 45 variables and 5653 tuples (or rows).

Code

df <- trips |>
  left_join(trucks, by = join_by(numberplate )) |> 
  left_join(dim_date, by = join_by(date) ) |>
  left_join(dim_time, by = join_by(time == time_key))

3 Results

This chapter will dive into the results of the dataset. The first visual (Figure 3.3) shows the collections locations and the 2 treatment sites.

The registered trucks have variable volumes. Figure 3.1 shows the distribution volumes of registered trucks.

Code

size_cat_order <- c("small","medium","large","HUGE!")

df |>
  mutate(size_cat = factor(size_cat, levels = size_cat_order))|>
ggplot(aes(y = numberplate,
           x = volume,
           color = size_cat
           )) +
  geom_point(shape = 4, size = 1, stroke = 2)+
  labs(
    title = "Truck Volume Distribution",
    subtitle = "Plot of all vehicles sizes, with colors showing size category",
    x = "Registered Trucks",
    y = "Volume (m3)",
    color = "Size Category"
  )

Looking at the distribution a large variance can be observed, but also different clusters of sizes can be distinguished. By grouping on typical sizes it can be expected different types transport are linked. For examples of different sizes also refer to Linda Strande, Mariska Ronteltap, and Damir Brdjanovic (2014). To summarise the results Figure 3.2 shows the results in a boxplot and Table 3.1 shows statistics of the created groups.

Code

df |>
  ggplot(aes(y = volume,
             x= ''
             )) +
  geom_boxplot()+
  geom_point(aes(color = size_cat))+
  labs(
    title = "Truck Volume Distribution in a Boxplot",
    subtitle = "Boxplot of all vehicles, with colors showing size category",
    x = "Registered Trucks",
    y = "Volume (m3)",
    color = "Size Category"
  )

Figure 3.2: Distribution of truck sizes visualized in a boxplot

Code

trucks |>
  mutate(size_cat = factor(size_cat, levels = size_cat_order))|>
group_by(size_cat) |>
summarise(count = n(), 
          mean = round(mean(volume),1), 
          median = median(volume), 
          min = min(volume), 
          max = max(volume),
          sd = round(sd(volume),1)
          )|>
        gt()|>
  tab_header(
    title = "Summary table of truck sizes, by size category")|>
  tab_options(
    table.width = pct(80),
    table.align = "center",
    heading.align = "center",
    table.border.top.width = px(2),
    table.border.top.color = "black",
    table.border.bottom.width = px(2),
    table.border.bottom.color = "black"
  )

Table 3.1: Summary table of truck sizes, by size category

size_cat	count	mean	median	min	max	sd
Summary table of truck sizes, by size category
small	18	4.2	4.0	3	6	0.9
medium	5	10.0	10.0	10	10	0.0
large	8	33.6	36.0	25	37	4.8
HUGE!	4	68.5	68.5	65	72	4.0

Apart from the collection location the trips dataset also states to which treatment plant the truck went for dislodging. One called Bugolobi and the other Lubigi. Where Lubigi is located at the Northern ring and Bugolobi at the southern part of the downtown. In order to see whether the distance from the one collection location and the treatment plant is correlated all collection points are mapped in fig-map_collection_locations and colours according the dislodging location. Also the locations of the treatment plants have been added as a reference.

Code

map = leaflet(data = df) |>
   addTiles() |>
   addCircleMarkers(~lon, ~lat 
                    ,popup = ~as.character(plant)
                    , color = ~ifelse(plant == "Bugolobi" , "red", "blue")
                    , radius = 0.5
                    ) |>
  addMarkers(lng = ~c(32.6071673,32.5458844)
             ,lat = ~c(0.3190139,0.3472747)
             ,popup = ~c("Bugolobi FS treatment plant","Lubigi FS treatment plant")
             )
map

Figure 3.3: Map with collection locations and the existing treatment plants

Based on the collection location one can observe the collection location has some correlation to the dislodging location as collection at the south and east tend to more dislodged at Bugolobi and collections from the North and West tend to be dislodged at Lubigi. Altough some correlation can be observed the collection location is clearly not the only factor for deciding whether to dislodge at a certain treatment plant.

Code

pal <- colorFactor(
  palette = c("blue", "orange", "green", "red", "purple"),
  domain  = df$size_cat
)

map <- leaflet() |>
  addTiles(group = "OSM") |>
  addProviderTiles(providers$Esri.WorldImagery, group = "Satellite")

for (sc in unique(df$size_cat)) {
  map <- map |>
    addCircleMarkers(
      data = df[df$size_cat == sc, ],
      ~lon, ~lat,
      fillColor   = pal(sc),
      fillOpacity = 0.7,
      stroke      = FALSE,
      radius      = 3,
      group       = sc
    )
}

map <- map |>
  addMarkers(
    lng = c(32.6071673, 32.5458844),
    lat = c(0.3190139, 0.3472747),
    popup = c(
      "Bugolobi FS treatment plant",
      "Lubigi FS treatment plant"
    ),
    group = "Treatment plants"
  ) |>
  addLayersControl(
    baseGroups    = c("OSM", "Satellite"),
    overlayGroups = c(unique(df$size_cat), "Treatment plants"),
    options       = layersControlOptions(collapsed = TRUE)
  )
map

(a) This map is interactive. By hovering over the icon on the right top it allows the reader to make filters and select background maps

Figure 3.4: Map with collection locations with color of truck size category

Does size matter? In Figure 3.4 the collection points are mapped again but this time the colour is based on the size categories. Although it is hard conclude from the available data, but it seems the are not clear clusters of dots for the specific sizing. For real proof more demographic data is required. Since distance seems to be not the only factor to choose dislodging location, we can look further for other drivers. The first option is to check whether the week day has an effect.

Code

week_day_order <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")

df|>
mutate(week_day_name = factor(week_day_name, levels = week_day_order))|>
ggplot(aes(x= week_day_name, fill = plant))+
  geom_bar()+
   facet_wrap(~ plant, scales = "free_y", ncol = 1)+
   labs(
    title = "Collections per Weekday",
    subtitle = "Counts of collection per day of the week",
    x = "Day of Week",
    y = "# of collections",
    fill = "Plant"
  )

Figure 3.5: Count of Collections per day of the week

Figure 3.5 plots the counts of collections per day of week. Secondary the colour in the graph indicates whether at which of the dislodging plants the sludge was dumped. For the Lubigi plant every day, also during weekends, seems constant. For the Bugolobi plant one can see that Thursdays seems to be the busiest followed two more quiet days on Friday and Saturday.

Code

ggplot(df,aes(x=hour24,
              fill = plant))+
  geom_bar()+
  labs(
    title = "Count of collection per hour",
    subtitle = "Count of collections, with colors showing plant at which the truck dislodged",
    x = "Hour of the day",
    y = "# of entries",
    color = "Dislodging location"
  )

Figure 3.6: Collection times per hour of the day

By counting timestamps and grouping them per hour of the day one can observe the time most collections are taking place (see Figure 3.6). Clearly it can be see most activity takes place during daytime and not at night. It starts to get busy from 5:00 in the morning and getting to almost no activity after 18:00. It also looks like both plant are equally used for dislodging over the whole day. So no correlation of collection time and dislodging location can be observed.

Code

df |>
ggplot(aes(y = numberplate,
            fill = plant,
            position="fill"))+
  scale_x_continuous(labels = scales::percent) +
geom_bar(position = "fill")+
  labs(title = "Percentage of dislodgings at each treatment plant grouped by numberplate")

Figure 3.7: Percentage of dislodges to each plant grouped by numberplate

Figure 3.7 seems to bring us closer to what factor is decisive on which plant a truck will dislodge. From the table above we can observe that all trucks just go to only 1 plant only. In order to proof this statement Table 3.2 shows a count of trucks dislodging solely at 1 location and the number of trucks dislodging at 2 or more locations. From this table it can be confirmed that 15 and 20 trucks respectively are dislodging at Bugolobi or Lubigi and actually zero trucks are ‘multi plant’ trucks.

Code

plate_types <- df |>
  mutate(
    plant = str_squish(plant),
    numberplate = str_squish(numberplate)
  ) |>
  distinct(numberplate, plant) |>
  group_by(numberplate) |>
  summarise(
    n_plants = n_distinct(plant),
    single_plant = if_else(n_plants == 1, first(plant), NA_character_),
    .groups = "drop"
  ) |>
  mutate(group = if_else(n_plants == 1, single_plant, "multi plant")) |>
  count(group, name = "count") |>
  rename(`single or double` = group)

plate_types |>
  complete(`single or double` = c("Lubigi", "Bugolobi", "multi plant"),
           fill = list(count = 0))|>
    gt()

Table 3.2: Table counting the number of trucks dislodging at each plant or multiple plants

single or double	count
Bugolobi	15
Lubigi	20
multi plant	0

4 Conclusions

The dataset contains a high variable set of faecal sludge collection trucks. Most of the smaller more versatile trucks. 4 registered trucks have volumes larger than 60m3 which are rarely seen in such settings. It could even be wondered if these volumes are correct or if a mistake in units is made.
The collection location and the dislodge location seem to have a correlation, but based on the spread it is not the only factor on which location the trucks will dislodge
There seem to be a slightly lower number of collections dislodged at Bugolobi on Fridays and Saturdays, but dislodging seems to be taking place. It also appears not to be a significant driver on whether to supply to a specific treatment location.
The time of collection and the dislodging location do not seem to show any correlation and so does not seem to be a factor on the decision where the plant truck is dislodging.
From the dataset it appears number plates only dislodge at 1 specific plant only. Likely this has to do with registration of the trucks.
Based on this the author suspects the causality to rather to be the other way around than what was researched at Figure 3.3: It was already concluded there seems to be a correlation between the collection location and dislodging location. Instead of of the collection location affecting the dislodging location it is more likely there is causality on the trucks registered plant and where they are collection collecting their faecal sludge than the other way around.

5 Recommendations

Due to the time period of data collection there seems to be not enough data points to test whether the day of the month has an effect on the amount collected and whether it a factor on which location dislodging is taking place.This could be research, but for that a much larger dataset over a longer period would be required.
To confirm the dislodging location is linked to registration it is recommended to check with the local authority e.g. like KCCA or NWSC

6 References

“Fslogisticskampala: Data on Faecal Sludge; Transporting Logistics in Kampala, Uganda.” 2024. https://openwashdata.github.io/fslogisticskampala/.

Kimball, Ralph, and Margy Ross. 2013. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling. Indianapolis, IN: Wiley.

Linda Strande, Mariska Ronteltap, and Damir Brdjanovic, eds. 2014. Faecal Sludge Management; Systems Approach for Implementation and Operation. http://www.iwapublishing.com/template.cfm?name=isbn9781780404721.

7 Lessons learned

This report was prepared as the capstone project for the Data Science for OpenWashData course. The primary objective of the project was to develop technical proficiency in the R programming language and to gain experience in the publication and communication of analytical results.

During the course of the analysis, substantial effort was devoted to integrating multiple auxiliary data sources in order to explore trends in faecal sludge collection. Upon reflection, however, it became evident that the principal findings could already be derived from the initial dataset without the need for extensive data augmentation.

A key lesson from this project is the importance of conducting a comprehensive exploratory and analytical assessment of the core dataset at an early stage. Such an approach enables researchers to identify whether additional data sources are genuinely required, thereby reducing unnecessary complexity and improving analytical efficiency. Future analyses would benefit from prioritizing early insight generation before expanding the scope of data integration.

8 License

The original dataset on faecal sludge logistics was published under CC-BY The results of this report are under presented here under similar license:

Reuse

CC BY 4.0