library(tidyverse)
library(gt)
library(ggplot2)
library(scales)Land suitability analysis for cowpea production in South Africa using output simulated by the AquaCrop model
Capstone project
1. Introduction
Cowpea is an underutilised indigenous legume crop grown by smallholder farmers in South Africa. When compared to other legumes, cowpea is tolerant to water and heat stress, and thus is resilient to increasing climate variability. According to Karl et al. (2024), cowpea is rich in protein and other essential nutrients (e.g. folate, zinc and iron), making the crop suitable for improving food security in rural communities. Despite the crop’s potential, it remains underutilised and neglected (i.e. under-researched) in South Africa. Hence, there is limited evidence from field experiments that describe the crop’s potential. To fill this knowledge gap, a crop simulation model developed by the Food and Agriculture Organisation (FAO) called AquaCrop (Hsiao et al., 2009; Raes et al., 2009; Steduto et al., 2009), was used to generate simulated data across South Africa for cowpea planted on the 01st of November in each season at 80000 plants per hectare (ha-1).
2. Methods
For each HRZ, AquaCrop generated up to 49 consecutive seasons of data from 1950/51 to 1998/99, from which long-term statistics were calculated for important AquaCrop output variables. The 1st objective was to generate long-term statistics from the seasonal dataset and to store this output in the data/processed folder (part 1; see Section 2a). The 2nd objective was to analyse the statistics to select zones that are deemed suitable for cowpea production (part 2; see Section 2b). This dataset is stored in the data/final folder. For the data analysis, the following R libraries are required:
2a. Generating long-term statistics
The raw dataset contains output simulated by AquaCrop for cowpea using 50 years of input climate and soil data for each of the 5838 homogeneous response zones (HRZs) across South Africa, which generated up to 49 seasons of data (1950/51 to 1998/99) for 22 important output variables (from Rain to WPet).
cpe_raw <- read_csv(here::here("data/raw/subc_5838_cpea_mthk_fro1.csv"))
ncl <- ncol(cpe_raw)
cat("The number of columns in the data frame (raw_cpe) is:", ncl)The number of columns in the data frame (raw_cpe) is: 30
nrw <- nrow(cpe_raw)
cat("The number of rows in the dataframe is:", nrw)The number of rows in the dataframe is: 268994
cat("The first five rows of data are:")The first five rows of data are:
head(cpe_raw, n = 5) |>
gt()| Sea | Day1 | Month1 | Year1 | Rain | ETo | GD | CO2 | Infilt | Runoff | Drain | Upflow | E | E/Ex | Tr | TrW | Tr/Trx | Cycle | TmpStr | ExpStr | StoStr | BioMass | BioRel | HI | Yield | WPet | DayN | MonthN | YearN | subc |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 11 | 1950 | 299.8 | 581.4 | 1261.3 | 311.30 | 290.5 | 9.3 | 47.5 | 0 | 100.7 | 29 | 201.3 | 201.3 | 72 | 95 | 1 | 22 | 22 | 4.914 | 58 | 18.2 | 0.895 | 0.30 | 11 | 2 | 1951 | 1 |
| 2 | 1 | 11 | 1951 | 142.2 | 585.4 | 1271.9 | 311.54 | 142.2 | 0.0 | 4.4 | 0 | 88.6 | 20 | 109.1 | 109.1 | 58 | 81 | 0 | 39 | 34 | 2.074 | 29 | 10.3 | 0.214 | 0.11 | 30 | 1 | 1952 | 1 |
| 3 | 1 | 11 | 1952 | 362.1 | 559.9 | 1262.1 | 311.78 | 355.3 | 6.8 | 89.7 | 0 | 141.2 | 46 | 174.6 | 174.6 | 59 | 92 | 1 | 31 | 27 | 4.176 | 53 | 4.7 | 0.194 | 0.06 | 8 | 2 | 1953 | 1 |
| 4 | 1 | 11 | 1953 | 287.0 | 562.4 | 1267.6 | 313.00 | 283.1 | 3.9 | 50.7 | 0 | 128.2 | 40 | 152.9 | 152.9 | 55 | 94 | 1 | 32 | 32 | 3.767 | 46 | 8.1 | 0.307 | 0.11 | 10 | 2 | 1954 | 1 |
| 5 | 1 | 11 | 1954 | 538.1 | 546.6 | 1259.7 | 314.25 | 489.8 | 48.3 | 206.0 | 0 | 129.5 | 38 | 148.3 | 148.3 | 62 | 97 | 1 | 43 | 24 | 3.960 | 46 | 18.4 | 0.730 | 0.26 | 13 | 2 | 1955 | 1 |
cat("The last five rows of data are:")The last five rows of data are:
tail(cpe_raw, n = 5) |>
gt()| Sea | Day1 | Month1 | Year1 | Rain | ETo | GD | CO2 | Infilt | Runoff | Drain | Upflow | E | E/Ex | Tr | TrW | Tr/Trx | Cycle | TmpStr | ExpStr | StoStr | BioMass | BioRel | HI | Yield | WPet | DayN | MonthN | YearN | subc |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 45 | 1 | 11 | 1994 | 294.6 | 476.3 | 1263.4 | 359.82 | 270.0 | 24.6 | 102.6 | 0 | 156.8 | 30 | 0.3 | 0.3 | 47 | 73 | 0 | 50 | 27 | 0.007 | 0 | 0.0 | 0.000 | 0.00 | 18 | 1 | 1995 | 5838 |
| 46 | 1 | 11 | 1995 | 221.1 | 446.0 | 1263.6 | 361.72 | 209.6 | 11.5 | 51.7 | 0 | 105.4 | 31 | 81.5 | 81.5 | 57 | 74 | 0 | 40 | 29 | 2.254 | 30 | 10.1 | 0.227 | 0.12 | 20 | 1 | 1996 | 5838 |
| 47 | 1 | 11 | 1996 | 298.1 | 470.8 | 1270.1 | 363.17 | 285.0 | 13.1 | 26.9 | 0 | 113.2 | 33 | 129.5 | 129.5 | 78 | 72 | 0 | 40 | 21 | 3.222 | 44 | 19.6 | 0.633 | 0.26 | 17 | 1 | 1997 | 5838 |
| 48 | 1 | 11 | 1997 | 293.3 | 464.7 | 1259.5 | 365.22 | 264.4 | 28.9 | 52.5 | 0 | 128.6 | 40 | 100.2 | 96.9 | 60 | 75 | 0 | 35 | 29 | 2.695 | 35 | 20.4 | 0.551 | 0.24 | 19 | 1 | 1998 | 5838 |
| 49 | 1 | 11 | 1998 | 486.6 | 445.2 | 1268.7 | 367.54 | 439.5 | 47.1 | 142.6 | 0 | 141.4 | 66 | 208.0 | 208.0 | 78 | 77 | 1 | 11 | 15 | 5.705 | 75 | 18.9 | 1.081 | 0.31 | 22 | 1 | 1999 | 5838 |
Since the number of rows is less than the expected 286062 rows (i.e. 5838 x 49), not all zones have 49 seasonal simulations, which is important to note. In total, there are 17219 rows of “missing” data. This occurs because AquaCrop is not run for seasons and HRZs that are too cold to support economically viable crop production. According to Gerrano et al. (2022), cowpea’s potential yield is about 3 tons per ha (or t ha-1). A total of 162 seasonal yield simulations exceeded 3 t ha-1.
cpe_raw |>
filter(Yield >= 3, is.na = TRUE) |> # select yields >= 3 tons per hectare
# exclude any missing (NA) values
count() # 162 seasonal values with yield > 3 t/haThe highest yield of 3.36 t ha-1 was simulated in HRZ number (subc) 4252 in one season. HRZ number 5767 has 8 seasonal yields of 3 t ha-1 or higher, followed by zones 4567, 5758 and 5764 with 7 seasonal yields exceeding 3 t ha-1.
cpe_raw |>
select(subc, Sea, Yield) |> # select variables of interest
filter(Yield >= 3, is.na = TRUE) |> # select yields >= 3 tons per hectare (t/ha)
# exclude any NA values
group_by(subc) |> # group yields by HRZ number (subc)
summarise(nsea = n(), # count the number of seasons (sea)
ymax = max(Yield) # obtain the largest yield in each HRZ
) |>
arrange(desc(ymax)) |> # arrange maximum yield in descending order, or
gt() |> # output a table with 78 zones
tab_options(
container.height = "300px", # Set a fixed height for vertical scroll
container.width = "200px" # Set a fixed width for horizontal scroll
)| subc | nsea | ymax |
|---|---|---|
| 4252 | 1 | 3.357 |
| 4567 | 7 | 3.346 |
| 5758 | 7 | 3.340 |
| 5764 | 7 | 3.329 |
| 5767 | 8 | 3.322 |
| 3296 | 4 | 3.287 |
| 4585 | 6 | 3.284 |
| 4621 | 3 | 3.283 |
| 4627 | 2 | 3.268 |
| 5683 | 4 | 3.266 |
| 4448 | 1 | 3.248 |
| 4472 | 5 | 3.246 |
| 4579 | 6 | 3.227 |
| 3290 | 1 | 3.206 |
| 4576 | 6 | 3.202 |
| 5371 | 2 | 3.173 |
| 5618 | 1 | 3.172 |
| 5374 | 2 | 3.167 |
| 4465 | 3 | 3.166 |
| 4568 | 4 | 3.161 |
| 5432 | 1 | 3.156 |
| 3323 | 2 | 3.151 |
| 5677 | 1 | 3.143 |
| 1677 | 1 | 3.137 |
| 551 | 1 | 3.132 |
| 3362 | 1 | 3.130 |
| 5494 | 2 | 3.128 |
| 5590 | 1 | 3.124 |
| 5434 | 2 | 3.122 |
| 5617 | 1 | 3.122 |
| 655 | 4 | 3.121 |
| 5755 | 2 | 3.121 |
| 5614 | 3 | 3.120 |
| 4577 | 1 | 3.113 |
| 5753 | 2 | 3.112 |
| 4573 | 1 | 3.111 |
| 661 | 1 | 3.109 |
| 4843 | 2 | 3.105 |
| 5499 | 1 | 3.102 |
| 658 | 2 | 3.101 |
| 5675 | 4 | 3.096 |
| 5279 | 1 | 3.095 |
| 5549 | 1 | 3.094 |
| 587 | 2 | 3.093 |
| 4683 | 2 | 3.091 |
| 5657 | 2 | 3.090 |
| 5497 | 1 | 3.085 |
| 5611 | 2 | 3.084 |
| 5684 | 1 | 3.082 |
| 4628 | 1 | 3.077 |
| 5429 | 1 | 3.072 |
| 4427 | 1 | 3.069 |
| 556 | 1 | 3.067 |
| 4638 | 1 | 3.061 |
| 5686 | 2 | 3.059 |
| 557 | 1 | 3.056 |
| 670 | 1 | 3.051 |
| 5788 | 2 | 3.048 |
| 5495 | 1 | 3.046 |
| 682 | 2 | 3.044 |
| 735 | 1 | 3.042 |
| 3366 | 1 | 3.040 |
| 5654 | 1 | 3.039 |
| 4917 | 1 | 3.038 |
| 4675 | 1 | 3.034 |
| 1701 | 1 | 3.033 |
| 3363 | 3 | 3.033 |
| 3311 | 1 | 3.031 |
| 550 | 1 | 3.029 |
| 547 | 1 | 3.024 |
| 5544 | 1 | 3.021 |
| 742 | 1 | 3.013 |
| 4941 | 1 | 3.007 |
| 760 | 1 | 3.006 |
| 5431 | 1 | 3.005 |
| 3369 | 1 | 3.004 |
| 5594 | 1 | 3.004 |
| 4841 | 1 | 3.002 |
For each season in all HRZs, the following additional variables were calculated:
yield (Y) in kg ha-1 (not t ha-1),
evapotranspiration (ET or crop water use in mm) as the sum of soil water evaporation (E) and crop transpiration (Tr),
crop water productivity (CWP in kg m-3) as the ratio of Y (in kg ha-1) to ET (in m3 ha-1), and
the ratio of harvest index (HI) to the reference harvest index (HIo) expressed as a percentage, where HIo is 26% for cowpea.
HI is used to calculate crop yield from accumulated biomass production. Although AquaCrop outputs crop water productivity (called WPet), it was calculated per season to check the model’s output and the summation of ET.
cpe_new <- cpe_raw |>
mutate("Yield_new" = Yield * 1000, .after = Yield) |> # convert Y to kg/ha
mutate("ET" = E + Tr, .after = Yield_new) |> # sum E & T to ET
mutate("CWP" = Yield_new / (ET * 10), .after = ET) |> # calculate CWP = Y / ET
mutate("HI/HIo" = 100 * HI / 26, .after = HI) # calculate Hi/HIoIt is important to note that certain HRZs have no data because the model crashed during execution and produced no output. This happens when the zone’s climate is far too cold for crop cultivation.
nzo <- cpe_new |>
filter(Sea == 0) |> # if Sea = 0, there is no data, i.e. the model crashed
count()
cat("The number of zones with no yield data is:", round(nzo[[1]], 0))The number of zones with no yield data is: 151
For selected variables such as rainfall (Rain or rfl), crop cycle length (Cycle or ccl), B/Bx (BioRel or bxr), HI/HIo (hxr), yield (or yld) and CWP (or WPet),
cpe_trm <- cpe_new |>
select("subc", "seas" = "Sea", "rfl" = "Rain", "ccl" = "Cycle", "bxr" = "BioRel", "hxr" = "HI/HIo", "yld" = "Yield_new", "cwp" = "WPet")the 1st main objective was to generate long-term statistics (e.g. mean, standard deviation, coefficient of variation, number of zero values, number of missing (NA) values, total number of values) from the seasonal data.
cpe_sta <- cpe_trm |>
group_by(subc) |>
summarise(
# for all variables from rfl to cwp, no need for a loop
across(rfl:cwp,
list(
# calculate the mean
ave = \(x) mean(x, na.rm = TRUE),
# calculate the standard deviation
sdv = \(x) sd(x, na.rm = TRUE),
# calculate the coefficent of variation as a %
cov = \(x) 100 * sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
)),
yld_non = sum(is.na(yld)), # count of missing yield values
yld_noi = sum(is.infinite(yld)), # count of infinite yield values
yld_nom = 49 - n(), # count of missing seasonal yields
yld_noz = sum(yld == 0, na.rm = TRUE), # count of zero yields
yld_nov = sum(yld > 0, na.rm = TRUE), # count of non-zero yields
yld_nos = n() # count of seasons with yield values
)To ensure the statistics were correct, the following were summed, namely the number of:
missing (NA) yield values (yld_non),
infinite (INF) yield values (yld_noi),
missing seasons (yld_nom),
zero yield values (yld_noz), and
non-zero yield values (yld_nov).
If the above values did not sum to 49 seasons in total, then there is an error.
tot <- cpe_sta |>
mutate("yld_chk" = (yld_non + yld_noi + yld_nom + yld_noz + yld_nov)) |>
count(yld_chk)
cat("The number of zones with a sum of", tot[[1]], "values is", tot[[2]])The number of zones with a sum of 49 values is 5838
The number of yield values equal to zero kg ha-1 (yld_noz) or greater than zero kg ha-1 (yld_nov) was also calculated (yld_nzv). Furthermore, the risk of crop failure (RCF) was calculated as a ratio of the number of missing, infinite and zero seasonal yields to the total number of simulations (49), which was then expressed as a percentage.
cpe_sta <- cpe_sta |>
mutate("yld_nzv" = yld_noz + yld_nov, .after = "yld_nov") |>
mutate("rcf_per" = 100 * (yld_non + yld_noi + yld_nom + yld_noz) / 49,
.after = cwp_cov)The analysis-ready statistics were outputted to a .CSV file in the data/processed folder. This was important since the number of rows of data was reduced from 268994 to 5838.
write_csv(cpe_sta, here::here("data/processed/stat_5838_cpea_mthk_fro1.csv"))2b. Identifying suitable cowpea production zones
Cowpea cannot be cultivated in all 5838 HRZs across the country (e.g. in zones that are too dry and/or cold), and thus the goal was to analyse the statistics to identify zones that are deemed suitable for cowpea production.
cpe_sta <- read_csv(here::here("data/processed/stat_5838_cpea_mthk_fro1.csv"))The first step involved identifying variables with missing (i.e. NA) data.
cpe_sta |>
summarise(
# for all variables from rfl_ave to rcf_per, no need for a loop
across(rfl_ave:rcf_per,
list(mis = \(x) sum(is.na(x)) # calculate number of NA values
))) |>
pivot_longer(cols = rfl_ave_mis:rcf_per_mis, # convert table from wide to long format
names_to = "variable",
values_to = "num_na") |>
arrange(num_na) |> # sort table into ascending order
filter(num_na > 0) |> # filter out variables with no NA values
gt() |>
tab_options(
container.height = "300px", # set a fixed height for vertical scroll
container.width = "200px" # set a fixed width for horizontal scroll
)| variable | num_na |
|---|---|
| rfl_sdv_mis | 189 |
| rfl_cov_mis | 189 |
| ccl_sdv_mis | 189 |
| ccl_cov_mis | 189 |
| bxr_sdv_mis | 189 |
| hxr_sdv_mis | 189 |
| yld_sdv_mis | 189 |
| cwp_sdv_mis | 189 |
| bxr_cov_mis | 192 |
| hxr_cov_mis | 362 |
| yld_cov_mis | 362 |
| cwp_cov_mis | 367 |
cnt <- cpe_sta |>
filter(is.na(cwp_cov)) |> # check for NA values in CV of CWP
count() # count the number of NA values
cat("To verify, the number of NA values in CWPcov is", cnt[[1]])To verify, the number of NA values in CWPcov is 367
If the coefficient of variation (CV in %) was missing (NA), values were converted to 700%. Similarly, missing standard deviations were converted to a value of 1200. A subset of the statistics was then created by only considering variables that could be used to identify suitable zones for cowpea production.
cpe_sta_fin <- cpe_sta |>
mutate(across(ends_with("_cov"), \(x) coalesce(x, 700))) |>
mutate(across(ends_with("_sdv"), \(x) coalesce(x, 1200))) |>
select("subc", "rfl_ave", "ccl_ave", "bxr_ave", "hxr_ave",
"yld_nzv", "yld_ave", "cwp_ave", "cwp_cov", "rcf_per")By creating new variables (set to 1 or yes), suitable crop production areas were identified with:
a) above zero seasonally averaged rainfall (RFLAVE) data (i.e. also excludes zones with no data);
cpe_sta_fin$rfl_ave_gt0 <- ifelse(cpe_sta_fin$rfl_ave > 0, 1, 0)
cnt <- cpe_sta_fin |>
filter(rfl_ave_gt0 == 0) |>
count() # 155 zones excluded
cat("Number of zones excluded since RFLave = 0 mm is", cnt[[1]])Number of zones excluded since RFLave = 0 mm is 155
b) a seasonally average HI/HIo ratio (called hxr_ave) above 25% (Lake, 2024):
cpe_sta_fin$hxr_ave_gt25 <- ifelse(cpe_sta_fin$hxr_ave > 25, 1, 0)c) more than 4 seasons of data (else too few data values for calculating reliable statistics);
cpe_sta_fin$yld_nzv_ge4 <- ifelse(cpe_sta_fin$yld_nzv >= 4, 1, 0)d) more than 30 (out of 49) seasons of data, else insufficient data values to calculate reliable long-term statistics according to Steduto et al. (2012);
cpe_sta_fin$yld_nzv_ge30 <- ifelse(cpe_sta_fin$yld_nzv >= 30, 1, 0)e) average yields above 25 kg ha-1 (else unviable for smallholder farmers according to Gerrano et al. (2022)), which helps to exclude zones where CWP is 0 kg m-3 of water use and the RCF is 100%;
cpe_sta_fin$yld_ave_gt25 <- ifelse(cpe_sta_fin$yld_ave > 25, 1, 0)f) low (e.g. < 55%) risk of crop failure (else economically unviable).
cpe_sta_fin$rcf_per_lt55 <- ifelse(cpe_sta_fin$rcf_per < 55, 1, 0)Next, another variable called suitable (0=no; 1=yes) was created from the product of the above-mentioned criteria.
cpe_sta_fin <- transform(cpe_sta_fin, suitable = rfl_ave_gt0 * hxr_ave_gt25 * yld_nzv_ge4 * yld_nzv_ge30 * yld_ave_gt25 * rcf_per_lt55)According to FAO (1976), suitable cropping areas can be classified as highly suitable (S1), moderately suitable (S2) or marginally suitable (S3). Lake (2024) noted that the seasonally averaged HI/HIo ratio (variable called hxr_ave) can be used to classify suitability as follows: < 50% (S3), 50 to 75% (S2) and > 75% (S1). For zones deemed suitable for cowpea production (i.e. suitable=1), another suitability index was created as follows:
cpe_sta_fin <- cpe_sta_fin |>
mutate(suitability =
case_when(
hxr_ave <= 25 ~ "Unsuitable",
hxr_ave > 25 & hxr_ave <= 50 ~ "Marginal",
hxr_ave > 50 & hxr_ave <= 75 ~ "Moderate",
hxr_ave > 75 ~ "High")
)The analysed statistics were then outputted to a .CSV file in the data/final folder.
write_csv(cpe_sta_fin, here::here("data/final/suit_5838_cpea_mthk_fro1.csv"))3. Results
To recap, AquaCrop was run for all HRZs to simulate cowpea planted on the 01st of November in each season at 80 000 plants ha-1. Long-term statistics for important AquaCrop output variables were calculated from the seasonal simulations. These statistics were then used to determine which zones are deemed suitable for rainfed production of cowpea. An analysis of the final dataset is presented in this section.
# read the final .CSV file into a data frame
cpe_sta_fin <- read_csv(here::here("data/final/suit_5838_cpea_mthk_fro1.csv"))The mean annual rainfall across the western parts of South Africa is below 400 mm, which is considered too dry to support economically viable crop production under rainfed conditions. Similarly, the climate is too cold in many HRZs located in mountainous regions, especially those bordering the country of Lesotho. Hence, not all HRZs are suitable for cowpea cultivation. The number of zones eliminated in each step was calculated in two different ways:
tab_sum_eln <- cpe_sta_fin |>
# order the variables from highest to lowest number of zones eliminated
select("hxr_ave_gt25", "rcf_per_lt55", "yld_ave_gt25",
"yld_nzv_ge4", "yld_nzv_ge30", "rfl_ave_gt0", "suitable") |>
summarise(
across(everything(),
list(n = \(x) sum(x)) # calculate the sum of zones not eliminated
)
)
tab_sum_ely <- cpe_sta_fin |>
# order the variables from highest to lowest number of zones eliminated
select("hxr_ave_gt25", "rcf_per_lt55", "yld_ave_gt25",
"yld_nzv_ge4", "yld_nzv_ge30", "rfl_ave_gt0", "suitable") |>
summarise(
across(everything(),
list(n = \(x) 5838 - sum(x)) # calculate the sum of zones eliminated
)
)
tab_sum_all <- bind_rows(tab_sum_ely, tab_sum_eln) |>
mutate(id = 1:n(), .before = hxr_ave_gt25_n)
tab_sum_all |>
gt()| id | hxr_ave_gt25_n | rcf_per_lt55_n | yld_ave_gt25_n | yld_nzv_ge4_n | yld_nzv_ge30_n | rfl_ave_gt0_n | suitable_n |
|---|---|---|---|---|---|---|---|
| 1 | 2452 | 2435 | 1108 | 213 | 373 | 155 | 2567 |
| 2 | 3386 | 3403 | 4730 | 5625 | 5465 | 5683 | 3271 |
The second method is as follows:
rm(tab_sum_all)
tab_sum_all <- tibble(id = c(0, 1))
# order the variables from highest to lowest number of zones eliminated
var_lst = c("hxr_ave_gt25", "rcf_per_lt55", "yld_ave_gt25", "yld_nzv_ge4", "yld_nzv_ge30", "rfl_ave_gt0", "suitable")
for (var_tmp in var_lst) {
# cat("Running code for", var_tmp, "\n")
var_nme = sym(var_tmp)
var_new = sym(paste0(var_nme, "_n"))
tab_sum <- cpe_sta_fin |>
select(!!var_nme) |>
group_by(!!var_nme) |>
summarise(n = n()) |>
rename(!!var_new := all_of("n")) |>
rename(id = !!var_nme)
tab_sum_all <- merge(tab_sum_all, tab_sum, by = "id")
}
tab_sum_all |>
gt()| id | hxr_ave_gt25_n | rcf_per_lt55_n | yld_ave_gt25_n | yld_nzv_ge4_n | yld_nzv_ge30_n | rfl_ave_gt0_n | suitable_n |
|---|---|---|---|---|---|---|---|
| 0 | 2452 | 2435 | 1108 | 213 | 373 | 155 | 2567 |
| 1 | 3386 | 3403 | 4730 | 5625 | 5465 | 5683 | 3271 |
A check was done to ensure the last column in the above tables is correct:
#check variable suitable_n
cpe_sta_fin |>
select(suitable) |>
group_by(suitable) |>
count()# A tibble: 2 × 2
# Groups: suitable [2]
suitable n
<dbl> <int>
1 0 2567
2 1 3271
It is important to that both methods produced the same outcome (i.e. to verify the results are correct). The results are presented in Table 1, which show the majority of zones were eliminated by applying the HI/HIo > 25% criterion (2452 in total), followed by RCF < 55% (2435 zones) and yield > 25 kg ha-1 (1108 zones).
tab_sum_all |>
# gt() formats the table
gt() |>
# give a header with a title and subtitle
tab_header(title = md("**Number of HRZs eliminated by each criterion**"),
subtitle = html("(<b>Cowpea planted on 01 November at 80 000 plants ha<sup>-1</sup></b>)")) |>
# format no. of decimals
# https://gt.rstudio.com/reference/fmt_number.html
fmt_number(decimals = 0, sep_mark = "") |>
# change 0 to N and 1 to Y
sub_values(columns = id, values = 0, replacement = "Y") |>
sub_values(columns = id, values = 1, replacement = "N") |>
cols_label(
id = md("**Eliminated?**<br>(Y/N)"),
rfl_ave_gt0_n = md("**RFL~AVE~**<br> > 0 mm"),
hxr_ave_gt25_n = md("**HI/HI~O~**<br> > 25%"),
# does not do sperscript in rendered HTML
# yld_ave_gt25_n = md("**YLD~AVE~**<br> > 25 kg {{ha^-1}}"),
yld_ave_gt25_n = html("<b>YLD<sub>AVE</sub></b><br> > 25 kg ha<sup>-1</sup>"),
rcf_per_lt55_n = md("**RCF**<br> < 55%"),
suitable_n = md("**Final<br>no. of<br>HRZs**")
) |>
# 1st grouping
tab_spanner(
label = md("**No. of yield values**"),
columns = c(yld_nzv_ge4_n, yld_nzv_ge30_n)
) |>
cols_label(
yld_nzv_ge4_n = "≥ 4",
yld_nzv_ge30_n = "≥ 30"
)| Number of HRZs eliminated by each criterion | |||||||
| (Cowpea planted on 01 November at 80 000 plants ha-1) | |||||||
| Eliminated? (Y/N) |
HI/HIO > 25% |
RCF < 55% |
YLDAVE > 25 kg ha-1 |
No. of yield values
|
RFLAVE > 0 mm |
Final no. of HRZs |
|
|---|---|---|---|---|---|---|---|
| ≥ 4 | ≥ 30 | ||||||
| Y | 2452 | 2435 | 1108 | 213 | 373 | 155 | 2567 |
| N | 3386 | 3403 | 4730 | 5625 | 5465 | 5683 | 3271 |
Of the total 5838 HRZs, 3271 suitable zones were selected for further analysis. Hence, 2567 zones were eliminated since the yare deemed unsuitable for cowpea production.
# filter the dataset to select only suitable zones
cpe_se1 <- cpe_sta_fin |>
filter(suitable == 1)
tot <- nrow(cpe_sta_fin)
se1 <- nrow(cpe_se1)
se0 = tot - se1
cat("The total number of HRZs is", tot, "\n")The total number of HRZs is 5838
cat("Of these,", se1, "are considered suitable (not eliminated) for cowpea production\n")Of these, 3271 are considered suitable (not eliminated) for cowpea production
cat("Hence,", se0, "zones are unsuitable for cowpea production and were therefore eliminated\n")Hence, 2567 zones are unsuitable for cowpea production and were therefore eliminated
3a. Collinearity test
A collinearity test between CWP and Y was performed by creating a scatter plot with a trend line (linear regression) as shown in Figure 1. Since CWP is calculated from Y, the variables should be strongly correlated.
ggplot(data = cpe_se1,
mapping = aes(x = cwp_ave,
y = yld_ave)) +
# create scatter plot
geom_point(size = 0.5) +
# set x-axis label to 3 decimals
scale_x_continuous(labels = label_number(accuracy = 0.001)) +
# label both axes and the plot title
labs(x = expression(paste("Crop water productivity (kg ", m^-3, ")")),
y = expression(paste("Crop yield (kg ", ha^-1, ")")),
title = "Crop yield vs crop water productivity") +
# centre the title label (hjust = 0 is left, 0.5 = center, 1 = right)
theme(plot.title = element_text(hjust = 0.5)) +
# add a trend line
geom_smooth(method = "lm",
formula = y ~ x + 0,
color = "blue",
linetype = "dashed",
size = 1.0,
na.rm = FALSE,
se = FALSE,
show.legend = TRUE) For the trend line, various statistics (e.g. R squared) were calculated. The slope value provides the average water use of cowpea in m3 ha-1 (divide by 10 to convert to mm ha-1).
# extract the two variables to assess for collinarity
reg_dat <- cpe_se1 |>
select("x" = cwp_ave, "y" = yld_ave)
# fit a linear regression model to the data with NO intercept
reg_mod <- lm(y ~ x + 0, data = reg_dat)
reg_sta <- summary(reg_mod)
slp = reg_sta$coefficients["x", "Estimate"]
r2v = reg_sta$r.squared
r2a = reg_sta$adj.r.squared
# print the equation
cat("The regression equation is: Yield =", round(slp, 2), "* CWP\n")The regression equation is: Yield = 3561.58 * CWP
cat("The R-squared value is:", round(r2v, 4), "\n")The R-squared value is: 0.9587
cat("The adjusted R-squared value is:", round(r2a, 4), "\n")The adjusted R-squared value is: 0.9587
If Pearson’s correlation coefficient (r) is above 0.5, the variables are highly correlated. Since the r value approaches 1, Y and CWP are highly correlated, and thus either Y or CWP (but not both) should be used for determining suitable crop production zones.
pcc = cor(reg_dat$x, reg_dat$y, method = 'pearson')
cat("The Pearson correlation coefficent is:", round(pcc, 4), "\n")The Pearson correlation coefficent is: 0.939
# Spearman correlation coefficient
scc = cor(reg_dat$x, reg_dat$y, method = 'spearman')
cat("The Spearman correlation coefficent is:", round(scc, 4), "\n")The Spearman correlation coefficent is: 0.9515
# Kendall correlation coefficient
kcc = cor(reg_dat$x, reg_dat$y, method = 'kendall')
cat("The Kendall correlation coefficent is:", round(kcc, 4), "\n")The Kendall correlation coefficent is: 0.8068
3b. Suitable cowpea production zones
A histogram can be created for each variable (e.g. seasonally averaged yield) showing the full range of values. Figure 2 illustrates the seasonally averaged yield simulations (yld_ave), which represent a right-skewed distribution (i.e. less HRZs with high yields), despite the elimination of many unsuitable zones.
library(ggplot2)
ggplot(data = cpe_se1,
mapping = aes(x = yld_ave)) +
geom_histogram(col = "black",
fill = "white",
breaks = seq(0, 2600, 25)) +
# label both axes and the plot title
labs(x = expression(paste("Seasonally averaged yield (kg ", ha^-1, ")")),
y = "Number of HRZs",
title = "Range in simulated cowpea yields") +
# removes default grey background fill
theme_minimal() +
# centre the title label (hjust = 0 is left, 0.5 = center, 1 = right)
theme(plot.title = element_text(hjust = 0.5)) As expected, there are no yields below 25 kg ha-1 as these very low yields were excluded (see Section 2b). However, the largest number of zones have a simulated yield ranging from 50-75 kg per ha. Since cowpea is mainly cultivated by subsistence and small-scale farmers (i.e. low input farming), yields are typically low (25-300 kg ha-1), as simulated by the AquaCrop model.
val <- nrow(cpe_se1 |>
filter(yld_ave > 50 & yld_ave <= 75))
cat("The number of HRZs with an average yield ranging from 50-75 kg per ha is", val, "\n")The number of HRZs with an average yield ranging from 50-75 kg per ha is 111
val <- nrow(cpe_se1 |>
filter(yld_ave > 25 & yld_ave <= 300))
cat("The number of HRZs with an average yield ranging from 25-300 kg per ha is", val, "\n")The number of HRZs with an average yield ranging from 25-300 kg per ha is 724
val <- nrow(cpe_se1 |>
filter(yld_ave > 300))
cat("The number of HRZs with an average yield exceeding 300 kg per ha is", val, "\n")The number of HRZs with an average yield exceeding 300 kg per ha is 2547
Seasonally averaged yields simulated by AquaCrop ranged from 28.6 to 2589 kg ha-1, with a mean and median of 725 and 628 kg ha-1, respectively. The highest averaged yield of 2589 kg ha-1 was simulated for HRZ number 4573.
# useful statistics for the yield variable
cpe_se1 |>
select(yld_ave) |> # select only one variable of interest
filter(!is.na(yld_ave)) |> # exclude any NA values
arrange(yld_ave) |> # arrange in ascending order
summarise( # produce sumamry statistics
Suitable = n(), # returns 3271 zones
Mean = mean(yld_ave), # returns 725 kg per ha
Median = median(yld_ave), # returns 628 kg per ha
SD = sd(yld_ave), # returns 504 kg per ha
CV = 100 * SD / Mean, # returns 69.5 %
Min = min(yld_ave), # returns 28.6 kg per ha
Max = max(yld_ave) # returns 2589 kg per ha
) |>
gt() |>
tab_header(title = md("**Yield statistics across suitable HRZs**"),
subtitle = html("(<b>in kg ha<sup>-1</sup> except for CV in %</b>)")) |>
fmt_number(columns = Suitable, decimals = 0, sep_mark = "") |>
fmt_number(columns = Mean:Max, decimals = 1, sep_mark = "")| Yield statistics across suitable HRZs | ||||||
| (in kg ha-1 except for CV in %) | ||||||
| Suitable | Mean | Median | SD | CV | Min | Max |
|---|---|---|---|---|---|---|
| 3271 | 724.8 | 628.0 | 503.6 | 69.5 | 28.6 | 2589.0 |
cpe_se1 |>
select(subc, yld_ave) |> # select variables of interest
filter(!is.na(yld_ave)) |> # exclude any NA values
arrange(desc(yld_ave)) |> # arrange in descending order
head( n = 1) # returns subc = 4573# A tibble: 1 × 2
subc yld_ave
<dbl> <dbl>
1 4573 2589
The suitability score was used to create a density plot (Figure 3), which shows the largest distribution of lower yield values classified as marginally suitable versus the smallest distribution of higher yield values classified as highly suitable.
fil_col = c("High" = "green",
"Moderate" = "yellow",
"Marginal" = "orange")
lne_col = c("High" = "grey60",
"Moderate" = "grey60",
"Marginal" = "grey60")
# Move Marginally suitable (S3) to bottom of legend using factors and levels
leg_ord = c("Marginal", "Moderate", "High")
ggplot(data = cpe_se1 |>
mutate(suitability = factor(suitability, levels = leg_ord)),
mapping = aes(x = yld_ave,
fill = suitability,
color = suitability)) +
geom_density(alpha = 0.50,
adjust = 0.7) +
scale_x_continuous(breaks = seq(0, 2600, by = 200)) +
scale_fill_manual(name ="Suitability",
values = fil_col,
labels = c("Marginal (S3)", "Moderate (S2)", "High (S1)"),
) +
# legend MUST have the same name
scale_color_manual(name = "Suitability",
values = lne_col,
labels = c("Marginal (S3)", "Moderate (S2)", "High (S1)"),
) +
labs(title = "Density plot of simulated cowpea yields",
x = expression(paste("Seasonally averaged yield (kg ", ha^-1, ")")),
y = "Density") +
# removes default grey background fill
theme_minimal() +
# centre the title label (hjust = 0 is left, 0.5 = center, 1 = right)
theme(plot.title = element_text(hjust = 0.5))Another way to view the distributions by suitability is using a ridgeline plot (also called a joy plot), which is useful for visualising the distribution of a continuous variable (yield) across different categories (suitability). In Figure 4, the vertical lines show the median (50th percentile) and the interquartile range (25th to 75th percentile).
library(ggridges)
fil_col = c("High" = "green",
"Moderate" = "yellow",
"Marginal" = "orange")
lne_col = c("High" = "green",
"Moderate" = "orange",
"Marginal" = "brown")
leg_ord = c("Marginal", "Moderate", "High")
ggplot(data = cpe_se1 |>
mutate(suitability = factor(suitability, levels = leg_ord)),
mapping = aes(x = yld_ave,
y = suitability,
fill = suitability,
color = suitability)) +
stat_density_ridges(alpha = 0.1,
scale = 1,
rel_min_height = 0.01,
quantile_lines = TRUE,
quantiles = c(0.25, 0.50, 0.75)) +
scale_x_continuous(breaks = seq(0, 2600, by = 200)) +
scale_fill_manual(name ="Suitability",
values = fil_col,
labels = c("Marginal (S3)", "Moderate (S2)", "High (S1)"),
) +
scale_color_manual(name = "Suitability",
values = lne_col,
labels = c("Marginal (S3)", "Moderate (S2)", "High (S1)"),
) +
labs(title = "Ridgeline plot of simulated cowpea yields",
x = expression(paste("Seasonally averaged yield (kg ", ha^-1, ")")),
y = "Density") +
# removes default grey background fill
theme_minimal() +
# centre the title label (hjust = 0 is left, 0.5 = center, 1 = right)
theme(plot.title = element_text(hjust = 0.5))Figure 5 again shows how yield increases from marginally (S3) to highly suitable (S1) zones. The figure also highlights the problem of using discrete thresholds (25, 50 and 75%) to classify a continuous variable (HI/HIo), especially if the thresholds are subjective.
# box and whisker plot
fil_col = c("High" = "green",
"Moderate" = "yellow",
"Marginal" = "orange")
leg_ord = c("High", "Moderate", "Marginal")
leg_lab = c("High (S1)", "Moderate (S2)", "Marginal (S3)")
ggplot(data = cpe_se1 |>
mutate(suitability = factor(suitability, levels = leg_ord)),
mapping = aes(x = yld_ave,
y = suitability,
fill = suitability)
) +
geom_boxplot(alpha = 0.5, outlier.shape = NA) +
scale_x_continuous(breaks = seq(0, 2600, by = 200)) +
scale_y_discrete(labels = leg_lab) +
scale_fill_manual(values = fil_col) +
labs(title = "Box plot of simulated cowpea yields",
x = expression(paste("Seasonally averaged yield (kg ", ha^-1, ")")),
y = "Suitability") +
# removes default grey background fill
theme_minimal() +
# no legend
theme(legend.position="none") +
# centre the title label (hjust = 0 is left, 0.5 = center, 1 = right)
theme(plot.title = element_text(hjust = 0.5))From the table of statistics for the seasonally averaged yield grouped by suitability (Table 2), there are more zones classified as S3 (1553 zones), compared to S2 (1320) and S1 (398). The spatially averaged yield is 357, 880 and 1646 kg ha-1 for S3, S2 and S1 zones, respectively. The median yield is 329 (S3), 845 (S2) and 1597 kg ha-1 (S1), as shown in the box plot (Figure 5).
tab_ord = c("Marginal", "Moderate", "High")
var_nme = "yld_ave"
tab_sum <- cpe_se1 |>
# order table as "Marginal", "Moderate", "High" (not alphabetical)
mutate(suitability = factor(suitability, levels = tab_ord)) |>
group_by(suitability) |>
filter(!is.na(var_nme)) |>
select(suitability, var_nme) |>
arrange(var_nme) |>
summarise(
cnt = n(), # number of values
dis = n_distinct(!!sym(var_nme)), # unique number of values
ave = mean(!!sym(var_nme)), # mean of values
p25 = quantile(!!sym(var_nme), probs = 0.25), # 25th percentile
p50 = quantile(!!sym(var_nme), probs = 0.50), # 50th percentile (median)
p75 = quantile(!!sym(var_nme), probs = 0.75), # 75th percentile
min = min(!!sym(var_nme)), # smallest value
max = max(!!sym(var_nme)), # largest value
sdv = sd(!!sym(var_nme)), # standard deviation
cov = 100 * sdv / ave # coefficient of variation (%)
)
# see gt.rstudio.com
tab_sum |>
# gt() formats the table
gt() |>
# give a header with a title and subtitle
tab_header(title = md("**Cowpea yield statistics**"),
subtitle = html("Planted on 01 November at 80 000 plants ha<sup>-1</sup>")
) |>
# format no. of decimals
# https://gt.rstudio.com/reference/fmt_number.html
fmt_number(columns = cnt:dis, decimals = 0, sep_mark = "") |>
fmt_number(columns = ave:cov, decimals = 1, sep_mark = "") |>
cols_label(
suitability = md("**Suitability<br>class**"),
cnt = md("**No. of<br>HRZs**"),
dis = md("**Distinct<br>values**"),
ave = html("<b>Average</b><br>(kg ha<sup>-1</sup>)"),
) |>
# 1st grouping
tab_spanner(
label = md("**Percentiles**"),
columns = c(p25, p50, p75)
) |>
cols_label(
p25 = "{{25^th}}",
p50 = "{{50^th}}",
p75 = "{{75^th}}",
) |>
# 2nd grouping
tab_spanner(
label = html("<b>Range (kg ha<sup>-1</sup>)</br>"),
columns = c(min, max)
) |>
cols_label(
min = "Min",
max = "Max",
) |>
# 3rd grouping
tab_spanner(
label = md("**Spread**"),
columns = c(sdv, cov)
) |>
cols_label(
# sdv = md("SD (kg {{ha^-1}})"), # does not do superscipt in rendered HTML
sdv = html("SD<br>(kg ha<sup>-1</sup>)"),
cov = md("CV<br>(%)"),
)| Cowpea yield statistics | ||||||||||
| Planted on 01 November at 80 000 plants ha-1 | ||||||||||
| Suitability class |
No. of HRZs |
Distinct values |
Average (kg ha-1) |
Percentiles
|
Range (kg ha-1) |
Spread
|
||||
|---|---|---|---|---|---|---|---|---|---|---|
| 25th | 50th | 75th | Min | Max | SD (kg ha-1) |
CV (%) |
||||
| Marginal | 1553 | 1527 | 356.8 | 157.2 | 328.5 | 490.8 | 28.6 | 1200.7 | 241.2 | 67.6 |
| Moderate | 1320 | 1302 | 880.1 | 655.9 | 844.9 | 1089.8 | 52.3 | 1874.4 | 298.0 | 33.9 |
| High | 398 | 397 | 1645.9 | 1395.1 | 1597.3 | 1883.1 | 1052.5 | 2589.0 | 325.5 | 19.8 |
4. Conclusions
- Only 3271 of the 5838 HRZs across South Africa are considered suitable for rainfed production of cowpea.
- Of these, 398 zones are classified as highly suitable, where simulated yields averaged 1646 kg ha-1 (range of 1052-2589 kg ha-1).
- Similarly, 1320 zones are classified as moderately suitable with an average yield of 880 kg ha-1.
- The remaining 1553 zones are deemed marginally suitable, with an average yield of 357 kg ha-1.
- The analysis highlights the importance of eliminating “noise” (especially outliers) when working with large datasets.
- It also highlights the problem of using discrete thresholds to classify continuous variables.
5. References
Reuse
Citation
@online{kunz2025,
author = {Kunz, Richard},
title = {Land Suitability Analysis for Cowpea Production in {South}
{Africa} Using Output Simulated by the {AquaCrop} Model},
date = {2025-12-09},
url = {https://ds4owd-002.github.io/project-kunzrp/},
langid = {en},
abstract = {The AquaCrop model, developed by the Food and Agriculture
Organisation, was run for cowpea planted on the 01\^{}st\^{} of
November at 80000 plants per hectare across 5838 homogeneous
response zones (HRZs) in South Africa. The model generated up to 49
consecutive seasons of data from 1950/51 to 1998/99 for each HRZ,
from which long-term statistics were calculated for important
AquaCrop output variables (part 1). The statistics were then
analysed to determine which HRZs are deemed suitable for rainfed
cultivation of cowpea (part 2). The crop can be grown in 3271 of the
5838 HRZs.}
}