Land suitability analysis for cowpea production in South Africa using output simulated by the AquaCrop model

Capstone project

Published

December 9, 2025

Abstract
The AquaCrop model, developed by the Food and Agriculture Organisation, was run for cowpea planted on the 01st of November at 80000 plants per hectare across 5838 homogeneous response zones (HRZs) in South Africa. The model generated up to 49 consecutive seasons of data from 1950/51 to 1998/99 for each HRZ, from which long-term statistics were calculated for important AquaCrop output variables (part 1). The statistics were then analysed to determine which HRZs are deemed suitable for rainfed cultivation of cowpea (part 2). The crop can be grown in 3271 of the 5838 HRZs.

1. Introduction

Cowpea is an underutilised indigenous legume crop grown by smallholder farmers in South Africa. When compared to other legumes, cowpea is tolerant to water and heat stress, and thus is resilient to increasing climate variability. According to Karl et al. (2024), cowpea is rich in protein and other essential nutrients (e.g. folate, zinc and iron), making the crop suitable for improving food security in rural communities. Despite the crop’s potential, it remains underutilised and neglected (i.e. under-researched) in South Africa. Hence, there is limited evidence from field experiments that describe the crop’s potential. To fill this knowledge gap, a crop simulation model developed by the Food and Agriculture Organisation (FAO) called AquaCrop (Hsiao et al., 2009; Raes et al., 2009; Steduto et al., 2009), was used to generate simulated data across South Africa for cowpea planted on the 01st of November in each season at 80000 plants per hectare (ha-1).


2. Methods

For each HRZ, AquaCrop generated up to 49 consecutive seasons of data from 1950/51 to 1998/99, from which long-term statistics were calculated for important AquaCrop output variables. The 1st objective was to generate long-term statistics from the seasonal dataset and to store this output in the data/processed folder (part 1; see Section 2a). The 2nd objective was to analyse the statistics to select zones that are deemed suitable for cowpea production (part 2; see Section 2b). This dataset is stored in the data/final folder. For the data analysis, the following R libraries are required:

library(tidyverse)
library(gt)
library(ggplot2)
library(scales)

2a. Generating long-term statistics

The raw dataset contains output simulated by AquaCrop for cowpea using 50 years of input climate and soil data for each of the 5838 homogeneous response zones (HRZs) across South Africa, which generated up to 49 seasons of data (1950/51 to 1998/99) for 22 important output variables (from Rain to WPet).

cpe_raw <- read_csv(here::here("data/raw/subc_5838_cpea_mthk_fro1.csv"))
ncl <- ncol(cpe_raw)
cat("The number of columns in the data frame (raw_cpe) is:", ncl)
The number of columns in the data frame (raw_cpe) is: 30
nrw <- nrow(cpe_raw)
cat("The number of rows in the dataframe is:", nrw)
The number of rows in the dataframe is: 268994
cat("The first five rows of data are:")
The first five rows of data are:
head(cpe_raw, n = 5) |>
  gt()
Sea Day1 Month1 Year1 Rain ETo GD CO2 Infilt Runoff Drain Upflow E E/Ex Tr TrW Tr/Trx Cycle TmpStr ExpStr StoStr BioMass BioRel HI Yield WPet DayN MonthN YearN subc
1 1 11 1950 299.8 581.4 1261.3 311.30 290.5 9.3 47.5 0 100.7 29 201.3 201.3 72 95 1 22 22 4.914 58 18.2 0.895 0.30 11 2 1951 1
2 1 11 1951 142.2 585.4 1271.9 311.54 142.2 0.0 4.4 0 88.6 20 109.1 109.1 58 81 0 39 34 2.074 29 10.3 0.214 0.11 30 1 1952 1
3 1 11 1952 362.1 559.9 1262.1 311.78 355.3 6.8 89.7 0 141.2 46 174.6 174.6 59 92 1 31 27 4.176 53 4.7 0.194 0.06 8 2 1953 1
4 1 11 1953 287.0 562.4 1267.6 313.00 283.1 3.9 50.7 0 128.2 40 152.9 152.9 55 94 1 32 32 3.767 46 8.1 0.307 0.11 10 2 1954 1
5 1 11 1954 538.1 546.6 1259.7 314.25 489.8 48.3 206.0 0 129.5 38 148.3 148.3 62 97 1 43 24 3.960 46 18.4 0.730 0.26 13 2 1955 1
cat("The last five rows of data are:")
The last five rows of data are:
tail(cpe_raw, n = 5) |>
  gt()
Sea Day1 Month1 Year1 Rain ETo GD CO2 Infilt Runoff Drain Upflow E E/Ex Tr TrW Tr/Trx Cycle TmpStr ExpStr StoStr BioMass BioRel HI Yield WPet DayN MonthN YearN subc
45 1 11 1994 294.6 476.3 1263.4 359.82 270.0 24.6 102.6 0 156.8 30 0.3 0.3 47 73 0 50 27 0.007 0 0.0 0.000 0.00 18 1 1995 5838
46 1 11 1995 221.1 446.0 1263.6 361.72 209.6 11.5 51.7 0 105.4 31 81.5 81.5 57 74 0 40 29 2.254 30 10.1 0.227 0.12 20 1 1996 5838
47 1 11 1996 298.1 470.8 1270.1 363.17 285.0 13.1 26.9 0 113.2 33 129.5 129.5 78 72 0 40 21 3.222 44 19.6 0.633 0.26 17 1 1997 5838
48 1 11 1997 293.3 464.7 1259.5 365.22 264.4 28.9 52.5 0 128.6 40 100.2 96.9 60 75 0 35 29 2.695 35 20.4 0.551 0.24 19 1 1998 5838
49 1 11 1998 486.6 445.2 1268.7 367.54 439.5 47.1 142.6 0 141.4 66 208.0 208.0 78 77 1 11 15 5.705 75 18.9 1.081 0.31 22 1 1999 5838

Since the number of rows is less than the expected 286062 rows (i.e. 5838 x 49), not all zones have 49 seasonal simulations, which is important to note. In total, there are 17219 rows of “missing” data. This occurs because AquaCrop is not run for seasons and HRZs that are too cold to support economically viable crop production. According to Gerrano et al. (2022), cowpea’s potential yield is about 3 tons per ha (or t ha-1). A total of 162 seasonal yield simulations exceeded 3 t ha-1.

cpe_raw |>
  filter(Yield >= 3, is.na = TRUE) |>   # select yields >= 3 tons per hectare
                                        # exclude any missing (NA) values
    count()                             # 162 seasonal values with yield > 3 t/ha

The highest yield of 3.36 t ha-1 was simulated in HRZ number (subc) 4252 in one season. HRZ number 5767 has 8 seasonal yields of 3 t ha-1 or higher, followed by zones 4567, 5758 and 5764 with 7 seasonal yields exceeding 3 t ha-1.

cpe_raw |>
  select(subc, Sea, Yield) |>          # select variables of interest
  filter(Yield >= 3, is.na = TRUE) |>  # select yields >= 3 tons per hectare (t/ha)
                                       # exclude any NA values
  group_by(subc) |>                    # group yields by HRZ number (subc)
  summarise(nsea = n(),                # count the number of seasons (sea) 
            ymax = max(Yield)          # obtain the largest yield in each HRZ
    ) |>
  arrange(desc(ymax)) |>               # arrange maximum yield in descending order, or
  gt() |>                              # output a table with 78 zones
  tab_options(
    container.height = "300px",        # Set a fixed height for vertical scroll
    container.width = "200px"          # Set a fixed width for horizontal scroll
  )
subc nsea ymax
4252 1 3.357
4567 7 3.346
5758 7 3.340
5764 7 3.329
5767 8 3.322
3296 4 3.287
4585 6 3.284
4621 3 3.283
4627 2 3.268
5683 4 3.266
4448 1 3.248
4472 5 3.246
4579 6 3.227
3290 1 3.206
4576 6 3.202
5371 2 3.173
5618 1 3.172
5374 2 3.167
4465 3 3.166
4568 4 3.161
5432 1 3.156
3323 2 3.151
5677 1 3.143
1677 1 3.137
551 1 3.132
3362 1 3.130
5494 2 3.128
5590 1 3.124
5434 2 3.122
5617 1 3.122
655 4 3.121
5755 2 3.121
5614 3 3.120
4577 1 3.113
5753 2 3.112
4573 1 3.111
661 1 3.109
4843 2 3.105
5499 1 3.102
658 2 3.101
5675 4 3.096
5279 1 3.095
5549 1 3.094
587 2 3.093
4683 2 3.091
5657 2 3.090
5497 1 3.085
5611 2 3.084
5684 1 3.082
4628 1 3.077
5429 1 3.072
4427 1 3.069
556 1 3.067
4638 1 3.061
5686 2 3.059
557 1 3.056
670 1 3.051
5788 2 3.048
5495 1 3.046
682 2 3.044
735 1 3.042
3366 1 3.040
5654 1 3.039
4917 1 3.038
4675 1 3.034
1701 1 3.033
3363 3 3.033
3311 1 3.031
550 1 3.029
547 1 3.024
5544 1 3.021
742 1 3.013
4941 1 3.007
760 1 3.006
5431 1 3.005
3369 1 3.004
5594 1 3.004
4841 1 3.002
 

For each season in all HRZs, the following additional variables were calculated:

  • yield (Y) in kg ha-1 (not t ha-1),

  • evapotranspiration (ET or crop water use in mm) as the sum of soil water evaporation (E) and crop transpiration (Tr),

  • crop water productivity (CWP in kg m-3) as the ratio of Y (in kg ha-1) to ET (in m3 ha-1), and

  • the ratio of harvest index (HI) to the reference harvest index (HIo) expressed as a percentage, where HIo is 26% for cowpea.

HI is used to calculate crop yield from accumulated biomass production. Although AquaCrop outputs crop water productivity (called WPet), it was calculated per season to check the model’s output and the summation of ET.

cpe_new <- cpe_raw |>
  mutate("Yield_new" = Yield * 1000, .after = Yield) |>    # convert Y to kg/ha
  mutate("ET" = E + Tr, .after = Yield_new) |>             # sum E & T to ET
  mutate("CWP" = Yield_new  / (ET * 10), .after = ET) |>   # calculate CWP = Y / ET
  mutate("HI/HIo" = 100 * HI / 26, .after = HI)            # calculate Hi/HIo

It is important to note that certain HRZs have no data because the model crashed during execution and produced no output. This happens when the zone’s climate is far too cold for crop cultivation.

nzo <- cpe_new |>
  filter(Sea == 0) |>         # if Sea = 0, there is no data, i.e. the model crashed
  count()
cat("The number of zones with no yield data is:", round(nzo[[1]], 0))
The number of zones with no yield data is: 151

For selected variables such as rainfall (Rain or rfl), crop cycle length (Cycle or ccl), B/Bx (BioRel or bxr), HI/HIo (hxr), yield (or yld) and CWP (or WPet),

cpe_trm <- cpe_new |>
  select("subc", "seas" = "Sea", "rfl" = "Rain", "ccl" = "Cycle", "bxr" = "BioRel", "hxr" = "HI/HIo", "yld" = "Yield_new", "cwp" = "WPet")

the 1st main objective was to generate long-term statistics (e.g. mean, standard deviation, coefficient of variation, number of zero values, number of missing (NA) values, total number of values) from the seasonal data.

cpe_sta <- cpe_trm |>
  group_by(subc) |>
  summarise(
    # for all variables from rfl to cwp, no need for a loop
    across(rfl:cwp,
           list(                      
      # calculate the mean
      ave = \(x) mean(x, na.rm = TRUE),
      # calculate the standard deviation
      sdv = \(x) sd(x, na.rm = TRUE),
      # calculate the coefficent of variation as a %
      cov = \(x) 100 * sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE) 
    )),
    yld_non = sum(is.na(yld)),             # count of missing yield values
    yld_noi = sum(is.infinite(yld)),       # count of infinite yield values
    yld_nom = 49 - n(),                    # count of missing seasonal yields
    yld_noz = sum(yld == 0, na.rm = TRUE), # count of zero yields
    yld_nov = sum(yld > 0, na.rm = TRUE),  # count of non-zero yields
    yld_nos = n()                          # count of seasons with yield values
  )

To ensure the statistics were correct, the following were summed, namely the number of:

  • missing (NA) yield values (yld_non),

  • infinite (INF) yield values (yld_noi),

  • missing seasons (yld_nom),

  • zero yield values (yld_noz), and

  • non-zero yield values (yld_nov).

If the above values did not sum to 49 seasons in total, then there is an error.

tot <- cpe_sta |>
  mutate("yld_chk" = (yld_non + yld_noi + yld_nom + yld_noz + yld_nov)) |>
  count(yld_chk)
cat("The number of zones with a sum of", tot[[1]], "values is", tot[[2]])
The number of zones with a sum of 49 values is 5838

The number of yield values equal to zero kg ha-1 (yld_noz) or greater than zero kg ha-1 (yld_nov) was also calculated (yld_nzv). Furthermore, the risk of crop failure (RCF) was calculated as a ratio of the number of missing, infinite and zero seasonal yields to the total number of simulations (49), which was then expressed as a percentage.

cpe_sta <- cpe_sta |>
  mutate("yld_nzv" = yld_noz + yld_nov, .after = "yld_nov") |>
  mutate("rcf_per" = 100 * (yld_non + yld_noi + yld_nom + yld_noz) / 49,
         .after = cwp_cov)

The analysis-ready statistics were outputted to a .CSV file in the data/processed folder. This was important since the number of rows of data was reduced from 268994 to 5838.

write_csv(cpe_sta, here::here("data/processed/stat_5838_cpea_mthk_fro1.csv"))

2b. Identifying suitable cowpea production zones

Cowpea cannot be cultivated in all 5838 HRZs across the country (e.g. in zones that are too dry and/or cold), and thus the goal was to analyse the statistics to identify zones that are deemed suitable for cowpea production.

cpe_sta <- read_csv(here::here("data/processed/stat_5838_cpea_mthk_fro1.csv"))

The first step involved identifying variables with missing (i.e. NA) data.

cpe_sta |>
  summarise(
    # for all variables from rfl_ave to rcf_per, no need for a loop
    across(rfl_ave:rcf_per,
           list(mis = \(x) sum(is.na(x))       # calculate number of NA values
  ))) |>
  pivot_longer(cols = rfl_ave_mis:rcf_per_mis, # convert table from wide to long format
               names_to = "variable",
               values_to = "num_na") |>
  arrange(num_na) |>                           # sort table into ascending order
  filter(num_na > 0) |>                        # filter out variables with no NA values
  gt() |>
  tab_options(
    container.height = "300px",                # set a fixed height for vertical scroll
    container.width = "200px"                  # set a fixed width for horizontal scroll
  )
variable num_na
rfl_sdv_mis 189
rfl_cov_mis 189
ccl_sdv_mis 189
ccl_cov_mis 189
bxr_sdv_mis 189
hxr_sdv_mis 189
yld_sdv_mis 189
cwp_sdv_mis 189
bxr_cov_mis 192
hxr_cov_mis 362
yld_cov_mis 362
cwp_cov_mis 367
 
cnt <- cpe_sta |>
  filter(is.na(cwp_cov)) |>                  # check for NA values in CV of CWP
  count()                                    # count the number of NA values
cat("To verify, the number of NA values in CWPcov is", cnt[[1]])
To verify, the number of NA values in CWPcov is 367

If the coefficient of variation (CV in %) was missing (NA), values were converted to 700%. Similarly, missing standard deviations were converted to a value of 1200. A subset of the statistics was then created by only considering variables that could be used to identify suitable zones for cowpea production.

cpe_sta_fin <- cpe_sta |>
   mutate(across(ends_with("_cov"), \(x) coalesce(x, 700))) |>
   mutate(across(ends_with("_sdv"), \(x) coalesce(x, 1200))) |>
   select("subc", "rfl_ave", "ccl_ave", "bxr_ave", "hxr_ave",
          "yld_nzv", "yld_ave", "cwp_ave", "cwp_cov", "rcf_per")

By creating new variables (set to 1 or yes), suitable crop production areas were identified with:

a) above zero seasonally averaged rainfall (RFLAVE) data (i.e. also excludes zones with no data);

cpe_sta_fin$rfl_ave_gt0 <- ifelse(cpe_sta_fin$rfl_ave > 0, 1, 0)

cnt <- cpe_sta_fin |>
  filter(rfl_ave_gt0 == 0) |>
  count() # 155 zones excluded
cat("Number of zones excluded since RFLave = 0 mm is", cnt[[1]])
Number of zones excluded since RFLave = 0 mm is 155

b) a seasonally average HI/HIo ratio (called hxr_ave) above 25% (Lake, 2024):

cpe_sta_fin$hxr_ave_gt25 <- ifelse(cpe_sta_fin$hxr_ave > 25, 1, 0)

c) more than 4 seasons of data (else too few data values for calculating reliable statistics);

cpe_sta_fin$yld_nzv_ge4 <- ifelse(cpe_sta_fin$yld_nzv >= 4, 1, 0)

d) more than 30 (out of 49) seasons of data, else insufficient data values to calculate reliable long-term statistics according to Steduto et al. (2012);

cpe_sta_fin$yld_nzv_ge30 <- ifelse(cpe_sta_fin$yld_nzv >= 30, 1, 0)

e) average yields above 25 kg ha-1 (else unviable for smallholder farmers according to Gerrano et al. (2022)), which helps to exclude zones where CWP is 0 kg m-3 of water use and the RCF is 100%;

cpe_sta_fin$yld_ave_gt25 <- ifelse(cpe_sta_fin$yld_ave > 25, 1, 0)

f) low (e.g. < 55%) risk of crop failure (else economically unviable).

cpe_sta_fin$rcf_per_lt55 <- ifelse(cpe_sta_fin$rcf_per < 55, 1, 0)

Next, another variable called suitable (0=no; 1=yes) was created from the product of the above-mentioned criteria.

cpe_sta_fin <- transform(cpe_sta_fin, suitable = rfl_ave_gt0 * hxr_ave_gt25 * yld_nzv_ge4 * yld_nzv_ge30 * yld_ave_gt25 * rcf_per_lt55)

According to FAO (1976), suitable cropping areas can be classified as highly suitable (S1), moderately suitable (S2) or marginally suitable (S3). Lake (2024) noted that the seasonally averaged HI/HIo ratio (variable called hxr_ave) can be used to classify suitability as follows: < 50% (S3), 50 to 75% (S2) and > 75% (S1). For zones deemed suitable for cowpea production (i.e. suitable=1), another suitability index was created as follows:

cpe_sta_fin <- cpe_sta_fin |>
  mutate(suitability = 
    case_when(
      hxr_ave  <= 25 ~ "Unsuitable",
      hxr_ave  >  25 & hxr_ave <= 50 ~ "Marginal",
      hxr_ave  >  50 & hxr_ave <= 75 ~ "Moderate",
      hxr_ave  >  75 ~ "High")
    )

The analysed statistics were then outputted to a .CSV file in the data/final folder.

write_csv(cpe_sta_fin, here::here("data/final/suit_5838_cpea_mthk_fro1.csv"))

3. Results

To recap, AquaCrop was run for all HRZs to simulate cowpea planted on the 01st of November in each season at 80 000 plants ha-1. Long-term statistics for important AquaCrop output variables were calculated from the seasonal simulations. These statistics were then used to determine which zones are deemed suitable for rainfed production of cowpea. An analysis of the final dataset is presented in this section.

# read the final .CSV file into a data frame
cpe_sta_fin <- read_csv(here::here("data/final/suit_5838_cpea_mthk_fro1.csv"))

The mean annual rainfall across the western parts of South Africa is below 400 mm, which is considered too dry to support economically viable crop production under rainfed conditions. Similarly, the climate is too cold in many HRZs located in mountainous regions, especially those bordering the country of Lesotho. Hence, not all HRZs are suitable for cowpea cultivation. The number of zones eliminated in each step was calculated in two different ways:

tab_sum_eln <- cpe_sta_fin |>
   # order the variables from highest to lowest number of zones eliminated
   select("hxr_ave_gt25", "rcf_per_lt55", "yld_ave_gt25",
          "yld_nzv_ge4", "yld_nzv_ge30", "rfl_ave_gt0", "suitable") |>
   summarise(
     across(everything(),
            list(n = \(x) sum(x)) # calculate the sum of zones not eliminated
     )
   )

tab_sum_ely <- cpe_sta_fin |>
   # order the variables from highest to lowest number of zones eliminated
   select("hxr_ave_gt25", "rcf_per_lt55", "yld_ave_gt25",
          "yld_nzv_ge4", "yld_nzv_ge30", "rfl_ave_gt0", "suitable") |>
   summarise(
     across(everything(),
            list(n = \(x) 5838 - sum(x)) # calculate the sum of zones eliminated
     )
   )

tab_sum_all <- bind_rows(tab_sum_ely, tab_sum_eln) |>
  mutate(id = 1:n(), .before = hxr_ave_gt25_n)

tab_sum_all |>
  gt()
id hxr_ave_gt25_n rcf_per_lt55_n yld_ave_gt25_n yld_nzv_ge4_n yld_nzv_ge30_n rfl_ave_gt0_n suitable_n
1 2452 2435 1108 213 373 155 2567
2 3386 3403 4730 5625 5465 5683 3271

The second method is as follows:

rm(tab_sum_all)
tab_sum_all <- tibble(id = c(0, 1))

# order the variables from highest to lowest number of zones eliminated
var_lst = c("hxr_ave_gt25", "rcf_per_lt55", "yld_ave_gt25", "yld_nzv_ge4", "yld_nzv_ge30", "rfl_ave_gt0", "suitable")

for (var_tmp in var_lst) {
# cat("Running code for", var_tmp, "\n")
  var_nme = sym(var_tmp)
  var_new = sym(paste0(var_nme, "_n"))
  tab_sum <- cpe_sta_fin |>
    select(!!var_nme) |>
    group_by(!!var_nme) |>
    summarise(n = n()) |>
    rename(!!var_new := all_of("n")) |>
    rename(id = !!var_nme)
  tab_sum_all <- merge(tab_sum_all, tab_sum, by = "id")
}

tab_sum_all |>
  gt()
id hxr_ave_gt25_n rcf_per_lt55_n yld_ave_gt25_n yld_nzv_ge4_n yld_nzv_ge30_n rfl_ave_gt0_n suitable_n
0 2452 2435 1108 213 373 155 2567
1 3386 3403 4730 5625 5465 5683 3271

A check was done to ensure the last column in the above tables is correct:

#check variable suitable_n
cpe_sta_fin |>
  select(suitable) |>
  group_by(suitable) |>
  count()
# A tibble: 2 × 2
# Groups:   suitable [2]
  suitable     n
     <dbl> <int>
1        0  2567
2        1  3271

It is important to that both methods produced the same outcome (i.e. to verify the results are correct). The results are presented in Table 1, which show the majority of zones were eliminated by applying the HI/HIo > 25% criterion (2452 in total), followed by RCF < 55% (2435 zones) and yield > 25 kg ha-1 (1108 zones).

tab_sum_all |>
  # gt() formats the table 
  gt() |>
  # give a header with a title and subtitle
  tab_header(title = md("**Number of HRZs eliminated by each criterion**"),
             subtitle = html("(<b>Cowpea planted on 01 November at 80 000 plants ha<sup>-1</sup></b>)")) |>
  # format no. of decimals
  # https://gt.rstudio.com/reference/fmt_number.html
    fmt_number(decimals = 0, sep_mark = "") |>
# change 0 to N and 1 to Y
    sub_values(columns = id, values = 0, replacement = "Y") |>
    sub_values(columns = id, values = 1, replacement = "N") |>
    cols_label(
    id = md("**Eliminated?**<br>(Y/N)"),
    rfl_ave_gt0_n = md("**RFL~AVE~**<br> > 0 mm"),
    hxr_ave_gt25_n = md("**HI/HI~O~**<br> > 25%"),
# does not do sperscript in rendered HTML
#   yld_ave_gt25_n = md("**YLD~AVE~**<br> > 25 kg {{ha^-1}}"),
    yld_ave_gt25_n = html("<b>YLD<sub>AVE</sub></b><br> > 25 kg ha<sup>-1</sup>"),
    rcf_per_lt55_n = md("**RCF**<br> < 55%"),
    suitable_n = md("**Final<br>no. of<br>HRZs**")
  ) |>
# 1st grouping
  tab_spanner(
    label = md("**No. of yield values**"),
    columns = c(yld_nzv_ge4_n, yld_nzv_ge30_n)
  ) |>
  cols_label(
    yld_nzv_ge4_n = "≥ 4",
    yld_nzv_ge30_n = "≥ 30"
  )
Table 1: Number of HRZs eliminated by applying the selected criteria
Number of HRZs eliminated by each criterion
(Cowpea planted on 01 November at 80 000 plants ha-1)
Eliminated?
(Y/N)
HI/HIO
> 25%
RCF
< 55%
YLDAVE
> 25 kg ha-1
No. of yield values
RFLAVE
> 0 mm
Final
no. of
HRZs
≥ 4 ≥ 30
Y 2452 2435 1108 213 373 155 2567
N 3386 3403 4730 5625 5465 5683 3271

Of the total 5838 HRZs, 3271 suitable zones were selected for further analysis. Hence, 2567 zones were eliminated since the yare deemed unsuitable for cowpea production.

# filter the dataset to select only suitable zones
cpe_se1 <- cpe_sta_fin |>
  filter(suitable == 1)

tot <- nrow(cpe_sta_fin)
se1 <- nrow(cpe_se1)
se0 = tot - se1

cat("The total number of HRZs is", tot, "\n")
The total number of HRZs is 5838 
cat("Of these,", se1, "are considered suitable (not eliminated) for cowpea production\n")
Of these, 3271 are considered suitable (not eliminated) for cowpea production
cat("Hence,", se0, "zones are unsuitable for cowpea production and were therefore eliminated\n")
Hence, 2567 zones are unsuitable for cowpea production and were therefore eliminated

3a. Collinearity test

A collinearity test between CWP and Y was performed by creating a scatter plot with a trend line (linear regression) as shown in Figure 1. Since CWP is calculated from Y, the variables should be strongly correlated.

ggplot(data = cpe_se1,
       mapping = aes(x = cwp_ave,
                     y = yld_ave)) +
# create scatter plot
  geom_point(size = 0.5) +
  
# set x-axis label to 3 decimals
  scale_x_continuous(labels = label_number(accuracy = 0.001)) +

# label both axes and the plot title  
  labs(x = expression(paste("Crop water productivity (kg ", m^-3, ")")),
       y = expression(paste("Crop yield (kg ", ha^-1, ")")),
       title = "Crop yield vs crop water productivity") +

# centre the title label (hjust = 0 is left, 0.5 = center, 1 = right) 
  theme(plot.title = element_text(hjust = 0.5)) +  
  
# add a trend line
  geom_smooth(method = "lm",
              formula = y ~ x + 0,
              color = "blue",
              linetype = "dashed",
              size = 1.0,
              na.rm = FALSE,
              se = FALSE,
              show.legend = TRUE) 
Figure 1: Relationship between yield and crop water productivity for cowpea planted on 01 November at 80 000 plants per hectare across 3271 suitable HRZs in South Africa

For the trend line, various statistics (e.g. R squared) were calculated. The slope value provides the average water use of cowpea in m3 ha-1 (divide by 10 to convert to mm ha-1).

# extract the two variables to assess for collinarity
reg_dat <- cpe_se1 |>
  select("x" = cwp_ave, "y" = yld_ave)
# fit a linear regression model to the data with NO intercept
reg_mod <- lm(y ~ x + 0, data = reg_dat)
reg_sta <- summary(reg_mod)
slp = reg_sta$coefficients["x", "Estimate"]
r2v = reg_sta$r.squared
r2a = reg_sta$adj.r.squared
# print the equation
cat("The regression equation is: Yield =", round(slp, 2), "* CWP\n")
The regression equation is: Yield = 3561.58 * CWP
cat("The R-squared value is:", round(r2v, 4), "\n")
The R-squared value is: 0.9587 
cat("The adjusted R-squared value is:", round(r2a, 4), "\n")
The adjusted R-squared value is: 0.9587 

If Pearson’s correlation coefficient (r) is above 0.5, the variables are highly correlated. Since the r value approaches 1, Y and CWP are highly correlated, and thus either Y or CWP (but not both) should be used for determining suitable crop production zones.

pcc = cor(reg_dat$x, reg_dat$y, method = 'pearson')
cat("The Pearson correlation coefficent is:", round(pcc, 4), "\n")
The Pearson correlation coefficent is: 0.939 
# Spearman correlation coefficient
scc = cor(reg_dat$x, reg_dat$y, method = 'spearman')
cat("The Spearman correlation coefficent is:", round(scc, 4), "\n")
The Spearman correlation coefficent is: 0.9515 
# Kendall correlation coefficient
kcc = cor(reg_dat$x, reg_dat$y, method = 'kendall')
cat("The Kendall correlation coefficent is:", round(kcc, 4), "\n")
The Kendall correlation coefficent is: 0.8068 

3b. Suitable cowpea production zones

A histogram can be created for each variable (e.g. seasonally averaged yield) showing the full range of values. Figure 2 illustrates the seasonally averaged yield simulations (yld_ave), which represent a right-skewed distribution (i.e. less HRZs with high yields), despite the elimination of many unsuitable zones.

library(ggplot2)
ggplot(data = cpe_se1,
       mapping = aes(x = yld_ave)) +
 
  geom_histogram(col = "black",
                 fill = "white",
                 breaks = seq(0, 2600, 25)) +

# label both axes and the plot title
  labs(x = expression(paste("Seasonally averaged yield (kg ", ha^-1, ")")),
       y = "Number of HRZs",
       title = "Range in simulated cowpea yields") +

# removes default grey background fill
  theme_minimal() +
    
# centre the title label (hjust = 0 is left, 0.5 = center, 1 = right)
  theme(plot.title = element_text(hjust = 0.5)) 
Figure 2: Histogram of yield variation across 3271 homogeneous response zones (HRZs) deemed suitable for cowpea production

As expected, there are no yields below 25 kg ha-1 as these very low yields were excluded (see Section 2b). However, the largest number of zones have a simulated yield ranging from 50-75 kg per ha. Since cowpea is mainly cultivated by subsistence and small-scale farmers (i.e. low input farming), yields are typically low (25-300 kg ha-1), as simulated by the AquaCrop model.

val <- nrow(cpe_se1 |>
            filter(yld_ave > 50 & yld_ave <= 75))
cat("The number of HRZs with an average yield ranging from 50-75 kg per ha is", val, "\n")
The number of HRZs with an average yield ranging from 50-75 kg per ha is 111 
val <- nrow(cpe_se1 |>
            filter(yld_ave > 25 & yld_ave <= 300))
cat("The number of HRZs with an average yield ranging from 25-300 kg per ha is", val, "\n")
The number of HRZs with an average yield ranging from 25-300 kg per ha is 724 
val <- nrow(cpe_se1 |>
            filter(yld_ave > 300))
cat("The number of HRZs with an average yield exceeding 300 kg per ha is", val, "\n")
The number of HRZs with an average yield exceeding 300 kg per ha is 2547 

Seasonally averaged yields simulated by AquaCrop ranged from 28.6 to 2589 kg ha-1, with a mean and median of 725 and 628 kg ha-1, respectively. The highest averaged yield of 2589 kg ha-1 was simulated for HRZ number 4573.

# useful statistics for the yield variable
cpe_se1 |>
  select(yld_ave) |>            # select only one variable of interest
  filter(!is.na(yld_ave)) |>    # exclude any NA values
  arrange(yld_ave) |>           # arrange in ascending order
  summarise(                    # produce sumamry statistics
    Suitable = n(),             # returns 3271 zones
    Mean = mean(yld_ave),       # returns  725 kg per ha
    Median = median(yld_ave),   # returns  628 kg per ha
    SD = sd(yld_ave),           # returns  504 kg per ha 
    CV = 100 * SD / Mean,       # returns 69.5 %  
    Min = min(yld_ave),         # returns 28.6 kg per ha
    Max = max(yld_ave)          # returns 2589 kg per ha
  ) |>
  gt() |>
    tab_header(title = md("**Yield statistics across suitable HRZs**"),
               subtitle = html("(<b>in kg ha<sup>-1</sup> except for CV in %</b>)")) |>
    fmt_number(columns = Suitable, decimals = 0, sep_mark = "") |>
    fmt_number(columns = Mean:Max, decimals = 1, sep_mark = "")
Yield statistics across suitable HRZs
(in kg ha-1 except for CV in %)
Suitable Mean Median SD CV Min Max
3271 724.8 628.0 503.6 69.5 28.6 2589.0
cpe_se1 |>
  select(subc, yld_ave) |>   # select variables of interest
  filter(!is.na(yld_ave)) |> # exclude any NA values
  arrange(desc(yld_ave)) |>  # arrange in descending order
  head( n = 1)               # returns subc = 4573
# A tibble: 1 × 2
   subc yld_ave
  <dbl>   <dbl>
1  4573    2589

The suitability score was used to create a density plot (Figure 3), which shows the largest distribution of lower yield values classified as marginally suitable versus the smallest distribution of higher yield values classified as highly suitable.

fil_col = c("High" = "green",
            "Moderate" = "yellow",
            "Marginal" = "orange")
lne_col = c("High" = "grey60",
            "Moderate" = "grey60",
            "Marginal" = "grey60")

# Move Marginally suitable (S3) to bottom of legend using factors and levels
leg_ord = c("Marginal", "Moderate", "High")

ggplot(data = cpe_se1 |>
       mutate(suitability = factor(suitability, levels = leg_ord)),
       mapping = aes(x = yld_ave,
                     fill = suitability,
                     color = suitability)) +
  geom_density(alpha = 0.50,
               adjust = 0.7) +
 
  scale_x_continuous(breaks = seq(0, 2600, by = 200)) +
 
  scale_fill_manual(name ="Suitability",
                    values = fil_col,
                    labels = c("Marginal (S3)", "Moderate (S2)", "High (S1)"),
                    ) +

  # legend MUST have the same name
  scale_color_manual(name = "Suitability",
                     values = lne_col,
                     labels = c("Marginal (S3)", "Moderate (S2)", "High (S1)"),
                     ) +
 
  labs(title = "Density plot of simulated cowpea yields",
       x = expression(paste("Seasonally averaged yield (kg ", ha^-1, ")")),
       y = "Density") +

  # removes default grey background fill
  theme_minimal() +
    
  # centre the title label (hjust = 0 is left, 0.5 = center, 1 = right)
  theme(plot.title = element_text(hjust = 0.5))
Figure 3: Density plot of simulated cowpea yields across suitable growing areas in South Africa

Another way to view the distributions by suitability is using a ridgeline plot (also called a joy plot), which is useful for visualising the distribution of a continuous variable (yield) across different categories (suitability). In Figure 4, the vertical lines show the median (50th percentile) and the interquartile range (25th to 75th percentile).

library(ggridges)
fil_col = c("High" = "green",
            "Moderate" = "yellow",
            "Marginal" = "orange")
lne_col = c("High" = "green",
            "Moderate" = "orange",
            "Marginal" = "brown")
leg_ord = c("Marginal", "Moderate", "High")

ggplot(data = cpe_se1 |>
       mutate(suitability = factor(suitability, levels = leg_ord)),
       mapping = aes(x = yld_ave,
                     y = suitability,
                     fill = suitability,
                     color = suitability)) +
  stat_density_ridges(alpha = 0.1,
                      scale = 1,
                      rel_min_height = 0.01,
                      quantile_lines = TRUE,
                      quantiles = c(0.25, 0.50, 0.75)) +

  scale_x_continuous(breaks = seq(0, 2600, by = 200)) +

  scale_fill_manual(name ="Suitability",
                    values = fil_col,
                    labels = c("Marginal (S3)", "Moderate (S2)", "High (S1)"),
                    ) +
  scale_color_manual(name = "Suitability",
                     values = lne_col,
                     labels = c("Marginal (S3)", "Moderate (S2)", "High (S1)"),
                     ) +

  labs(title = "Ridgeline plot of simulated cowpea yields",
       x = expression(paste("Seasonally averaged yield (kg ", ha^-1, ")")),
       y = "Density") +

  # removes default grey background fill
  theme_minimal() +
    
  # centre the title label (hjust = 0 is left, 0.5 = center, 1 = right) 
  theme(plot.title = element_text(hjust = 0.5))
Figure 4: Ridgeline plot of simulated cowpea yields across suitable growing areas in South Africa

Figure 5 again shows how yield increases from marginally (S3) to highly suitable (S1) zones. The figure also highlights the problem of using discrete thresholds (25, 50 and 75%) to classify a continuous variable (HI/HIo), especially if the thresholds are subjective.

# box and whisker plot
fil_col = c("High" = "green",
            "Moderate" = "yellow",
            "Marginal" = "orange")
leg_ord = c("High", "Moderate", "Marginal")
leg_lab = c("High (S1)", "Moderate (S2)", "Marginal (S3)")

ggplot(data = cpe_se1 |>
       mutate(suitability = factor(suitability, levels = leg_ord)),
       mapping = aes(x = yld_ave,
                     y = suitability,
                     fill = suitability)
       ) +

  geom_boxplot(alpha = 0.5, outlier.shape = NA) +

  scale_x_continuous(breaks = seq(0, 2600, by = 200)) +  

  scale_y_discrete(labels = leg_lab) +
  
  scale_fill_manual(values = fil_col) +

  labs(title = "Box plot of simulated cowpea yields",
       x = expression(paste("Seasonally averaged yield (kg ", ha^-1, ")")),
       y = "Suitability") +

  # removes default grey background fill
  theme_minimal() +

  # no legend
  theme(legend.position="none") +

  # centre the title label (hjust = 0 is left, 0.5 = center, 1 = right) 
  theme(plot.title = element_text(hjust = 0.5))
Figure 5: Box plot of simulated cowpea yields across suitable growing areas in South Africa

From the table of statistics for the seasonally averaged yield grouped by suitability (Table 2), there are more zones classified as S3 (1553 zones), compared to S2 (1320) and S1 (398). The spatially averaged yield is 357, 880 and 1646 kg ha-1 for S3, S2 and S1 zones, respectively. The median yield is 329 (S3), 845 (S2) and 1597 kg ha-1 (S1), as shown in the box plot (Figure 5).

tab_ord = c("Marginal", "Moderate", "High")
var_nme = "yld_ave"
tab_sum <- cpe_se1 |>
# order table as "Marginal", "Moderate", "High" (not alphabetical)
  mutate(suitability = factor(suitability, levels = tab_ord)) |>
  group_by(suitability) |>
  filter(!is.na(var_nme)) |>
  select(suitability, var_nme) |>
  arrange(var_nme) |>  
  summarise(
    cnt = n(),                                    # number of values  
    dis = n_distinct(!!sym(var_nme)),             # unique number of values
    ave = mean(!!sym(var_nme)),                   # mean of values
    p25 = quantile(!!sym(var_nme), probs = 0.25), # 25th percentile
    p50 = quantile(!!sym(var_nme), probs = 0.50), # 50th percentile (median)
    p75 = quantile(!!sym(var_nme), probs = 0.75), # 75th percentile
    min = min(!!sym(var_nme)),                    # smallest value
    max = max(!!sym(var_nme)),                    # largest value
    sdv = sd(!!sym(var_nme)),                     # standard deviation
    cov = 100 * sdv / ave                         # coefficient of variation (%)
  )

# see gt.rstudio.com
tab_sum |>
  # gt() formats the table 
  gt() |>
  # give a header with a title and subtitle
  tab_header(title = md("**Cowpea yield statistics**"),
            subtitle = html("Planted on 01 November at 80 000 plants ha<sup>-1</sup>")
  ) |>
  # format no. of decimals
  # https://gt.rstudio.com/reference/fmt_number.html
  fmt_number(columns = cnt:dis, decimals = 0, sep_mark = "") |>
  fmt_number(columns = ave:cov, decimals = 1, sep_mark = "") |>
  cols_label(
    suitability = md("**Suitability<br>class**"),
    cnt = md("**No. of<br>HRZs**"),
    dis = md("**Distinct<br>values**"),
    ave = html("<b>Average</b><br>(kg ha<sup>-1</sup>)"),
  ) |>
  # 1st grouping
  tab_spanner(
    label = md("**Percentiles**"),
    columns = c(p25, p50, p75)
  ) |>
  cols_label(
    p25 = "{{25^th}}",
    p50 = "{{50^th}}",      
    p75 = "{{75^th}}",      
  ) |>
  # 2nd grouping
  tab_spanner(
    label = html("<b>Range (kg ha<sup>-1</sup>)</br>"),
    columns = c(min, max)
  ) |>
  cols_label(
    min = "Min",
    max = "Max",      
  ) |>
  # 3rd grouping
  tab_spanner(
    label = md("**Spread**"),
    columns = c(sdv, cov)
  ) |>
  cols_label(
#   sdv = md("SD (kg {{ha^-1}})"), # does not do superscipt in rendered HTML
    sdv = html("SD<br>(kg ha<sup>-1</sup>)"),
    cov = md("CV<br>(%)"),      
  )
Table 2: Useful statistics describing the range in cowpea yield (kg/ha) across three suitability classes
Cowpea yield statistics
Planted on 01 November at 80 000 plants ha-1
Suitability
class
No. of
HRZs
Distinct
values
Average
(kg ha-1)
Percentiles
Range (kg ha-1)
Spread
25th 50th 75th Min Max SD
(kg ha-1)
CV
(%)
Marginal 1553 1527 356.8 157.2 328.5 490.8 28.6 1200.7 241.2 67.6
Moderate 1320 1302 880.1 655.9 844.9 1089.8 52.3 1874.4 298.0 33.9
High 398 397 1645.9 1395.1 1597.3 1883.1 1052.5 2589.0 325.5 19.8

4. Conclusions

  • Only 3271 of the 5838 HRZs across South Africa are considered suitable for rainfed production of cowpea.
  • Of these, 398 zones are classified as highly suitable, where simulated yields averaged 1646 kg ha-1 (range of 1052-2589 kg ha-1).
  • Similarly, 1320 zones are classified as moderately suitable with an average yield of 880 kg ha-1.
  • The remaining 1553 zones are deemed marginally suitable, with an average yield of 357 kg ha-1.
  • The analysis highlights the importance of eliminating “noise” (especially outliers) when working with large datasets.
  • It also highlights the problem of using discrete thresholds to classify continuous variables.

5. References

FAO. (1976). A framework for land evaluation. FAO. https://www.fao.org/3/x5310e/x5310e00.htm
Gerrano, A. S., Lubinga, M. H., & Bairu, M. W. (2022). Genetic resources management, seed production constraints and trade performance of orphan crops in southern africa: A case of cowpea. South African Journal of Botany, 146, 340–347. https://doi.org/10.1016/j.sajb.2021.11.007
Hsiao, T. C., Lee, H., Steduto, P., Basilio, R. L., Raes, D., & Fereres, E. (2009). AquaCrop – the FAO crop model to simulate yield response to water: III. Parameterization and testing for maize. Agronomy Journal, 101(3), 448–459. https://doi.org/10.2134/agronj2008.0218s
Karl, K., MacCarthy, D., Porciello, J., Chimwaza, G., Fredenberg, E., Freduah, B. S., Guarin, J., Mendez Leal, E., Kozlowski, N., Narh, S., Sheikh, H., Valdivia, R., Wesley, G., Deynze, A. van, Zonneveld, M. van, & Yang, M. (2024). Opportunity crop profiles for the vision for adapted crops and soils (VACS) in africa (p. 78). Center for Climate Systems Research, Columbia University. https://academiccommons.columbia.edu/doi/10.7916/7msa-yy32
Lake, S. (2024). A novel approach to mapping areas suitable for rainfed production of bambara nut and cowpea [Master’s thesis]. Centre for Water Resources Research, School of Agricultural, Earth; Environmental Sciences, University of KwaZulu-Natal.
Raes, D., Steduto, P., Hsiao, T. C., & Fereres, E. (2009). AquaCrop – the FAO crop model to simulate yield response to water: II. Main algorithms and software description. Agronomy Journal, 101(3), 438–447. https://doi.org/10.2134/agronj2008.0140s
Steduto, P., Hsiao, T. C., Fereres, E., & Raes, D. (2012). Crop yield response to water (p. 500). Food; Agriculture Organisation (FAO) of the United Nations. http://www.fao.org/docrep/016/i2800e/i2800e00.htm
Steduto, P., Hsiao, T. C., Raes, D., & Fereres, E. (2009). AquaCrop – the FAO crop model to simulate yield response to water: I. Concepts and underlying principles. Agronomy Journal, 101(3), 426–437. https://doi.org/10.2134/agronj2008.0139s

Reuse

CC-BY-4.0

Citation

BibTeX citation:
@online{kunz2025,
  author = {Kunz, Richard},
  title = {Land Suitability Analysis for Cowpea Production in {South}
    {Africa} Using Output Simulated by the {AquaCrop} Model},
  date = {2025-12-09},
  url = {https://ds4owd-002.github.io/project-kunzrp/},
  langid = {en},
  abstract = {The AquaCrop model, developed by the Food and Agriculture
    Organisation, was run for cowpea planted on the 01\^{}st\^{} of
    November at 80000 plants per hectare across 5838 homogeneous
    response zones (HRZs) in South Africa. The model generated up to 49
    consecutive seasons of data from 1950/51 to 1998/99 for each HRZ,
    from which long-term statistics were calculated for important
    AquaCrop output variables (part 1). The statistics were then
    analysed to determine which HRZs are deemed suitable for rainfed
    cultivation of cowpea (part 2). The crop can be grown in 3271 of the
    5838 HRZs.}
}
For attribution, please cite this work as:
Kunz, R. (2025, December 9). Land suitability analysis for cowpea production in South Africa using output simulated by the AquaCrop model. Capstone Project. https://ds4owd-002.github.io/project-kunzrp/