Land suitability analysis for cowpea production in South Africa using output simulated by the AquaCrop model

Capstone project

Author

Affiliation

Richard Kunz

Centre for Water Resources Research, University of KwaZulu-Natal, Pietermaritzburg

Published

December 9, 2025

Abstract

The AquaCrop model, developed by the Food and Agriculture Organisation, was run for cowpea planted on the 01^st of November at 80000 plants per hectare across 5838 homogeneous response zones (HRZs) in South Africa. The model generated up to 49 consecutive seasons of data from 1950/51 to 1998/99 for each HRZ, from which long-term statistics were calculated for important AquaCrop output variables (part 1). The statistics were then analysed to determine which HRZs are deemed suitable for rainfed cultivation of cowpea (part 2). The crop can be grown in 3271 of the 5838 HRZs.

1. Introduction

Cowpea is an underutilised indigenous legume crop grown by smallholder farmers in South Africa. When compared to other legumes, cowpea is tolerant to water and heat stress, and thus is resilient to increasing climate variability. According to Karl et al. (2024), cowpea is rich in protein and other essential nutrients (e.g. folate, zinc and iron), making the crop suitable for improving food security in rural communities. Despite the crop’s potential, it remains underutilised and neglected (i.e. under-researched) in South Africa. Hence, there is limited evidence from field experiments that describe the crop’s potential. To fill this knowledge gap, a crop simulation model developed by the Food and Agriculture Organisation (FAO) called AquaCrop (Hsiao et al., 2009; Raes et al., 2009; Steduto et al., 2009), was used to generate simulated data across South Africa for cowpea planted on the 01^st of November in each season at 80000 plants per hectare (ha^-1).

2. Methods

For each HRZ, AquaCrop generated up to 49 consecutive seasons of data from 1950/51 to 1998/99, from which long-term statistics were calculated for important AquaCrop output variables. The 1^st objective was to generate long-term statistics from the seasonal dataset and to store this output in the data/processed folder (part 1; see Section 2a). The 2^nd objective was to analyse the statistics to select zones that are deemed suitable for cowpea production (part 2; see Section 2b). This dataset is stored in the data/final folder. For the data analysis, the following R libraries are required:

library(tidyverse)
library(gt)
library(ggplot2)
library(scales)

2a. Generating long-term statistics

The raw dataset contains output simulated by AquaCrop for cowpea using 50 years of input climate and soil data for each of the 5838 homogeneous response zones (HRZs) across South Africa, which generated up to 49 seasons of data (1950/51 to 1998/99) for 22 important output variables (from Rain to WPet).

cpe_raw <- read_csv(here::here("data/raw/subc_5838_cpea_mthk_fro1.csv"))
ncl <- ncol(cpe_raw)
cat("The number of columns in the data frame (raw_cpe) is:", ncl)

The number of columns in the data frame (raw_cpe) is: 30

nrw <- nrow(cpe_raw)
cat("The number of rows in the dataframe is:", nrw)

The number of rows in the dataframe is: 268994

cat("The first five rows of data are:")

The first five rows of data are:

head(cpe_raw, n = 5) |>
  gt()

Sea	Day1	Month1	Year1	Rain	ETo	GD	CO2	Infilt	Runoff	Drain	E	E/Ex	Tr	TrW	Tr/Trx	Cycle	TmpStr	ExpStr	StoStr	BioMass	BioRel	HI	Yield	WPet	DayN	MonthN	YearN	subc
1	1	11	1950	299.8	581.4	1261.3	311.30	290.5	9.3	47.5	100.7	29	201.3	201.3	72	95	1	22	22	4.914	58	18.2	0.895	0.30	11	2	1951	1
2	1	11	1951	142.2	585.4	1271.9	311.54	142.2	0.0	4.4	88.6	20	109.1	109.1	58	81	0	39	34	2.074	29	10.3	0.214	0.11	30	1	1952	1
3	1	11	1952	362.1	559.9	1262.1	311.78	355.3	6.8	89.7	141.2	46	174.6	174.6	59	92	1	31	27	4.176	53	4.7	0.194	0.06	8	2	1953	1
4	1	11	1953	287.0	562.4	1267.6	313.00	283.1	3.9	50.7	128.2	40	152.9	152.9	55	94	1	32	32	3.767	46	8.1	0.307	0.11	10	2	1954	1
5	1	11	1954	538.1	546.6	1259.7	314.25	489.8	48.3	206.0	129.5	38	148.3	148.3	62	97	1	43	24	3.960	46	18.4	0.730	0.26	13	2	1955	1

cat("The last five rows of data are:")

The last five rows of data are:

tail(cpe_raw, n = 5) |>
  gt()

Sea	Day1	Month1	Year1	Rain	ETo	GD	CO2	Infilt	Runoff	Drain	E	E/Ex	Tr	TrW	Tr/Trx	Cycle	TmpStr	ExpStr	StoStr	BioMass	BioRel	HI	Yield	WPet	DayN	MonthN	YearN	subc
45	1	11	1994	294.6	476.3	1263.4	359.82	270.0	24.6	102.6	156.8	30	0.3	0.3	47	73	0	50	27	0.007	0	0.0	0.000	0.00	18	1	1995	5838
46	1	11	1995	221.1	446.0	1263.6	361.72	209.6	11.5	51.7	105.4	31	81.5	81.5	57	74	0	40	29	2.254	30	10.1	0.227	0.12	20	1	1996	5838
47	1	11	1996	298.1	470.8	1270.1	363.17	285.0	13.1	26.9	113.2	33	129.5	129.5	78	72	0	40	21	3.222	44	19.6	0.633	0.26	17	1	1997	5838
48	1	11	1997	293.3	464.7	1259.5	365.22	264.4	28.9	52.5	128.6	40	100.2	96.9	60	75	0	35	29	2.695	35	20.4	0.551	0.24	19	1	1998	5838
49	1	11	1998	486.6	445.2	1268.7	367.54	439.5	47.1	142.6	141.4	66	208.0	208.0	78	77	1	11	15	5.705	75	18.9	1.081	0.31	22	1	1999	5838

Since the number of rows is less than the expected 286062 rows (i.e. 5838 x 49), not all zones have 49 seasonal simulations, which is important to note. In total, there are 17219 rows of “missing” data. This occurs because AquaCrop is not run for seasons and HRZs that are too cold to support economically viable crop production. According to Gerrano et al. (2022), cowpea’s potential yield is about 3 tons per ha (or t ha^-1). A total of 162 seasonal yield simulations exceeded 3 t ha^-1.

cpe_raw |>
  filter(Yield >= 3, is.na = TRUE) |>   # select yields >= 3 tons per hectare
                                        # exclude any missing (NA) values
    count()                             # 162 seasonal values with yield > 3 t/ha

The highest yield of 3.36 t ha^-1 was simulated in HRZ number (subc) 4252 in one season. HRZ number 5767 has 8 seasonal yields of 3 t ha^-1 or higher, followed by zones 4567, 5758 and 5764 with 7 seasonal yields exceeding 3 t ha^-1.

cpe_raw |>
  select(subc, Sea, Yield) |>          # select variables of interest
  filter(Yield >= 3, is.na = TRUE) |>  # select yields >= 3 tons per hectare (t/ha)
                                       # exclude any NA values
  group_by(subc) |>                    # group yields by HRZ number (subc)
  summarise(nsea = n(),                # count the number of seasons (sea) 
            ymax = max(Yield)          # obtain the largest yield in each HRZ
    ) |>
  arrange(desc(ymax)) |>               # arrange maximum yield in descending order, or
  gt() |>                              # output a table with 78 zones
  tab_options(
    container.height = "300px",        # Set a fixed height for vertical scroll
    container.width = "200px"          # Set a fixed width for horizontal scroll
  )

subc	nsea	ymax
4252	1	3.357
4567	7	3.346
5758	7	3.340
5764	7	3.329
5767	8	3.322
3296	4	3.287
4585	6	3.284
4621	3	3.283
4627	2	3.268
5683	4	3.266
4448	1	3.248
4472	5	3.246
4579	6	3.227
3290	1	3.206
4576	6	3.202
5371	2	3.173
5618	1	3.172
5374	2	3.167
4465	3	3.166
4568	4	3.161
5432	1	3.156
3323	2	3.151
5677	1	3.143
1677	1	3.137
551	1	3.132
3362	1	3.130
5494	2	3.128
5590	1	3.124
5434	2	3.122
5617	1	3.122
655	4	3.121
5755	2	3.121
5614	3	3.120
4577	1	3.113
5753	2	3.112
4573	1	3.111
661	1	3.109
4843	2	3.105
5499	1	3.102
658	2	3.101
5675	4	3.096
5279	1	3.095
5549	1	3.094
587	2	3.093
4683	2	3.091
5657	2	3.090
5497	1	3.085
5611	2	3.084
5684	1	3.082
4628	1	3.077
5429	1	3.072
4427	1	3.069
556	1	3.067
4638	1	3.061
5686	2	3.059
557	1	3.056
670	1	3.051
5788	2	3.048
5495	1	3.046
682	2	3.044
735	1	3.042
3366	1	3.040
5654	1	3.039
4917	1	3.038
4675	1	3.034
1701	1	3.033
3363	3	3.033
3311	1	3.031
550	1	3.029
547	1	3.024
5544	1	3.021
742	1	3.013
4941	1	3.007
760	1	3.006
5431	1	3.005
3369	1	3.004
5594	1	3.004
4841	1	3.002

For each season in all HRZs, the following additional variables were calculated:

yield (Y) in kg ha^-1 (not t ha^-1),
evapotranspiration (ET or crop water use in mm) as the sum of soil water evaporation (E) and crop transpiration (Tr),
crop water productivity (CWP in kg m^-3) as the ratio of Y (in kg ha^-1) to ET (in m³ ha^-1), and
the ratio of harvest index (HI) to the reference harvest index (HIo) expressed as a percentage, where HIo is 26% for cowpea.

HI is used to calculate crop yield from accumulated biomass production. Although AquaCrop outputs crop water productivity (called WPet), it was calculated per season to check the model’s output and the summation of ET.

cpe_new <- cpe_raw |>
  mutate("Yield_new" = Yield * 1000, .after = Yield) |>    # convert Y to kg/ha
  mutate("ET" = E + Tr, .after = Yield_new) |>             # sum E & T to ET
  mutate("CWP" = Yield_new  / (ET * 10), .after = ET) |>   # calculate CWP = Y / ET
  mutate("HI/HIo" = 100 * HI / 26, .after = HI)            # calculate Hi/HIo

It is important to note that certain HRZs have no data because the model crashed during execution and produced no output. This happens when the zone’s climate is far too cold for crop cultivation.

nzo <- cpe_new |>
  filter(Sea == 0) |>         # if Sea = 0, there is no data, i.e. the model crashed
  count()
cat("The number of zones with no yield data is:", round(nzo[[1]], 0))

The number of zones with no yield data is: 151

For selected variables such as rainfall (Rain or rfl), crop cycle length (Cycle or ccl), B/Bx (BioRel or bxr), HI/HIo (hxr), yield (or yld) and CWP (or WPet),

cpe_trm <- cpe_new |>
  select("subc", "seas" = "Sea", "rfl" = "Rain", "ccl" = "Cycle", "bxr" = "BioRel", "hxr" = "HI/HIo", "yld" = "Yield_new", "cwp" = "WPet")

the 1^st main objective was to generate long-term statistics (e.g. mean, standard deviation, coefficient of variation, number of zero values, number of missing (NA) values, total number of values) from the seasonal data.

cpe_sta <- cpe_trm |>
  group_by(subc) |>
  summarise(
    # for all variables from rfl to cwp, no need for a loop
    across(rfl:cwp,
           list(                      
      # calculate the mean
      ave = \(x) mean(x, na.rm = TRUE),
      # calculate the standard deviation
      sdv = \(x) sd(x, na.rm = TRUE),
      # calculate the coefficent of variation as a %
      cov = \(x) 100 * sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE) 
    )),
    yld_non = sum(is.na(yld)),             # count of missing yield values
    yld_noi = sum(is.infinite(yld)),       # count of infinite yield values
    yld_nom = 49 - n(),                    # count of missing seasonal yields
    yld_noz = sum(yld == 0, na.rm = TRUE), # count of zero yields
    yld_nov = sum(yld > 0, na.rm = TRUE),  # count of non-zero yields
    yld_nos = n()                          # count of seasons with yield values
  )

To ensure the statistics were correct, the following were summed, namely the number of:

missing (NA) yield values (yld_non),
infinite (INF) yield values (yld_noi),
missing seasons (yld_nom),
zero yield values (yld_noz), and
non-zero yield values (yld_nov).

If the above values did not sum to 49 seasons in total, then there is an error.

tot <- cpe_sta |>
  mutate("yld_chk" = (yld_non + yld_noi + yld_nom + yld_noz + yld_nov)) |>
  count(yld_chk)
cat("The number of zones with a sum of", tot[[1]], "values is", tot[[2]])

The number of zones with a sum of 49 values is 5838

The number of yield values equal to zero kg ha^-1 (yld_noz) or greater than zero kg ha^-1 (yld_nov) was also calculated (yld_nzv). Furthermore, the risk of crop failure (RCF) was calculated as a ratio of the number of missing, infinite and zero seasonal yields to the total number of simulations (49), which was then expressed as a percentage.

cpe_sta <- cpe_sta |>
  mutate("yld_nzv" = yld_noz + yld_nov, .after = "yld_nov") |>
  mutate("rcf_per" = 100 * (yld_non + yld_noi + yld_nom + yld_noz) / 49,
         .after = cwp_cov)

The analysis-ready statistics were outputted to a .CSV file in the data/processed folder. This was important since the number of rows of data was reduced from 268994 to 5838.

write_csv(cpe_sta, here::here("data/processed/stat_5838_cpea_mthk_fro1.csv"))

2b. Identifying suitable cowpea production zones

Cowpea cannot be cultivated in all 5838 HRZs across the country (e.g. in zones that are too dry and/or cold), and thus the goal was to analyse the statistics to identify zones that are deemed suitable for cowpea production.

cpe_sta <- read_csv(here::here("data/processed/stat_5838_cpea_mthk_fro1.csv"))

The first step involved identifying variables with missing (i.e. NA) data.

cpe_sta |>
  summarise(
    # for all variables from rfl_ave to rcf_per, no need for a loop
    across(rfl_ave:rcf_per,
           list(mis = \(x) sum(is.na(x))       # calculate number of NA values
  ))) |>
  pivot_longer(cols = rfl_ave_mis:rcf_per_mis, # convert table from wide to long format
               names_to = "variable",
               values_to = "num_na") |>
  arrange(num_na) |>                           # sort table into ascending order
  filter(num_na > 0) |>                        # filter out variables with no NA values
  gt() |>
  tab_options(
    container.height = "300px",                # set a fixed height for vertical scroll
    container.width = "200px"                  # set a fixed width for horizontal scroll
  )

variable	num_na
rfl_sdv_mis	189
rfl_cov_mis	189
ccl_sdv_mis	189
ccl_cov_mis	189
bxr_sdv_mis	189
hxr_sdv_mis	189
yld_sdv_mis	189
cwp_sdv_mis	189
bxr_cov_mis	192
hxr_cov_mis	362
yld_cov_mis	362
cwp_cov_mis	367

cnt <- cpe_sta |>
  filter(is.na(cwp_cov)) |>                  # check for NA values in CV of CWP
  count()                                    # count the number of NA values
cat("To verify, the number of NA values in CWPcov is", cnt[[1]])

To verify, the number of NA values in CWPcov is 367

If the coefficient of variation (CV in %) was missing (NA), values were converted to 700%. Similarly, missing standard deviations were converted to a value of 1200. A subset of the statistics was then created by only considering variables that could be used to identify suitable zones for cowpea production.

cpe_sta_fin <- cpe_sta |>
   mutate(across(ends_with("_cov"), \(x) coalesce(x, 700))) |>
   mutate(across(ends_with("_sdv"), \(x) coalesce(x, 1200))) |>
   select("subc", "rfl_ave", "ccl_ave", "bxr_ave", "hxr_ave",
          "yld_nzv", "yld_ave", "cwp_ave", "cwp_cov", "rcf_per")

By creating new variables (set to 1 or yes), suitable crop production areas were identified with:

a) above zero seasonally averaged rainfall (RFL_AVE) data (i.e. also excludes zones with no data);

cpe_sta_fin$rfl_ave_gt0 <- ifelse(cpe_sta_fin$rfl_ave > 0, 1, 0)

cnt <- cpe_sta_fin |>
  filter(rfl_ave_gt0 == 0) |>
  count() # 155 zones excluded
cat("Number of zones excluded since RFLave = 0 mm is", cnt[[1]])

Number of zones excluded since RFLave = 0 mm is 155

b) a seasonally average HI/HIo ratio (called hxr_ave) above 25% (Lake, 2024):

cpe_sta_fin$hxr_ave_gt25 <- ifelse(cpe_sta_fin$hxr_ave > 25, 1, 0)

c) more than 4 seasons of data (else too few data values for calculating reliable statistics);

cpe_sta_fin$yld_nzv_ge4 <- ifelse(cpe_sta_fin$yld_nzv >= 4, 1, 0)

d) more than 30 (out of 49) seasons of data, else insufficient data values to calculate reliable long-term statistics according to Steduto et al. (2012);

cpe_sta_fin$yld_nzv_ge30 <- ifelse(cpe_sta_fin$yld_nzv >= 30, 1, 0)

e) average yields above 25 kg ha^-1 (else unviable for smallholder farmers according to Gerrano et al. (2022)), which helps to exclude zones where CWP is 0 kg m^-3 of water use and the RCF is 100%;

cpe_sta_fin$yld_ave_gt25 <- ifelse(cpe_sta_fin$yld_ave > 25, 1, 0)

f) low (e.g. < 55%) risk of crop failure (else economically unviable).

cpe_sta_fin$rcf_per_lt55 <- ifelse(cpe_sta_fin$rcf_per < 55, 1, 0)

Next, another variable called suitable (0=no; 1=yes) was created from the product of the above-mentioned criteria.

cpe_sta_fin <- transform(cpe_sta_fin, suitable = rfl_ave_gt0 * hxr_ave_gt25 * yld_nzv_ge4 * yld_nzv_ge30 * yld_ave_gt25 * rcf_per_lt55)

According to FAO (1976), suitable cropping areas can be classified as highly suitable (S1), moderately suitable (S2) or marginally suitable (S3). Lake (2024) noted that the seasonally averaged HI/HIo ratio (variable called hxr_ave) can be used to classify suitability as follows: < 50% (S3), 50 to 75% (S2) and > 75% (S1). For zones deemed suitable for cowpea production (i.e. suitable=1), another suitability index was created as follows:

cpe_sta_fin <- cpe_sta_fin |>
  mutate(suitability = 
    case_when(
      hxr_ave  <= 25 ~ "Unsuitable",
      hxr_ave  >  25 & hxr_ave <= 50 ~ "Marginal",
      hxr_ave  >  50 & hxr_ave <= 75 ~ "Moderate",
      hxr_ave  >  75 ~ "High")
    )

The analysed statistics were then outputted to a .CSV file in the data/final folder.

write_csv(cpe_sta_fin, here::here("data/final/suit_5838_cpea_mthk_fro1.csv"))

3. Results

To recap, AquaCrop was run for all HRZs to simulate cowpea planted on the 01^st of November in each season at 80 000 plants ha^-1. Long-term statistics for important AquaCrop output variables were calculated from the seasonal simulations. These statistics were then used to determine which zones are deemed suitable for rainfed production of cowpea. An analysis of the final dataset is presented in this section.

# read the final .CSV file into a data frame
cpe_sta_fin <- read_csv(here::here("data/final/suit_5838_cpea_mthk_fro1.csv"))

The mean annual rainfall across the western parts of South Africa is below 400 mm, which is considered too dry to support economically viable crop production under rainfed conditions. Similarly, the climate is too cold in many HRZs located in mountainous regions, especially those bordering the country of Lesotho. Hence, not all HRZs are suitable for cowpea cultivation. The number of zones eliminated in each step was calculated in two different ways:

tab_sum_eln <- cpe_sta_fin |>
   # order the variables from highest to lowest number of zones eliminated
   select("hxr_ave_gt25", "rcf_per_lt55", "yld_ave_gt25",
          "yld_nzv_ge4", "yld_nzv_ge30", "rfl_ave_gt0", "suitable") |>
   summarise(
     across(everything(),
            list(n = \(x) sum(x)) # calculate the sum of zones not eliminated
     )
   )

tab_sum_ely <- cpe_sta_fin |>
   # order the variables from highest to lowest number of zones eliminated
   select("hxr_ave_gt25", "rcf_per_lt55", "yld_ave_gt25",
          "yld_nzv_ge4", "yld_nzv_ge30", "rfl_ave_gt0", "suitable") |>
   summarise(
     across(everything(),
            list(n = \(x) 5838 - sum(x)) # calculate the sum of zones eliminated
     )
   )

tab_sum_all <- bind_rows(tab_sum_ely, tab_sum_eln) |>
  mutate(id = 1:n(), .before = hxr_ave_gt25_n)

tab_sum_all |>
  gt()

id	hxr_ave_gt25_n	rcf_per_lt55_n	yld_ave_gt25_n	yld_nzv_ge4_n	yld_nzv_ge30_n	rfl_ave_gt0_n	suitable_n
1	2452	2435	1108	213	373	155	2567
2	3386	3403	4730	5625	5465	5683	3271

The second method is as follows:

rm(tab_sum_all)
tab_sum_all <- tibble(id = c(0, 1))

# order the variables from highest to lowest number of zones eliminated
var_lst = c("hxr_ave_gt25", "rcf_per_lt55", "yld_ave_gt25", "yld_nzv_ge4", "yld_nzv_ge30", "rfl_ave_gt0", "suitable")

for (var_tmp in var_lst) {
# cat("Running code for", var_tmp, "\n")
  var_nme = sym(var_tmp)
  var_new = sym(paste0(var_nme, "_n"))
  tab_sum <- cpe_sta_fin |>
    select(!!var_nme) |>
    group_by(!!var_nme) |>
    summarise(n = n()) |>
    rename(!!var_new := all_of("n")) |>
    rename(id = !!var_nme)
  tab_sum_all <- merge(tab_sum_all, tab_sum, by = "id")
}

tab_sum_all |>
  gt()

id	hxr_ave_gt25_n	rcf_per_lt55_n	yld_ave_gt25_n	yld_nzv_ge4_n	yld_nzv_ge30_n	rfl_ave_gt0_n	suitable_n
0	2452	2435	1108	213	373	155	2567
1	3386	3403	4730	5625	5465	5683	3271

A check was done to ensure the last column in the above tables is correct:

#check variable suitable_n
cpe_sta_fin |>
  select(suitable) |>
  group_by(suitable) |>
  count()

# A tibble: 2 × 2
# Groups:   suitable [2]
  suitable     n
     <dbl> <int>
1        0  2567
2        1  3271

It is important to that both methods produced the same outcome (i.e. to verify the results are correct). The results are presented in Table 1, which show the majority of zones were eliminated by applying the HI/HIo > 25% criterion (2452 in total), followed by RCF < 55% (2435 zones) and yield > 25 kg ha^-1 (1108 zones).

tab_sum_all |>
  # gt() formats the table 
  gt() |>
  # give a header with a title and subtitle
  tab_header(title = md("**Number of HRZs eliminated by each criterion**"),
             subtitle = html("(<b>Cowpea planted on 01 November at 80 000 plants ha<sup>-1</sup></b>)")) |>
  # format no. of decimals
  # https://gt.rstudio.com/reference/fmt_number.html
    fmt_number(decimals = 0, sep_mark = "") |>
# change 0 to N and 1 to Y
    sub_values(columns = id, values = 0, replacement = "Y") |>
    sub_values(columns = id, values = 1, replacement = "N") |>
    cols_label(
    id = md("**Eliminated?**<br>(Y/N)"),
    rfl_ave_gt0_n = md("**RFL~AVE~**<br> > 0 mm"),
    hxr_ave_gt25_n = md("**HI/HI~O~**<br> > 25%"),
# does not do sperscript in rendered HTML
#   yld_ave_gt25_n = md("**YLD~AVE~**<br> > 25 kg {{ha^-1}}"),
    yld_ave_gt25_n = html("<b>YLD<sub>AVE</sub></b><br> > 25 kg ha<sup>-1</sup>"),
    rcf_per_lt55_n = md("**RCF**<br> < 55%"),
    suitable_n = md("**Final<br>no. of<br>HRZs**")
  ) |>
# 1st grouping
  tab_spanner(
    label = md("**No. of yield values**"),
    columns = c(yld_nzv_ge4_n, yld_nzv_ge30_n)
  ) |>
  cols_label(
    yld_nzv_ge4_n = "≥ 4",
    yld_nzv_ge30_n = "≥ 30"
  )

Table 1: Number of HRZs eliminated by applying the selected criteria

Eliminated? (Y/N)	HI/HI_O > 25%	RCF < 55%	YLD_AVE > 25 kg ha^-1	No. of yield values		RFL_AVE > 0 mm	Final no. of HRZs
Number of HRZs eliminated by each criterion
(Cowpea planted on 01 November at 80 000 plants ha^-1)
Eliminated? (Y/N)	HI/HI_O > 25%	RCF < 55%	YLD_AVE > 25 kg ha^-1	≥ 4	≥ 30	RFL_AVE > 0 mm	Final no. of HRZs
Y	2452	2435	1108	213	373	155	2567
N	3386	3403	4730	5625	5465	5683	3271

Of the total 5838 HRZs, 3271 suitable zones were selected for further analysis. Hence, 2567 zones were eliminated since the yare deemed unsuitable for cowpea production.

# filter the dataset to select only suitable zones
cpe_se1 <- cpe_sta_fin |>
  filter(suitable == 1)

tot <- nrow(cpe_sta_fin)
se1 <- nrow(cpe_se1)
se0 = tot - se1

cat("The total number of HRZs is", tot, "\n")

The total number of HRZs is 5838

cat("Of these,", se1, "are considered suitable (not eliminated) for cowpea production\n")

Of these, 3271 are considered suitable (not eliminated) for cowpea production

cat("Hence,", se0, "zones are unsuitable for cowpea production and were therefore eliminated\n")

Hence, 2567 zones are unsuitable for cowpea production and were therefore eliminated

3a. Collinearity test

A collinearity test between CWP and Y was performed by creating a scatter plot with a trend line (linear regression) as shown in Figure 1. Since CWP is calculated from Y, the variables should be strongly correlated.

ggplot(data = cpe_se1,
       mapping = aes(x = cwp_ave,
                     y = yld_ave)) +
# create scatter plot
  geom_point(size = 0.5) +
  
# set x-axis label to 3 decimals
  scale_x_continuous(labels = label_number(accuracy = 0.001)) +

# label both axes and the plot title  
  labs(x = expression(paste("Crop water productivity (kg ", m^-3, ")")),
       y = expression(paste("Crop yield (kg ", ha^-1, ")")),
       title = "Crop yield vs crop water productivity") +

# centre the title label (hjust = 0 is left, 0.5 = center, 1 = right) 
  theme(plot.title = element_text(hjust = 0.5)) +  
  
# add a trend line
  geom_smooth(method = "lm",
              formula = y ~ x + 0,
              color = "blue",
              linetype = "dashed",
              size = 1.0,
              na.rm = FALSE,
              se = FALSE,
              show.legend = TRUE)

Figure 1: Relationship between yield and crop water productivity for cowpea planted on 01 November at 80 000 plants per hectare across 3271 suitable HRZs in South Africa

For the trend line, various statistics (e.g. R squared) were calculated. The slope value provides the average water use of cowpea in m³ ha^-1 (divide by 10 to convert to mm ha^-1).

# extract the two variables to assess for collinarity
reg_dat <- cpe_se1 |>
  select("x" = cwp_ave, "y" = yld_ave)
# fit a linear regression model to the data with NO intercept
reg_mod <- lm(y ~ x + 0, data = reg_dat)
reg_sta <- summary(reg_mod)
slp = reg_sta$coefficients["x", "Estimate"]
r2v = reg_sta$r.squared
r2a = reg_sta$adj.r.squared
# print the equation
cat("The regression equation is: Yield =", round(slp, 2), "* CWP\n")

The regression equation is: Yield = 3561.58 * CWP

cat("The R-squared value is:", round(r2v, 4), "\n")

The R-squared value is: 0.9587

cat("The adjusted R-squared value is:", round(r2a, 4), "\n")

The adjusted R-squared value is: 0.9587

If Pearson’s correlation coefficient (r) is above 0.5, the variables are highly correlated. Since the r value approaches 1, Y and CWP are highly correlated, and thus either Y or CWP (but not both) should be used for determining suitable crop production zones.

pcc = cor(reg_dat$x, reg_dat$y, method = 'pearson')
cat("The Pearson correlation coefficent is:", round(pcc, 4), "\n")

The Pearson correlation coefficent is: 0.939

# Spearman correlation coefficient
scc = cor(reg_dat$x, reg_dat$y, method = 'spearman')
cat("The Spearman correlation coefficent is:", round(scc, 4), "\n")

The Spearman correlation coefficent is: 0.9515

# Kendall correlation coefficient
kcc = cor(reg_dat$x, reg_dat$y, method = 'kendall')
cat("The Kendall correlation coefficent is:", round(kcc, 4), "\n")

The Kendall correlation coefficent is: 0.8068

3b. Suitable cowpea production zones

A histogram can be created for each variable (e.g. seasonally averaged yield) showing the full range of values. Figure 2 illustrates the seasonally averaged yield simulations (yld_ave), which represent a right-skewed distribution (i.e. less HRZs with high yields), despite the elimination of many unsuitable zones.

library(ggplot2)
ggplot(data = cpe_se1,
       mapping = aes(x = yld_ave)) +
 
  geom_histogram(col = "black",
                 fill = "white",
                 breaks = seq(0, 2600, 25)) +

# label both axes and the plot title
  labs(x = expression(paste("Seasonally averaged yield (kg ", ha^-1, ")")),
       y = "Number of HRZs",
       title = "Range in simulated cowpea yields") +

# removes default grey background fill
  theme_minimal() +
    
# centre the title label (hjust = 0 is left, 0.5 = center, 1 = right)
  theme(plot.title = element_text(hjust = 0.5))

Figure 2: Histogram of yield variation across 3271 homogeneous response zones (HRZs) deemed suitable for cowpea production

As expected, there are no yields below 25 kg ha^-1 as these very low yields were excluded (see Section 2b). However, the largest number of zones have a simulated yield ranging from 50-75 kg per ha. Since cowpea is mainly cultivated by subsistence and small-scale farmers (i.e. low input farming), yields are typically low (25-300 kg ha^-1), as simulated by the AquaCrop model.

val <- nrow(cpe_se1 |>
            filter(yld_ave > 50 & yld_ave <= 75))
cat("The number of HRZs with an average yield ranging from 50-75 kg per ha is", val, "\n")

The number of HRZs with an average yield ranging from 50-75 kg per ha is 111

val <- nrow(cpe_se1 |>
            filter(yld_ave > 25 & yld_ave <= 300))
cat("The number of HRZs with an average yield ranging from 25-300 kg per ha is", val, "\n")

The number of HRZs with an average yield ranging from 25-300 kg per ha is 724

val <- nrow(cpe_se1 |>
            filter(yld_ave > 300))
cat("The number of HRZs with an average yield exceeding 300 kg per ha is", val, "\n")

The number of HRZs with an average yield exceeding 300 kg per ha is 2547

Seasonally averaged yields simulated by AquaCrop ranged from 28.6 to 2589 kg ha^-1, with a mean and median of 725 and 628 kg ha^-1, respectively. The highest averaged yield of 2589 kg ha^-1 was simulated for HRZ number 4573.

# useful statistics for the yield variable
cpe_se1 |>
  select(yld_ave) |>            # select only one variable of interest
  filter(!is.na(yld_ave)) |>    # exclude any NA values
  arrange(yld_ave) |>           # arrange in ascending order
  summarise(                    # produce sumamry statistics
    Suitable = n(),             # returns 3271 zones
    Mean = mean(yld_ave),       # returns  725 kg per ha
    Median = median(yld_ave),   # returns  628 kg per ha
    SD = sd(yld_ave),           # returns  504 kg per ha 
    CV = 100 * SD / Mean,       # returns 69.5 %  
    Min = min(yld_ave),         # returns 28.6 kg per ha
    Max = max(yld_ave)          # returns 2589 kg per ha
  ) |>
  gt() |>
    tab_header(title = md("**Yield statistics across suitable HRZs**"),
               subtitle = html("(<b>in kg ha<sup>-1</sup> except for CV in %</b>)")) |>
    fmt_number(columns = Suitable, decimals = 0, sep_mark = "") |>
    fmt_number(columns = Mean:Max, decimals = 1, sep_mark = "")

Suitable	Mean	Median	SD	CV	Min	Max
Yield statistics across suitable HRZs
(in kg ha^-1 except for CV in %)
3271	724.8	628.0	503.6	69.5	28.6	2589.0

cpe_se1 |>
  select(subc, yld_ave) |>   # select variables of interest
  filter(!is.na(yld_ave)) |> # exclude any NA values
  arrange(desc(yld_ave)) |>  # arrange in descending order
  head( n = 1)               # returns subc = 4573

# A tibble: 1 × 2
   subc yld_ave
  <dbl>   <dbl>
1  4573    2589

The suitability score was used to create a density plot (Figure 3), which shows the largest distribution of lower yield values classified as marginally suitable versus the smallest distribution of higher yield values classified as highly suitable.

fil_col = c("High" = "green",
            "Moderate" = "yellow",
            "Marginal" = "orange")
lne_col = c("High" = "grey60",
            "Moderate" = "grey60",
            "Marginal" = "grey60")

# Move Marginally suitable (S3) to bottom of legend using factors and levels
leg_ord = c("Marginal", "Moderate", "High")

ggplot(data = cpe_se1 |>
       mutate(suitability = factor(suitability, levels = leg_ord)),
       mapping = aes(x = yld_ave,
                     fill = suitability,
                     color = suitability)) +
  geom_density(alpha = 0.50,
               adjust = 0.7) +
 
  scale_x_continuous(breaks = seq(0, 2600, by = 200)) +
 
  scale_fill_manual(name ="Suitability",
                    values = fil_col,
                    labels = c("Marginal (S3)", "Moderate (S2)", "High (S1)"),
                    ) +

  # legend MUST have the same name
  scale_color_manual(name = "Suitability",
                     values = lne_col,
                     labels = c("Marginal (S3)", "Moderate (S2)", "High (S1)"),
                     ) +
 
  labs(title = "Density plot of simulated cowpea yields",
       x = expression(paste("Seasonally averaged yield (kg ", ha^-1, ")")),
       y = "Density") +

  # removes default grey background fill
  theme_minimal() +
    
  # centre the title label (hjust = 0 is left, 0.5 = center, 1 = right)
  theme(plot.title = element_text(hjust = 0.5))

Figure 3: Density plot of simulated cowpea yields across suitable growing areas in South Africa

Another way to view the distributions by suitability is using a ridgeline plot (also called a joy plot), which is useful for visualising the distribution of a continuous variable (yield) across different categories (suitability). In Figure 4, the vertical lines show the median (50^th percentile) and the interquartile range (25^th to 75^th percentile).

library(ggridges)
fil_col = c("High" = "green",
            "Moderate" = "yellow",
            "Marginal" = "orange")
lne_col = c("High" = "green",
            "Moderate" = "orange",
            "Marginal" = "brown")
leg_ord = c("Marginal", "Moderate", "High")

ggplot(data = cpe_se1 |>
       mutate(suitability = factor(suitability, levels = leg_ord)),
       mapping = aes(x = yld_ave,
                     y = suitability,
                     fill = suitability,
                     color = suitability)) +
  stat_density_ridges(alpha = 0.1,
                      scale = 1,
                      rel_min_height = 0.01,
                      quantile_lines = TRUE,
                      quantiles = c(0.25, 0.50, 0.75)) +

  scale_x_continuous(breaks = seq(0, 2600, by = 200)) +

  scale_fill_manual(name ="Suitability",
                    values = fil_col,
                    labels = c("Marginal (S3)", "Moderate (S2)", "High (S1)"),
                    ) +
  scale_color_manual(name = "Suitability",
                     values = lne_col,
                     labels = c("Marginal (S3)", "Moderate (S2)", "High (S1)"),
                     ) +

  labs(title = "Ridgeline plot of simulated cowpea yields",
       x = expression(paste("Seasonally averaged yield (kg ", ha^-1, ")")),
       y = "Density") +

  # removes default grey background fill
  theme_minimal() +
    
  # centre the title label (hjust = 0 is left, 0.5 = center, 1 = right) 
  theme(plot.title = element_text(hjust = 0.5))

Figure 4: Ridgeline plot of simulated cowpea yields across suitable growing areas in South Africa

Figure 5 again shows how yield increases from marginally (S3) to highly suitable (S1) zones. The figure also highlights the problem of using discrete thresholds (25, 50 and 75%) to classify a continuous variable (HI/HIo), especially if the thresholds are subjective.

# box and whisker plot
fil_col = c("High" = "green",
            "Moderate" = "yellow",
            "Marginal" = "orange")
leg_ord = c("High", "Moderate", "Marginal")
leg_lab = c("High (S1)", "Moderate (S2)", "Marginal (S3)")

ggplot(data = cpe_se1 |>
       mutate(suitability = factor(suitability, levels = leg_ord)),
       mapping = aes(x = yld_ave,
                     y = suitability,
                     fill = suitability)
       ) +

  geom_boxplot(alpha = 0.5, outlier.shape = NA) +

  scale_x_continuous(breaks = seq(0, 2600, by = 200)) +  

  scale_y_discrete(labels = leg_lab) +
  
  scale_fill_manual(values = fil_col) +

  labs(title = "Box plot of simulated cowpea yields",
       x = expression(paste("Seasonally averaged yield (kg ", ha^-1, ")")),
       y = "Suitability") +

  # removes default grey background fill
  theme_minimal() +

  # no legend
  theme(legend.position="none") +

  # centre the title label (hjust = 0 is left, 0.5 = center, 1 = right) 
  theme(plot.title = element_text(hjust = 0.5))

Figure 5: Box plot of simulated cowpea yields across suitable growing areas in South Africa

From the table of statistics for the seasonally averaged yield grouped by suitability (Table 2), there are more zones classified as S3 (1553 zones), compared to S2 (1320) and S1 (398). The spatially averaged yield is 357, 880 and 1646 kg ha^-1 for S3, S2 and S1 zones, respectively. The median yield is 329 (S3), 845 (S2) and 1597 kg ha^-1 (S1), as shown in the box plot (Figure 5).

tab_ord = c("Marginal", "Moderate", "High")
var_nme = "yld_ave"
tab_sum <- cpe_se1 |>
# order table as "Marginal", "Moderate", "High" (not alphabetical)
  mutate(suitability = factor(suitability, levels = tab_ord)) |>
  group_by(suitability) |>
  filter(!is.na(var_nme)) |>
  select(suitability, var_nme) |>
  arrange(var_nme) |>  
  summarise(
    cnt = n(),                                    # number of values  
    dis = n_distinct(!!sym(var_nme)),             # unique number of values
    ave = mean(!!sym(var_nme)),                   # mean of values
    p25 = quantile(!!sym(var_nme), probs = 0.25), # 25th percentile
    p50 = quantile(!!sym(var_nme), probs = 0.50), # 50th percentile (median)
    p75 = quantile(!!sym(var_nme), probs = 0.75), # 75th percentile
    min = min(!!sym(var_nme)),                    # smallest value
    max = max(!!sym(var_nme)),                    # largest value
    sdv = sd(!!sym(var_nme)),                     # standard deviation
    cov = 100 * sdv / ave                         # coefficient of variation (%)
  )

# see gt.rstudio.com
tab_sum |>
  # gt() formats the table 
  gt() |>
  # give a header with a title and subtitle
  tab_header(title = md("**Cowpea yield statistics**"),
            subtitle = html("Planted on 01 November at 80 000 plants ha<sup>-1</sup>")
  ) |>
  # format no. of decimals
  # https://gt.rstudio.com/reference/fmt_number.html
  fmt_number(columns = cnt:dis, decimals = 0, sep_mark = "") |>
  fmt_number(columns = ave:cov, decimals = 1, sep_mark = "") |>
  cols_label(
    suitability = md("**Suitability<br>class**"),
    cnt = md("**No. of<br>HRZs**"),
    dis = md("**Distinct<br>values**"),
    ave = html("<b>Average</b><br>(kg ha<sup>-1</sup>)"),
  ) |>
  # 1st grouping
  tab_spanner(
    label = md("**Percentiles**"),
    columns = c(p25, p50, p75)
  ) |>
  cols_label(
    p25 = "{{25^th}}",
    p50 = "{{50^th}}",      
    p75 = "{{75^th}}",      
  ) |>
  # 2nd grouping
  tab_spanner(
    label = html("<b>Range (kg ha<sup>-1</sup>)</br>"),
    columns = c(min, max)
  ) |>
  cols_label(
    min = "Min",
    max = "Max",      
  ) |>
  # 3rd grouping
  tab_spanner(
    label = md("**Spread**"),
    columns = c(sdv, cov)
  ) |>
  cols_label(
#   sdv = md("SD (kg {{ha^-1}})"), # does not do superscipt in rendered HTML
    sdv = html("SD<br>(kg ha<sup>-1</sup>)"),
    cov = md("CV<br>(%)"),      
  )

Table 2: Useful statistics describing the range in cowpea yield (kg/ha) across three suitability classes

Suitability class	No. of HRZs	Distinct values	Average (kg ha^-1)	Percentiles			Range (kg ha^-1)		Spread
Cowpea yield statistics
Planted on 01 November at 80 000 plants ha^-1
Suitability class	No. of HRZs	Distinct values	Average (kg ha^-1)	25^th	50^th	75^th	Min	Max	SD (kg ha^-1)	CV (%)
Marginal	1553	1527	356.8	157.2	328.5	490.8	28.6	1200.7	241.2	67.6
Moderate	1320	1302	880.1	655.9	844.9	1089.8	52.3	1874.4	298.0	33.9
High	398	397	1645.9	1395.1	1597.3	1883.1	1052.5	2589.0	325.5	19.8

4. Conclusions

Only 3271 of the 5838 HRZs across South Africa are considered suitable for rainfed production of cowpea.
Of these, 398 zones are classified as highly suitable, where simulated yields averaged 1646 kg ha^-1 (range of 1052-2589 kg ha^-1).
Similarly, 1320 zones are classified as moderately suitable with an average yield of 880 kg ha^-1.
The remaining 1553 zones are deemed marginally suitable, with an average yield of 357 kg ha^-1.
The analysis highlights the importance of eliminating “noise” (especially outliers) when working with large datasets.
It also highlights the problem of using discrete thresholds to classify continuous variables.

5. References

FAO. (1976). A framework for land evaluation. FAO. https://www.fao.org/3/x5310e/x5310e00.htm

Gerrano, A. S., Lubinga, M. H., & Bairu, M. W. (2022). Genetic resources management, seed production constraints and trade performance of orphan crops in southern africa: A case of cowpea. South African Journal of Botany, 146, 340–347. https://doi.org/10.1016/j.sajb.2021.11.007

Hsiao, T. C., Lee, H., Steduto, P., Basilio, R. L., Raes, D., & Fereres, E. (2009). AquaCrop – the FAO crop model to simulate yield response to water: III. Parameterization and testing for maize. Agronomy Journal, 101(3), 448–459. https://doi.org/10.2134/agronj2008.0218s

Karl, K., MacCarthy, D., Porciello, J., Chimwaza, G., Fredenberg, E., Freduah, B. S., Guarin, J., Mendez Leal, E., Kozlowski, N., Narh, S., Sheikh, H., Valdivia, R., Wesley, G., Deynze, A. van, Zonneveld, M. van, & Yang, M. (2024). Opportunity crop profiles for the vision for adapted crops and soils (VACS) in africa (p. 78). Center for Climate Systems Research, Columbia University. https://academiccommons.columbia.edu/doi/10.7916/7msa-yy32

Lake, S. (2024). A novel approach to mapping areas suitable for rainfed production of bambara nut and cowpea [Master’s thesis]. Centre for Water Resources Research, School of Agricultural, Earth; Environmental Sciences, University of KwaZulu-Natal.

Raes, D., Steduto, P., Hsiao, T. C., & Fereres, E. (2009). AquaCrop – the FAO crop model to simulate yield response to water: II. Main algorithms and software description. Agronomy Journal, 101(3), 438–447. https://doi.org/10.2134/agronj2008.0140s

Steduto, P., Hsiao, T. C., Fereres, E., & Raes, D. (2012). Crop yield response to water (p. 500). Food; Agriculture Organisation (FAO) of the United Nations. http://www.fao.org/docrep/016/i2800e/i2800e00.htm

Steduto, P., Hsiao, T. C., Raes, D., & Fereres, E. (2009). AquaCrop – the FAO crop model to simulate yield response to water: I. Concepts and underlying principles. Agronomy Journal, 101(3), 426–437. https://doi.org/10.2134/agronj2008.0139s

Reuse

CC-BY-4.0

Citation

BibTeX citation:

@online{kunz2025,
  author = {Kunz, Richard},
  title = {Land Suitability Analysis for Cowpea Production in {South}
    {Africa} Using Output Simulated by the {AquaCrop} Model},
  date = {2025-12-09},
  url = {https://ds4owd-002.github.io/project-kunzrp/},
  langid = {en},
  abstract = {The AquaCrop model, developed by the Food and Agriculture
    Organisation, was run for cowpea planted on the 01\^{}st\^{} of
    November at 80000 plants per hectare across 5838 homogeneous
    response zones (HRZs) in South Africa. The model generated up to 49
    consecutive seasons of data from 1950/51 to 1998/99 for each HRZ,
    from which long-term statistics were calculated for important
    AquaCrop output variables (part 1). The statistics were then
    analysed to determine which HRZs are deemed suitable for rainfed
    cultivation of cowpea (part 2). The crop can be grown in 3271 of the
    5838 HRZs.}
}

For attribution, please cite this work as:

Kunz, R. (2025, December 9). Land suitability analysis for cowpea production in South Africa using output simulated by the AquaCrop model. Capstone Project. https://ds4owd-002.github.io/project-kunzrp/