Collecting the CWGC Data - Part 3

Introduction

This is the third post in a series that explores data collection using R to collect the Commonwealth War Graves Commission (CWGC) dataset.

It’s a good time to note that data scraping is inherently trial and error, and the best way forward is often not clear. During the course of this project I went down a number of dead-ends, so you will invariably need to approach your particular problem from several angles before you find one that works or scales appropriately.

Partitioning the Problem

By the end of the last post, I had produced a breakdown of the number of war dead commemorated in each country, which I load here:

# Load data from previous post
load(file = "path/to/file/Summary_data_countries_commemorated.rda")

I previously discussed how the website’s spreadsheet download option will only permit a maximum of 80,000 rows, so results for any country with less than this number of graves may be downloaded with no further issues. Sorting the data and taking the top 10 values:

# Order data by number of records (descending)
cwgc_summary <- cwgc_summary[order(-cwgc_summary$sum),]

# Show the top 10 values
knitr::kable(head(cwgc_summary, 10), row.names = FALSE, caption = "Number of Records by Country and War")
Table 1: Number of Records by Country and War
commemorated war-1 war-2 sum
France 538841 46260 585101
United Kingdom 134962 183130 318092
Belgium 197351 10397 207748
Civilian War Dead 0 68887 68887
India 18788 43342 62130
Iraq 51427 3002 54429
Italy 3809 45555 49364
Egypt 14908 28568 43476
Myanmar 112 38388 38500
Turkey (including Gallipoli) 36800 30 36830

Only France, United Kingdom and Belgium exceeded the threshold of 80,000 records and it was simple to develop a loop to download the graves data for the other countries below this threshold. The website does not support a search with multiple countries, so the loop downloaded one spreadsheet per country. Also, I had to remove the ‘Civilian War Dead’ category since it could not be downloaded with the other countries using the same method.

# Filter data frame for countries with fewer than 80,000 records
cwgc_set_1 <- cwgc_summary[cwgc_summary$sum < 80000,]

# Discard countries with zero graves
cwgc_set_1 <- cwgc_set_1[cwgc_set_1$sum != 0,]

# Get vector of countries 
cwgc_set_1 <- as.character(cwgc_set_1$commemorated)

# Remove 'Civilian War Dead' category
cwgc_set_1 <- cwgc_set_1[2:length(cwgc_set_1)]

I denoted Set 1 as a list of 144 countries whose results each fit into a single spreadsheet download and Set 2 as the three countries with greater than 80,000 records. Data for countries in Set 2 had to be split up in some manner before downloading. These two sets required a different approach and will be dealt with separately in the next sections.

# Filter data frame for countries with an individual sum > 80,000
cwgc_set_2 <- cwgc_summary[cwgc_summary$sum > 80000,]

# Get vector of countries 
cwgc_set_2 <- as.character(cwgc_set_2$commemorated)

Show Set 1

cwgc_set_1
##   [1] "India"                                
##   [2] "Iraq"                                 
##   [3] "Italy"                                
##   [4] "Egypt"                                
##   [5] "Myanmar"                              
##   [6] "Turkey (including Gallipoli)"         
##   [7] "Germany"                              
##   [8] "Singapore"                            
##   [9] "Netherlands"                          
##  [10] "Canada"                               
##  [11] "Greece"                               
##  [12] "Israel and Palestine (including Gaza)"
##  [13] "Australia"                            
##  [14] "South Africa"                         
##  [15] "Tunisia"                              
##  [16] "Thailand"                             
##  [17] "Papua New Guinea"                     
##  [18] "Kenya"                                
##  [19] "Libya"                                
##  [20] "Malta"                                
##  [21] "Tanzania"                             
##  [22] "Nigeria"                              
##  [23] "Malaysia"                             
##  [24] "Iran"                                 
##  [25] "New Zealand"                          
##  [26] "Indonesia"                            
##  [27] "Algeria"                              
##  [28] "Sri Lanka"                            
##  [29] "Sierra Leone"                         
##  [30] "Ghana"                                
##  [31] "Japan"                                
##  [32] "Lebanese Republic"                    
##  [33] "Sudan"                                
##  [34] "Pakistan"                             
##  [35] "Bangladesh"                           
##  [36] "Syria"                                
##  [37] "Poland"                               
##  [38] "United States of America"             
##  [39] "Denmark"                              
##  [40] "Lesotho"                              
##  [41] "Eritrea"                              
##  [42] "Norway"                               
##  [43] "Gibraltar"                            
##  [44] "New Caledonia"                        
##  [45] "Zimbabwe"                             
##  [46] "Namibia"                              
##  [47] "Russian Federation"                   
##  [48] "Somalia (including Somaliland)"       
##  [49] "Austria"                              
##  [50] "Serbia"                               
##  [51] "Uganda"                               
##  [52] "Ethiopia"                             
##  [53] "Yemen"                                
##  [54] "Cyprus"                               
##  [55] "Seychelles"                           
##  [56] "Malawi"                               
##  [57] "Madagascar"                           
##  [58] "Zambia"                               
##  [59] "Mozambique"                           
##  [60] "Gambia"                               
##  [61] "Jamaica"                              
##  [62] "Bulgaria"                             
##  [63] "Iceland"                              
##  [64] "Czech Republic"                       
##  [65] "Hungary"                              
##  [66] "Romania"                              
##  [67] "Botswana"                             
##  [68] "Sweden"                               
##  [69] "Trinidad and Tobago"                  
##  [70] "Bermuda"                              
##  [71] "Switzerland"                          
##  [72] "Spain"                                
##  [73] "Fiji"                                 
##  [74] "Mauritius"                            
##  [75] "Eswatini"                             
##  [76] "Maldives"                             
##  [77] "Georgia"                              
##  [78] "Morocco"                              
##  [79] "Bahamas"                              
##  [80] "Faroe Islands"                        
##  [81] "Azores"                               
##  [82] "Portugal"                             
##  [83] "Belize"                               
##  [84] "Albania"                              
##  [85] "Azerbaijan"                           
##  [86] "Cameroon"                             
##  [87] "St. Lucia"                            
##  [88] "Falkland Islands"                     
##  [89] "Latvia"                               
##  [90] "Congo (Democratic Republic)"          
##  [91] "Congo"                                
##  [92] "Luxembourg"                           
##  [93] "Barbados"                             
##  [94] "Brazil"                               
##  [95] "Guyana"                               
##  [96] "Senegal"                              
##  [97] "St. Helena and Ascension Island"      
##  [98] "St. Vincent"                          
##  [99] "Argentina"                            
## [100] "Estonia"                              
## [101] "Panama"                               
## [102] "Djibouti"                             
## [103] "Liberia"                              
## [104] "Netherlands Antilles"                 
## [105] "Samoa"                                
## [106] "Equatorial Guinea"                    
## [107] "British Indian Ocean Territory"       
## [108] "Cape Verde"                           
## [109] "Uruguay"                              
## [110] "Madeira"                              
## [111] "Peru"                                 
## [112] "Chile"                                
## [113] "Mauritania"                           
## [114] "Norfolk Island"                       
## [115] "Solomon Islands"                      
## [116] "Chad"                                 
## [117] "Grenada"                              
## [118] "Society Islands"                      
## [119] "Cuba"                                 
## [120] "Dominica"                             
## [121] "Mali"                                 
## [122] "Puerto Rico"                          
## [123] "Tonga"                                
## [124] "Antigua"                              
## [125] "Croatia"                              
## [126] "Guinea"                               
## [127] "Lithuania"                            
## [128] "Monaco"                               
## [129] "Oman"                                 
## [130] "Philippines"                          
## [131] "San Marino"                           
## [132] "St. Christopher and Nevis"            
## [133] "Ukraine"                              
## [134] "Vanuatu"                              
## [135] "Costa Rica"                           
## [136] "Guatemala"                            
## [137] "Honduras"                             
## [138] "Martinique"                           
## [139] "Nepal"                                
## [140] "Saudi Arabia"                         
## [141] "Togo"


Show Set 2

cwgc_set_2
## [1] "France"         "United Kingdom" "Belgium"

Downloading Set 1

Using a process similar to what I covered previously, I downloaded the spreadsheets by looping through all the sites in Set 1. First I defined the URL, set up a HTML session and retrieved the search form fields. I started out intending to use the html_form() function, but I quickly ran into some issues trying to set the parameters.

# URL to be scraped
war_dead_url <- "https://www.cwgc.org/find/find-war-dead"

# Set up HTML session for scraping
pgsession <- html_session(war_dead_url)

# Parse form fields from URL
page_form <- html_form(pgsession)[[3]]

When I submitted the form for a single country, I wasn’t getting the expected response back from the website. Because I wanted the results from both wars (WWI and WWII), I simply omitted the ‘War’ variable, with the (incorrect) assumption that this would include results from both wars. However, it seems the filled form defaulted to searching for something else altogether and my failure to set the form values correctly resulted in the unexpected output.

# Set form fields with country to be queried
# Not working as expected
filled_form <- set_values(page_form,
                          "countryCommemoratedAt" = 'India')

Inspecting the forms’ HTML elements in Chrome showed that the two check-boxes to set the War had the same Id, and hunting around on the Internet revealed that the rvest package cannot handle multiple form fields with the same name. I struggled to use the submit_form() function and my efforts to set the check-box form fields correctly were unsuccessful. This appears to be a known issue with rvest but there is zero appetite to resolve it considering the package has a git branch with a fix that has been waiting for over three years. I should have used Selenium! [Edit: since I originally worked on this project, I see that the CWGC website has been updated and the elements have different Ids]

All was not lost however - a simple alternative is to manually generate the request URLs based on the string pattern generated by a manual search. Frankly, this approach would have saved me a whole lot of heartache and time. A quick search on the website showed the URL pattern to emulate:

# URL pattern to reproduce
# https://www.cwgc.org/find/find-war-dead/results?country=Germany

# Construct URL
baseurl <- "https://www.cwgc.org/find/find-war-dead/results?country="
mainurl <- paste(baseurl, country, sep = "")
mainurl <- gsub(" ", "%2B", mainurl) # Replace spaces if present

The next step was to retrieve the hyperlink for downloading the spreadsheet. I did a manual search on the website by setting the Country drop-down to a random country and left the remaining search fields blank. Once this had run, I used Chrome Dev tools to copy the XPath from the ‘Download Results’ button and retrieved the corresponding href attribute value.

Figure 1. Form search fields and names

# Read HTML contents
dat <- read_html(mainurl)

# Get contents of HTML element
dat_html <- html_nodes(dat,
                       xpath = "/html/body/main/section[2]/div/div[1]/div/div/div/div/a")

# Get attribute
dat_attr <- html_attr(dat_html, "href")

The attribute is a relative hyperlink, so I had to stitch on the preceding part of the URL:

# Concatenate URL and relative URL
download_address <- paste0("https://www.cwgc.org", dat_attr)

Finally, I downloaded the file using the download.file() function, which unsurprisingly downloads a file using a URL. The file is downloaded to the current working directory - if in any doubt use getwd() to check where you are.

# Download file from URL - the curl method removes blank rows in the output
download.file(download_address, method = "curl", "India.csv")

At this point I had figured out the mechanics of downloading a single country’s casualties and I simply wrapped the preceding code snippets in a loop to step through the countries in Set 1. I also created a sub-directory called ‘Set1’ for the results.

# Set path for downloaded files
path <- "./Set1/"

# URL to be scraped
war_dead_url <- "https://www.cwgc.org/find/find-war-dead"

# Set up HTML session for scraping
pgsession <- html_session(war_dead_url)

# Loop to step through countries 
for (country in cwgc_set_1){
  
  # Console output
  cat("Working on", country, "\n")
  
  # Construct URL
  baseurl <- "https://www.cwgc.org/find/find-war-dead/results?country="
  mainurl <- paste(baseurl, country, sep = "")
  mainurl <- gsub(" ", "%2B", mainurl)
  
  # Read HTML contents
  dat <- read_html(mainurl)
  
  # Get contents of HTML element with XPath option
  dat_html <- html_nodes(dat, 
                         xpath = "/html/body/main/section[2]/div/div[1]/div/div/div/div/div[2]/a")
  
  # Get href attribute
  dat_attr <- html_attr(dat_html, "href")
  
  # Concatenate URL and relative URL
  download_address <- paste0("https://www.cwgc.org", dat_attr)
  
  # Set up file name for each country
  file_name <- paste0(path, country, ".csv")
  
  # Console output
  cat("Downloading", country, "\n")
  
  # Download file from URL
  # This function's method parameter depends on your OS, so pick accordingly
  download.file(download_address, method = "curl", file_name)
  
  # Wait between 100 and 200 seconds before next page request
  Sys.sleep(sample(seq(100, 200, 1), 1))
}

Once this was done, I downloaded the ‘Civilian War Dead’ category manually, since coding it was more effort than it was worth.

Merging Set 1

The final task was to read in all the *.csv files and save them into a *.rda object.

# Set path for downloaded files
path <- "./Set1/"

# List of *.csv files 
csv_files <- list.files(path, pattern = "\\.csv$")

# Initialise object for output
out_file <- c()

# Loop through each file
for (my_file in csv_files){
  
  # Set up file name
  file_name <- paste0(path, my_file)
  
  # Read in file
  dat <- read.csv(file_name, header = TRUE, sep = ",")
  
  # Console output for progress indicator
  cat(my_file, " \n")
  
  # Bind results
  out_file <- rbind(out_file, dat)
}

# Save output to file
save(out_file, file = "./path/to/file/Set_1_cwgc_graves.rda")

A quick check showed the number of rows in the output data frame matches the expected number of records for all countries in Set 1.

Downloading Set 2

Set 2 consists of the United Kingdom, France and Belgium and the challenge was to search the website to produce results with no more than 80,000 records. First, let’s reassess where we were with the number of records in Set 2.

# Graves per country in Set 2
knitr::kable(cwgc_summary[cwgc_summary$commemorate %in% cwgc_set_2,],
             row.names = FALSE, caption = "Records in Set 2 Countries")
Table 2: Records in Set 2 Countries
commemorated war-1 war-2 sum
France 538841 46260 585101
United Kingdom 134962 183130 318092
Belgium 197351 10397 207748

The easiest way to proceed was to split these countries’ war dead into sets using the war and date ranges. I had to spend a little time manually searching to figure out the date splits, since I didn’t want to do it programatically considering the potential impact on the website. There were four chunks of data that need to be split - France WWI, Belgium WWI, UK WWI and UK WWII. I summarised the date splits in a spreadsheet and imported the dates.

# Read in date splits
dates <- read.csv(file = ".././data/Dates.csv",
                  sep = ",",
                  header = FALSE,
                  stringsAsFactors = FALSE)

# Add column names
colnames(dates) <- c("Country", "Start", "End")

# Show table
knitr::kable(dates, row.names = FALSE, caption = "Set 2 Date Splits")
Table 3: Set 2 Date Splits
Country Start End
France 01/07/1914 31/12/1914
France 01/01/1915 30/06/1915
France 01/07/1915 31/12/1915
France 01/01/1916 30/06/1916
France 01/07/1916 31/08/1916
France 01/09/1916 31/12/1916
France 01/01/1917 30/04/1917
France 01/05/1917 31/08/1917
France 01/09/1917 31/12/1917
France 01/01/1918 30/06/1918
France 01/07/1918 30/09/1918
France 01/10/1918 31/12/1918
France 01/01/1919 31/12/1919
France 01/07/1939 31/12/1947
United Kingdom 01/07/1914 31/12/1917
United Kingdom 01/01/1918 31/12/1919
Belgium 01/07/1914 30/06/1916
Belgium 01/07/1916 30/09/1917
Belgium 01/10/1917 31/12/1919
Belgium 01/07/1939 31/12/1947
United Kingdom 01/07/1939 31/12/1941
United Kingdom 01/01/1942 31/12/1943
United Kingdom 01/01/1944 31/12/1947

I wanted to use the submit_form() method with the website form but could not get it to work, so I generated the request URLs manually. Doing a random search using the Date of Death option showed the structure of the URL to replicate. The loop draws on code that we have discussed previously, so should be familiar enough. The files were downloaded into a directory called ‘Set2’.

# URL structure to replicate
#.../find-war-dead/results?country=Belgium&dateFrom=01-01-1915&dateTo=01-01-1916

# URL to be scraped
war_dead_url <- "https://www.cwgc.org/find/find-war-dead"

# Set up HTML session for scraping
pgsession <- html_session(war_dead_url)

# Read contents of webpage into XML document
webpage <- read_html(pgsession)

# Loop through dates and country
for(i in 1:nrow(dates)){
  
  # Get country
  country <- dates$Country[i]
  
  # Replace spaces with '+' in country name
  country <- gsub(" ", "+", country)
  
  # Get start date
  start_date <- dates$Start[i]
  # Change '/' to '-'
  start_date <- gsub("/", "-", start_date)

  # Get end date
  end_date <- dates$End[i]
  # Change '/' to '-'
  end_date <- gsub("/", "-", end_date)

  # Set path
  path <- "./Set2/"
  
  # Set up file name
  file_name <- paste0(path, country, start_date, end_date, ".csv")
  
  # Construct URL
  part_1 <- "https://www.cwgc.org/find/find-war-dead/results?"
  part_2 <- paste0("country=", country)
  part_3 <- paste0("&dateFrom=", start_date)
  part_4 <- paste0("&dateTo=", end_date)
  request_URL <- paste0(part_1, part_2, part_3, part_4)
  
  # Get contents of webpage
  webpage <- read_html(request_URL)
  
  # Get 'Download Results' element
  dat_html <- html_nodes(webpage, 
                         xpath = "/html/body/main/section[2]/div/div[1]/div/div/div/div/a")
  
  # Get href attribute
  dat_attr <- html_attr(dat_html, "href")
  
  # Concatenate URL and relative URL
  download_address <- paste0("https://www.cwgc.org", dat_attr)
  
  # Download file from URL
  download.file(download_address, file_name)
  
  # Wait between 30 and 60 seconds before next page request
  Sys.sleep(sample(seq(30, 60, 1), 1))
}

Merging Set 2

As before, we can read in all the *.csv files, merge them and save to a *.rda object.

# Set path for downloaded files
path <- "./Set2/"

# List of *.csv files 
csv_files <- list.files(path, pattern = "\\.csv$")

# Initialise object for output
out_file <- c()

# Loop through each file
for (my_file in csv_files){
  
  # Set up file name
  file_name <- paste0(path, my_file)
  
  # Read in file
  dat <- read.csv(file_name, header = TRUE, sep = ",")
  
  # Console output for progress indicator
  cat(my_file, " \n")
  
  # Bind results
  out_file <- rbind(out_file, dat)
}

# Save output to file
save(out_file, file = "./path/to/file/Set_2_cwgc_graves.rda")

I checked that the number of rows in the output data frame equaled the expected number of records for all countries in Set 2. The numbers did not match precisely and the totals were out by ~280. When I dug into the cause, I found that that some additional records had been added into the CWGC database during my analysis, so my summary data did match the data downloaded. Also, there are graves with no date of death belonging to German and Russian soldiers - the searches which used a date of death do not collect these records. These graves more or less accounted for the difference, so I was satisfied that the numbers reconciled.

Combining the Results

The last step was to combine the data from Set 1 and Set 2.

# Load data from file
load(file = "./path/to/file/Set_1_cwgc_graves.rda")
set1 <- out_file

# Load data from file
load(file = "./path/to/file/Set_2_cwgc_graves.rda")
set2 <- out_file

# Row bind
cwgc_graves <- rbind(set1, set2)

# Save output to file
save(cwgc_graves, file = "./path/to/file/All_cwgc_graves.rda")

Conclusion

This post has demonstrated how to automatically search for and download the 1.7m graves records from the CWGC website.

Related