Collecting the CWGC Data - Part 4

Last updated on Oct 20, 2020 9 min read

Project

Introduction
Summary by Branch and Country
Summary Results
Labeling Branch and Country
Data Cleaning
Data Summary
Conclusion

Introduction

This is the fourth post in a series that explores data collection using R to collect the Commonwealth War Graves Commission (CWGC) dataset.

Previously I showed how to download all the details of the 1.7 million war dead commemorated by the CWGC. Unfortunately, when I opened one of the downloaded spreadsheets I found a fundamental problem that frankly I should have spotted earlier. The data did not include the Service branch (Army, Navy etc.) or the Commonwealth country (Australia, Canada etc.) that the person served with. Whilst these are viable search terms on the website, these variables were not included in the results. At this point, I had not given serious thought to the visualisations that I wanted to develop, but exploring the data by service branch or country served with are two obvious ones. I needed to figure out a way of filling in these fields, so it was back to the drawing board.

Summary by Branch and Country

The two missing fields (‘Country served’ & ‘Service branch’) could be determined by using these as download options and then labeling the data appropriately, but this presented all the familiar problems of chunking the data into sets of less than 80,000 records. I didn’t want to have to do this all again though, so I made a summary of the data according to these two search options to see what the options were. This draws heavily on the code from the previous posts, so head back there for a full discussion if any of the functions are unfamiliar.

# URL to be scraped
war_dead_url <- "https://www.cwgc.org/find/find-war-dead"

# Set up HTML session for scraping
pgsession <- html_session(war_dead_url)

# Read contents of webpage into XML document
webpage <- read_html(pgsession)

# Nodes with 'Served in (army .etc)' tick boxes
dat_html <- html_nodes(webpage, 
                       xpath = "/html/body/main/section[2]
                          /div/form/div[4]/div[2]")

# Tick box input elements
dat_html <- html_nodes(dat_html, "input") 

# List of service branches
branch_service <- html_attr(dat_html, "value")

# Nodes with 'Served with (country)' tick boxes
dat_html <- html_nodes(webpage, 
                       xpath = "/html/body/main/section[3]
                          /div/form/div[4]/div[1]/div")

# Tick box input elements
dat_html <- html_nodes(dat_html, "input") 

# List of countries served with
countries_served <- html_attr(dat_html, "value")

# Initialise data object for results
results <- data.frame(BranchServed=as.character(),
                      CountryServed=as.character(),
                      GraveCount=numeric(),
                      WorldWar=numeric())

Instead of trying to submit a form using the check boxes, I manually produced the request URLs by copying the string pattern that a browser produces.

# URL to replicate
# The '%2B' in the URL is how a '+' is encoded
# .../results?war=1&servedWith=American&servedIn=Air%2Bforce

# Loop for both wars
for (war in c(1,2)){
  # Loop for all service branches
  for (branch in branch_service){
    # Loop for all countries served
    for (served in countries_served){
      
    # Replace spaces in country served name
    served <- gsub(" ", "+", served)
      
    # Replace spaces in country served name
    branch <- gsub(" ", "+", branch)
    
    # Construct URL
    part_1 <- "https://www.cwgc.org/find/find-war-dead/results?"
    part_2 <- paste0("war=", war)
    part_3 <- paste0("&servedWith=", served)
    part_4 <- paste0("&servedIn=", branch)
    request_URL <- paste0(part_1, part_2, part_3, part_4)
    
    # Get contents of webpage
    webpage <- read_html(request_URL)
    
    # Get HTML node
    dat_html <- html_nodes(webpage, 
                           xpath = "/html/body/main/section[2]/div")
    
    # Get attributes
    dat_attr <- html_attrs(dat_html)
    
    # Unlist to get character vector
    dat_attr <- unlist(dat_attr)
    
    # Get the second element and convert to numeric format
    num_records <- as.numeric(dat_attr[2])
    
    # Copy results to dataframe 
    result_new <- data.frame(branch, served, num_records, war)
    
    # Bind results
    results <- rbind(results, result_new)
    
    # Wait between 10 and 20 seconds before next page request
    Sys.sleep(sample(seq(10, 20, 1), 1))
    }
  }
}

# Save output
save(results, file = "./Data/Summary_data_service_branch_countries_served.rda")

Summary Results

The results from the previous section of code guided my next steps. The top ten permutations of branch and country served by total records were:

# Show the top 10 values
knitr::kable(head(results, 10), row.names = FALSE, caption = "Summary of Records by Branch and Country Served")

Table 1: Summary of Records by Branch and Country Served
branch	served	num_records	war
Army	United+Kingdom	759355	1
Army	United+Kingdom	210287	2
Air+force	United+Kingdom	84835	2
Army	Indian	79322	2
Army	Indian	71539	1
Civilian+War+Dead+1939-1945	United+Kingdom	68887	2
Army	Canadian	64426	1
Army	Australian	61668	1
Navy	United+Kingdom	59181	2
Navy	United+Kingdom	48387	1

There were 136 non-zero combinations overall, and 3 of these had more than 80,000 records, which had to be dealt with separately. The combinations with less than 80,000 rows could be downloaded directly, labelled with the appropriate service branch and country and merged with the data obtained previously.

First, I retrieved the records for the 133 combinations. These were downloaded to a folder called Set 3. The code should be familiar enough.

# URL to be scraped
war_dead_url <- "https://www.cwgc.org/find/find-war-dead"

# Set up HTML session for scraping
pgsession <- html_session(war_dead_url)

# Read contents of webpage into XML document
webpage <- read_html(pgsession)

# Filter out combinations with 0 records
results <- results[results$num_records != 0, ]

# Order data by records (descending)
results <- results[order(-results$num_records),]

# Exclude first three rows - these have more than 80,000 records
results <- results[4:nrow(results),]

# Loop through war, country served and branch
for(i in 1:nrow(results)){
  
  # Get row details
  my_results <- results[i,]
  
  # Get branch
  branch <- my_results[,1]
  
  # Get country served
  served <- my_results[,2]
  
  # Get war
  war <- my_results[,4]
  
  # Set path and filename
  path <- "./Set3/"
  
  # Set up file name
  file_name <- paste0(path, war, "-", branch, "-", served, ".csv")
  
  # Construct URL
  part_1 <- "https://www.cwgc.org/find/find-war-dead/results?"
  part_2 <- paste0("war=", war)
  part_3 <- paste0("&servedWith=", served)
  part_4 <- paste0("&servedIn=", branch)
  request_URL <- paste0(part_1, part_2, part_3, part_4)
  
  # Contents of webpage
  webpage <- read_html(request_URL)
  
  # Get 'Download Results' element
  dat_html <- html_nodes(webpage, 
                         xpath = "/html/body/main/section[2]/div/div[1]/div/div/div/div/a")
  
  # Get href attribute
  dat_attr <- html_attr(dat_html, "href")
  
  # Concatenate URL and relative URL
  download_address <- paste0("https://www.cwgc.org", dat_attr)
  
  # Download file from URL
  download.file(download_address, file_name)
  
  # Wait between 30 and 60 seconds before next page request
  Sys.sleep(sample(seq(30, 60, 1), 1))
}

Labeling Branch and Country

With respect to the third largest category, the records for ‘Air Force’ and ‘United Kingdom’ were split by date before downloading and labeling. This could be done programatically but it was simpler to do it manually. Once this was done, a ‘Service Branch’ and ‘Country Served’ label was added to each record based on the file name and the result was merged with the previously downloaded data.

# Set path to downloaded files
path <- "./Set3/"

# List of *.csv files 
csv_files <- list.files(path, pattern = "\\.csv$")

# Initialise object for output
out_file <- c()

# Loop through each file
for (my_file in csv_files){
  
  # Console output for progress indicator
  cat(my_file, " \n")
  
  # Split filename to get labels
  branch <- unlist(strsplit(my_file, "-"))[2]
  served <- unlist(strsplit(my_file, "-"))[3]
  
  # Set path and file name
  file_name <- paste0(path, my_file)
  
  # Read in file
  dat <- read.csv(file_name, header = TRUE, sep = ",")
  
  # Add in branch and country served category
  dat$branch <- branch
  dat$served <- served
  
  # Remove ".csv" from column
  dat$served <- gsub(".csv", "", dat$served)
  
  # Drop all variables except ID, branch and country served
  dat <- dat[,c(1, 19, 20)]
  
  # Bind results
  out_file <- rbind(out_file, dat)
}

# Load saved results
load("./path/to/file/All_cwgc_graves.rda")

# Merge data with previous results
final_cwgc <- merge(cwgc_graves, out_file, all.x = TRUE)

Any remaining records were labelled ‘Army’ and ‘United Kingdom’ by default (759,355 and 210,287 for WWI and WWII respectively).

# Add in missing labels
final_cwgc$branch[is.na(final_cwgc$branch)] <- "Army"
final_cwgc$served[is.na(final_cwgc$served)] <- "United+Kingdom"

# Coerce to date
final_cwgc$date_of_death <- as.Date(final_cwgc$date_of_death, '%d/%m/%Y')
# Add in war label
final_cwgc$war[final_cwgc$date_of_death < "1930-01-01"] <- 1
final_cwgc$war[is.na(final_cwgc$war)] <- 2

# Remove empty column
final_cwgc$X <- NULL

# Rename Civilian War Dead category, mislabelled as '1945'
# Coerce to factor
final_cwgc$served <- as.factor(final_cwgc$served)
levels(final_cwgc$served)[1] <- 'Civilian'

Data Cleaning

The last thing to do was to tidy up the downloaded data and to make sure the data types were set correctly. The current data types can be seen in the Environment panel in Rstudio, or can be listed using:

# List the data types
knitr::kable(sapply(final_cwgc, class), col.names = "CLASS")

Many of the fields were character type, so I’m wasn’t expecting much insight, but it will make the summary statistics a bit more useful if the correct types are set.

# Coerce to character
final_cwgc$id <- as.character(final_cwgc$id)
final_cwgc$initials <- as.character(final_cwgc$initials)
final_cwgc$servicenumberexport <- as.character(final_cwgc$servicenumberexport)
final_cwgc$gravereference <- as.character(final_cwgc$gravereference)
final_cwgc$additionalinformation <- as.character(final_cwgc$additionalinformation)

# Coerce to date
final_cwgc$date_of_death <- as.Date(final_cwgc$date_of_death, "%d/%m/%Y")
final_cwgc$date_of_death2 <- as.Date(final_cwgc$date_of_death2, "%d/%m/%Y")

# Clean up variable names
colnames(final_cwgc) <- c("Id", "Surname", "Forename", "Initials", "Age",
                          "Honours", "DoD_1", "DoD_2", "Rank", "Regiment",
                           "Secondary_Regiment", "Unit_Ship_Squadron",
                           "Country_Commemorated", "Service_Number",
                           "Cemetery_Memorial", "Grave_Reference",
                           "Additional_Info", "Branch", "Served", "War")

# Save output to file
save(final_cwgc, file = "./path/to/file/All_cwgc_graves_with_served_and_branch.rda")

Data Summary

The standard R function for producing summary statistics is summary() but the console output can be quite cluttered in my view. The Hmisc package has a describe() function that presents the information more clearly but either one will do the job.

# Get a data summary for a few select fields
Hmisc::describe(final_cwgc[,c(1,2)])

## final_cwgc[, c(1, 2)] 
## 
##  2  Variables      1731865  Observations
## --------------------------------------------------------------------------------
## Id 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##  1731865        0  1731865        1  2039274  2152080   115036   249114 
##      .25      .50      .75      .90      .95 
##   554405  1601316  2506800  2954580  3138144 
## 
## lowest :    18500    18501    18502    18503    18504
## highest: 75463959 75463960 75463961 75463962 75463990
##                                                                           
## Value      0.0e+00 1.0e+06 2.0e+06 3.0e+06 4.0e+06 5.0e+06 6.0e+06 8.0e+06
## Frequency   383024  419085  491314  382383   17492      27   11875   16576
## Proportion   0.221   0.242   0.284   0.221   0.010   0.000   0.007   0.010
##                   
## Value      7.5e+07
## Frequency    10089
## Proportion   0.006
## 
## For the frequency table, variable is rounded to the nearest 1000000
## --------------------------------------------------------------------------------
## Surname 
##        n  missing distinct 
##  1731865        0   194494 
## 
## lowest : AVEYARD         BAIN            BANTING         BARR            BROOKS         
## highest: ZUCKERBRODT     ZUKEMURA CHIREN ZULA ABDUL      ZUMBROCK        ZWOLSKI        
## --------------------------------------------------------------------------------

I won’t spend much time discussing the output but it provides a useful overview of the data. The nice feature of this function is that it gives a count of the samples ‘n’ and the ‘missing’ and the ‘distinct’ samples. The data fields are self-explanatory for the most part and have been mentioned previously. There are two dates of death which seems to represent a possible date range, where the precise date of death is unknown. The CWGC data is almost entirely categorical, so a numeric summary is not used except for the Age variable. The mean age at death is 27.8 years, the lowest is 1 year and the highest is 100 years - the latter two values reflecting the ages of civilians killed in the Second World War. I’ll explore this further in another post.

Conclusion

I finally have a viable dataset that I can start to explore. It’s a truism that 80% of an analyst’s efforts will be spent on preparing the data, and this certainly seemed to be the case with this project. This post has shown how to add in the ‘service branch’ and the ‘country served’ categories into the CWGC data that I downloaded previously. The next post will show how to summarise and explore the data with some initial visualisations.

CWGC R WW2 WWI Military History Web Scraping