Collecting the CWGC Data - Part 4
Introduction
This is the fourth post in a series that explores data collection using R to collect the Commonwealth War Graves Commission (CWGC) dataset.
Previously I showed how to download all the details of the 1.7 million war dead commemorated by the CWGC. Unfortunately, when I opened one of the downloaded spreadsheets I found a fundamental problem that frankly I should have spotted earlier. The data did not include the Service branch (Army, Navy etc.) or the Commonwealth country (Australia, Canada etc.) that the person served with. Whilst these are viable search terms on the website, these variables were not included in the results. At this point, I had not given serious thought to the visualisations that I wanted to develop, but exploring the data by service branch or country served with are two obvious ones. I needed to figure out a way of filling in these fields, so it was back to the drawing board.
Summary by Branch and Country
The two missing fields (‘Country served’ & ‘Service branch’) could be determined by using these as download options and then labeling the data appropriately, but this presented all the familiar problems of chunking the data into sets of less than 80,000 records. I didn’t want to have to do this all again though, so I made a summary of the data according to these two search options to see what the options were. This draws heavily on the code from the previous posts, so head back there for a full discussion if any of the functions are unfamiliar.
# URL to be scraped
war_dead_url <- "https://www.cwgc.org/find/find-war-dead"
# Set up HTML session for scraping
pgsession <- html_session(war_dead_url)
# Read contents of webpage into XML document
webpage <- read_html(pgsession)
# Nodes with 'Served in (army .etc)' tick boxes
dat_html <- html_nodes(webpage,
xpath = "/html/body/main/section[2]
/div/form/div[4]/div[2]")
# Tick box input elements
dat_html <- html_nodes(dat_html, "input")
# List of service branches
branch_service <- html_attr(dat_html, "value")
# Nodes with 'Served with (country)' tick boxes
dat_html <- html_nodes(webpage,
xpath = "/html/body/main/section[3]
/div/form/div[4]/div[1]/div")
# Tick box input elements
dat_html <- html_nodes(dat_html, "input")
# List of countries served with
countries_served <- html_attr(dat_html, "value")
# Initialise data object for results
results <- data.frame(BranchServed=as.character(),
CountryServed=as.character(),
GraveCount=numeric(),
WorldWar=numeric())
Instead of trying to submit a form using the check boxes, I manually produced the request URLs by copying the string pattern that a browser produces.
# URL to replicate
# The '%2B' in the URL is how a '+' is encoded
# .../results?war=1&servedWith=American&servedIn=Air%2Bforce
# Loop for both wars
for (war in c(1,2)){
# Loop for all service branches
for (branch in branch_service){
# Loop for all countries served
for (served in countries_served){
# Replace spaces in country served name
served <- gsub(" ", "+", served)
# Replace spaces in country served name
branch <- gsub(" ", "+", branch)
# Construct URL
part_1 <- "https://www.cwgc.org/find/find-war-dead/results?"
part_2 <- paste0("war=", war)
part_3 <- paste0("&servedWith=", served)
part_4 <- paste0("&servedIn=", branch)
request_URL <- paste0(part_1, part_2, part_3, part_4)
# Get contents of webpage
webpage <- read_html(request_URL)
# Get HTML node
dat_html <- html_nodes(webpage,
xpath = "/html/body/main/section[2]/div")
# Get attributes
dat_attr <- html_attrs(dat_html)
# Unlist to get character vector
dat_attr <- unlist(dat_attr)
# Get the second element and convert to numeric format
num_records <- as.numeric(dat_attr[2])
# Copy results to dataframe
result_new <- data.frame(branch, served, num_records, war)
# Bind results
results <- rbind(results, result_new)
# Wait between 10 and 20 seconds before next page request
Sys.sleep(sample(seq(10, 20, 1), 1))
}
}
}
# Save output
save(results, file = "./Data/Summary_data_service_branch_countries_served.rda")
Summary Results
The results from the previous section of code guided my next steps. The top ten permutations of branch and country served by total records were:
# Show the top 10 values
knitr::kable(head(results, 10), row.names = FALSE, caption = "Summary of Records by Branch and Country Served")
branch | served | num_records | war |
---|---|---|---|
Army | United+Kingdom | 759355 | 1 |
Army | United+Kingdom | 210287 | 2 |
Air+force | United+Kingdom | 84835 | 2 |
Army | Indian | 79322 | 2 |
Army | Indian | 71539 | 1 |
Civilian+War+Dead+1939-1945 | United+Kingdom | 68887 | 2 |
Army | Canadian | 64426 | 1 |
Army | Australian | 61668 | 1 |
Navy | United+Kingdom | 59181 | 2 |
Navy | United+Kingdom | 48387 | 1 |
There were 136 non-zero combinations overall, and 3 of these had more than 80,000 records, which had to be dealt with separately. The combinations with less than 80,000 rows could be downloaded directly, labelled with the appropriate service branch and country and merged with the data obtained previously.
First, I retrieved the records for the 133 combinations. These were downloaded to a folder called Set 3. The code should be familiar enough.
# URL to be scraped
war_dead_url <- "https://www.cwgc.org/find/find-war-dead"
# Set up HTML session for scraping
pgsession <- html_session(war_dead_url)
# Read contents of webpage into XML document
webpage <- read_html(pgsession)
# Filter out combinations with 0 records
results <- results[results$num_records != 0, ]
# Order data by records (descending)
results <- results[order(-results$num_records),]
# Exclude first three rows - these have more than 80,000 records
results <- results[4:nrow(results),]
# Loop through war, country served and branch
for(i in 1:nrow(results)){
# Get row details
my_results <- results[i,]
# Get branch
branch <- my_results[,1]
# Get country served
served <- my_results[,2]
# Get war
war <- my_results[,4]
# Set path and filename
path <- "./Set3/"
# Set up file name
file_name <- paste0(path, war, "-", branch, "-", served, ".csv")
# Construct URL
part_1 <- "https://www.cwgc.org/find/find-war-dead/results?"
part_2 <- paste0("war=", war)
part_3 <- paste0("&servedWith=", served)
part_4 <- paste0("&servedIn=", branch)
request_URL <- paste0(part_1, part_2, part_3, part_4)
# Contents of webpage
webpage <- read_html(request_URL)
# Get 'Download Results' element
dat_html <- html_nodes(webpage,
xpath = "/html/body/main/section[2]/div/div[1]/div/div/div/div/a")
# Get href attribute
dat_attr <- html_attr(dat_html, "href")
# Concatenate URL and relative URL
download_address <- paste0("https://www.cwgc.org", dat_attr)
# Download file from URL
download.file(download_address, file_name)
# Wait between 30 and 60 seconds before next page request
Sys.sleep(sample(seq(30, 60, 1), 1))
}
Labeling Branch and Country
With respect to the third largest category, the records for ‘Air Force’ and ‘United Kingdom’ were split by date before downloading and labeling. This could be done programatically but it was simpler to do it manually. Once this was done, a ‘Service Branch’ and ‘Country Served’ label was added to each record based on the file name and the result was merged with the previously downloaded data.
# Set path to downloaded files
path <- "./Set3/"
# List of *.csv files
csv_files <- list.files(path, pattern = "\\.csv$")
# Initialise object for output
out_file <- c()
# Loop through each file
for (my_file in csv_files){
# Console output for progress indicator
cat(my_file, " \n")
# Split filename to get labels
branch <- unlist(strsplit(my_file, "-"))[2]
served <- unlist(strsplit(my_file, "-"))[3]
# Set path and file name
file_name <- paste0(path, my_file)
# Read in file
dat <- read.csv(file_name, header = TRUE, sep = ",")
# Add in branch and country served category
dat$branch <- branch
dat$served <- served
# Remove ".csv" from column
dat$served <- gsub(".csv", "", dat$served)
# Drop all variables except ID, branch and country served
dat <- dat[,c(1, 19, 20)]
# Bind results
out_file <- rbind(out_file, dat)
}
# Load saved results
load("./path/to/file/All_cwgc_graves.rda")
# Merge data with previous results
final_cwgc <- merge(cwgc_graves, out_file, all.x = TRUE)
Any remaining records were labelled ‘Army’ and ‘United Kingdom’ by default (759,355 and 210,287 for WWI and WWII respectively).
# Add in missing labels
final_cwgc$branch[is.na(final_cwgc$branch)] <- "Army"
final_cwgc$served[is.na(final_cwgc$served)] <- "United+Kingdom"
# Coerce to date
final_cwgc$date_of_death <- as.Date(final_cwgc$date_of_death, '%d/%m/%Y')
# Add in war label
final_cwgc$war[final_cwgc$date_of_death < "1930-01-01"] <- 1
final_cwgc$war[is.na(final_cwgc$war)] <- 2
# Remove empty column
final_cwgc$X <- NULL
# Rename Civilian War Dead category, mislabelled as '1945'
# Coerce to factor
final_cwgc$served <- as.factor(final_cwgc$served)
levels(final_cwgc$served)[1] <- 'Civilian'
Data Cleaning
The last thing to do was to tidy up the downloaded data and to make sure the data types were set correctly. The current data types can be seen in the Environment panel in Rstudio, or can be listed using:
# List the data types
knitr::kable(sapply(final_cwgc, class), col.names = "CLASS")
Many of the fields were character type, so I’m wasn’t expecting much insight, but it will make the summary statistics a bit more useful if the correct types are set.
# Coerce to character
final_cwgc$id <- as.character(final_cwgc$id)
final_cwgc$initials <- as.character(final_cwgc$initials)
final_cwgc$servicenumberexport <- as.character(final_cwgc$servicenumberexport)
final_cwgc$gravereference <- as.character(final_cwgc$gravereference)
final_cwgc$additionalinformation <- as.character(final_cwgc$additionalinformation)
# Coerce to date
final_cwgc$date_of_death <- as.Date(final_cwgc$date_of_death, "%d/%m/%Y")
final_cwgc$date_of_death2 <- as.Date(final_cwgc$date_of_death2, "%d/%m/%Y")
# Clean up variable names
colnames(final_cwgc) <- c("Id", "Surname", "Forename", "Initials", "Age",
"Honours", "DoD_1", "DoD_2", "Rank", "Regiment",
"Secondary_Regiment", "Unit_Ship_Squadron",
"Country_Commemorated", "Service_Number",
"Cemetery_Memorial", "Grave_Reference",
"Additional_Info", "Branch", "Served", "War")
# Save output to file
save(final_cwgc, file = "./path/to/file/All_cwgc_graves_with_served_and_branch.rda")
Data Summary
The standard R function for producing summary statistics is summary()
but the console output can be quite cluttered in my view. The Hmisc
package has a describe()
function that presents the information more clearly but either one will do the job.
# Get a data summary for a few select fields
Hmisc::describe(final_cwgc[,c(1,2)])
## final_cwgc[, c(1, 2)]
##
## 2 Variables 1731865 Observations
## --------------------------------------------------------------------------------
## Id
## n missing distinct Info Mean Gmd .05 .10
## 1731865 0 1731865 1 2039274 2152080 115036 249114
## .25 .50 .75 .90 .95
## 554405 1601316 2506800 2954580 3138144
##
## lowest : 18500 18501 18502 18503 18504
## highest: 75463959 75463960 75463961 75463962 75463990
##
## Value 0.0e+00 1.0e+06 2.0e+06 3.0e+06 4.0e+06 5.0e+06 6.0e+06 8.0e+06
## Frequency 383024 419085 491314 382383 17492 27 11875 16576
## Proportion 0.221 0.242 0.284 0.221 0.010 0.000 0.007 0.010
##
## Value 7.5e+07
## Frequency 10089
## Proportion 0.006
##
## For the frequency table, variable is rounded to the nearest 1000000
## --------------------------------------------------------------------------------
## Surname
## n missing distinct
## 1731865 0 194494
##
## lowest : AVEYARD BAIN BANTING BARR BROOKS
## highest: ZUCKERBRODT ZUKEMURA CHIREN ZULA ABDUL ZUMBROCK ZWOLSKI
## --------------------------------------------------------------------------------
I won’t spend much time discussing the output but it provides a useful overview of the data. The nice feature of this function is that it gives a count of the samples ‘n’ and the ‘missing’ and the ‘distinct’ samples. The data fields are self-explanatory for the most part and have been mentioned previously. There are two dates of death which seems to represent a possible date range, where the precise date of death is unknown. The CWGC data is almost entirely categorical, so a numeric summary is not used except for the Age variable. The mean age at death is 27.8 years, the lowest is 1 year and the highest is 100 years - the latter two values reflecting the ages of civilians killed in the Second World War. I’ll explore this further in another post.
Conclusion
I finally have a viable dataset that I can start to explore. It’s a truism that 80% of an analyst’s efforts will be spent on preparing the data, and this certainly seemed to be the case with this project. This post has shown how to add in the ‘service branch’ and the ‘country served’ categories into the CWGC data that I downloaded previously. The next post will show how to summarise and explore the data with some initial visualisations.