Collecting the CWGC Data - Part 2

Introduction

This is the second post in a series that explores data collection using R to collect the Commonwealth War Graves Commission (CWGC) dataset.

Once I had extracted the list of the countries from the website, I proceeded to get the number of graves and memorials for each country, with a break-down for both WWI and WWII. I defined a dataframe to store the results and two loops to step through each war and each country.

# Load data from previous post
load(file = "path/to/your/files/countries_commemorated.rda")

# Initialise data object for results
results <- data.frame(Country = as.character(),
                      GraveCount = numeric(),
                      WorldWar = numeric())
# Loop for both wars
for (war in c(1,2)){
  # Loop for all countries commemorated
  for (commemorated in commemorated_data){
    Get number of graves
  }
}

The number of records for each country was obtained from the main URL and a HTML session was set up for the duration of the scraping. The html_form() function was used to extract the form structure from the page.

# URLs to be scraped
memorial_url <- "https://www.cwgc.org/find/find-cemeteries-and-memorials"
war_dead_url <- "https://www.cwgc.org/find/find-war-dead"

# Set up html session for scraping
pgsession <- html_session(war_dead_url)

# Parse form fields from URL
page_form <- html_form(pgsession)

# Inspect output
View(page_form)

The output was a list of lists, and the third list element contained the search fields for the form (LastName, Initials, FirstName and so on). Specifically, I used the ‘CountryCommemoratedAt’ and ‘War’ fields for the automated search.

Figure 1. Form search fields and names

Submitting a Form

After selecting the third element, I used the set_values() method to set the values in the form, where the variables war and commemorated are set by the loops roughly defined previously. Lastly, the form is submitted and the results are retrieved.

# Parse the form from the page
page_form <- html_form(pgsession)[[3]]

# Set form fields with country and war to be queried
filled_form <- set_values(page_form,
                          "countryCommemoratedAt" = commemorated,
                          "War" = war)

# Submit filled form to server and get results
dat <- submit_form(session = pgsession,
                   form = filled_form)

The parsed HTML response contained a count of the number of records, which was ultimately what I was trying to determine, but first I needed to figure out which element had the relevant data. I did a manual search on the website by selecting a country and war (for instance, selecting ‘France’ and ‘WWI’). From the result, I could copy the XPath for the element containing the number of records.

Figure 2. HTML element Id and XPath from https://www.cwgc.org

Once I had the XPath, I was able to retrieve the relevant HTML elements and attribute values using the functions as previously described in the first post:

# Get contents of HTML element with XPath option
dat_html <- html_nodes(dat, xpath = "/html/body/main/section[2]/div")

# Get attributes
dat_attr <- html_attrs(dat_html)

# Unlist to get character vector
dat_attr <- unlist(dat_attr)

# Get the second element and convert to numeric format
num_records <- as.numeric(dat_attr[2])

At this point, I had the number of graves commemorated for the respective country and war and all that remained was to copy the results to a data frame (which was initialised earlier) and row bind to any previous results from other iterations.

# Copy results to dataframe
result_new <- data.frame(commemorated, num_records, war)

# Bind results by row
results <- rbind(results, result_new)

Its probably not necessary in this case, considering the low volume of page requests, but if you were concerned about overloading the website then we can introduce a delay between each loop iteration using the Sys.sleep() command:

# Wait between 10 and 20 seconds before next page request
Sys.sleep(sample(seq(10, 20, 1), 1))

Putting it all Together

# URL to be scraped
war_dead_url <- "https://www.cwgc.org/find/find-war-dead"

# Initialise data object for results
results <- data.frame(Country=as.character(),
                      GraveCount=numeric(),
                      WorldWar=numeric()) 
# Loop for both wars
  for (war in c(1,2)){
    # Loop for all countries commemorated
    for (commemorated in commemorated_data){
      
      # Set up HTML session for scraping
      pgsession <- html_session(war_dead_url)
      
      # Parse form fields from URL
      page_form <- html_form(pgsession)[[3]]
      
      # Set form field with country and war
      filled_form <- set_values(page_form,
                                "countryCommemoratedAt" = commemorated,
                                "War" = war)
      
      # Submit query and get results
      dat <- submit_form(
        session = pgsession,
        form = filled_form)
      
      # Get HTML node 
      dat_html <- html_nodes(dat, xpath = "/html/body/main/section[2]/div")
      
      # Get attributes
      dat_attr <- html_attrs(dat_html)

      # Unlist to get character vector
      dat_attr <- unlist(dat_attr)
      
      # Get the second element and convert to numeric format
      num_records <- as.numeric(dat_attr[2])
      
      # Copy results to dataframe 
      result_new <- data.frame(commemorated, num_records, war)
      
      # Bind results
      results <- rbind(results, result_new)
      
      # Wait between 10 and 20 seconds before next page request
      Sys.sleep(sample(seq(10, 20, 1), 1))
    }
  }

Number of Graves by Country

I then calculated the total number of graves commemorated per country for both wars. First I converted the dataframe from long format to wide format - if you’re not sure what this means, check out tidyr for a helpful explanation. This split out the results for each war and placed them in their own columns. A total column is derived by simply summing the two values.

# Convert to wide format
results <- spread(results, key = war, value = num_records, sep = "-")

# Add sum of graves for both WWI and WWII
results$sum <- results$`war-1`+ results$`war-2`

At this point, I checked that the results tallied with the total number of graves commemorated by the CWGC (1,741,938), but I was short 68,887 records. Comparing the results against the website showed I had omitted the category for civilians (‘Civilian War Dead’) which was not included in the list of countries that was used for the loop. I added this in manually and re-scraped the data.

# Add in additional category for civilians
commemorated_data <- c(commemorated_data, "Civilian War Dead")

# Rerun the code as before

Once this was done, I verified that my results matched the CWGC totals and saved the results to a *.rda file for later use.

# Copy object
cwgc_summary <- results

# Save results
save(cwgc_summary, file = "./Data/Summary_data_countries_commemorated.rda")

The final output gives a table of the number of graves in each country and each for each war:

# View first 10 results
knitr::kable(head(cwgc_summary, 10), row.names = FALSE, caption = "Graves per Country and War")
Table 1: Graves per Country and War
commemorated war-1 war-2 sum
Albania 0 48 48
Algeria 19 2051 2070
Antigua 0 2 2
Argentina 2 13 15
Australia 3498 9969 13467
Austria 7 577 584
Azerbaijan 47 0 47
Azores 0 52 52
Bahamas 2 58 60
Bahrain 0 0 0

Conclusion

This post demonstrated how to use simple web-scraping techniques to get a breakdown of the number of graves commemorated by the CWGC in each country. The following post will show how this information is used to download the data for further analysis.

Related