post-image

Scraping data from STATEC's public tables


A lot of open data is available in Luxembourg’s open data portal, but sometimes, it is not very easy to download. In the video below, I give you an example of such data and show how you can use rvest to get the data easily.

After watching the video, take a look at the code below. This code does two things; first it scrapes the data, and then it puts the data in a tidy format fur further processing.

So to summarize the idea of the video; instead of clicking the buttons to download each year’s data (which you would have to do 15 times), it is easier to simple turn off javascript and then scrape the html version of the table. It would be possible, albeit with much more effort, to scrape the tables with javascript enabled, by using a tool such as phantomjs. But since we have the possibility to view the table in html, why not take advantage of it?

To scrape the data, you will need first to install the rvest and then load it (and let’s also load the other needed packages)

library(rvest)
library(dplyr)
library(tidyr)
library(purrr)
library(janitor)

Now, using rvest::read_html(), we can download the whole html page:

page_unemp <- read_html("http://www.statistiques.public.lu/stat/TableViewer/tableViewHTML.aspx?ReportId=12950&IF_Language=eng&MainTheme=2&FldrName=3&RFPath=91")

Now, we need to extract the table from the html page, and we do this by using rvest::html_nodes() and by providing this function with the name of the class of the object we’re interested in, namely, the table.

page_unemp %>%
  html_nodes(".b2020-datatable") %>% .[[1]] %>% html_table(fill = TRUE) -> data_raw


head(data_raw)
##                          X1                         X2      X3      X4
## 1                      Year                       Year    2001    2002
## 2             Specification                       Year    2001    2002
## 3 Grand Duchy of Luxembourg  Total employed population 180,084 182,004
## 4 Grand Duchy of Luxembourg     of which: Wage-earners 162,407 164,277
## 5 Grand Duchy of Luxembourg of which: Non-wage-earners  17,677  17,727
## 6 Grand Duchy of Luxembourg                 Unemployed   5,393   6,773
##        X5      X6      X7      X8      X9     X10     X11     X12     X13
## 1    2003    2004    2005    2006    2007    2008    2009    2010    2011
## 2    2003    2004    2005    2006    2007    2008    2009    2010    2011
## 3 183,419 186,325 187,380 192,095 197,486 202,203 204,127 207,923 214,094
## 4 165,509 168,214 169,194 174,045 179,176 183,705 185,369 188,983 194,893
## 5  17,910  18,111  18,186  18,050  18,310  18,498  18,758  18,940  19,201
## 6   8,359   9,426  10,653  10,297   9,670  11,496  14,816  15,567  16,159
##       X14     X15     X16     X17     X18      X19
## 1    2012    2013    2014    2015    2016     2017
## 2    2012    2013    2014    2015    2016 Measures
## 3 219,168 223,407 228,423 233,130 236,100  246,410
## 4 199,741 203,535 208,238 212,530 215,430  225,230
## 5  19,427  19,872  20,185  20,600  20,670   21,180
## 6  16,963  19,287  19,362  18,806  18,185   17,283

As you can see, we got the data in quite a nice format, but it still needs to be cleaned a bit. Let’s do this.

First, let’s use the first row as the header of the data set and then remove it:

colnames(data_raw) <- data_raw[2, ]
colnames(data_raw)[1:2] <- c("division", "variable")
data_raw <- data_raw[-c(1,2), ]
head(data_raw)
##                    division                   variable    2001    2002
## 3 Grand Duchy of Luxembourg  Total employed population 180,084 182,004
## 4 Grand Duchy of Luxembourg     of which: Wage-earners 162,407 164,277
## 5 Grand Duchy of Luxembourg of which: Non-wage-earners  17,677  17,727
## 6 Grand Duchy of Luxembourg                 Unemployed   5,393   6,773
## 7 Grand Duchy of Luxembourg          Active population 185,477 188,777
## 8 Grand Duchy of Luxembourg  Uneimployment rate (in %)    2.91    3.59
##      2003    2004    2005    2006    2007    2008    2009    2010    2011
## 3 183,419 186,325 187,380 192,095 197,486 202,203 204,127 207,923 214,094
## 4 165,509 168,214 169,194 174,045 179,176 183,705 185,369 188,983 194,893
## 5  17,910  18,111  18,186  18,050  18,310  18,498  18,758  18,940  19,201
## 6   8,359   9,426  10,653  10,297   9,670  11,496  14,816  15,567  16,159
## 7 191,778 195,751 198,033 202,392 207,156 213,699 218,943 223,490 230,253
## 8    4.36    4.82    5.38    5.09    4.67    5.38    6.77    6.97    7.02
##      2012    2013    2014    2015    2016 Measures
## 3 219,168 223,407 228,423 233,130 236,100  246,410
## 4 199,741 203,535 208,238 212,530 215,430  225,230
## 5  19,427  19,872  20,185  20,600  20,670   21,180
## 6  16,963  19,287  19,362  18,806  18,185   17,283
## 7 236,131 242,694 247,785 251,936 254,285  263,693
## 8    7.18    7.95    7.81    7.46    7.15     6.55

This is starting to look nice, but we need to replace the “,” with “.” and then convert the columns to numeric.

data_raw %>%
  map_df(function(x)(gsub(",", ".", x = x))) %>%
  mutate_at(vars(matches("\\d{4}")), as.numeric
            ) -> clean_unemp

head(clean_unemp)
## # A tibble: 6 x 19
##   division variable `2001` `2002` `2003` `2004` `2005` `2006` `2007` `2008`
##   <chr>    <chr>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Grand D… Total e… 180.   182.   183.   186.   187.   192.   197.   202.  
## 2 Grand D… of whic… 162.   164.   166.   168.   169.   174.   179.   184.  
## 3 Grand D… of whic…  17.7   17.7   17.9   18.1   18.2   18.0   18.3   18.5 
## 4 Grand D… Unemplo…   5.39   6.77   8.36   9.43  10.7   10.3    9.67  11.5 
## 5 Grand D… Active … 185.   189.   192.   196.   198.   202.   207.   214.  
## 6 Grand D… Uneimpl…   2.91   3.59   4.36   4.82   5.38   5.09   4.67   5.38
## # ... with 9 more variables: `2009` <dbl>, `2010` <dbl>, `2011` <dbl>,
## #   `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>, `2016` <dbl>,
## #   Measures <chr>

This line: map_df(function(x)(gsub(",", ".", x = x))) calls purrr::map_df(), which maps a function to each column of a data frame. The function in question is function(x)(gsub(",", ".", x = x)), which is an anonymous function (meaning it does not have a name) wrapped around gsub. This function looks for the string “,” and replaces it with “.” in a single column of the data frame. But because we’re mapping this function to all the columns of the data frame with purrr::map_df(), this substitution happens in each column. We’ not done yet, because these columns are still holding characters. We need to convert each column to a numeric vector and this is what happens in the next line, mutate_at(vars(matches("\\d{4}")), as.numeric). Each column that contains exactly for digits (hence the "\\d{4}") is converted to numeric with dplyr::mutate_at().

Now, one last step to really have the data in a nice format:

clean_unemp %>% 
    gather(key=year, value, -division, -variable) %>%
    spread(variable, value) %>%
    clean_names(
           ) -> clean_unemp

head(clean_unemp)
## # A tibble: 6 x 8
##   division year  active_populati… of_which_non_wa… of_which_wage_e…
##   <chr>    <chr> <chr>            <chr>            <chr>           
## 1 Beaufort 2001  688              85               568             
## 2 Beaufort 2002  742              85               631             
## 3 Beaufort 2003  773              85               648             
## 4 Beaufort 2004  828              80               706             
## 5 Beaufort 2005  866              96               719             
## 6 Beaufort 2006  893              87               746             
## # ... with 3 more variables: total_employed_population <chr>,
## #   uneimployment_rate_in_percent <chr>, unemployed <chr>

By using tidyr::gather() and then tidyr::spread() we get a nice data set where each column is a variable and each row is an observation. I advise you run the above code line by line and try to understand what each function does. We finish by cleaning the names of the variables with janitor::clean_names() and that’s it.

Don’t hesitate to follow us on twitter @rdata_lu and to subscribe to our youtube channel.
You can also contact us if you have any comments or suggestions. See you for the next post!

Back to blog