This is part 1 of a 3 part blog post. This post presents the Luxembourgish language as well as the literary work I am going to analyze using the R programming language. Part 2 deals with preparing the data for analysis, and finally part 3 is the analysis. Hope you enjoy!
Luxembourg and the Luxembourgish language
Luxembourg is a small European country, squeezed between France, Belgium and Germany. Over the course of its history, it’s been invaded over and over by either France or Prussia (later Germany). It eventually became a state under the personal possession of William I of the Netherlands in 1815, with a… Prussian garrison to guard its capital, Luxembourg City, from further French invasions. After the Belgian revolution of 1839, the purely French-speaking part of the country was ceded to Belgium and the Luxembourgish-speaking part became what is known today as the Grand-Duchy of Luxembourg. What’s a Grand-Duchy you might wonder?
Luxembourg is the only remaining Grand-Duchy in the world. A Grand-Duchy is like a Kingdom, but instead of a King, we have a Grand Duke. The current monarch is Henri, which means that Luxembourg is a constitutional monarchy with the head of state being the prime minister, Xavier Bettel. As you can imagine, Luxembourg’s history has had a very important impact on the languages we speak today in the country; there are three official languages, French, German, and Luxembourgish. Unlike other countries with several official languages, in Luxembourg, there is not a French, or German, or Luxembourgish speaking part. In Luxembourg, you use one of the three languages based on context.
For example, the laws are all written in French, and French is mostly the language used for official or formal written correspondence.German has traditionally been the language of the press and the police. And finally Luxembourgish is the language Luxembourguians use to speak with one another. This means that on a given day, most people here might switch between these three languages; of course, add English to the pile, which is rapidly growing in the country due to all the English speaking expats that come here to work (coughbrexitcough).
There is also a sizable Portuguese community in Luxembourg, so you’ll hear a lot of Portuguese on the streets too, as well as Italian. Around 50% of the inhabitants of Luxembourg are foreign born, mostly from other EU countries. The Italians, Portuguese and a lot of others have emigrated to Luxembourg starting in the 60s to work in the metallurgic sector, and later, in the construction sector. The children of these emigrants usually speak five languages; their mother tongue, say, Portuguese, the three official languages of the country, and finally English. You might wonder what Luxembourgish sounds like? Here is a video of our Prime Minister talking in Luxembourgish: Here is another video of him speaking French: Here he’s speaking German : And here English :
On the English video, you might notice the typical accent Luxembourguians have when speaking English :)
The text we’re analysing
The text I’ll be analyzing is called Renert oder de Fuuss am Frack an a Maansgréisst, published in 1872 by Michel Rodange. My high school was named after Michel Rodange by the way! Renert is a fable featuring a sly fox as the main character, called Renert. He gets in trouble because of his shenanigans and gets sentenced to death by the Lion King. However, through further lies and deceptions, he manages to escape. After some tribulations, he proves his worth to the King by winning a duel against the wolf and becomes an aristocrat. Because it was written in the 19th century, the way some words are written may be different that how we write them in modern Luxembourgish, which might create some problems when analyzing the text.
Now starts the technical part. If you’re only interested in the results, you can skip to part 3! ## Scraping the data
First of all, let’s load (or install if you don’t have them) the needed packages:
I download the text using read_html() from the xml2 package (which gets loaded by the tidyverse) and then find the nodes that interest me, in this case mw-parser-output. Then I extract the text from this node, and split it on the \n character, to get a big vector where each element is a line of text. I also remove the 24 first lines, which are mostly blank. Let’s take a look at the first five lines:
##  "'T stung Alles an der Bléi," "An d'Villercher di songen"
##  "Hir Lidder spéit a fréi." "Du rifft de Léiw, de Kinnek,"
##  "All Déier op e Fest"
The Renert is divided into 14 songs, so I’d like to create a list with 14 elements, where each element is the text of a song. Every song is titled “First Song”, “Second Song” etc, so I first check on which lines I find the word Gesank, which identifies the start of a song.
indices contains the indices of where the songs start. So I need to create the indices of when the songs end. If you think about it, the first songs ends where the second song begins, minus 1. So I create a new vector of indices, by first removing the index for the first song, substracting 1, and then adding the index for the last line (using length(renert_raw)).