Wakeipedia: Experimenting with Finnegans Wake Data

Who knew that Finnegans Wake would one day be reduced to cells in a spreadsheet?

For a long time, I wanted to do experiments with Finnegans Wake and data visualizations. A recent assignment gave me this chance, so I quickly got to work. The first thing I had to do was figure out my dataset. Obviously Finnegans Wake is fiction and relies heavily on an idioglossia of Joyce’s design so it might be tough to pinpoint distinct data points for the book. This meant that I had to take a step back and look at the book from a very removed perspective to start -what better a dataset for this book than its own lexicon and frequencies? Studying the Wake in the past led me to remember a couple of different online tools like Fweet, which is a search engine for the book, and Finwake which is an online annotated version. However, the most useful gathered data would have to be from Eric Rosenbloom’s Concordance of Finnegans Wake which he compiled apparently in the late 1990s. Throughout this project, there will definitely be certain data constraints considering the fact that no major datasets have really been constructed for the Wake.

My first task was to scrape the entire concordance that Rosenbloom developed and turn it into a manipulable dataset. I copied all the data into an excel spreadsheet, cleaned up the parentheticals using a replace function, and separated the numbers by using Excel’s “Text to Columns” feature. Now I had my base data: 61,597 unique utterances that occur throughout the Wake and how frequently each word appears. Now that we have this, we can do very basic, primary visualizations.

Using Tableau, I imported the big dataset to try to play around with visualizing the most commonly used phrases in the book. Unfortunately it was a bit too much to visualize 60,000 plus entries to I scaled down the dataset to include 10 or more utterances of a unique phrase. This then led to around 1,500 entries which is much more manageable for the software. From that, I played with both polygon visualizations and bubble visualizations:

Polygon Visualization
bubble visualization

Unfortunately, this isn’t really more than simply filtering data on excel, but it helped me lead to some small conclusions in an easier way. For example, the true author corrected version of Finnegans Wake sits at around 628 pages. Based on the data visualized, “The” appears on average 19 times per page. This shows the importance of the declarative in Joyce’s work considering that he was trying to tell of a certain history in his book of the night. Obviously you need to truly weed through the conjunctions, prepositions, pronouns and more, but you can even find meaning in the most basic of visualizations such as this one. However, I needed to dig a bit deeper and scale down my dataset to try to grab a bit more information from a first visual run through the Wake.

Now I tried a different method: I would filter based on characters to play around with Joyce’s portmanteaus. Words in the English language average around 5 characters (give or take a margin of error of 2), so to weed out most normal instances, I set a range of 15-30 characters when building the next set. After filtering and copying, I managed to construct a small dataset of character counts and frequency values. Obviously the one variable is whether a word like “acknowledgment” would make the cut because it is 15 characters. To eliminate this, I had to manually go through the values and get numbers based on only portmanteaus of Joyce’s creation. From rockbysuckerassousyoceanal to alljawbreakical, Joyce showed a negative correlation between the amount of characters in a portmanteau and the amount of portmanteaus. This is just one example of how using data for Finnegans Wake is an alternative way of finding answers to this still unsolved mystery of a book. Understanding Joyce’s choices for the construction of the Wake is just one of the building blocks that can’t be ignored.

Portmanteau Frequency

I also wanted to take a dive into the languages used in the book, so I needed to consult outside data sources. Joyce was in fact a polyglot and this book uses an absurd amount of languages. How would the Wake look like stacked against a language’s lexicon? In this case, I would need to find these datasets on the internet (and also see whether or not they exist). Of course another variable is the fact that nobody would have a “complete” lexicon of a certain language – there are too many factors that would render it incomplete such as the addition of new words. However, what I have to work with are an English lexicon and a French lexicon. I decided to go with these two because the former is Joyce’s native tongue and the latter was one of Joyce’s more advanced languages. Joyce lived in Paris for many years, and also wrote parts of the book while he was there.

Excel Crash

The first lexicon I tackled was English. This set contained over 300,000 entries and was a lot to handle for my computer. I placed the data next to the book’s unique values and used the Excel formula =IF(ISERROR(MATCH(X1,$Y:$Y,0)),””,X1) where x = your words and y = your lexicon to fill a new column with the matches. Of course this calculation took a good 15 minutes and almost crashed my computer in the process. However, this led to the finding that there are 26,777 unique English entries in the Wake. The one variable that this does not include are place names and the names of people, but it is still an incredible find because it is astounding that less than half of the Wake is in “English.” After this, I used the same process to find unique French entries. I ran the calculation, found that there were 5,189 unique records and then ran that against the English records to filter out potential cognates. I did this by using conditional formatting to highlight duplicate values, sorted based on the new formatting and deleted the duplicate entries. This led to a smaller value of only 495 entries. Again, we have to be skeptical of the dataset because these lexicons are subject to change, and potentially lack information. However, the trust comes from the fact that any deviation of a word was most likely a portmanteau and probably wouldn’t be included as a unique entry.

Amount of Entries Bar
Amount of Entries Pie

Unique English and French entries (when sorted for cognates) don’t even make up half the work. This data was fun to work with but there are some factors that had to be neglected simply due to the amount of data and methodologies. For example, the real meat of the portmanteaus really need much more time to be explored simply because of how many languages and meanings they encompass. French and English were two great options for comparison, but that’s not where the book ends. the Wake is rumored to have over 70 different languages in it, but they were the best and most thorough options I could find.

However, I didn’t end it there

I decided to look to build one last dataset for a small and fun project: the thunderwords. The thunderwords of Finnegans Wake were 10 100 character words (the last being 101 characters) that were placed throughout the book. Each word could be broken into bits that were taken from various languages and again, these had their variables as well. Words 6, 8 and 10 were all standardized for a single language, and word 5 had many unrecognizable parts. One last variable is that I based the collection of the meanings from a video series called “Don’t Panic: It’s Only Finnegans Wake” from Adam Harvey on YouTube, and Finwake. For this small visualization, I decided to create a map to show countries that the words were scraped from. For example, let’s say one of the thunderwords contained 7 different languages – that could then be attributed to 7 different countries on a map. For this map, I downloaded a 2014 world boundaries shapefile that contained administrative boundaries for countries. I then merged the data based on the admin boundary names (simple country names) and the column I built named “Area.”

After the data was merged, I set my basemap and set my visual factors which I decided would be “country,” “value” (or number of times a thunderword is derived from the mother tongue, and I also added one example from the table of a piece of a thunderword that relates to that country. Considering that it’s Joyce’s book of the night, I wanted to give the map a darker feel in terms of aesthetics. The map is a small way of looking at what countries played heavily in the formation of the thunderwords.

For the data used in this project, please see my GitHub repository

Leave a Reply

Your email address will not be published. Required fields are marked *