# #SWDchallenge: artisanal data

This post is a participation in #SWDchallenge: artisanal data. Along with that it visualizes two common measures of word importance: tf and tf-idf. Variations of tf-idf, term frequency-inverse document frequency, are often used by search engines and text-based recommender systems.

I use the transcript of the “learning dataviz” episode of #SWDpodcast, where 12 data visualization professionals share their stories and recommendations.

I find a lot of wisdom and inspiration in this episode. The Storytelling with Data (SWD) book has been on my shelf for over a year and it is this podcast episode that sparked an interest in me about data visualization and inspired me to read the book in a few days and start participating in data viz challenges. So I thought I would enjoy spending more time with this episode by analysing the words in it.

The analysis was performed in R, a free software environment for statistical computing and graphics, mainly using the tidyverse packages, including ggplot2 for visualization.

# text cleaning

Text cleaning involves five steps:

1. Extract 12 interviews from the transcript.
2. Replace contractions. For example: “It’s a fascinating field” becomes “It is a fascinating field”.
3. Split sentences into single words.
4. Lemmatize words.
5. Remove stop words, the most common1 words in English language. Skipping this step may lead to conclusions like on the map below:

Here is a quote from Jeffrey Shaffer’s interview to illustrate steps 3-5:

library(tidyverse)
library(tidytext)
sentense <- "I learned R and started doing visualizations in R"
stop_words <- setdiff(tm::stopwords(kind = "SMART"), "r")

example <- tibble(sentense) %>%
unnest_tokens(output = "word", input = sentense) %>%
mutate(word_lemma = textstem::lemmatize_words(word),
stop_word = word %in% stop_words)

example
## # A tibble: 9 x 3
##   word           word_lemma    stop_word
##   <chr>          <chr>         <lgl>
## 1 i              i             TRUE
## 2 learned        learn         FALSE
## 3 r              r             FALSE
## 4 and            and           TRUE
## 5 started        start         FALSE
## 6 doing          do            TRUE
## 7 visualizations visualization FALSE
## 8 in             in            TRUE
## 9 r              r             FALSE

If you are already familiar with tf (term frequency), idf (inverse document frequency) and tf-idf (term frequency-inverse document frequency) then you may skip this section and go directly to the visual comparison of these two measures of word importance.

Few definitions in the context of this analysis:

• term is a single word (in general, a term may be a combination of two words, a sentence, etc.)
• document is an interview - any single interview from the “learning dataviz” episode.
• collection of documents is a collection of 12 interviews in the episode.

Term frequency (tf) measures how frequently a term occurs in a document. Since every document is different in length, it is often divided by the document length2:

$\text{tf}(t) = \frac{\text{Number of times term t appears in a document}}{\text{Total number of terms in the document}}.$

For example, if the word “learn” is used 8 times and the interview contains 540 words then

$\text{tf}(learn)=8/540\approx0.015.$

When used to measure word importance, the shortcoming of tf is that it is calculated using a single document in isolation from other documents in the collection. Notice that a word’s tf may be different for each document within a collection.

Inverse document frequency (idf) measures how important a term is in the context of other documents. Idf is computed as the logarithm of the number of documents in the collection divided by the number of documents where the specific term appears:

$\text{idf}(t) = log_e\left(\frac{\text{Total number of documents}}{\text{Number of documents with term t in it}}\right).$

Here is the relationship between a number of documents with term $$t$$ in it and the term’s idf. I assume there are 12 documents in the collection:

The “learning dataviz” episode of #SWDpodcast is about learning data visualization. Naturally, the word “learn” appears at least once in each of the 12 interviews. Does it mean that this word is important? In describing the whole episode - yes. In describing any particular interview from this episode - no, at least in the understanding of idf. Because all interviewees use the word “learn”.

The word “learn” appears in all 12 interviews, so its idf is zero:

$\text{idf}(learn)=log_e(12/12)=log_e(1)=0$

This means that the word “learn” is completely unimportant to describe any specific interview in the context of the other 11 interviews. Other words with zero importance here are data, make, thing, work, year.

Notice that the idf of each term is the same for all documents in a collection.

Term frequency-inverse document frequency (tf-idf) is a multiplication of tf and idf:

$\text{tfidf}(t) = \text{tf}(t) \times \text{idf}(t).$

It does not matter how high $$\text{tf}(learn)$$ is, if $$\text{idf}(learn)=0$$, the $$\text{tfidf}(learn)$$ will also be zero. On the contrary, if a word is only mentioned in one interview from 12 then its tf gets multiplied by 2.48. This is what happened to the word “copy”. Its importance rank increased from #40, measured by tf, to #3, measured by tf-idf.

Tf-idf gives the highest weight to words that are “common locally and rare globally”3. “Common locally” refers to the tf component, while “rare globally” refers to the idf component.

# tf-idf: the effect of idf on tf visualised

Next I use the most important words, as measured by tf and tf-idf, from Andy Cotgreave’s interview in the “learning dataviz” episode of #SWDpodcast. The plot below visualises changes in ranks - slopes go from tf ranks on the left to tf-idf ranks on the right. Ranks are positions relative to other words; the highest rank means the highest importance. There are 271 unique words, so the lowest possible rank is 271.

Tf gives the highest weight to words that are simply used most frequently. It is a local measure in the sense that it ignores other interviews in the same podcast episode.

Multiplication of tf by idf, which gives tf-idf, diminishes the importance of words that are common across all interviews, such as “data” and “learn”. At the same time it amplifies the importance of rare words, such as “copy” and “fun”.

The word “copy” is used only in Andy’s interview. It is used in the context of recommending Austin Kleon’s book “Steal Like an Artist”. The word “fun” is used only in two more interviews. Andy says “… get out of your comfort zone and have fun. Have fun. My gosh, you are allowed to have fun …”. Let’s do it!

Below is the data used for plotting. There are 12 interviews. Andy Cotgreave’s interview contains 540 words, 271 of them unique. There are six words with zero tf-idf (three of them in the table below); they share tf-idf ranks from 266 to 271.

word count total words tf documents with word idf tf-idf rank by tf rank by idf
data 21 540 0.039 12 0.000 0.000 1 266
tableau 11 540 0.020 5 0.875 0.018 2 1
work 10 540 0.019 12 0.000 0.000 3 270
book 9 540 0.017 11 0.087 0.001 4 246
learn 8 540 0.015 12 0.000 0.000 6 267
good 8 540 0.015 11 0.087 0.001 5 248
amaze 5 540 0.009 2 1.792 0.017 12 2
fun 4 540 0.007 3 1.386 0.010 24 6
week 4 540 0.007 2 1.792 0.013 32 5
single 4 540 0.007 2 1.792 0.013 27 4
copy 3 540 0.006 1 2.485 0.014 40 3

References:

Any comments or suggestions? I’d be glad to know! Please leave them below, no login required if you check “I’d rather post as a guest”.

1. Stop words from the SMART information retrieval system available in tm package served as the basis. They were supplemented with the names of podcast host and guests. One stop word was removed; it was the letter “r”, which is used by one of the podcast guests to refer to R as a software. It is definitely worth knowing prepackaged stop words!

2. I heard the phrase “common locally and rare globally” for the first time here and I liked this definition a lot.