Chapter 15: Unsupervised Machine Learning

Latent Dirichlet Allocation (LDA)

In a former section, I, first, explored how the sentiment in the SOTU addresses has evolved over the 20th century. Then, we looked at the decade-specific vocabulary. This, paired with previous knowledge of what happened throughout the 20th century, sufficed to gain some sort of insight. However, another approach to infer meaning from text is to search it for topics.

The two main assumptions of LDA are as follows:

  • Every document is a mixture of topics.
  • Every topic is a mixture of words.

Hence, singular documents do not necessarily be distinct in terms of their content. They can be related – if they contain the same topics. This is more in line with natural language use.

The following graphic depicts a flowchart of text analysis with the tidytext package.

Text analysis flowchart

What becomes evident is that the actual topic modeling does not happen within tidytext. For this, the text needs to be transformed into a document-term-matrix and then passed on to the topicmodels package (Grün et al. 2020), which will take care of the modeling process. Thereafter, the results are turned back into a tidy format, using broom so that they can be visualized using ggplot2. The following analysis focuses on Taylor Swift lyrics and partially borrows from this blog post by Julia Silge.

Document-term matrix

To search for the topics which are prevalent in the singular addresses through LDA, we need to transform the tidy tibble into a document-term matrix first. This can be achieved with cast_dtm().

needs(taylor, tidyverse, tidytext, SnowballC, stm, spacyr)

taylor_album_songs |> glimpse()
Rows: 240
Columns: 29
$ album_name          <chr> "Taylor Swift", "Taylor Swift", "Taylor Swift", "T…
$ ep                  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ album_release       <date> 2006-10-24, 2006-10-24, 2006-10-24, 2006-10-24, 2…
$ track_number        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
$ track_name          <chr> "Tim McGraw", "Picture To Burn", "Teardrops On My …
$ artist              <chr> "Taylor Swift", "Taylor Swift", "Taylor Swift", "T…
$ featuring           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ bonus_track         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ promotional_release <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ single_release      <date> 2006-06-19, 2008-02-03, 2007-02-19, NA, NA, NA, N…
$ track_release       <date> 2006-06-19, 2006-10-24, 2006-10-24, 2006-10-24, 2…
$ danceability        <dbl> 0.580, 0.658, 0.621, 0.576, 0.418, 0.589, 0.479, 0…
$ energy              <dbl> 0.491, 0.877, 0.417, 0.777, 0.482, 0.805, 0.578, 0…
$ key                 <int> 0, 7, 10, 9, 5, 5, 2, 8, 4, 2, 2, 8, 7, 4, 10, 5, …
$ loudness            <dbl> -6.462, -2.098, -6.941, -2.881, -5.769, -4.055, -4…
$ mode                <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,…
$ speechiness         <dbl> 0.0251, 0.0323, 0.0231, 0.0324, 0.0266, 0.0293, 0.…
$ acousticness        <dbl> 0.57500, 0.17300, 0.28800, 0.05100, 0.21700, 0.004…
$ instrumentalness    <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, …
$ liveness            <dbl> 0.1210, 0.0962, 0.1190, 0.3200, 0.1230, 0.2400, 0.…
$ valence             <dbl> 0.425, 0.821, 0.289, 0.428, 0.261, 0.591, 0.192, 0…
$ tempo               <dbl> 76.009, 105.586, 99.953, 115.028, 175.558, 112.982…
$ time_signature      <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
$ duration_ms         <int> 232107, 173067, 203040, 199200, 239013, 207107, 24…
$ explicit            <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ key_name            <chr> "C", "G", "A#", "A", "F", "F", "D", "G#", "E", "D"…
$ mode_name           <chr> "major", "major", "major", "major", "major", "majo…
$ key_mode            <chr> "C major", "G major", "A# major", "A major", "F ma…
$ lyrics              <list> [<tbl_df[55 x 4]>], [<tbl_df[33 x 4]>], [<tbl_df[…
tidy_taylor <- taylor_album_songs |>
    unnest(lyrics) |> 
    group_by(album_name, track_name, album_release) |>
    summarize(text = str_c(lyric, collapse = " ")) |> 
    rowid_to_column("doc_id") |> 
    mutate(doc_id = doc_id |> as.integer())
`summarise()` has grouped output by 'album_name', 'track_name'. You can
override using the `.groups` argument.
spacy_initialize(model = "en_core_web_sm")
successfully initialized (spaCy Version: 3.8.3, language model: en_core_web_sm)
preprocessed_taylor <- spacy_parse(tidy_taylor) |> 
  entity_consolidate() |> 
  filter(pos != "PUNCT") |> 
  filter(str_length(token) > 1) |> 
  anti_join(get_stopwords(), by = join_by(lemma == word)) |> 
  mutate(lemma = str_to_lower(lemma)) |> 
  add_count(doc_id, lemma) |> 
  filter(!str_detect(lemma, "'")) |> 
  mutate(doc_id = doc_id |> as.integer()) |> 
  arrange(doc_id)

taylor_sparse <- preprocessed_taylor |>  
  select(doc_id, lemma, n) |>
  cast_sparse(doc_id, lemma, n)

A sparse matrix is similar to a document term matrix in that it contains Documents (rows) and Terms (columns) and specifies how often a term appears in a document.

taylor_sparse |> as.matrix() %>% .[1:5, 1:5]
  flamingo pink sunrise_boulevard_clink clink young
1        1    1                       1     1     1
2        0    0                       0     0     0
3        0    0                       0     0     0
4        0    0                       0     0     0
5        0    0                       0     0     3
taylor_sparse |> dim()
[1]  240 3674

Inferring the number of topics

We need to tell the model in advance how many topics we assume to be present within the document. Since we have neither read all the taylor songs (if so, we would hardly need to use the topic model), we cannot make an educated guess on how many topics are in there.

Making guesses

One approach might be to just provide it with wild guesses on how many topics might be in there and then try to make sense of them afterward.

needs(broom)

taylor_lda_k13 <- stm(taylor_sparse, K = 13, verbose = FALSE)

The tidy() function from the broom package (Robinson 2020) brings the stm output back into a tidy format. It consists of three columns: the topic, the term, and beta, which is the probability that the term stems from this topic.

beta_taylor_k13 <- tidy(taylor_lda_k13) # gives word distributions per topic
gamma_taylor_k13 <- tidy(taylor_lda_k13, matrix = "gamma") # gives topic probabilities per document

Now, we can wrangle it a bit, and then visualize it with ggplot2.

top_terms_k13 <- beta_taylor_k13 |>
  group_by(topic) |>
  slice_max(beta, n = 5, with_ties = FALSE) |>
  ungroup() |>
  arrange(topic, -beta)

top_terms_k13 |>
  mutate(topic = factor(topic),
         term = reorder_within(term, beta, topic)) |>
  ggplot(aes(term, beta, fill = topic)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  scale_x_reordered() +
  facet_wrap(~topic, scales = "free", ncol = 2) +
  coord_flip()

ggsave("tylor_k13_beta.png", width = 7, height = 10, bg = "white")

Raw betas won’t get us anywhere. They just show us the most frequent words per topic. However, these words might also be frequent in other topics. Therefore, we need to use other metrics to make sense of the topics.

labelTopics(taylor_lda_k13, topics = 1:13)
Topic 1 Top Words:
     Highest Prob: red, like, snow, i, beach, remember, come 
     FREX: red, snow, beach, style, weird, pocketful, rare 
     Lift: james_dean, slick, condition, rosy, request, hindsight, a_few_moon_ago 
     Score: red, beach, snow, style, weird, pocketful, passionate 
Topic 2 Top Words:
     Highest Prob: yet, clear, wood, uh, i, go, huh 
     FREX: wood, yet, clear, delicate, huh, happiness, uh 
     Lift: glorious, discover, wood, last_december, furniture, airplane, brake 
     Score: yet, wood, clear, happiness, huh, the_great_war, furniture 
Topic 3 Top Words:
     Highest Prob: oh, i, like, come, know, just, never 
     FREX: getaway, twenty_-_two, oh, trouble, lavender, rain, grow 
     Lift: anthem, proudly, revolution, everythi, ing, glaad, surreal 
     Score: trouble, oh, getaway, twenty_-_two, bedroom, lucky, lavender 
Topic 4 Top Words:
     Highest Prob: wish, i, know, never, like, give, now 
     FREX: wish, live, give, sign, wall, long, singing 
     Lift: 2_am, crooked, fightin, misty_morning, special, superstar, morning_loneliness 
     Score: wish, crooked, singing, aimee, drew, teardrop, the_very_first_night 
Topic 5 Top Words:
     Highest Prob: run, i, want, call, one, say, like 
     FREX: run, call, london, man, dear, darling, ring 
     Lift: mile, depressed, guide, twenty_year, homecoming, march, hopeless 
     Score: run, throat, london, dear, splash, reader, sprinkler 
Topic 6 Top Words:
     Highest Prob: love, go, i, say, know, back, can 
     FREX: la, today, love, beautiful, way, silence, smile 
     Lift: flamingo, sunrise_boulevard_clink, clink, aquamarine, swimmin, handprints, cement 
     Score: la, fairytale, today, fallin, hopin, december, tide 
Topic 7 Top Words:
     Highest Prob: go, shake, i, just, think, can, say 
     FREX: shake, hate, finally, breathe, fake, clean, nice 
     Lift: faker, peaceful, applause, sabotage, moth, certain, sweetness 
     Score: shake, lightnin, mm_-_mm, finally, hater, evermore, haha 
Topic 8 Top Words:
     Highest Prob: i, just, make, like, look, get, karma 
     FREX: karma, bye, gorgeous, usin, boyfriend, di, belong 
     Lift: dash, bye, tale, agree, directly, exhaust, anti 
     Score: karma, bye, gorgeous, florida, boyfriend, usin, drug 
Topic 9 Top Words:
     Highest Prob: know, i, now, like, well, just, get 
     FREX: know, closure, bless, well, everybody, man, killin 
     Lift: fox, attitude, temper, picket, april_29th, proposition, version 
     Score: bless, closure, killin, 18_hour_ago, heartbeat, empire, shoulda 
Topic 10 Top Words:
     Highest Prob: see, daylight, new_york, i, dance, welcome, now 
     FREX: daylight, new_york, welcome, maroon, waitin, dance, feeling 
     Lift: daylight, aglow, kaleidoscope, new_york_welcome, soundtrack, unforgiven, 20_-_year 
     Score: daylight, new_york, welcome, maroon, aglow, soundtrack, waitin 
Topic 11 Top Words:
     Highest Prob: ooh, i, never, get, ever, like, wanna 
     FREX: ooh, starlight, wonderland, end, cornelia_street, hoo, whoa 
     Lift: important, wonderland, mystify, eeh, 19, memories, stained 
     Score: ooh, starlight, wonderland, cornelia_street, hoo, eeh, whoa 
Topic 12 Top Words:
     Highest Prob: stay, i, think, hey, get, now, like 
     FREX: stay, hey, ha, mad, blood, follow, alive 
     Lift: este, lettin, aside, og, d.o.c., tlc, od 
     Score: hey, ha, blood, stay, mad, solve, dramatic 
Topic 13 Top Words:
     Highest Prob: time, last, i, know, say, still, ask 
     FREX: last, mr., bet, ask, babe, fiction, timeless 
     Lift: careful, ba, sweeter, proud, familiarity, breed, contempt 
     Score: ho, mr., fiction, timeless, last, bet, electric 

Or look at which documents belong to which topics the most:

gamma_prep <- gamma_taylor_k13 |>
  group_by(topic) |>
  ungroup() |>
  arrange(topic, -gamma) |> 
  left_join(tidy_taylor, by = c("document" = "doc_id")) |>  
  select(topic, document, gamma, album_name, track_name) |> 
  mutate(topic = factor(topic),
         term = reorder_within(document, gamma, topic))

gamma_tbl <- gamma_prep |> 
  group_by(topic) |>
  slice_max(gamma, n = 5) |> 
  ungroup() |>
  select(topic, track_name) |> 
  mutate(id = rep(1:5, times = 13)) |>
  pivot_wider(names_from = topic, values_from = track_name, names_prefix = "Topic_")

Okay, still tricky. Let’s check which topics co-occur in albums. (This graph is taken from Julia Silge’s blog post linked above, I take zero credit for it.)

gamma_prep |> 
ggplot(aes(gamma, topic, fill = topic)) +
    geom_boxplot(alpha = 0.7, show.legend = FALSE) +
    facet_wrap(vars(album_name)) +
    labs(x = expression(gamma))

Okay, we see some differences now. Especially that certain topics are more prevalent in certain albums (for instance, topic 9 in folklore and 12 in fearless). However, still not perfect. Let’s start from scratch and do it properly.

More elaborate methods

LDA offers a couple of parameters to tune, but the most crucial one probably is K, the number of topics. The stm package offers the FindTopicsNumber() function which allows us to train multiple LDA models with different numbers of topics and then compare them based on different metrics.

determine_k <- searchK(taylor_sparse, K = seq(5, 40, by = 1), N = 20)
plot(determine_k)

This graph can be read as follows: - Held-Out Likelihood (top left, higher is better) measures how well our model predicts unseen data. We should look for where the curve starts plateauing – i.e, topics beyond this point will only offer diminishing returns and we’ve captured the main patterns in the data. - Residuals (top right, lower is better) assess model fit. The curve should be relatively stable with gradual improvement as K increases. Large spikes indicate convergence problems or model instability at that K value, thus we should avoid those. - Semantic Coherence (bottom left, higher is better) evaluates whether the top words in each topic tend to co-occur together in documents. Topics with high coherence are generally more interpretable. This metric typically decreases as you add more topics, reflecting the trade-off between granularity and interpretability. - Lower Bound (bottom right, higher is better) is a log-likelihood measure that always improves with more topics, so it’s less useful for model selection on its own.

Based on these metrics, we might choose K = 20 as a good balance between model fit and interpretability, mainly due to the semantic coherence and the held-out likelihood starting to plateau around this point.

Sense-making

Now, the harder part begins: making sense of the different topics. In LDA, words can exist across topics, making them not perfectly distinguishable. Also, as the number of topics becomes greater, plotting them doesn’t make too much sense anymore. Let’s do this with our optimized K:

taylor_lda_k20 <- stm(taylor_sparse, K = 20, verbose = FALSE, seed = 1989)

beta_taylor_k20 <- tidy(taylor_lda_k20) # gives word distributions per topic
gamma_taylor_k20 <- tidy(taylor_lda_k20, matrix = "gamma") # gives topic probabilities per document

top_terms_k20 <- beta_taylor_k20 |>
  group_by(topic) |>
  slice_max(beta, n = 5, with_ties = FALSE) |>
  ungroup() |>
  arrange(topic, -beta)

top_terms_k20 |>
  #filter(topic %in% 1:10) |> 
  mutate(topic = factor(topic),
         term = reorder_within(term, beta, topic)) |>
  ggplot(aes(term, beta, fill = topic)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  scale_x_reordered() +
  facet_wrap(~topic, scales = "free", ncol = 3) +
  coord_flip()

labelTopics(taylor_lda_k20, topics = 1:20)
Topic 1 Top Words:
     Highest Prob: red, like, remember, i, well, love, know 
     FREX: red, rare, stair, wind, hair, remember, sacred 
     Lift: scarf, disposition, upstate, all_these_day, patriarchy, three_month, refrigerator 
     Score: red, rare, scarf, stair, prayer, wind, sacred 
Topic 2 Top Words:
     Highest Prob: yet, clear, wood, i, go, can, good 
     FREX: wood, yet, clear, delicate, happiness, decide, cool 
     Lift: four_year, bike, wood, discover, last_december, furniture, airplane 
     Score: yet, wood, clear, happiness, delicate, whether, decide 
Topic 3 Top Words:
     Highest Prob: oh, i, like, just, yeah, never, know 
     FREX: getaway, oh, twenty_-_two, lucky, lavender, car, work 
     Lift: anthem, proudly, glaad, surreal, damned, the_1950s, viral 
     Score: getaway, oh, twenty_-_two, lucky, lavender, uh, romantic 
Topic 4 Top Words:
     Highest Prob: wish, i, know, never, can, time, like 
     FREX: wish, live, singing, drew, guitar, question, another 
     Lift: 2_am, special, superstar, fifteen_second_later, clappin, 19, memories 
     Score: wish, singing, crooked, drew, aimee, teardrop, one_day 
Topic 5 Top Words:
     Highest Prob: run, go, i, know, see, take, one 
     FREX: run, begin, london, dear, romeo, sprinkler, fireplace 
     Lift: romeo, juliet, guide, sprinkler, fireplace, gon, hello_"_little_do_i_know 
     Score: begin, sprinkler, fireplace, london, splash, guide, reader 
Topic 6 Top Words:
     Highest Prob: love, go, way, can, know, i, la 
     FREX: la, love, today, way, beautiful, fairytale, fallin 
     Lift: incredible, breathless, rollercoaster, tennessee, enjoy, highgate, enchant 
     Score: la, fairytale, fallin, today, hopin, fancy, foolish 
Topic 7 Top Words:
     Highest Prob: go, shake, i, hate, nice, break, thing 
     FREX: shake, nice, hate, fake, mm_-_mm, exist, catch 
     Lift: cruisin, heartbreakers, faker, off, peaceful, finance, sweetness 
     Score: shake, mm_-_mm, nice, hater, evermore, tiger, indifference 
Topic 8 Top Words:
     Highest Prob: i, make, karma, look, just, like, time 
     FREX: karma, gorgeous, usin, boyfriend, di, florida, belong 
     Lift: tock, waves, agree, directly, exhaust, anti, karma 
     Score: karma, gorgeous, florida, boyfriend, usin, drug, belong 
Topic 9 Top Words:
     Highest Prob: know, i, now, well, get, like, just 
     FREX: everybody, closure, know, killin, young, yes, fifteen 
     Lift: version, waistline, vulture, circlin, clouds, fox, baby_(_baby 
     Score: closure, everybody, killin, fifteen, fearless, yes, infidelity 
Topic 10 Top Words:
     Highest Prob: new_york, welcome, still, i, dance, like, never 
     FREX: new_york, worship, welcome, blind, maroon, bright, innocent 
     Lift: aglow, kaleidoscope, new_york_welcome, new_york_city, false, worship, incense 
     Score: new_york, welcome, maroon, worship, problem, soundtrack, blind 
Topic 11 Top Words:
     Highest Prob: ooh, ever, i, like, get, never, dance 
     FREX: ooh, starlight, ever, tie, hoo, whoa, eeh 
     Lift: the_red_sea, anxious, icon, platonic, mega, shroud, devils 
     Score: ooh, starlight, hoo, eeh, tie, shove, hum 
Topic 12 Top Words:
     Highest Prob: stay, hey, think, now, get, time, blood 
     FREX: hey, stay, mad, blood, alive, solve, dead 
     Lift: aside, og, d.o.c., tlc, od, id, pov 
     Score: hey, stay, blood, mad, solve, problem, alive 
Topic 13 Top Words:
     Highest Prob: time, last, i, get, go, know, just 
     FREX: wonderland, last, style, fiction, timeless, ho, babe 
     Lift: ba, james_dean, classic, style, slick, seen, ten_foot 
     Score: ho, wonderland, fiction, december, style, last, timeless 
Topic 14 Top Words:
     Highest Prob: wanna, i, like, daylight, love, see, now 
     FREX: daylight, wanna, somewhere, reputation, fortnight, game, else 
     Lift: afterglow, 20_-_year, daylight, paris, throughout, the_great_war, plague 
     Score: daylight, reputation, fortnight, wanna, the_great_war, paris, somewhere 
Topic 15 Top Words:
     Highest Prob: want, call, i, like, one, know, get 
     FREX: bless, call, want, prophecy, except, august, midnight 
     Lift: footprint, boyish, architect, 17, 16th_avenue_get, lyrical, indigo 
     Score: bless, prophecy, august, chaos, revelry, heartbeat, except 
Topic 16 Top Words:
     Highest Prob: ah, ha, i, think, just, say, now 
     FREX: ah, ha, finally, glitch, prove, buy, clean 
     Lift: careful, este, lettin, system, glitch, five_second_later, fastenin 
     Score: ha, ah, finally, prove, glitch, este, lettin 
Topic 17 Top Words:
     Highest Prob: man, can, i, get, just, know, still 
     FREX: man, fuck, window, polish, a_thousand, wreck, might 
     Lift: wakin, dollar, a_thousand, familiarity, breed, contempt, basement 
     Score: man, wreck, fuck, a_thousand, polish, wakin, wherever 
Topic 18 Top Words:
     Highest Prob: come, like, can, i, snow, oh, feel 
     FREX: snow, beach, come, help, jump, hallelujah, star 
     Lift: three_hundred, revolution, laughing, person, eighteen, applause, moth 
     Score: beach, snow, weird, pocketful, help, hallelujah, impossible 
Topic 19 Top Words:
     Highest Prob: i, never, see, now, know, right, trouble 
     FREX: trouble, mr., cornelia_street, different, bet, easy, sign 
     Lift: dreamer, mystify, jacket, archer, prey, stop_(_stop, hedge 
     Score: trouble, mr., cornelia_street, bet, different, shame, whenever 
Topic 20 Top Words:
     Highest Prob: say, i, go, back, oh, baby, away 
     FREX: rain, back, forever, say, bye, everything, wrong 
     Lift: unarmed, halfway, fold, fray, bye, dresser, hesitation 
     Score: bye, forever, rain, bedroom, everything, baby, bleedin 
gamma_prep <- gamma_taylor_k20 |>
  group_by(topic) |>
  ungroup() |>
  arrange(topic, -gamma) |> 
  left_join(tidy_taylor, by = c("document" = "doc_id")) |>  
  mutate(factor(album_name) |> fct_reorder(album_release)) |> 
  select(topic, document, gamma, album_name, track_name) |> 
  mutate(topic = factor(topic),
         term = reorder_within(document, gamma, topic))

gamma_tbl <- gamma_prep |> 
  group_by(topic) |>
  slice_max(gamma, n = 5, with_ties = FALSE) |> 
  ungroup() |>
  select(topic, track_name) |> 
  mutate(id = rep(1:5, times = 20)) |>
  pivot_wider(names_from = topic, values_from = track_name, names_prefix = "Topic_")

gamma_prep |> 
ggplot(aes(gamma, topic, fill = topic)) +
    geom_boxplot(alpha = 0.7, show.legend = FALSE) +
    facet_wrap(vars(album_name)) +
    labs(x = expression(gamma))

Incorporating meta data

Structural Topic Models offer a framework for incorporating metadata into topic models. In particular, you can have these metadata affect the topical prevalence, i.e., the frequency a certain topic is discussed can vary depending on some observed non-textual property of the document. On the other hand, the topical content, i.e., the terms that constitute topics, may vary depending on certain covariates.

Structural Topic Models are implemented in R via a dedicated package. The following overview provides information on the workflow and the functions that facilitate it.

In the following example, I will use the State of the Union addresses to run you through the process of training and evaluating an STM.

effects <- estimateEffect(1:20 ~ album_name, taylor_lda_k20, tidy_taylor)

effects |> tidy() |> filter(p.value < 0.05)
# A tibble: 5 × 6
  topic term                                estimate std.error statistic p.value
  <int> <chr>                                  <dbl>     <dbl>     <dbl>   <dbl>
1     6 (Intercept)                           0.124     0.0520      2.39  0.0176
2     9 album_namefolklore                    0.155     0.0776      2.00  0.0464
3    12 (Intercept)                           0.108     0.0423      2.55  0.0113
4    20 (Intercept)                           0.0780    0.0373      2.09  0.0373
5    20 album_nameFearless (Taylor's Versi…   0.110     0.0498      2.22  0.0276

Seeded Topic Models

Another flavor of topic models are seeded topic models. They give you more control over the topics that are actually “worth finding” since you can predetermine the words that make up a certain topic. We are here going to use the Taylor Swift corpus from before. We need it to be in the format of a document-feature matrix.

needs(quanteda, seededlda)

taylor_dfm <- preprocessed_taylor |> 
  cast_dfm(doc_id, lemma, n)

Also, we needs to define our topics in a dictionary. I got some inspiration from this Reddit post

dict <- dictionary(
  list(
    heartbreak = c("tears", "goodbye", "breakup", "cry", "hurt", "pain", "lonely", "miss", "sorry", "regret"),
    romance = c("love", "kiss", "boyfriend", "forever", "marry", "romantic", "heart", "darling", "lover", "beautiful"),
    revenge = c("revenge", "burn", "reputation", "mad", "blame", "fight", "deserve", "payback", "worse"),
    nostalgia = c("remember", "back", "fifteen", "young", "memories", "childhood", "innocence", "time", "old"),
    empowerment = c("shake", "stand", "strong", "brave", "myself", "independent", "rise", "confidence", "unstoppable"),
    fame = c("camera", "star", "spotlight", "stage", "dress", "carpet", "crown", "audience", "performance"),
    small_town = c("small", "town", "truck", "porch", "summer", "seventeen", "high", "school", "hometown", "street"),
    celebration = c("dance", "party", "night", "champagne", "spark", "glitter", "magic", "celebrate", "midnight", "shine"),
    friendship = c("friend", "girl", "squad", "together", "best", "laugh", "call", "close", "trust", "loyal"),
    seasons = c("snow", "winter", "fall", "autumn", "spring", "cold", "rain", "december", "leaves", "season"),
    cat = c("karma", "cat", "attack", "hunter", "purr")
  )
)

Then we can train the model. We will again use k = 20 – hence, we need to set residual = 9 – this will give us 9 remaining + 11 defined topics.

lda_res <- textmodel_seededlda(taylor_dfm, 
                               dict, 
                               residual = 9, 
                               batch_size = 0.01, 
                               auto_iter = TRUE,
                               verbose = TRUE)
Fitting LDA with 20 topics
 ...initializing
 ...Gibbs sampling in up to 2000 iterations
 ......iteration 100 elapsed time: 0.49 seconds (delta: -0.07%)
 ...computing theta and phi
 ...complete

Let’s have a look at the words in the topics:

topic_words <- lda_res |> 
  pluck("phi") |> 
  t() |> 
  as_tibble(rownames = NA) |> 
  rownames_to_column("term") |> 
  pivot_longer(-term) |> 
  group_by(name) |> 
  slice_max(value, n = 10) 

Let’s check out the strength of the topics in the particular documents/years:

docs <- rownames(taylor_dfm) |> 
  enframe(name = NULL, value = "doc_id") |> 
  bind_cols(lda_res$theta |> as_tibble())

strongest_belongings <- docs |> 
  pivot_longer(-doc_id, names_to = "topic") |> 
  group_by(topic) |> 
  slice_max(value, n = 5) |> 
  left_join(tidy_taylor |> mutate(doc_id = doc_id |> as.character()))
Joining with `by = join_by(doc_id)`
needs(plotly)
boxplot <- docs |> 
  left_join(tidy_taylor |> mutate(doc_id = doc_id |> as.character())) |> 
  select(-starts_with("other")) |> 
  pivot_longer(heartbreak:cat, names_to = "topic", values_to = "value") |> 
  mutate(album_name = factor(album_name) |> fct_reorder(album_release),
         label = case_when(
          value > 0.7 ~ track_name,
          TRUE ~ ""
         )) |> 
  ggplot(aes(value, topic, fill = topic, text = track_name)) +
    geom_point(alpha = 0.7, show.legend = FALSE) +
    facet_wrap(vars(album_name)) +
    labs(x = expression(theta)) +
    theme_minimal()
Joining with `by = join_by(doc_id)`
ggplotly(boxplot, tooltip = "text") %>%
  layout(hovermode = "closest")

This is just a first glimpse into the capabilities of seeded topic models. Of course, you can now do more, adapt the seed words etc., and finally visualize the topics. Just the way we did above.

Further readings

References

Grün, Bettina, Kurt Hornik, David Blei, John Lafferty, Xuan-Hieu Phan, Makoto Matsumoto, Nishimura Takuji, and Shawn Cokus. 2020. “Topicmodels: Topic Models.”
Robinson, David. 2020. “Broom: Convert Statistical Analysis Objects into Tidy Data Frames.”