/ Smart DataData AnalysisStatisticstopic modelingtext analysis


A part of the corporate hiring process is devising job advertisements. While creating these descriptions, it is important to know the skills one is looking for in a candidate and be familiar with the subject matter.

For instance in the technology sector, new processes, tools, and languages keep occurring regularly; hence, keeping up to date requires regular research. And during these research phases, it might be of help to know some of the words and phrases we need to google.

Manual search for these words may require blindly poking around in numerous articles, blog posts and papers, only to find a few which are truly relevant. We are going to attempt to automate a part of this process by writing a script which will hopefully help us obtain useful and currently trending buzzwords in a given field.

The basic idea

Our goal is to obtain current buzzwords from online articles. We will pick a platform which has new articles and collect them to extract significant and trending words.

We will foremost need a number of documents - here referred to as a corpus - based on a certain topic. The more documents, the better. We will choose Data Science as an example field in the implementation of our method.

Before we can further work with the corpus, it makes sense to get rid of words which do not contribute to the content. Words such as the, a, this, that, which,... are known as stop-words and can be removed from the text without losing vital information.
If our aim were to retrieve the meaning of these texts, we would have stopped here with the filtering. Yet our aim is to find words which bring us new information, which is why we are going to continue with the filtering. In our case, we are strictly looking for important nouns in the text. Verbs, adjectives and adverbs would have been essential in order to derive the meaning of texts as these carry information about the action and description of the subjects and objects in the texts, but we are going to filter these out. Even certain nouns - for example in case of Data Science: data, programming, code or language - are not bringing anything new to the table. In contrast, finding words such as Python or R would be very useful.

This filtering leaves us with a small percentage (1-10) of words compared to what we began with. We can now go ahead and extract groups of significant words from these results using methods we will explain later. Each of these groups will ideally adhere to one topic - topics which are related to Data Science and thus also occurred in the articles we collected e.g. Artificial Intelligence, Machine learning. As we do not know how many different topics occurring are present in the articles, we can experiment with the number until the results look reasonable.[1]

In order to build these word groups i.e. topics, we are going to implement a method called Latent Dirichlet Allocation (LDA). This method is primarily used to extract topics out of a corpus. The basic idea is to generate a table with words and topics on each of the axes which depicts how likely (probable) it is to select a certain word while sampling from a certain topic. Then the result is a certain number of most likely words for a topic.[2]

. bat kick ball
soccer 0 0.4 0.6
baseball 0.4 0.6 0.4

For example in table 1, if we were to sample words from topic soccer, 40% of the time we would obtain the word kick. The other 60% would be the ball. Then the two following topics would be displayed:

soccer: ball, kick
baseball: ball, bat

How we obtain the table is going to be further explained in the next section with the help of a example consisting of two documents.

In our case, in order to find relevant words from the corpus, our plan is to apply this method assuming we already know the topics (or at least have some prior knowledge on them [3]). After applying LDA, our idea is to interpret the likeliness entries as a measure of relevance of the word to a certain topic and thus obtain an ordered list of relevant words for each topic.

Simplified Latent Dirichlet Allocation

Let’s say we have the following two documents at our hand:

document1: Gotham Joker
document2: Luthor Lois

These two documents consist of one separate topic each: Batman (1) and Superman (0).
Our goal is to find which of these four words adheres to which topic.
Let us randomly assign topics to each word:

Gotham Joker Luthor Lois
0 1 0 0

This of course is wrong as Gotham clearly does not belong to Superman (0). To obtain the correct topic assignment for each word, we are going to apply LDA.
Because solving the general LDA equation is intractable[4], the general practice is to apply an iterative technique called Gibbs sampling [1]. We will work out this example by hand and hopefully obtain the correct topic-word assignment. The example is created based on [2] and [3].

Before the first iteration there are some additional parameter we need to set:
\eta = \frac{\text{No. of words belonging to topic t}}{\text{Total no. of words in the documents}}
The \( \eta \) parameter controls the distribution of words per topic. A large value for \( \eta \) would mean each topic is made up of most of the words in the corpus. In case of a low \( \eta \) they only consist of a part of them - i.e. the topic overlap is minimal. In our case, there is no overlap between the two topics, hence \( \eta \) is set well below one.

\alpha = \frac{\text{No. of word in document of topic t}}{\text{Total no. of words in the documents}}
The \( \alpha \) parameter controls the distribution of topics per document. A large value for \( \alpha \) would mean each document is made up of multiple topics. In case of a low \( \alpha \) they would consist of a small number of topics - i.e. the document overlap regarding topics is minimal. In our case, there is no overlap between the two documents, hence \( \alpha \) is also set well below one.

Now that these parameters are set we are going to go through each word of each document and calculate the (conditional) probability for a word $w$ to be assigned to a certain topic \(t \) using the formula[5]:

p(w = t) = \frac{\text{No. of times word $w$ in topic $t$} + \eta}{\text{Total no. of words in topic $t$} + \eta \times \text{Total no. of words}} \\
\times \frac{\text{No. of times word $w$ in topic $t$} + \eta}{\text{Total no. of words in topic $t$} + \eta \times \text{Total no. of words}}\
We will also construct two additional tables using the topic-assignment. A topic-word count table which shows how often a certain word occurs with respect to a certain topic based on the topic assignment:

. Gotham Joker Luthor Lois
Superman 1 0 0 0
Batman 0 1 1 1

A document-topic count table which shows how often a certain word occurs with respect to a certain topic based on the topic assignment:

. Superman Batman
document1 1 1
document2 2 0

As an example we will do this for the word Gotham in document1.
Gibbs sampling involves sampling for a certain topic assignment while keeping all other current topic assignments fixed; hence, we must remove the current assignment from our count tables. For topic number 0 i.e. Superman we obtain:

. Gotham Joker Luthor Lois
Superman 0 0 0 0
Batman 0 1 1 1
. Superman Batman
document1 0 1
document2 2 0

So we use the formula for \( p(w = t)\) above[6]:

p_0 = \frac{0 + 0.1}{1 + 0.1 \cdot 4} \times \frac{0 + 0.1}{1 + 0.1 \cdot 2} = 0.004
Similarly for topic number 1 i.e. Batman
p_1 = 0.006
Hence the normalized probabilities are given by:
p(\text{gotham} = 0) = \frac{p_0}{p_0 + p_1} = 0.37
p(\text{gotham} = 1) = \frac{p_1}{p_0 + p_1} = 0.63
Now if we flip a biased coin with these probabilities for the topic assignment of the word Gotham, we get the topic assignment 1 i.e. Batman for it; hence, our topic-assignment table now looks like

Gotham Joker Luthor Lois
1 1 0 0

which makes much more sense. Repeating this procedure with each word in every document (Joker, Luthor, Lois) and get this unchanged topic assignment. At last calculate
\frac{\text{Word column with topic row} + \eta}{\text{No. of words with topic row} + \text{No. of words} \cdot \eta}
for each word and topic to obtain the table required for our purposes:

. Gotham Joker Luthor Lois
Superman 0.042 0.042 0.458 0.458
Batman 0.458 0.458 0.042 0.042

After which we are done. This result gives us the following topics:

Batman: Gotham, Joker
Superman: Luthor, Lois

If we were trying to find relevant words for the topic Batman, we would have gotten Gotham and Joker, which is not bad at all.
Now that we have basic understanding of the LDA algorithm we are ready to implement this on real data.

[1] Darling, William M. "A theoretical and practical implementation tutorial on topic modeling and gibbs sampling." Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. 2011.
[2] Steyvers, Mark, and Tom Griffiths. "Probabilistic topic models." Handbook of latent semantic analysis 427.7 (2007): 424-440.
[3] "Your Easy Guide to Latent Dirichlet Allocation",Medium, Lettier medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d
[4] Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." Advances in neural information processing systems. 2002.
[5] Chang, Jonathan, and David Blei. "Relational topic models for document networks." Artificial Intelligence and Statistics. 2009.

  1. Quantities like Model-Perplexity and Log-Likelihood can be used to quantify the results ↩︎

  2. More accurately, the goal is to estimate the topic-word \( \Phi \sim Dir(\eta)\) and document-topic distribution \( \Theta \sim Dir(\alpha)\). These represent the word distribution for a topic and the topic distribution for a document respectively. The \( Dir()\) refers to the Dirichlet distribution. ↩︎

  3. Known as semi-supervised LDA [5] ↩︎

  4. LDA is a generative process depicted in the joint distribution. Finding the variables \( \Theta, \Phi, \mathbf{z}, \mathbf{w} (w_i represent the words for topic z_i)\) amounts to reversing this process learning the posterior distributions of the latent variables in the model given the observed data and solving the equation \( p(\Theta,\Phi,
    \mathbf{z}| \mathbf{w},\alpha,\eta) = \frac{p(\Theta,\Phi,
    \mathbf{z},\mathbf{w} | \alpha,\eta)}{p(\mathbf{w} | \alpha, \eta)}\)[1] ↩︎

  5. Derivation in [1]. Pages 4-6. ↩︎

  6. We left out the conditionals for simplicity's sake ↩︎