Digging through Jeopardy! Questions using Python and Natural Language Processing

Jonathan Hernandez
5 min readFeb 17, 2022
Image: https://www.thoughtco.com/jeopardy-past-and-present-history-1396954

*** Work done originally on January 2019***

While taking a Web Analytics course, my class had to find a corpus of text to analyze and find the most common words. After doing some digging, I came across an interesting data set: list of Jeopardy! questions. A Reddit user had posted a link to a data set of over 200K+ Jeopardy! questions asked throughout the show’s history.

Link to the reddit post: https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/

I decided to work with this dataset as I too was curious on the kinds of questions that appear on the show. I could imagine the text processing I wanted to do was similar (hopefully) to how IBM’s Watson analyzed questions. Eager and excited, I began coding away.

The programming language Python has libraries and functions to use to work with text and is one of my favorite programming languages (Python FTW!!!) I definitely recommended learning how to code and Python is definitely a fun and easy language to get started with in my opinion.

Before I get carried away talking away how wonderful the language is, my approach was as follows:

  • For this assignment, and to make things easier, consider only words with 2 or more letters with no digits and special characters allowed.
  • Load libraries to extract the data set and use Natural Language Processing functions.
  • Take every question in the data set and join it as one big string by spaces
  • Tokenize the string
  • Set every word to lowercase (in NLP, this is called normalizing). This also helps as when doing word counts, I don’t want to consider for example ‘Country’ and ‘country’ differently.
  • Use regular expressions to filter out the corpus of text based on my first bullet point.
  • Print out the top 100 most common words and their frequency and finally make a plot of them.

Loading the data set and the required libraries/packages:

import pandas as pdimport nltkimport reimport enchantimport matplotlib.pyplot as plt# read in the dataset and print how many questions there arejeopardy = pd.read_json("JEOPARDY_QUESTIONS1.json")

Combine all the questions as one giant corpus of questions (one big string):

d = enchant.Dict("en_US")# join the list of strings into onecorpus = " ".join(questions)tokens = nltk.word_tokenize(corpus) # tokenize the stringNormalize the words in the corpus and store the number of words:words = [w.lower() for w in tokens]

Using the rule to only allow 2 letter words or more and filter out numbers and characters and any words that are not in the English dictionary, we get a much smaller number to work with:

words_only = [w for w in words if re.search(r"^[a-z]{1,}[^\W|\d]+$",w)]words_only = [w for w in words_only if d.check(w)]# only if valid in the English Languagen_unique_words = len(set(words_only))print "Number of unique words: ", n_unique_words

Number of unique words: 37824

So in over 28 years of Jeopardy! questions, there are about 38000 unique words that appear in the corpus.

Let’s now see the frequency distribution of these words in the text and see the most popular words.

Python’s nltk (Natural Language Toolkit) library has a cool function, most_common(n) that outputs the top n most frequent words in a given corpus.

I will print out the top 100 in this case:

# 100 most frequent words sorted by countmost_freq_100 = fdist.most_common(100) # 100 most frequent words sorted by countfor word, frequency in most_freq_100: # iterate and printprint word, frequency

the 159660
of 113472
this 106279
in 80984
to 50357
for 35403
is 34621
was 29775
on 23269
it 20561
from 17957
with 17247
that 15959
by 15778
his 15589
as 15555
these 13977
he 12793
you 12678
one 11842
an 11547
at 11365
name 11153
or 10274
first 9942
are 8576
its 8213
who 7633
city 7338
here 7020
be 6710
has 6111
and 6022
country 5954
her 5953
man 5522
named 5426
called 5368
state 5289
have 5219
about 5128
can 5051
but 4913
when 4894
seen 4860
film 4756
new 4745
not 4736
like 4731
clue 4677
type 4556
were 4370
up 4331
she 4216
made 4044
your 3995
crew 3982
which 3978
title 3931
used 3881
had 3874
known 3670
world 3605
after 3591
into 3570
out 3515
do 3474
also 3431
no 3426
word 3286
only 3274
all 3253
him 3199
became 3163
said 3152
president 3132
may 3125
years 3058
novel 2987
played 2986
wrote 2955
over 2937
my 2913
they 2875
capital 2862
king 2715
their 2714
term 2656
than 2612
war 2594
part 2590
book 2541
last 2517
island 2510
show 2489
most 2480
won 2478
been 2402
famous 2369
french 2362

We see that the most frequent word is ‘the’ followed by ‘of’, and so on. The frequencies start to decrease but a slower rate.

If we were to plot this frequency distribution vs relative frequency distribution, we see that there is a inverse relationship going on:

# compute the relative frequency and round it to 4 decimal placesfrequency = [x[1] for x in most_freq_100] # frequencies of wordsrel_frequency = [round(float(x)/n_words, 4) for x in frequency]plt.plot(rel_frequency) # log-scaleplt.xlabel("Rank of words in corpus", color="w")plt.ylabel("Relative Frequency of Words", color="w")plt.title("Zipf Plot of 100 Most Common Words", color="w")plt.show()lot of 100 Most Common Words")plt.show()
Plot of relative frequencies of the top 100 words in the corpus. As the rank of a word decreases, the relative frequency decreases showing a inverse proportional relationship. This behavior is known as Zipf’s Law.

Sorry for the lack of axis labels in the graph above. The y-axis (vertical) is the count frequency and the x-axis (horizonal) is the word.

Zipf’s Law states that the frequency of a word is inversely proportional to its rank in the frequency list. In mathematical notation:

r=1/n

Where r is the rank of the word (1 the most common word, 2 the second most common and so on) and n is the relative frequency.

So what I have demonstrated you might ask? By using computer programming and some Natural Language Processing, we can gain insight on text corpora and explore what the corpus reveals. In this case of using Jeopardy! questions, we’ve saw that the most common word in these questions was ‘the’ followed by ‘of’ and then ‘this’ and so on. Similar or advanced methods are how one would analyze text of say twitter feeds and customer reviews of a product.

I hope you enjoyed reading the post and learned something new 🙂 Let me know your thoughts in the comments section below. Did you enjoy reading this post? Could this type of problem be solved in a different manner? What other things would you have liked to see in this post?

You can find the files and data used for this project by clicking here:

--

--

Jonathan Hernandez

Data science grad who loves blogging about data science topics. Open to work.