Thanks to the magic of DVR, I got re-addicted to Jeopardy! over winter break. I quickly burned through what must have been 50 or more saved up episodes in just a matter of days. What I was disappointed to find was that I hadn’t improved much by the end of this trivia binge. Sure, my fact-recall latency improved a little, but what I found was that I still struggled with the same topics over and over again. I still mix up plotlines between Macbeth and Hamlet, can’t distinguish a Manet from a Monet, and can’t “name that Broadway musical” to save my life. On the other hand, I nail most geography questions, do pretty well in A.D. history, and am generally good enough to solve word puzzles. What I realized is that I hit a ceiling in my trivia knowledge and that only a targeted effort at filling in the gaps in this knowledge would help me improve. Inspired by Jeopardy! legend Roger Craig’s knowledge tracking system, I decided to set out and build my own trivia knowledge training engine.
Obtaining the data from J! Archive and putting it all into a local database was a simple enough task. Each record of data consists of just four fields: the clue’s text, the clue’s answer, the clue’s category, and the clue’s dollar value. My goal is use these clues to systematically quiz myself in the topics which both need the most improvement and maximize expected payoff. The difficulty is figuring out how to cluster categories together based on an overall topic. I first tackled the problem using a variation on k-means clustering but found that it was too difficult to compute a meaningful similarity metric between two questions using just the category and the clue text. I decided to dive deep into the data and wrangle it myself.
Before I describe my solution, I just want to point out that my background in machine learning is pretty basic and just limited to my own tinkering. If the process I describe below is already a popular technique and has a name for it, please email me so I can edit this post in order to appear as less of an idiot. It’s somewhat similar to Naive Bayes classification, but not quite the same.
I first determined what kind of topic clusters I was looking for and how many I needed. I wanted to encapsulate categories like STATE CAPITALS and BODIES OF WATER under the GEOGRAPHY topic and encapsulate categories like TV STARS and THE 70th OSCARS under POP CULTURE. I realized that I only needed a few topics so that my targeted training could be more practical. I decided on the following twelve topics: WORD PLAY, SCIENCE, LITERATURE, HISTORY, GEOGRAPHY, RELIGION, THEATRE, ART, MUSIC, SPORTS, POP CULTURE, and FOOD. I then spent the next three hours associating categories with their respective topics. It was tremendously tedious, but using regular expressions to extract similar categories in bulk made the job a little easier. I also made sure that the top 200 most frequently appearing categories were manually assigned to their respective topics. By the end of this process, I had manually mapped about 9000 of the 25000 total unique categories. I updated my models with this new labeled data and got to work on an algorithm to fill in the other 16000 records.
I figured that the best way to cluster the data now was to figure out which topic T is most frequently associated with a particular answer A (and above a certain frequency threshold). I then take each category C corresponding to A which has not already been manually mapped to a topic and map it to T. What I then have is a single mapping like GEOGRAPHY: [RIVER SOURCES, ON A MAP SOMEWHERE, ... ] for each answer. I then combine these mappings across all answers into twelve mega-mappings, one for each topic. For example, if the answer ‘Japan’ mapped to GEOGRAPHY: [ list A ] (where A contains all the unlabeled categories associated with the answer ‘Japan’) and ‘California’ mapped to GEOGRAPHY: [list B], then they would be combined into GEOGRAPHY: [list C = list A + list B] where C is just A appended to B and might contain duplicate categories. Finally, I compute the frequency distribution on each of the twelve lists, and filter out categories which appear with a frequency below a certain cutoff. What I’m left with is a mapping of each topic to the categories most frequently found in answers with a high probability of corresponding to that topic.
It’s quite simple and it all worked beautifully.
-Perform multiple iterations of the process so that it uses the newly labeled topics to then in turn generate more new labels.
-Use the exact same method for sub-topic classification.