Wednesday, January 16, 2008

Text Mining of Political Speech

Introduction:


Political analysis from commentators on 2005 State of the Union address, and how it differs from 2004 in terms of softness and strength of words used by the president.


Objective:
Given a set of political speech, the objective is to successfully identify the speech to the period it was delivered.
The second minor objective is to investigate which of the data mining classification methods works better on classifying political speech.


Challenges:
Semantic relationships: terms that refer to the same or similar concept. Some of the term that are semantically equivalent; however they are literally preferred in different period of time.
The difference in the style of the political speech through reigns of the 20th century.
Associative relationship include the terms that are closely related but are not semantically or conceptually equivalent. These terms usually appears within the same context.



Approach:

Documents Classification
The data set consists of 102 State of the Union documents that cover the period between 1901 to 2000.


supervised Learning using pre-defined classes.

The first run grouped the document based on the year it was delivered.
Build a data set of ten classes based on sequential 10 years period.
1901 -1910
1911 - 1920
….
1991 - 2000






The second run grouped the document based on the historical knowledge.
•Data set of three classes based:



  1. War time (1914-1919) & (1939-1945)

  2. Cold War time (1946-1990)

  3. Peace time. (1901-1913) & ( 1991-2000)

Term Representation :
Build a model based on the Bag of Word representation scheme using the 102 documents.
The model is built using a list of 524 words on the stop list; (common words, like "the", "of", "is") The dictionary consists of 17,200 relevant words
The training/Testing documents are selected randomly (50/50) out of the whole data set.
Classification Algorithm used:



  • Naive Bayes classifier

  • TD*IDF Scheme

  • K-Nearest Neighbor

Tools Used
Rainbow from Carnegie Mellon university
Bow (or libbow) is a library of C code for writing statistical text analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow).



Rainbow was developed to run on Unix; however, it works on Linux too.
Rainbow supports several classification methods:
Naïve Bayes (Default method)
K-Nearest Neighbor
TFIDF
Probabilistic Indexing
For this project is Rainbow is run on Pentium2 – 400 MHz machine and 128 MB RAM


Analysis and Result :
The first run used the decade-base categorization.
directory/filename TrueClass TopPredictedClass:score1 2ndPredictedClass:score2 ..
Example:
Then, used the output of this document to build the confusion Matrix to check the performance of the classifier.

Naïve Bayes ClassifierDecade Based Analysis:



TF*IDF Decade Base Analysis:


KNN Decade Base Analysis:





Conclusion:
Text mining techniques can work well to classify document within a subject as broad as political speech.
Bayesian Classifier and TF-IDF perform better than the KNN


Future Work:
Uses the data set from the state of the union speech to test other political speech e.g. Inauguration address. The president’s weekly radio address.
Build timeline classifier that can be used to classify political documents.


References:
M, Gomeze, A. Gelbukh, A. Lopez, Text Mining as a Social Thermometer, Text Mining workshop at 16th International Joint Conference on Artificial Intelligence (IJCAI'99), Stockholm, Sweden, July 31 – August 6, 1999, pp. 103-107.
Politics &Commentary from NPR.org website
Tracing a Common Theme in State of the Union Addresses
http://www.npr.org/templates/story/story.php?storyId=4485068
Reviewing Bush's Address from a Theatrical Perspective
http://www.npr.org/templates/story/story.php?storyId=4485068
Translating the State of the Union Lexicon
http://www.npr.org/templates/story/story.php?storyId=4475701
4. LIBOW Source Code
http://www2.cs.cmu.edu/~mccallum/bow/src/
5. Y. H. LI AND A. K. JAIN, Classification of Text Documents, Department of Computer Science and Engineering, Michigan State University, East Lansing, Michigan.
6. The American Presidency Project http://www.presidency.ucsb.edu/sou.php

No comments: