These are web-accessible demos and tools relating to (mostly
unpublished) older research.
Maximum Subsequence Segmentation
Extracts article text from HTML documents using maximum
subsequence segmentation with supervised scoring (trigram and
most-recent-unclosed-tag features), as described in the
paper. There is also a corresponding rudimentary web
service.
http://took.cs.uiuc.edu/MSS/Wikipedia Template Tag Histograms
View the number of times a particular value has appeared in each
template field in a given article (this most notably includes
infoboxes, which are implemented as templates in Wikipedia).
This is sometimes useful in manually estimating the degree and
quality of "contentiousness" for certain articles and attributes;
the data is taken from a 2009 English Wikipedia dump.
http://took.cs.uiuc.edu/TemplateTags/Deleted Wikipedia Articles
A MediaWiki installation hosting pages that were permanently
deleted from Wikipedia in 2009 and 2010, primarily for lack of
notability. These may be useful for Wikipedia research
(observing the type and quality of content that Wikipedia deemed
even less notable than its vast library of popular culture
minutiae), or simply for entertainment and curiousity value.
http://took.cs.uiuc.edu/wiki/Artist Frequency * Inverse Document Frequency
This is a proof-of-concept demonstration of a method for finding
the similarity between TV series. Instead of counting the
number of mentions of all terms (as per standard TF-IDF), AF-IDF
looks at the set of people associated with each series and weights
them by the number of episodes in which they appeared (relative to
their total number of appearances in all TV series). The results
often correspond surprsingly well with human judgement.
Known bug: selecting "Use all cast and crew" will result in an
exception since a requisite data file has since been archived.
http://took.cs.uiuc.edu/afidf/