Jeff Pasternack

Computer Science Ph.D. / University of Illinois at Urbana-Champaign

Demos

These are web-accessible demos and tools relating to (mostly unpublished) older research.


Maximum Subsequence Segmentation

Extracts article text from HTML documents using maximum subsequence segmentation with supervised scoring (trigram and most-recent-unclosed-tag features), as described in the paper.  There is also a corresponding rudimentary web service.

http://took.cs.uiuc.edu/MSS/


Wikipedia Template Tag Histograms

View the number of times a particular value has appeared in each template field in a given article (this most notably includes infoboxes, which are implemented as templates in Wikipedia).  This is sometimes useful in manually estimating the degree and quality of "contentiousness" for certain articles and attributes; the data is taken from a 2009 English Wikipedia dump.

http://took.cs.uiuc.edu/TemplateTags/


Deleted Wikipedia Articles

A MediaWiki installation hosting pages that were permanently deleted from Wikipedia in 2009 and 2010, primarily for lack of notability.  These may be useful for Wikipedia research (observing the type and quality of content that Wikipedia deemed even less notable than its vast library of popular culture minutiae), or simply for entertainment and curiousity value.

http://took.cs.uiuc.edu/wiki/


Artist Frequency * Inverse Document Frequency

This is a proof-of-concept demonstration of a method for finding the similarity between TV series.  Instead of counting the number of mentions of all terms (as per standard TF-IDF), AF-IDF looks at the set of people associated with each series and weights them by the number of episodes in which they appeared (relative to their total number of appearances in all TV series). The results often correspond surprsingly well with human judgement.

Known bug: selecting "Use all cast and crew" will result in an exception since a requisite data file has since been archived.

http://took.cs.uiuc.edu/afidf/