Jeff Pasternack

Computer Science Ph.D. / University of Illinois at Urbana-Champaign

Software

This is a partial collection of some of the software I have written; some of these (e.g. Quadruples and BetterStreams) are general purpose libraries with broad applications, while others are implementations of algorithms from my publications.  Please contact me if you are interested in additional licensing possibilities beyond those listed here.

This list will expand as I find time to package additional code.


Quadruple Floating Point Library

Signed 128-bit floating point data type, with 64 effective bits of precision (vs. 53 for the built-in Double type) and a 64 bit exponent (vs. 11 for Doubles).  With greater precision and far greater range, Quads are especially useful when dealing with very large or very small values, such as those in probabilistic models.  Adopting a larger fixed precision rather than an arbitrary precision type (such as Java's BigDecimal) means that, while still slower than built-in arithmetic, the penalty is only an order of magnitude or less and thus still feasible in many math-heavy applications.  For example, on an Intel Core i5-2410M laptop, a billion multiplications takes 17 seconds with Double values, 135 seconds with Quad values using the overloaded "*" operator, and just 76 seconds using the Multiply() method (equivalent to "*="), less than five times the cost of native Double arithmetic (the higher overhead of "*" is due to the poor inlining logic of the .Net compiler/JIT optimizer).  By comparison, the commonly-used workaround for multiplication underflow and overflow, summing logarithms, takes 130 seconds.  In addition to being faster and more precise than log arithmetic, Quads also simplify code by eliminating the need to remember which variables are log'd and converting back and forth to log'd values.

The Quadruple library is written in C# and targets .Net 4.0; it should also be easily portable to .Net 2.0 and similar languages (such as Java) with straightforward modifications.

License: LGPL
Download: http://quadruple.codeplex.com/



Maximum Subsequence Segmentation

This .Net 2.0+ library implements supervised maximum subsequence segmentation article text extraction (and model training) with unigram and trigram features via an easy-to-use EasyMSS class.  There are also many other classes used for additional features, algorithm variants, etc., but these are sparsely documented.  A readme file and example C# code using the library is included with the package, along with a serialized model file trained upon the 24K examples in the paper.

If you are using a non-.Net language and wish to use the superivsed model from our paper to extract text, consider using the web service instead.

License: Proprietary (Academic Research Only)
Download: /media/1124/msslibrary.zip
Publication: Extracting Article Text from the Web with Maximum Subsequence Segmentation



BetterStreams

The BetterStreams library is a collection of three classes that aid in manipulating streams.  AsyncStream and BetterBufferedStream wrap existing streams to improve I/O performance.  AsyncStream provides simple, fast asynchronous I/O via the standard Read() and Write() methods without the overhead and complexity of BeginRead/EndRead and BeginWrite/EndWrite, while BetterBufferedStream is similar to System.IO.BufferedStream but with much more efficient seeks and "peeks".  Finally, the static AlternateStreams class adds the ability to manipulate NTFS alternate data streams (ADS) which are a great way to add metadata to a file or create compound storage (multiple data streams in a single file).  Requires the .Net Framework 2.0 (or later).

The Academic/30-day Evaluation license permits use in non-commercial, academic research at no cost, and use for evaluation purposes for up to 30 days.  It does not permit redistribution.

License: Proprietary (Academic/30-day Evaluation)
Download: /media/850/betterstreamsinstaller.msi



HTMLTools

Lightweight HTML parser and tokenizer.  HTMLTools breaks an HTML document into tokens corresponding to tags and words.  A key feature is that each token's offset in the original HTML is recorded, allowing you to find the token corresponding to a particular character offset (and vice-versa).

License: Proprietary (Academic Research Only)
Download: /media/1162/htmltools.zip



Wiki Access Objects

.Net library for reading, parsing, analyzing, storing and writing Mediawiki corpora (such as Wikipedia).

License: (not yet available)
Download: (not yet available)
Publication: The Wikipedia Corpus