This is a partial collection of some of the software I have
written; some of these (e.g. Quadruples and BetterStreams) are
general purpose libraries with broad applications, while others are
implementations of algorithms from my publications. Please
contact me if you are interested in additional licensing
possibilities beyond those listed here.
This list will expand as I find time to package additional
code.
Quadruple Floating Point Library
Signed 128-bit floating point data type, with 64 effective bits
of precision (vs. 53 for the built-in Double type) and a 64 bit
exponent (vs. 11 for Doubles). With greater precision and far
greater range, Quads are especially useful when dealing with very
large or very small values, such as those in probabilistic
models. Adopting a larger fixed precision rather than an
arbitrary precision type (such as Java's BigDecimal) means that,
while still slower than built-in arithmetic, the penalty is only an
order of magnitude or less and thus still feasible in many
math-heavy applications. For example, on an Intel Core
i5-2410M laptop, a billion multiplications takes 17
seconds with Double values, 135 seconds with Quad values using the
overloaded "*" operator, and just 76 seconds using the Multiply()
method (equivalent to "*="), less than five times the cost of
native Double arithmetic (the higher overhead of "*" is due to the
poor inlining logic of the .Net compiler/JIT optimizer). By
comparison, the commonly-used workaround for multiplication
underflow and overflow, summing logarithms, takes 130
seconds. In addition to being faster and more precise than
log arithmetic, Quads also simplify code by eliminating the need to
remember which variables are log'd and converting back and forth to
log'd values.
The Quadruple library is written in C# and targets .Net 4.0; it
should also be easily portable to .Net 2.0 and similar languages
(such as Java) with straightforward modifications.
License:
LGPL
Download:
http://quadruple.codeplex.com/Maximum Subsequence Segmentation
This .Net 2.0+ library implements supervised maximum subsequence
segmentation article text extraction (and model training) with
unigram and trigram features via an easy-to-use EasyMSS
class. There are also many other classes used for additional
features, algorithm variants, etc., but these are sparsely
documented. A readme file and example C# code using the
library is included with the package, along with a serialized model
file trained upon the 24K examples in the paper.
If you are using a non-.Net language and wish to use the
superivsed model from our paper to extract text, consider using the
web service instead.
License:
Proprietary (Academic Research Only)
Download:
/media/1124/msslibrary.zip
Publication:
Extracting Article Text from the Web with Maximum Subsequence SegmentationBetterStreams
The BetterStreams library is a collection of three classes that
aid in manipulating streams. AsyncStream and
BetterBufferedStream wrap existing streams to improve I/O
performance. AsyncStream provides simple, fast asynchronous
I/O via the standard Read() and Write() methods without the
overhead and complexity of BeginRead/EndRead and
BeginWrite/EndWrite, while BetterBufferedStream is similar to
System.IO.BufferedStream but with much more efficient seeks and
"peeks". Finally, the static AlternateStreams class adds the
ability to manipulate NTFS alternate data streams (ADS) which are a
great way to add metadata to a file or create compound storage
(multiple data streams in a single file). Requires the .Net
Framework 2.0 (or later).
The Academic/30-day Evaluation license permits use in
non-commercial, academic research at no cost, and use for
evaluation purposes for up to 30 days. It does not permit
redistribution.
License:
Proprietary (Academic/30-day Evaluation)
Download:
/media/850/betterstreamsinstaller.msiHTMLTools
Lightweight HTML parser and tokenizer. HTMLTools breaks an
HTML document into tokens corresponding to tags and words. A
key feature is that each token's offset in the original HTML is
recorded, allowing you to find the token corresponding to a
particular character offset (and vice-versa).
License:
Proprietary (Academic Research Only)
Download:
/media/1162/htmltools.zipWiki Access Objects
.Net library for reading, parsing, analyzing, storing and
writing Mediawiki corpora (such as Wikipedia).
License:
(not yet available)
Download:
(not yet available)
Publication:
The Wikipedia Corpus