| Abstract | | Wikipedia, the popular online encyclopedia,
has in just six years grown from an adjunct to the now-defunct
Nupedia to over 31 million pages and 429 million revisions in 256
languages and spawned sister projects such as Wiktionary and
Wikisource. Available under the GNU Free Documentation
License, it is an extraordinarily large corpus with broad scope and
constant updates. Its articles are largely consistent in
structure and organized into category hierarchies.
However, the wiki method of collaborative editing creates
challenges that must be addressed. Wikipedia's accuracy is
frequently questioned, and systemic bias means that quality and
coverage are uneven, while even the variety of English dialects
juxtaposed can sabotage the unwary with differences in semantics,
diction and spelling. This paper examines Wikipedia from a
research perspective, providing basic background knowledge and an
understanding of its strengths and weaknesses. We also solve
a technical challenge posed by the enormity of text (1.04TB for the
English version) made available with a simple, easily-implemented
dictionary compression algorithm that permits time-efficient random
access to the data with a twenty-eight-fold reduction in size. |