15em 7em
first release second release open anc
about contents encoding frequency data using xaira bugs & caveats
obtaining contributing contents encoding frequency data using Xaira bugs & caveats
contents using annotations download
home overview masc I download
annotations software source code frequency data publications contributor's FAQ
project people anc mailing list contact us site map

ANC Second Release Frequency Data

The Data

There are two versions of the frequency data files, one sorted by lemma and the other sorted by frequency count. The files are available as zip archives or UTF-8 text files.



Written & Spoken

File Format

The frequency files consist of four columns separated by TAB characters. The four columns are:

  1. Word - the word as it appears in the text.
  2. Lemma - the word's lemma.
  3. POS - the Penn part of speech tag for the word.
  4. Count - the number of occurrences in the second release.

Token Counts

Frequency counts are also available for word types, that is, the surface form of the word as it appears in the text without considering part of speech or lemma. Each file contains three columns:

  1. Token - the word as it appears in the text.
  2. Count - the number of times the token appears.
  3. Ratio - the frequency ratio for the word.

There are 239,208 unique tokens in the second release and 22,164,985 tokens in total for an overall Type Token Ratio of 0.010792.


The frequency information includes counts for any token that has been assigned a part of speech tag by the part of speech tagger. Therefore, tokens such as the possessive 's are counted as a "word". The frequency counts were generated by reading the standoff annotation files for the Penn part of speech tags to obtain the lemma, part of speech, and the start and end offsets of the word in the text. The occurrence of the word was then extracted from the content and stored in the triple { type, lemma, part of speech }. Unique triples were then counted to obtain the frequency counts.

Known Problems

The accuracy of the frequency counts is dependent on the accuracy of the tokenization. We note the the following issues: