Note: While this page will display in any browser, it will look much better when viewed with a browser that is compliant with the latest W3C standards. If you can see this message you may want to consider updating your software at this time.
There are two versions of the frequency data files, one sorted by lemma and the other sorted by frequency count. The files are available as zip archives or UTF-8 text files.
The frequency files consist of four columns separated by TAB characters. The four columns are:
Frequency counts are also available for word types, that is, the surface form of the word as it appears in the text without considering part of speech or lemma. Each file contains three columns:
There are 239,208 unique tokens in the second release and 22,164,985 tokens in total for an overall Type Token Ratio of 0.010792.
The frequency information includes counts for any token that has been assigned a part of speech tag by the part of speech tagger. Therefore, tokens such as the possessive 's are counted as a "word". The frequency counts were generated by reading the standoff annotation files for the Penn part of speech tags to obtain the lemma, part of speech, and the start and end offsets of the word in the text. The occurrence of the word was then extracted from the content and stored in the triple { type, lemma, part of speech }. Unique triples were then counted to obtain the frequency counts.
The accuracy of the frequency counts is dependent on the accuracy of the tokenization. We note the the following issues: