Note: While this page will display in any browser, it will look much better when viewed with a browser that is compliant with the latest W3C standards. If you can see this message you may want to consider updating your software at this time.


15em 7em
first release second release open anc
about contents encoding frequency data using xaira bugs & caveats
obtaining contributing contents encoding frequency data using Xaira bugs & caveats
contents using annotations download
annotations software source code frequency data publications contributor's FAQ
project people consortium anc mailing list contact us site map

What's New

The Open ANC

The open portion of the ANC (approximately 15 million words of text, with annotations) is now available for download.

2nd Release Frequency Counts

Frequency counts for the second release are now available and can be downloaded here.

New Annotations Available

Both sets of annotations can be downloaded from our annotations page.

Manually Annotated Subcorpus

The ANC, in collaboration with the FrameNet project, WordNet, and Columbia University, has received a grant from the National Science Foundation to produce a balanced sub-corpus of the ANC that is manually annotated for WordNet senses, FrameNet frames, and validated for word and sentence boundaries, part of speech, noun chunks, and verb chunks.

ANC in UIMA

The ANC has been awarded an IBM UIMA Innovation Grant to port the ANC to UIMA and provide information with all ANC annotations that conform to UIMA Type Definitions.

SIGANN shared corpus

The newly-formed Assocation for Computational LInguistics Special Interest Group for Annotations (SIGANN) is publishing a 40K word "sharable corpus" consisting of texts drawn from the ANC 2nd Release, with the intent of gathering as many annotations of the corpus as possible.

ANC in the news

The ANC has been written up in national newspapers.

The American National Corpus (ANC) project is creating a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward. The ANC will provide the most comprehensive picture of American English ever created, and will serve as a resource for education, linguistic and lexicographic research, and technology development.

When completed, the ANC will contain a core corpus of at least 100 million words, comparable across genres to the British National Corpus (BNC). The corpus will also include an "opportunistic" component of potentially several hundreds of millions of words, chosen to provide both the broadest and largest selection of texts (and, where available, annotations) possible.

ANC Status

The ANC has so far released 22 million words of American English, which is available from the Linguistic Data Consortium--please consult the LDC Catalog entry.

Contribute to the ANC

Contribute Texts

Do you have public domain (or "sharealike") texts in American English produced in or after 1990? You can upload all or parts of this data to be included in the ANC.

Authors may consult the frequently asked questions page to learn more about how the data will be used, and why you should consider contributing your work to the ANC.

Contribute annotations and derived data

If you have annotated any part of the ANC for linguistic features of any kind or produced linguistic information derived from it, please contribute the annotations to the ANC for free distribution and use to anyone who has the ANC data.

Coming Soon

ANC annotations in the format specified for the Linguistic Annotation Format developed by ISO TC37 SC4, and a version of the ANC Tool that handles data in this format.

New output options for the ANC Tool, initially including output formats for the Natural Language Processing Toolkit and UIMA.

Future Releases

The ANC is working with annotation projects that are generating layers of annotation for some or all of the following: Penn Treebank-style syntactic annotations, PropBank, NomBank, TimeML, and opinion annotations. The data and annotations from these projects will be added to the ANC.

Acknowledgements

The American National Corpus project has received support from the ANC Consortium, the TalkBank project, the Department of Chinese, Translation, and Linguistics at the City University of Hong Kong, and the National Science Foundation.

The ANC also acknowledges the following, who have provided software and/or support for ANC development:

Gate logo