Note: While this page will display in any browser, it will look much better when viewed with a browser that is compliant with the latest W3C standards. If you can see this message you may want to consider updating your software at this time.
The Open ANC includes over 14 million words from the Second Release that can be freely distributed. Please see the OANC license for more details.
The OANC includes the following data from the ANC Second Release:
Spoken
|
|||
Name | Domain | No. files | No. words |
| charlotte | face to face | 93 | 198,295 |
| switchboard | telephone | 2,307 | 3,019,477 |
| Spoken Totals | 2,410 | 3,217,772 | |
Written |
|||
Name | Domain | No. files | No. words |
| 911 report | government, technical | 17 | 281,093 |
| berlitz | travel guides | 179 | 1,012,496 |
| biomed | technical | 837 | 3,349,714 |
| eggan | fiction | 1 | 61,746 |
| icic | letters | 245 | 91,318 |
| oup | non-fiction | 45 | 330,524 |
| plos | technical | 252 | 409,280 |
| slate | journal | 4,531 | 4,238,808 |
| verbatim | journal | 32 | 582,384 |
| web data | government | 285 | 1,048,792 |
| Written Totals | 6424 | 11,406,155 | |
| Corpus Totals | 8,832 | 14,623,927 | |
Back to the top.
The file organization and encoding conventions for the OANC is the same as in the ANC Second Release. Please consult the Second Release document encoding conventions for a full description.
The OANC data is distributed with the following annotations:
All annotations were originally produced automatically using GATE's ANNIE system. Some of the texts in the OANC include manually validated sentence boundaries (the list of texts validated for sentence boundaries is here). Note that the validated sentence boundaries are not included in the ANC Second Release.
All ANC annotations are in stand-off format--that is, each annotation type is stored in a separate file and linked to the primary data, which is contained in a plain text (UTF-8) file. Annotations are represented as a graph of feature structures according to the specifications of the ISO Linguistic Annotation Format (LAF) (Ide and Romary 2007 and Ide and Suderman 2006).
A version of all, or part, of the ANC data with annotations merged in-line can be generated using the ANC Tool. Several output options are provided, including XML and non-XML formats that can be input to a variety of other software.
Please Note: The OANC is distributed with UTF-8 and UTF-16 character encoded text files while the ANC Second Release uses UTF-16 only. All of the software tools provided by the ANC assume a UTF-16 character encoding as the default encoding.
Be sure to specify the correct character encoding for the text files when processing the OANC with any of the ANC tools.
Back to the top.
The OANC is a community resource that is freely available for download. Please see the OANC license for details.
We ask that you provide us with any of the following that may have resulted from your use of the OANC, which we will make freely available to the user community on this website:
Download the Open ANC as a self installing jar file. (316 MB) See below for installation instructions.
Download the Open ANC as a zip file. (326 MB)
Download the Open ANC as a zip file. (475 MB), or
Download the Open ANC as a gzipped tar file (463 MB)
The OANC will unpack to approximately 4.8 GB.
Download the ANC Tool (2.5 MB) (required to process the standoff annotations)
The Java installers are executable jar files that can be used to install the Open ANC and the ANC Tool. On most operating systems you should be able double click on the .jar file. If that does not work, open a command prompt (Windows) shell (Linux), or terminal window (Max OS X) and run the command:
java -jar OANC-installer.jar
Installation Notes
File dialog boxes in Java are implemented slightly differently on different platforms. For instance, the "Open File" dialog box in Mac OS X does not allow the user to create a directory from within the dialog. Therefore on Mac OS X, users must do one of the following:
Back to the top.