GrAF: The Graph Annotation Framework

NOTE: These pages are still under construction and the information they contain may be incomplete or inaccurate.

Table of Contents

  1. Introduction
  2. The Abstract Model
    1. The Model for Graphs
    2. The Model for Annotations
    3. Anchors and Regions
  3. Java quick start page
  4. Open Issues
  5. Frequently Asked Questions
  6. Latest Code
  7. References


Welcome to the Developer's Wiki for the Graph Annotation Framework. This is the place to find help with GrAF/XML and the GrAF Java API.

The use of graphs has increasingly been recognized as the underlying data model for linguistic annotations, either explicityly (Annotation Graphs, typed feature structures) or implicitly (XML, Lisp). Therefore GrAF provides a data model for linguistic annotations based on basic graph theory. Give any computational linguist a paper napkin and a pencil and they will be able to sketch out a simple diagram with a few labelled circles and lines that illustrates their favourite annotation format. GrAF is intended to be the literal represenation of these paper napkin sketches.

What GrAF Is

GrAF is a data model based on graph theory that can be used to represent linguistic annotations. The GrAF data model is similar to most other document models with a few exceptions:

  1. Nodes in the graph do not represent annotations. Nodes and annotations are different sorts of beings in GrAF. In GrAF nodes are simply place holders for zero or more annotations.
  2. Edges in the graph are first class citizens of the data model. In many data models the edges between annotations are implied by the nesting of tags (XML, Lisp) or by listing children by reference (W3C DOM, GATE, UIMA). In GrAF the edges between annotations are explicitly represented as objects and may also be annotated in the same way nodes are.

GrAF consists of three parts:

  1. An abstract data model.
  2. An API for manipulating the data model.
  3. A simple XML serialization of the data model.

What GrAF Is Not

GrAF is not an annotation format (the F stands for Framework). GrAF is also not an annotation "scheme"--it does not provide or require any particular linguistic labels. Rather, GrAF is used to represent annotation structure--it provides the means to associate annotations to data, annotations to other annotations, and parts of annotations to other parts of annotations.

The Abstract Model

The abstract data model is divided into three parts: the model for graphs, the model for annotations, and the model for the artifact being annotated. One of the main design goals of GrAF was to maintain a clear distinction between these models with limited interactions amongst them.

The Model for Graphs

The GrAF model for graphs is ripped straight from the text books.

A graph G = (V,E) consists of a set of vertices V(G) and a set of edges E(G).
A vertex v in V(G) is a terminal point in the graph G. Vertices are also referred to as nodes.
An edge e in E(G) in an ordered pair of vertices [AB] in V(G). The order of the vertices determines the direction of the edge. A is called the edge source and B is called the edge destination.
A graph G' = (V',E') is a subgraph of G = (V,E) iff V(G') is a subset of V(G) and for every edge e' in E(G') there is a corresponding edge e in E(G).
Two vertices A and B in V(G) are neighbors iff an edge ei = [AB] or ej = [BA] exists in E(G).
A path P is a sequences of directed edges P = {e1, e2, ... en} from E(G) such that the destination of ei is the source of ei+1.
Two vertices A and B in V(G) are connected iff there is a path p from A to B in E(G), or there is a third vertex C in V(G) such that there is a path Pa from C to A and a path Pb from C to B in G.
Connected Component
A connected component in G is a subgraph G' of G such that every pair of vertices in G' is connected.

The Model for Annotations

An annotation is a labelled feature structure.
Feature Structure
A feature structure is an Attribute-Value Graph (AVG) as described by ISO 24610-1 Feature Structure Representation.
A feature is a mapping from a string (the name) to a value. A feature value may be a string (a simple feature value) or it may be another feature structure (a complex feature value).

Anchors and Regions

Anchors and regions are used to segment the artifact being annotated.

An anchor is an immutable position in the artifact being annotated. What an anchor is will depend on the media being annotated. For example, text anchors may be character offsets, audio anchors may be time offsets, and image anchors may be cartesian corrdinates. The only assumption that GrAF makes about anchors is that they hava a natural ordering. The Java API also assumes that anchors know how to serialize themselves to/from a string representation (the value of an XML attribute).
A region is the area defined by one more anchors. The number of anchors required to bound a region will depend on the media being annotated. For convenience every region also defines start and end anchors, where the start anchor is the smallest anchor as determined by their natural ordering and the end anchor is the largest anchor as determined by their natural ordering. Points (or instants) will have the same start and end anchors.

GrAF provides a default ordering for regions based on the start and end anchors, but this behaviour can be overridden for specific media types. The default ordering for regions can expressed in psuedo-code as:

compare(Region A, Region B) {


/* Check which region starts first */ if A.start < B.start then

A comes first
else if B.start < A.start then
B comes first

/* start anchors are equal so the longest region comes first*/ else if B.end < A.end then

A comes first
else if A.end < B.end then
B comes first
the regions are equal


Note; Regions define the parts of the artifact being annotated, anchors themselves can not be annotated. Anchors are used to define regions and then the regions are annotated by linking them to nodes.


Currently there is a Java API for GrAF and a Python API in the works. In addition the following projects are also being developed.

A compact syntax for GrAF. XML is extremely verbose and introduces megabytes of redundancy. GrAF/CS is a character based representation of the GrAF data model intended to reduce the character bloat. Think of the dot format used by GraphViz.
A Simple API for GrAF. This is intended to be to GrAF what SAX is to XML. Regardless of the GrAF syntax developers need an easy way to parse GrAF documents without jumping through alot of hoops. The GrAF/SAG API is intended to allow developers an easy way to process GrAF annotations without having to walk through a maze of nodes and edges.
Last modified 2 years ago Last modified on May 3, 2012 10:55:51 PM