get_annotations_by_type.WebAnno_XMI: Import custom annotations from files exported by WebAnno
In petereckley/webannor: Import text annotation files exported by WebAnno

Description Usage Arguments Details Value Functions Implementation notes Examples

Import custom annotation layers as exported by annotation software WebAnno v3.0.0. in UIMA CAS XMI format.

get_annotations_by_type.WebAnno_XMI(file, type)

get_annotations.WebAnno_XMI(file)

get_annotations.WebAnno_annotator(zipfile)

get_annotations.WebAnno_document(path)

get_annotations.WebAnno_project(path)

`file`	File containing the annotations
`type`	String containing the name of the XMI attribute that holds the annotation.
`zipfile`	Zipfile containing the annotations
`path`	Path containing the annotations

The different functions provide different entry points in the directory structure exported by WebAnno, from the innermost XMI file, to the outermost project folder.

Note the sentence_id, which was encoded as the name of the .txt file input to WebAnno, is not recorded in any of these attributes. I think the 'sofa' (Subject of Analysis) number is just the order of the file within the WebAnno corpus, which is generally a subset of the documents in our overall corpus, since we don't manually annotate them all.

These functions will not work with other WebAnno export formats. They may work with other versions of WebAnno, but this has not been tested.

A dataframe containing all custom annotations visible from the specified entry point, or an empty dataframe (no rows or columns) if no custom annotations are visible.

get_annotations_by_type.WebAnno_XMI: Start from the XMI file for a single annotator, and get all custom annotations of the specified type.
get_annotations.WebAnno_XMI: Start from the XMI file for a single annotator, and get all custom annotations of types listed in the unexported list .annotation_type_attribute_names contained in this package.
get_annotations.WebAnno_annotator: Start from the zipfile for a single annotator of a single document and get all annotations of the default types.
get_annotations.WebAnno_document: Start from the folder for a single document and all annotations of the default types from all annotators.
get_annotations.WebAnno_project: Start from the outer WebAnno project directory (not zipped) and get all annotations of the default types for all documents and annotators.

The core functionalilty is contained in get_annotations_by_type.WebAnno_XMI. We extract the data is a slightly hacky way, using XPath expressions based on hardwired knowledge of the layers to be extracted and their representation in XMI. The XPath expressions may need to be tweaked for different WebAnno annotation types (layers). (A more robust approach would be to use the UIMA framework, but would be considerably more complicated. Also, it is implemented in Java, and R bindings were not available at the time of writing.)

Our layer names may differ from the naming in WebAnno. To abstract the code from the names used in WebAnno, the attribute names used for the layers are stored in .annotation_type_attribute_names (not exported) in the parent environment of the functions, rather than hardwired in the code.

The other functions form nested wrappers around this. get_annotations.WebAnno_XMI calls get_annotations_by_type.WebAnno_XMI iteratively over the default types hardwired in in the parent environment of that function (not exported). get_annotations.WebAnno_document and get_annotations.WebAnno_project include doc_id. This is taken from the filename of the unannotated document input to WebAnno, which is preserved in the folder name at an intermediate level in the project directory structure. doc_id can therefore be used to keep track of external identifiers for the texts fed into WebAnno, which are not otherwise known to WebAnno.

TO DO:

Consider if my function naming convention is the most appropriate, given that its not really an S3 generic. I could create classes for the different WebAnno file types, but provide a minimal implementation (essentially just passing through file / path names, without checking if the file / paths are well-formed, though could check existence)
Add examples that aren't subject to copyright.
Write unit tests.
Add error handling, e.g. for bad file | zipfile | path arguments.
Add warning if no the result contains no annotations.
Consider removing the suppressWarnings(), or dealing only with particular types of warnings at that stage, rather than a catch-all.
Separate function logic to process the XML, from providing an XML source, which need not be a file (e.g. a network connection).
Consider whether should use xml2 rather than XML package?
Disentangle our hardwired names for annotation types, from get_annotations_by_type.WebAnno_XMI, so that function can be called with the name as appears in the XMI, and a wrapper takes care of converting that name to our preferred default type name. This will make the function useable for arbitrary layers without having to edit .annotation_type_attribute_names.
Make the element type a parameter, instead of hardwired to "custom:Credibility" to facilitate extraction of other types of annotations.

## Not run: sentiments <-
get_annotations_by_type.WebAnno_XMI("temp/webanno/out/admin.xmi", "sentiment")
## End(Not run)
## Not run: topics
<- get_annotations_by_type.WebAnno_XMI("temp/webanno/out/admin.xmi", "topic")
## End(Not run)