Tags

, , ,

brat is a tool for collaborative annotation, but it can also be used for visualization as well; for example, the Stanford CoreNLP project uses brat in their online demo.

I needed to visualize some number of documents with dependency annotation, so I spent yesterday writing a script to convert my annotation to the brat format, and messing with the configuration to get the converted files to display in a local server. It turns out that there is almost no need to write any configuration files, except to customize the colors of the displayed annotations — brat will happily show whatever annotation you give it as long as it’s in the right format.

format

The format is simple: a two- or three-column tab separated file. There are four types of annotation, but I only needed two of them: entities and relations. In my case, entities are the part-of-speech tags, and relations are the dependency relations between words. The first column for every annotation is an id. An annotation id is a letter (‘T’ for text or entity annotations, ‘R’ for relations etc.) followed by a unique number (within an annotation type, so ‘T1’ and ‘R1’ can exist in the same file).

The format for an entity annotation is

<ID><tab><type><space><begin><space><end><tab><text>

where “begin” and “end” are character offsets from the beginning of the text file, and “end” is one character past the end of the span. This caused me hours of pain yesterday, because for some reason I had it in my head that it should be the last character in the span, even though it makes more sense the other way (it’s how Python’s list slices work, after all). “text” is the text contained within the span.

A relation annotation is

<ID><tab><type><space>Arg1:<entity_id_1><space>Arg2:<entity_id_2>

where “entity_id_1” and “entity_id_2” are the ids of two entity annotations to be connected by the relation. When displayed, the arrow will point from the first entity to the second; which one is the head and which is the daughter (of a dependency relation) is essentially arbitrary but should be consistent (obviously).

For POS tags, “type” would be “NN”, “VBZ”, etc. and for dependency relations “nsubj”, “dobj”, etc.

conversion script

You can see the script to output annotation files for brat here.

running the server

Just run “python standalone.py” from the brat directory where you unpacked the brat archive. The brat website has more detailed instructions. Depending on the web server settings, you can just drop the brat directory somewhere that is served and it will automatically be accessible though the CGI interface.

Of course in a serious installation that will actually be used for annotation by multiple people, you’d need to set up permissions and everything, but for a fairly quick and easy visualization solution it’s not necessary.

debugging tip

If you need to debug something because it’s not working and you get a lot of unhelpful errors from the javascript, enable debug output by changing the line in “config.py” in the brat directory from “Debug = False” to “Debug = True”. That should output much more helpful error messages.

Advertisements