A beginning with no end

That’s the peril of a historically successful, productive research program. We get locked in to a model; there is the appeal of being able to use solid, established protocols to gather lots of publishable data, and to keep on doing it over and over. It’s real information, and useful, but it also propagates the illusion of comprehension. We are not motivated to step away from the busy, churning machine of data gathering and rethink our theories.

I read the above quote in this blog post by PZ Myers. That post is about the concept of the gene in biology, but it made me think of the use of corpora like the Penn Treebank to measure performance in part-of-speech tagging and parsing.


Things That Annoy Me About Scala, #n: scalac


, , , ,

Okay, it’s really one thing that just keeps coming back to piss me off more: why aren’t ‘-deprecation’, ‘-feature’ etc. on by default?! When would it make sense NOT to enable them?! The compiler still bitches at you if you leave them off, except that it won’t tell you why, because that would be too easy, no you have to rerun the damn thing with the option that should have been on before! It’s no consolation that I can add it to the sbt configuration, because as soon as I start another project it’s off again trying to figure out what to tell the compiler so it will shut the fuck up and do what I want it to do.

Oh yeah, and the ‘-language:’ options. I understand why some features should be enabled explicitly—except that you don’t have to enable them! The compiler warns you that you need to enable feature X, and then happily finished compiling as if you had enabled it! What was the point of bothering me, then?? It just seems insulting; I’d be perfectly happy (well, happier) to be given an error, just like when you forget an include, but it’s frustrating to be told to enable a feature and then have it enabled for me. (This second problem is likely because the ‘-language:’ features are fairly recent additions; hopefully scala 2.11 will make these errors instead of warn-only).

It really seems like this should be so much easier. I’d like to be more forgiving, but it’s 2AM and I’ve been jumping through bullshit hoops for X hours trying to figure out the right compiler options to make the compiler happy for no good reason because my code compiles fine, it’s just the stupid warning messages that are so frustrating that at this point I would happily say ‘screw Scala’ and go use Julia or something if I didn’t depend on some Scala libraries.

Julia and NLP



Julia is a fairly new programming language, targeted toward scientific (i.e. statistical) computing. It is dynamically-typed and high-performance (close to C and C++, faster than Python and R), and has some neat features like multiple dispatch and macros that make it a lot like Lisp.

I went through the manual and played around with the REPL—the language is very intuitive; I like the type system and the multiple dispatch (a method is a specific instantiation of a function for a particular set of argument types, and which method is run for a given function call is determined at run time from the types of all of the arguments—methods are not owned by any object); overall, it seems very promising, both as an interesting and unique programming language, and as a tool for computer science.

There doesn’t seem to be much use of Julia for Natural Language Processing as of yet, although there is one paper on Latent Dirichlet Allocation that used Julia to implement their algorithm. However, it seems to be gaining traction in the academic/scientific community (I heard about Julia from a student in our lab who was recommending it). It still lacks libraries compared to other languages, but that will only be solved through early adopters creating libraries for themselves.

‘Because’ as a Preposition


, , ,

Seen on Twitter: “English Has a New Preposition, Because Internet” (theatlantic.com)

In brief, someone at the Atlantic noticed some discussion of the construction “because X”, where X can be basically anything. This usage is (you probably know, because this is the Internet) common on The Internet; Mark Liberman at Language Log wrote about it last year. The story is that ‘because’ has become a preposition, where it used to be limited to introducing a dependent clause as a subordinating conjuntion or complementizer and being used with ‘of’ to introduce a prepositional phrase.

However, I believe that the Cambridge Grammar of the English Language (Huddleston and Pullum, 2002) already classified ‘because’ as a preposition — they have a discussion of why they disagree with some traditional categories, including subordinating conjunctions. I don’t have a copy available so I can’t look up exactly what they say, but see the Wikipedia article on Prepositions for references. I like their analysis because of reasons — I’m a wannabe iconoclast, and also it made sense to me when I read it.

Anyway, my main beef is that I am not so sure we should consider the ‘because X’ construction grammatical, anyway. Sure, it’s widely used on the Internet, but so is lolspeak, for example. It’s entirely possible that this usage may persist long enough to be accessible to non-self-conscious usage, but for now (for me, at least) it seems to require a conscious deviation from normal language use, very similar to saying something like, “i can haz prepositional ‘because’?”.

Update: I found a discussion of CGEL’s explanation of ‘because’ et al as prepositions here: http://english-jack.blogspot.jp/2007/05/bain-on-prepositions.html

Software Licenses, Open Source, and Github


, , , , ,

tl;dr—use the OSI-approved MIT License

Oh yeah, and IANAL (I Am Not A Lawyer—obviously).


(Background: I am working on a project for which I am planning to develop some non-trivial software, which I will host on Github as much for the convenience as to make my code available to others, although that is also one motivation.)

I can’t find it now, but not too long ago I was reading an article discussing open source licenses that showed that a large majority (almost all?) of the repositories on Github had no license information. Github has acted to address this by adding an option to automatically add a license to a new repository, and offers several options. The author of the article said that they thought the option would only cause more problems because users will not consider the effects of different licenses and just pick one, probably the GPL as it is the most well-known. Version incompatibility between GPL versions, and more generally incompatibility between various open source licenses offered as options by Github has the potential to cause more fragmentation as users pick a license without due consideration.

The upshot is that Github users need to be more informed and proactive about the way we release our software—just because Github makes it so easy doesn’t mean we’re done as soon as we type git push origin master.

It was thus that I embarked on a journey to learn what I should know about open source software licenses and make a decision on what license to use. After wandering adrift on the ocean of the internet for approximately 1000x as long as I initially intended, I have come to some conclusions which I have recorded here to try to make myself feel better about how long it took to get here.

Conclusions First (Or Second, Rather)

My conclusion is: use a permissive license listed as compatible with the GPLv2 by the FSF. For myself, I am choosing to use the MIT License approved by the OSI. This license is compatible with the GNU GPL and source code licensed by it can be combined with source code licensed under the GPLv2 (with or without an “any later version” statement) or GPLv3[1].

My rationale is that it is the least restrictive and most compatible of the commonly-used FOSS licenses, along with the Simplified or 2-clause BSD license, which I believe it is essentially equivalent to. It is mostly arbitrary which of the two to pick, but “the BSD license” is ambiguous and refers to two or three different licenses[2], while “the MIT license”—although it can also refer to the X11 license—I think most commonly refers to the Expat License, and that is what the OSI approved as the MIT License. Github supports using the OSI-approved MIT license, among other open source licenses. Thus I think that the MIT License is the clearest, easiest, low-cost option going forward for licensing a program as open source, for both the developer and the user.

As for the GNU GPL, I think it makes sense to use it if it serves your purpose, but it deserves careful consideration. I don’t think it makes sense for the GPL (any version) to necessarily be the default option for FOSS, especially given the incompatibility between the GPLv2 and GPLv3. If your goal is to make it easy for people to modify and share your software, then it makes more sense to choose a permissive license like the MIT license. If your goal is to ensure that modifications and derivatives of your software remain open source—a perfectly valid goal—then it makes sense to use the GNU GPLv3 (or, if you use source code which is licensed under the GPLv2 without the “any later version” statement then you have to use GPLv2 as well).

Other Issues


The biggest concern in developing and distributing free or open source software is copyright, because copyright applies as soon as a work is created without need for any application or process, and software has been copyrightable since 1983, thus anytime someone writes a piece of code they own the copyright for that code, which means that they can sue anyone else who uses that code without their permission. Software licenses exist to grant others permission to use source code in various ways.

Patents are a separate issue. A patent is a claimed invention that requires an application to the USPTO, it must show originality yadda yadda to be granted, but the salient point here is that if a patent is granted, it applies to all instantiations of that invention regardless of whether the creator/manufacteror is evenaware of the patent (consequences can be worse if they did know), and thus any source code anywhere could be violating someone’s patent somewhere, and there is no way to ensure that no patents apply other than reading all patents in existence. This is literally impossible, as the number of software patents is in the hundreds of thousands[3]. Yeah, the software patent system in the United States is massively fucked up, not to mince words.

Anyway, corporations and universities often acquire patents, and they may also have reasons to want to open-source software which may be related to those patents. Many open source licenses don’t mention patents; I don’t think it is clear what that means for patents that may affect software which is released under those licenses by the owners of the patents. My guess is that most organizations would be rather leery of this uncertainty; thus the Apache License 2.0 contains a provision granting permission to use patents in the software and preventing people who modify and redistribute the software from suing other people for those patents. The Educational Community License 2.0 limits the patent license to only those held by the programmers and not necessarily all the patents owned by their employing organzation (roughly)—this was apparently to make it easier for research universities to make use of without having to do a large amount of administrative work, essentially.

This is all irrelevant to me, however, and I suspect to most of Github as well. If you do belong to an organization that owns patents, check out the FSF’s list of licenses, and consult a lawyer.

Public Domain

Public domain refer to works that, for whatever reason, are not copyrighted. The most common use refers to works on which the copyright has expired; works produced by the US government are also in the public domain (within the US).

Rather than participate in all of this capitalist, legalist mumbo-jumbo that is copyright and licensing and all that, one alternative that seems attractive is to dedicate your source code into the public domain, thus allowing anyone to use it however they want. The capitalist international legal system turns out not to be so simple, however, in that not all countries even allow public domain dedications, and they may be revocable and ultimately Things Are Not That Simple. Unfortunately.

You can try it if you want to; I feel you, sister, believe me. Apparently The Unlicense is popular, recently, though you might do better with the CC0 dedication as it is more legally tuned and tested. I have a feeling that those might hurt wider adoption somewhat, although the SQLite database is in the public domain, so go figure.

Final Things

I want to mention the CRAPL, by Matt Might at the University of Utah. It’s quite funny, and serious as well—it address real needs unique to academic research. On the other hand, I don’t know how it would stand up in court. I do like it, though.

And one last thing, which I did not address in the original version of this post, and that is applying a license to your code. I still need to read it, so I will just point to the Software Freedom Law Center’s guidelines on this topic.

Really really super duper finally: the issue of documentation! (which I also forgot about, of course) The upshot: use the Creative Commons CC BY-SA (here) for documentation (see section 8 here, and the rest of that page too while you’re at it)—it’s what Wikipedia would want!

[1] http://www.gnu.org/licenses/license-list.html#Expat
[2] http://en.wikipedia.org/wiki/BSD_license
[3] http://www.groklaw.net/article.php?story=20130715054823358

Some Thoughts on the Stanford Dependencies



The Stanford dependencies provide a representation of grammatical relations between words in a sentence. They have been designed to be easily understood and effectively used by people who want to extract textual relations. …

Stanford Dependencies

Stanford dependencies are a dependency-based representation that is semantically-oriented; it tries to make the more semantically “important” word the head in a grammatical relation / dependency arc. It provides a basic representation that is similar to other dependency representations as well as a transformed representation that further surfaces semantic relations by, among other transformations, collapsing prepositions out of the graph and replacing them with a single arc from the head of the preposition to its daughter. See the link above for more explanation and an example.

My Thoughts

Disclaimer: None of the below is meant to disparage the undeniable usefulness of the Stanford dependencies or the Stanford Parser. I use the Stanford Parser in my own research; these are some thoughts I had while reviewing the output.

The output of the Stanford Parser suffers from errors and inconsistencies. Errors are a problem for all parsers (and NLP in general), but I have seen a pattern of errors that may be specific to the Stanford Parser implementation. Inconsistency is a more central problem for the Stanford dependency representation, because of its status as a de facto standard implemented in the Stanford Parser.

Error Analysis

I had occasion to manually review the dependency output of the Stanford Parser on two documents from the Gigaword corpus. Below is an analysis of the errors I noted (my research is focused on verbs, so I ignored errors that didn’t affect a relation governing a verb; also this is based on my non-expert manual analysis, so keep your salt-shaker close to hand).

Out of a total of 87 sentences, I noted errors in 27; discounting 1 sentence which was incorrectly split by the Punkt sentence tokenizer and 5 sentences which were in all caps that leaves an error rate of .24 (21 / 87). Of these errors, 2 sentences were likely affected by excessive length, as the parse result for these was completely incorrect.

Errors by dependency type:

Wrong Label
Correct Wrong Head root ccomp xcomp advcl other
root 3 1
ccomp 1 2
xcomp 4
advcl 2 2 2 2
other 2 (8) 7 2 3

The root word was misidentified in 7 sentences, including 1 where the sentence had no verb and 2 where the copula was identified as the root (the copula should be analyzed as a dependent of its syntactic complement).

The next most common type of error was confusion between adjuncts and complements of clauses. Adverbial clauses modifying another clause were labeled as complements 4 times, while a clausal complement was labeled as an adjunct once. There seems to be a bias toward treating clauses as complements, even when the governing verb does not normally take a clausal argument.

My goal is to use the Stanford Parser output to identify verbs which have a high probability of describing events in a narrative; I want to include verbs which contribute significantly to the semantic content of a document’s narrative. The apparent tendency to identify non-complements as complements, and to a lesser extent the misidentification of root words, lowers my confidence in the results. Is a ~.88 “success” rate (conjuncts vs. adjuncts vs. root words) “good enough”? I need a better idea of what performance is good enough.


Because of the way Stanford dependencies are implemented (in the Stanford Parser), there is some inconsistency in that some constructions are headed by content words while others retain their syntactic head word as head. Stanford dependencies are implemented by pattern-matching on phrase-structure parse trees—each construction must be implemented separately. This inevitably means that there will be some constructions which are not covered. Of course, the implementation can be improved gradually.

A related problem is the question of how far to go in transforming the parse tree. Some constructions may require an even more radical transformation than is currently done, in order to surface a content word as head (I show an example below). How far is too far? The answer is up to the implementers of the Stanford Parser.

The last related problem is that of performance. There is no gold standard data—the “gold standard” is the output of the Stanford Parser on gold standard phrase-structure trees. This means we (users) can’t evaluate the Stanford dependencies output. I am sure the researchers at Stanford do their own checking, but I have not seen any numbers for coverage or accuracy of the tree transformations. Some gold-standard data was apparently created for a BioNLP task, but I don’t know how useful that domain-specific data would be.


So the question remains, should we be satisfied with the Stanford dependencies? I certainly intend to use them—they offer a substantial improvement in usefulness over either phrase-structure trees or standard dependencies. Still, there appears to be room to consider alternate approaches, if they have advantages over the Stanford dependencies approach. All this has certainly made me think, and I have a few ideas of my own. I’m not sure whether I will have time to explore them, though, which seems rather appropriate, I feel—or ironic, maybe.

An Example

Consider the following parser output:


The phrase “much of the first-quarter profit” is parsed as an NP headed by the adverb “much”—which I think is a noun, pronoun, or possibly a (pre-)determiner in this case, but this is a problem with the Penn Treebank which only tags “much” as RB or JJ. The relation nsubj(arose, much) doesn’t convey much (haha) meaning, though—it would nice if “profit” was the head, perhaps with an amod(profit, much_of) relation, but this is a radical transformation of the original syntax, and it is not clear what the best output here is.

I think that the “most correct” parse here is:

(NP (NN much)
    (PP (IN of)
        (NP (DT the)
            (NN first-quarter)
            (NN profit))))

and a more semantic-oriented analysis could be:

(NP (PRE-DET much of) (DT the) (NN profit))

Well, it should be something like the above, anyway.

No Attributes for Relation Annotations in Brat


, , ,

brat supports “attribute” annotations, which are extra properties attached to other annotations. I wanted to add an attribute to dependency edges to note specific dependencies, but unfortunately attributes are not shown on relation annotations in the current version (1.3). It’s possible to add them in the configuration files, but the client-side javascript doesn’t do anything with them, apparently.

It’s not a dealbreaker, but it would be nice to have.

Visualizing Annotations With Brat


, , ,

brat is a tool for collaborative annotation, but it can also be used for visualization as well; for example, the Stanford CoreNLP project uses brat in their online demo.

I needed to visualize some number of documents with dependency annotation, so I spent yesterday writing a script to convert my annotation to the brat format, and messing with the configuration to get the converted files to display in a local server. It turns out that there is almost no need to write any configuration files, except to customize the colors of the displayed annotations — brat will happily show whatever annotation you give it as long as it’s in the right format.


The format is simple: a two- or three-column tab separated file. There are four types of annotation, but I only needed two of them: entities and relations. In my case, entities are the part-of-speech tags, and relations are the dependency relations between words. The first column for every annotation is an id. An annotation id is a letter (‘T’ for text or entity annotations, ‘R’ for relations etc.) followed by a unique number (within an annotation type, so ‘T1’ and ‘R1’ can exist in the same file).

The format for an entity annotation is


where “begin” and “end” are character offsets from the beginning of the text file, and “end” is one character past the end of the span. This caused me hours of pain yesterday, because for some reason I had it in my head that it should be the last character in the span, even though it makes more sense the other way (it’s how Python’s list slices work, after all). “text” is the text contained within the span.

A relation annotation is


where “entity_id_1” and “entity_id_2” are the ids of two entity annotations to be connected by the relation. When displayed, the arrow will point from the first entity to the second; which one is the head and which is the daughter (of a dependency relation) is essentially arbitrary but should be consistent (obviously).

For POS tags, “type” would be “NN”, “VBZ”, etc. and for dependency relations “nsubj”, “dobj”, etc.

conversion script

You can see the script to output annotation files for brat here.

running the server

Just run “python standalone.py” from the brat directory where you unpacked the brat archive. The brat website has more detailed instructions. Depending on the web server settings, you can just drop the brat directory somewhere that is served and it will automatically be accessible though the CGI interface.

Of course in a serious installation that will actually be used for annotation by multiple people, you’d need to set up permissions and everything, but for a fairly quick and easy visualization solution it’s not necessary.

debugging tip

If you need to debug something because it’s not working and you get a lot of unhelpful errors from the javascript, enable debug output by changing the line in “config.py” in the brat directory from “Debug = False” to “Debug = True”. That should output much more helpful error messages.

Unescaping HTML (harder than it looks)


, , , ,

I have a dataset that consists of aggregated blog posts, saved in an XML format — meaning that the original content of each blog (HTML) is saved inside an XML document, and therefore has been ‘escaped’, so that, for example, all of the ‘<‘s have been converted into ‘&lt;’s.

So far, so good, except that what I really need is the actual text of each post. Okay, I think. It’s just HTML; surely Python has an easy way to convert HTML entities back to the original text. Alas, it was not to be. There is an undocumented method of the HTMLParser class called ‘unescape’ that ostensibly does what I want it to (see http://fredericiana.com/2010/10/08/decoding-html-entities-to-text-in-python/), but it barfed a UnicodeDecodeError at me and since I have no idea where any non-ASCII characters might be coming from and no idea what is going on inside this undocumented method, I took the path of least resistance and looked elsewhere.

Elsewhere turned out to be Nokogiri, a Ruby gem for working with XML and HTML. Here’s the script; it simply parses each line into HTML, and then parses the resulting HTML and extracts the text:

ARGF.each do |line|
 html = Nokogiri::HTML.fragment(line, 'UTF-8').text
 puts Nokogiri::HTML.fragment(html, 'UTF-8').text