The Stanford dependencies provide a representation of grammatical relations between words in a sentence. They have been designed to be easily understood and effectively used by people who want to extract textual relations. …
Stanford dependencies are a dependency-based representation that is semantically-oriented; it tries to make the more semantically “important” word the head in a grammatical relation / dependency arc. It provides a basic representation that is similar to other dependency representations as well as a transformed representation that further surfaces semantic relations by, among other transformations, collapsing prepositions out of the graph and replacing them with a single arc from the head of the preposition to its daughter. See the link above for more explanation and an example.
Disclaimer: None of the below is meant to disparage the undeniable usefulness of the Stanford dependencies or the Stanford Parser. I use the Stanford Parser in my own research; these are some thoughts I had while reviewing the output.
The output of the Stanford Parser suffers from errors and inconsistencies. Errors are a problem for all parsers (and NLP in general), but I have seen a pattern of errors that may be specific to the Stanford Parser implementation. Inconsistency is a more central problem for the Stanford dependency representation, because of its status as a de facto standard implemented in the Stanford Parser.
I had occasion to manually review the dependency output of the Stanford Parser on two documents from the Gigaword corpus. Below is an analysis of the errors I noted (my research is focused on verbs, so I ignored errors that didn’t affect a relation governing a verb; also this is based on my non-expert manual analysis, so keep your salt-shaker close to hand).
Out of a total of 87 sentences, I noted errors in 27; discounting 1 sentence which was incorrectly split by the Punkt sentence tokenizer and 5 sentences which were in all caps that leaves an error rate of .24 (21 / 87). Of these errors, 2 sentences were likely affected by excessive length, as the parse result for these was completely incorrect.
Errors by dependency type:
The root word was misidentified in 7 sentences, including 1 where the sentence had no verb and 2 where the copula was identified as the root (the copula should be analyzed as a dependent of its syntactic complement).
The next most common type of error was confusion between adjuncts and complements of clauses. Adverbial clauses modifying another clause were labeled as complements 4 times, while a clausal complement was labeled as an adjunct once. There seems to be a bias toward treating clauses as complements, even when the governing verb does not normally take a clausal argument.
My goal is to use the Stanford Parser output to identify verbs which have a high probability of describing events in a narrative; I want to include verbs which contribute significantly to the semantic content of a document’s narrative. The apparent tendency to identify non-complements as complements, and to a lesser extent the misidentification of root words, lowers my confidence in the results. Is a ~.88 “success” rate (conjuncts vs. adjuncts vs. root words) “good enough”? I need a better idea of what performance is good enough.
Because of the way Stanford dependencies are implemented (in the Stanford Parser), there is some inconsistency in that some constructions are headed by content words while others retain their syntactic head word as head. Stanford dependencies are implemented by pattern-matching on phrase-structure parse trees—each construction must be implemented separately. This inevitably means that there will be some constructions which are not covered. Of course, the implementation can be improved gradually.
A related problem is the question of how far to go in transforming the parse tree. Some constructions may require an even more radical transformation than is currently done, in order to surface a content word as head (I show an example below). How far is too far? The answer is up to the implementers of the Stanford Parser.
The last related problem is that of performance. There is no gold standard data—the “gold standard” is the output of the Stanford Parser on gold standard phrase-structure trees. This means we (users) can’t evaluate the Stanford dependencies output. I am sure the researchers at Stanford do their own checking, but I have not seen any numbers for coverage or accuracy of the tree transformations. Some gold-standard data was apparently created for a BioNLP task, but I don’t know how useful that domain-specific data would be.
So the question remains, should we be satisfied with the Stanford dependencies? I certainly intend to use them—they offer a substantial improvement in usefulness over either phrase-structure trees or standard dependencies. Still, there appears to be room to consider alternate approaches, if they have advantages over the Stanford dependencies approach. All this has certainly made me think, and I have a few ideas of my own. I’m not sure whether I will have time to explore them, though, which seems rather appropriate, I feel—or ironic, maybe.
Consider the following parser output:
The phrase “much of the first-quarter profit” is parsed as an NP headed by the adverb “much”—which I think is a noun, pronoun, or possibly a (pre-)determiner in this case, but this is a problem with the Penn Treebank which only tags “much” as RB or JJ. The relation nsubj(arose, much) doesn’t convey much (haha) meaning, though—it would nice if “profit” was the head, perhaps with an amod(profit, much_of) relation, but this is a radical transformation of the original syntax, and it is not clear what the best output here is.
I think that the “most correct” parse here is:
(NP (NN much) (PP (IN of) (NP (DT the) (NN first-quarter) (NN profit))))
and a more semantic-oriented analysis could be:
(NP (PRE-DET much of) (DT the) (NN profit))
Well, it should be something like the above, anyway.