, , , ,

I have a dataset that consists of aggregated blog posts, saved in an XML format — meaning that the original content of each blog (HTML) is saved inside an XML document, and therefore has been ‘escaped’, so that, for example, all of the ‘<‘s have been converted into ‘&lt;’s.

So far, so good, except that what I really need is the actual text of each post. Okay, I think. It’s just HTML; surely Python has an easy way to convert HTML entities back to the original text. Alas, it was not to be. There is an undocumented method of the HTMLParser class called ‘unescape’ that ostensibly does what I want it to (see http://fredericiana.com/2010/10/08/decoding-html-entities-to-text-in-python/), but it barfed a UnicodeDecodeError at me and since I have no idea where any non-ASCII characters might be coming from and no idea what is going on inside this undocumented method, I took the path of least resistance and looked elsewhere.

Elsewhere turned out to be Nokogiri, a Ruby gem for working with XML and HTML. Here’s the script; it simply parses each line into HTML, and then parses the resulting HTML and extracts the text:

ARGF.each do |line|
 html = Nokogiri::HTML.fragment(line, 'UTF-8').text
 puts Nokogiri::HTML.fragment(html, 'UTF-8').text