OK. This is a better way: define a custom TreeWalker which converts named entities to utf-8 as it goes. This avoids having to do an extra tree traversal in sanitize_rexml, AND avoids the trainwreck that is html5/inputstream.rb.
My REXML::Element.to_ncr (and REXML::Element.to_utf8) is horribly slow. For long documents, it proves more efficient to serialize to a string, apply String.to_ncr (or String.to_utf8) and then Sanitize the string.
Apparently, the form_spam_protect plugin only works with HTTP POST, not GET.
Unsafe operations (save and file-upload) should be POSTs anyway.
Fixed.
Also, two broken tests fixed. Only two Unit Tests now fail: both are minor bugs in XHTMLDiff.
In preparation for adding new tests, let's fix the existing ones.
3 Unit tests and one Functional test still fail.
* Two unit tests are bugs in xhtmldiff
* One is a bug in Maruku
* A file upload functional test fails, for reasons that escape me.