Completely removed the html5lib sanitizer.
Fixed the string-handling to work in both
Ruby 1.8.x and 1.9.2. There are still,
inexplicably, two functional tests that
fail. But the rest seems to work quite well.
Apply the same methodology, as in Revision 432,
to the category chunk-handler. This completes
the replacement of all the code that looks like
if string.is_utf8?
do something
else
complain
end
with code that looks like
string.purify
do something
Another request from the old (and apparently defunct) Instiki Bug Tracker:
allow single letter WikiLinks, e.g. "[[a]]". Requested by a Japanese user.
Fixed.
Previously,
<nowiki>[[!include foo]]</nowiki>
would produce some garbage, like
chunk18226682includechunk
instead of the desired rendered text,
[[!include foo]]
Fixed.
The html5lib sanitizer does not necessarily produce well-formed output.
Take some "bad" input, wrap it in a <nowiki> tag and -- bingo! -- you get
ill-formed output.
Fixed. (Though, probably, one should fix the html5lib sanitizer, instead.)
The new sanitizer seems to work well (cuts the time required
to produce the Instiki Atom feed in half). Our strategy is to
use HTML5lib for <nowiki> content, but to use the new sanitizer
for content that has been processed by Maruku (and hence is
well-formed).
The one broken unit test won't affect us (since it dealt with
very malformed HTML).
Start work (which may not pan out) on a new sanitizer. Right now, it passes
all but 1 of the HTML5lib Sanitizer's unit tests. But it doesn't do much
of anything to ensure well-formedness. This is not an issue for Maruku-processed
content, but it is a concern for <nowiki> blocks.
(One solution would be to use the HTML5lib parser on <nowiki> blocks.)
In any case, this baby is 3 times as fast as the HTML5lib sanitizer.
Previously, used a regexp to find and convert named entities in the content.
Now use a more efficient algorithm.
Similar tweak for converting NCRs before checking whether text is valid utf-8.