Some more tests from Clint Ruoho. The main branch of Instiki (and, I guess,
the old sanitizer) are vulnerable.
Also: under Ruby 1.8.x, CGI.unescapeHTML screws up horribly decoding NCRs
which represent high-bit ASCII characters. UTF-8 agrees with 7-bit ASCII,
but CGI.unescapeHTML doesn't seem to know that they disagree for i>127.
CMyApp is a WikiWord (at least, on other Wiki systems, like TWiki).
Should allow that here
Also, choose a more obscure name for the thread-local variable tracking
included chunks.
Use "Thread.current[:included_by]" instead of the Class variable,
"@@included_by".
The former will work on some newfangled multi-threaded Webserver stack,
which uses separate threads to handle multiple simlutaneous requests
(one request/thread). Dunno that the rest of the application is
thread-safe, but using a class variable, in this context, probably isn't.
Thanks to Sam Ruby for the suggestion.
Another request from the old (and apparently defunct) Instiki Bug Tracker:
allow single letter WikiLinks, e.g. "[[a]]". Requested by a Japanese user.
Fixed.
Another very amusing 3-year old bug from the main Instiki Bug Tracker
(don't they ever fix anything?): the chunk-handling code was supposed
to prevent recursive [[!include ...]] statements. Alas, instead of
actually preventing them it would -- when it encountered a recursive
include -- churn away until Rails ran out of stack space.
Fixed.
Previously,
<nowiki>[[!include foo]]</nowiki>
would produce some garbage, like
chunk18226682includechunk
instead of the desired rendered text,
[[!include foo]]
Fixed.
By default, Rails will cache
example.com/mywiki/show/SomePage
and
www.example.com/mywiki/show/SomePage
In Instiki, this just leads to stale cached pages and frustration.
Fix that behaviour.
Be a little gentler in recovering from Instiki::ValidationErrors, when saving a page.
Previously, we threw away all the user's changes upon the redirect. Now we attempt
to salvage what he wrote.
Links to a published web should be to the 'publish' action, not to the
'show' action. Previously, the published status of the source, not the target
was used.
Also, correct display of the Navigation Links for the 'published' action.
The html5lib sanitizer does not necessarily produce well-formed output.
Take some "bad" input, wrap it in a <nowiki> tag and -- bingo! -- you get
ill-formed output.
Fixed. (Though, probably, one should fix the html5lib sanitizer, instead.)
Updated to Rails 2.2.2.
Added a couple more Ruby 1.9 fixes, but that's pretty much at a standstill,
until one gets Maruku and HTML5lib working right under Ruby 1.9.
The new sanitizer seems to work well (cuts the time required
to produce the Instiki Atom feed in half). Our strategy is to
use HTML5lib for <nowiki> content, but to use the new sanitizer
for content that has been processed by Maruku (and hence is
well-formed).
The one broken unit test won't affect us (since it dealt with
very malformed HTML).
Start work (which may not pan out) on a new sanitizer. Right now, it passes
all but 1 of the HTML5lib Sanitizer's unit tests. But it doesn't do much
of anything to ensure well-formedness. This is not an issue for Maruku-processed
content, but it is a concern for <nowiki> blocks.
(One solution would be to use the HTML5lib parser on <nowiki> blocks.)
In any case, this baby is 3 times as fast as the HTML5lib sanitizer.
The "optimization" of using arrays instead of regexps to
implement to_utf8 and is_utf8? (and their brethren) is
actually no faster. Go back to the logically-clearer implementation.
Previously, used a regexp to find and convert named entities in the content.
Now use a more efficient algorithm.
Similar tweak for converting NCRs before checking whether text is valid utf-8.
Upgraded to Rails 2.0.2, except that we maintain
vendor/rails/actionpack/lib/action_controller/routing.rb
from Rail 1.2.6 (at least for now), so that Routes don't change. We still
get to enjoy Rails's many new features.
Also fixed a bug in Chunk-handling: disable WikiWord processing in tags (for real this time).
OK. This is a better way: define a custom TreeWalker which converts named entities to utf-8 as it goes. This avoids having to do an extra tree traversal in sanitize_rexml, AND avoids the trainwreck that is html5/inputstream.rb.
My REXML::Element.to_ncr (and REXML::Element.to_utf8) is horribly slow. For long documents, it proves more efficient to serialize to a string, apply String.to_ncr (or String.to_utf8) and then Sanitize the string.
The file upload dialog asks for a description of the image or file to be uploaded. Use this as the default alt-text for the image and as a title attribute for a file link.
The URIChunk and LocalURICunk handlers were
1) Slow
2) Buggy (prone to produce ill-formed pages in edge cases)
3) Of dubious utility
So I ditched them. No auto-linked URLs, but who cares?
Fixed a bug in the HTML5lib tokenizer (affects S5 slideshows).
Some miscellaneous code cleanup. In particular, don't bother with zapping control characters;
instead, rely on is_utf8? method to raise an exception (which we do anyway).