Commit graph

20 commits

Author SHA1 Message Date
Jacques Distler
d3e79ea84a Make truncate() Unicode-aware 2009-12-14 17:41:28 -06:00
Jacques Distler
faac8951a3 More Ruby 1.9 String Encoding Fun 2009-12-08 08:50:01 -06:00
Jacques Distler
171c12d2c1 Efficiency
This version of String#purify
is 12% faster, under Ruby 1.9,
than before.
2009-12-05 10:50:58 -06:00
Jacques Distler
34b63a8375 Fix a Ruby 1.9 Character Encoding Bug
Wow, this stuff is complicated!
Some things really want to be UTF-8;
others really want to be byte strings.
2009-12-01 12:03:15 -06:00
Jacques Distler
a6429f8c22 Ruby 1.9 Compatibility
Completely removed the html5lib sanitizer.
Fixed the string-handling to work in both
Ruby 1.8.x and 1.9.2. There are still,
inexplicably, two functional tests that
fail. But the rest seems to work quite well.
2009-11-30 16:28:18 -06:00
Jacques Distler
371aab6f96 Sync with Latest itex2MML and MathML::Entities
Support the latest changes in
http://www.w3.org/TR/2009/WD-xml-entity-names-20091117/
2009-11-18 12:04:07 -06:00
Jacques Distler
e0df6c8a6a Updated Tests and Sanitizer Fixes for Revision 439 2009-09-25 15:59:43 -05:00
Jacques Distler
b438bc64f6 Update More MathML Entity Mappings
Bring up-to-date with Editor's copy of
XML Entity definitions for Characters
(W3C Working Draft 13 September 2009)
http://www.w3.org/2003/entities/2007doc/overview.html
2009-09-25 14:34:22 -05:00
Jacques Distler
31ed55f055 Update MathML Entity Mappings
Update list of XHTML+MathML named entities
to match
http://www.w3.org/TR/2008/WD-xml-entity-names-20080721/
2009-09-24 16:21:22 -05:00
Jacques Distler
7185af32fc Fix an Eyesore
That just looked sloppy. I blame copy/paste.
2009-09-09 15:01:25 -05:00
Jacques Distler
3ff68ef42f Don't Expand NCRs
That operation is not idempotent (among other defects).
Instead, just check that the NCRs corespond to valid utf-8.
(Reported by Andrew Stacey)
2009-09-09 09:16:00 -05:00
Jacques Distler
116255dc0d Purify Categories
Apply the same methodology, as in Revision 432,
to the category chunk-handler. This completes
the replacement of all the code that looks like

  if string.is_utf8?
    do something
  else
    complain
  end

with code that looks like

  string.purify
  do something
2009-09-07 20:38:09 -05:00
Jacques Distler
c79fef9c01 Clean, rather than Complain
Previously, if the user tried to submit content which was
malformed utf-8, Instiki would complain loudly to him.

A slightly more user-friendly approach was suggested by
the latest Rails 2.3.4, and a conversation with Sam Ruby
(who suggested some improvements).

Now, instead of complaining, we remove the offending bytes,
leaving a well-formed utf-8 string, which we pretend is what
the user meant to submit.
2009-09-07 16:02:36 -05:00
Jacques Distler
52c1f74ecc Add a couple of XSS tests.
Some more tests from Clint Ruoho. The main branch of Instiki (and, I guess,
the old sanitizer) are vulnerable.

Also: under Ruby 1.8.x, CGI.unescapeHTML screws up horribly decoding NCRs
which represent high-bit ASCII characters. UTF-8 agrees with 7-bit ASCII,
but CGI.unescapeHTML doesn't seem to know that they disagree for i>127.
2009-01-05 16:25:27 -06:00
Jacques Distler
a503e2b8ac Gentler
Be a little gentler in recovering from Instiki::ValidationErrors, when saving a page.
Previously, we threw away all the user's changes upon the redirect. Now we attempt
to salvage what he wrote.
2008-12-17 00:07:21 -06:00
Jacques Distler
2e81ca2d30 Rails 2.2.2
Updated to Rails 2.2.2.
Added a couple more Ruby 1.9 fixes, but that's pretty much at a standstill,
until one gets Maruku and HTML5lib working right under Ruby 1.9.
2008-11-24 15:53:39 -06:00
Jacques Distler
ca1e8de89c Minor Cleanups
Remove a no-longer-needed function.
' -> &39;
Fix regexp for tag chunk.
2008-05-22 02:46:45 -05:00
Jacques Distler
f6508de6dd Whoops!
In some circumstances, the new Sanitizer was double-escaping text nodes.
Fixed (with unit test).
2008-05-21 14:14:43 -05:00
Jacques Distler
45405fc97e New Sanitizer Goes Live
The new sanitizer seems to work well (cuts the time required
to produce the Instiki Atom feed in half). Our strategy is to
use HTML5lib for <nowiki> content, but to use the new sanitizer
for content that has been processed by Maruku (and hence is
well-formed).

The one broken unit test won't affect us (since it dealt with
very malformed HTML).
2008-05-21 02:06:31 -05:00
Jacques Distler
800880f382 Rough In New Sanitizer
Start work (which may not pan out) on a new sanitizer. Right now, it passes
all but 1 of the HTML5lib Sanitizer's unit tests. But it doesn't do much
of anything to ensure well-formedness. This is not an issue for Maruku-processed
content, but it is a concern for <nowiki> blocks.

(One solution would be to use the HTML5lib parser on <nowiki> blocks.)

In any case, this baby is 3 times as fast as the HTML5lib sanitizer.
2008-05-20 17:02:10 -05:00