CRYPTO-GRAM (
http://www.counterpane.com/crypto-gram.html) is Bruce
"Applied Cryptography" Schneier's excellent email newsletter, and one of
the few things I bother to read every month. It's full of interesting
articles, including this one...
> Security Risks of Unicode
>
>
>
> Unicode is an international character set. Like ASCII, it provides a
> standard correspondence between the binary numbers that computers
> understand and the letters, digits, and punctuation that people
> understand. But unlike ASCII, it seeks to provide a code for every
> character in every language in the world. To do this requires more than
> 256 characters, the 8-bit ASCII character set; default Unicode uses 16-bit
> characters, and there are rules to extend even that.
>
> I don't know if anyone has considered the security implications of this.
>
> Remember all those input validation attacks that were based on replacing
> characters with alternate representations, or that explored alternative
> delimiters? For example, there was a hole in the IRIX Web server: if you
> could replace spaces with tabs you could fool the parser, and you could use
> hexadecimal escapes, strange quoting, and nulls to defeat input validation.
>
> The Unicode specification includes all sorts of complicated new escape
> sequences. They have things called UTF-8 and UTF-16, which allow several
> possible representations of various character codes, several different
> places where control-characters pop through, a scheme for placing
> diacriticals and accents in separated characters (looking very much like an
> escape), and hundreds of brand new punctuation characters and otherwise
> nonalphabetic characters.
>
> The philosophy behind the Unicode spec is to provide all possible useful
> characters for applications that are 8- or 16-bit clean. This is
> admirable, but it is nearly impossible to filter a Unicode character stream
> to decide what is "safe" in some application and what is not.
>
> What happens when:
>
> - We start attaching semantics to the new characters as delimiters, white
> space, etc? With thousands of characters and new characters being added
> all the time, it will be extremely difficult to categorize all the possible
> characters consistently, and where there is inconsistency, there tends to
> be security holes.
>
> - Somebody uses "modifier" characters in an unexpected way?
>
> - Somebody uses UTF-8 or UTF-16 to encode a conventional character in a
> novel way to bypass validation checks?
>
> With the ASCII character set, we could carefully study a small selection of
> characters, categorize them clearly, and make relatively straightforward
> decisions about the nature of each character. And even here, there have
> been mistakes (forgetting about tabs, multicharacter control-sequence
> snafus, etc). Still, a careful designer can figure out a safe way to deal
> with any possible character that can come off an untrusted wire by
> elimination if necessary.
>
> With Unicode, we probably won't be able to get a consistent definition of
> what to accept, what is a delimiter under what circumstance, or how to
> handle arbitrary streams safely. It's just a matter of time before simple
> validators pass data and upper layer software, trying to be helpful, attach
> magic-character semantics, and we have a brand-new variety of security holes.
>
> Unicode is just too complex to ever be secure.
>
> Unicode:
>
http://www.unicode.org
>
>