[Imap-protocol] Character encoding question
guenther+imap at sendmail.com
Wed Nov 2 12:16:35 PDT 2011
On Wed, 2 Nov 2011, Jeff Mckay wrote:
> Thanks for your comments. I'm still a bit confused. Let me clarify what
> I am seeing in these two examples. In the first, one of the characters
> in question is "lower case o with acute" which is supposed to be xF3 in
> ISO-8859-2 and xC3 xB3 in UTF-8. The imap server represents this as
> ampersand followed by AMP followed by a dash (I am writing out the
> description so it does not get interpreted incorrectly somewhere). If I
> take the AMP and run it through a base64 decoder, I get xF3.
No, when you run APM (not AMP) through a base64 decoder you get *two*
character, in hex as 00 F3. This is the big-endian UTF-16 representation
of "lower case o with acute".
> In the second example, we have the letters Temp/New followed by a couple
> Chinese characters that I don't know the names of. The two Chinese
> characters are represented in imap by ampersand followed by bUuL1Q and
> the closing dash. When I base64 decode this I end up with x6D x4B x8B
> xD5. This appears to be big-endian UTF-16.
Yep. This is *exactly* what is specified by RFC 2152 ("UTF-7"), as
modified by RFC 3501.
> I have to byte-reverse each 2 byte sequence, but then I can convert it
> to UTF8 (my target) and see the Chinese characters.
Uh, I think you mean you do a conversion from the UTF-16BE to the UTF-8
that your display routines expect, right?
> I could also take the original data and stick a + in front of it (ending
> up with +bUuL1Q) and convert this from UTF7 to UTF8 and end up with
> valid characters. This last part I really don't understand - if it is
> base64 encoded, how is that valid UTF7?
Please go read RFC 2152 again. base64 encoding is a step in generating
UTF-7 encoded text.
> Anyway, I don't seem to have an algorithm that will work on both of
> these examples, and no way to detect which one I should use. Obviously
> I am totally confused about what I am doing, but any further insight
> would be appreciated.
I think you lost track of the NUL byte in the first example, and from that
ended up thinking a different conversion was necessary. The rules are
consistent. For a given &.....- chunk:
strip & - delimiters
base64 decode the ..... part
convert that from UTF-16BE to whatever encoding you want to use
More information about the Imap-protocol