[Imap-protocol] Character encoding question

Philip Guenther guenther+imap at sendmail.com
Wed Nov 2 12:16:35 PDT 2011


On Wed, 2 Nov 2011, Jeff Mckay wrote:

> Thanks for your comments. I'm still a bit confused. Let me clarify what

> I am seeing in these two examples. In the first, one of the characters

> in question is "lower case o with acute" which is supposed to be xF3 in

> ISO-8859-2 and xC3 xB3 in UTF-8. The imap server represents this as

> ampersand followed by AMP followed by a dash (I am writing out the

> description so it does not get interpreted incorrectly somewhere). If I

> take the AMP and run it through a base64 decoder, I get xF3.


No, when you run APM (not AMP) through a base64 decoder you get *two*
character, in hex as 00 F3. This is the big-endian UTF-16 representation
of "lower case o with acute".



> In the second example, we have the letters Temp/New followed by a couple

> Chinese characters that I don't know the names of. The two Chinese

> characters are represented in imap by ampersand followed by bUuL1Q and

> the closing dash. When I base64 decode this I end up with x6D x4B x8B

> xD5. This appears to be big-endian UTF-16.


Yep. This is *exactly* what is specified by RFC 2152 ("UTF-7"), as
modified by RFC 3501.



> I have to byte-reverse each 2 byte sequence, but then I can convert it

> to UTF8 (my target) and see the Chinese characters.


Uh, I think you mean you do a conversion from the UTF-16BE to the UTF-8
that your display routines expect, right?



> I could also take the original data and stick a + in front of it (ending

> up with +bUuL1Q) and convert this from UTF7 to UTF8 and end up with

> valid characters. This last part I really don't understand - if it is

> base64 encoded, how is that valid UTF7?


Please go read RFC 2152 again. base64 encoding is a step in generating
UTF-7 encoded text.



> Anyway, I don't seem to have an algorithm that will work on both of

> these examples, and no way to detect which one I should use. Obviously

> I am totally confused about what I am doing, but any further insight

> would be appreciated.


I think you lost track of the NUL byte in the first example, and from that
ended up thinking a different conversion was necessary. The rules are
consistent. For a given &.....- chunk:
strip & - delimiters
base64 decode the ..... part
convert that from UTF-16BE to whatever encoding you want to use


Philip Guenther



More information about the Imap-protocol mailing list