[Imap-protocol] Character encoding question

Jeff Mckay jeff.mckay at comaxis.com
Wed Nov 2 14:45:00 PDT 2011


You're right - I understand now and have my code working. Thanks for
your help.

Philip Guenther wrote:

> On Wed, 2 Nov 2011, Jeff Mckay wrote:

>

>> Thanks for your comments. I'm still a bit confused. Let me clarify what

>> I am seeing in these two examples. In the first, one of the characters

>> in question is "lower case o with acute" which is supposed to be xF3 in

>> ISO-8859-2 and xC3 xB3 in UTF-8. The imap server represents this as

>> ampersand followed by AMP followed by a dash (I am writing out the

>> description so it does not get interpreted incorrectly somewhere). If I

>> take the AMP and run it through a base64 decoder, I get xF3.

>>

>

> No, when you run APM (not AMP) through a base64 decoder you get *two*

> character, in hex as 00 F3. This is the big-endian UTF-16 representation

> of "lower case o with acute".

>

>

>

>> In the second example, we have the letters Temp/New followed by a couple

>> Chinese characters that I don't know the names of. The two Chinese

>> characters are represented in imap by ampersand followed by bUuL1Q and

>> the closing dash. When I base64 decode this I end up with x6D x4B x8B

>> xD5. This appears to be big-endian UTF-16.

>>

>

> Yep. This is *exactly* what is specified by RFC 2152 ("UTF-7"), as

> modified by RFC 3501.

>

>

>

>> I have to byte-reverse each 2 byte sequence, but then I can convert it

>> to UTF8 (my target) and see the Chinese characters.

>>

>

> Uh, I think you mean you do a conversion from the UTF-16BE to the UTF-8

> that your display routines expect, right?

>

>

>

>> I could also take the original data and stick a + in front of it (ending

>> up with +bUuL1Q) and convert this from UTF7 to UTF8 and end up with

>> valid characters. This last part I really don't understand - if it is

>> base64 encoded, how is that valid UTF7?

>>

>

> Please go read RFC 2152 again. base64 encoding is a step in generating

> UTF-7 encoded text.

>

>

>

>> Anyway, I don't seem to have an algorithm that will work on both of

>> these examples, and no way to detect which one I should use. Obviously

>> I am totally confused about what I am doing, but any further insight

>> would be appreciated.

>>

>

> I think you lost track of the NUL byte in the first example, and from that

> ended up thinking a different conversion was necessary. The rules are

> consistent. For a given &.....- chunk:

> strip & - delimiters

> base64 decode the ..... part

> convert that from UTF-16BE to whatever encoding you want to use

>

>

> Philip Guenther

>

>

>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman13.u.washington.edu/pipermail/imap-protocol/attachments/20111102/ea2e2f7f/attachment.html>


More information about the Imap-protocol mailing list