[Imap-protocol] [noob] fetch envelope charset?

Mark Crispin mrc+imap at panda.com
Sun Nov 20 09:34:59 PST 2011


Timo gave a very good, albeit brief answer. Here is a more detailed one.

On Sun, 20 Nov 2011, Petite Abeille wrote:

> Given a fetch envelope command, what character set encoding can the

> response be in?


The only IMAP texts which are not ASCII are BODY[] parts other than the
message header. Everything else is in ASCII. An extension may relax this
requirement; but in general all implementations must assume ASCII and must
certainly handle an ASCII-only IMAP world.

This means that UTF-8 personal name and message subjects are transmitted
as MIME encoded-words, as there is no other way to represent these in
ASCII.


> For example, say the subject is originally encoded as

> =?iso-8859-1?Q?H=F3la!?=, which is Hóla! in UTF-8.



> Does it have to be the original value (=?iso-8859-1?Q?H=F3la!?=)?


It may.


> Could it be the literal UTF-8 value (Hóla!)?


Not without an extension. You can not assume that to be the case.


> Could it be the q-encoded UTF-8 value (=?UTF-8?Q?H=C3=83=C2=B3la!?=)?


Possibly. IMAP does not prevent server implementations from transforming
the text of header fields into a canonical form. In fact, it encourages
this practice (e.g., taking multi-line subject fields with continuation
and rendering them as a single line) in servers.

However, servers may NOT canonicalize BODY[] parts. Those MUST be as in
the message.


> Could it be the UTF-7 encoded UTF-8 value (H+APM-la+ACE-)?


It can only be that if is encoded within a MIME encoded-word, e.g.
something like
=?UTF-7?Q?H+APM-la+ACE-?=
However, UTF-7 has been deprecated for many years and should not be used.
That example should demonstrate the pointlessness of using UTF-7.


> I'm really confused about what character set encoding IMAP is expecting

> and where. Is there perhaps a FAQ to complement the RFC that summarizes

> what applies where?


The IMAP specification, on page 5, says:

Characters are 7-bit US-ASCII unless otherwise specified.

Barring explicit text that allows a non-ASCII CHARSET (either in the base
specification or via some negotiated extension), this means that:

(1) ALL message and MIME header texts MUST be in ASCII. Any non-ASCII
characters must be represented using MIME encoded-words. This is a
requirement of the email header and MIME specifications. Usenet
netnews violated this rule, but Usenet is moribund.

(2) If, and ONLY if, the message body text is 8-bit non-ASCII in the
actual message, it is permitted for BODY[] fetches to transmit
8-bit in that character set. This is a specific exemption to what
is otherwise an ASCII-only rule.

(3) If the actual message body text is 7-bit with BASE64 or
QUOTED-PRINTABLE encoding of a non-ASCII character set, then a
BODY[] fetch MUST be transmitted in that form. The server MUST
NOT decode it into 8-bit for you or otherwise transform it from
the EXACT representation in the actual message.

The corrolary to this is that if such transformation is desired,
it MUST be done by the mail delivery system (SMTP receiver) so
that the "actual message" is in this form to the IMAP server. At
that point, the IMAP server can send the transformed form under
(2) above.

Extensions to IMAP MAY relax these rules. Don't assume this unless both
client AND server agree to this extension, and both implementations comply
with the added rules of the extension.

To put it simply and more brutally:

All the email world is still based in ASCII.

We still have to do disgusting kludges to do non-ASCII text in
email.

It's taken 20 years, and progress to do non-ASCII in email still
moves at a snail's pace.

It's taken 15 years, and progress to standardize on UTF-8 and
abolish all other character sets (including ASCII) still moves at
a snail's pace.

There are reasons why this is the case. They are not good reasons.
Several unfortunate decisions, made 20 years ago (and bitterly
opposed by me and a few others...sadly we lost that debate),
created those reasons.

One of those decisions, in particular (and especially opposed by
me), made IMAP's ASCII-only rule necessary. Otherwise, IMAP could
have deployed UTF-8 as its one and only character set many years
ago.

This is not the place to discuss that history. The individuals
responsible are not in this group to be flamed over their idiocy
and be told "I told you so." Nor do I wish to be flamed over
decisions that were not mine, and especially not the consequences.

Life sucks.

Nobody is more unhappy about this situation than I am.

We do the best that we can, given what we have.

-- Mark --

http://panda.com/mrc
Democracy is two wolves and a sheep deciding what to eat for lunch.
Liberty is a well-armed sheep contesting the vote.


More information about the Imap-protocol mailing list