[Imap-protocol] Cyrus and RFC5255
mrc+imap at panda.com
Tue Nov 1 09:33:40 PDT 2011
On Tue, 1 Nov 2011, Bron Gondwana wrote:
>> RFC 5255 explicitly requires that you apply i;unicode-casemap in searches
>> as part of level 1 compliance.
> The response when I mentioned it to our project manager was "it's often nice
> not to worry about a vs å when searching - and have it find both".
i;unicode-casemap is designed to be a simple collator/comparator that even
a baby programmer can implement correctly. It is not intended to be
something that people can fork off all sorts of random non-interoperable
It also formalized, and moderately amended, what Cyrus has done from its
inception in searching Unicode strings.
You will probably need to define a different comparator for that purpose
(e.g., i;unicode-casemap-ignore-diacriticals). Beyond that, you will
quickly find yourself in a swamp filled with alligators (or crocodiles if
you prefer). Even the modest step of an "ignore-diacriticals" comparator
will get you wet above the knee.
If you want to get into the type of matching you are talking about, you
will wind up needing to do a full-fledged implementation of i18n collation
and comparison, which more likely that not includes locale sensitivity.
This is not something to be half-assed or hackish on. There are standards
and rules; and in some cases these are enforced in national laws.
I strongly urge you, BEFORE embarking upon such a project, to get involved
with the various groups involved with i18n collation and comparison and
seek their advice.
I did not do i;unicode-casemap in a vacuum; I sought their advice and
after their screams of anguished horror, these guys gave good advice which
I took serious and acted upon. One of the things that was important to
them was that, while (reluctantly) accepting the "we need something that
even a baby programmer can implement", they wanted to draw the line and
say "do this, or do it right."
With this said, I don't particularly object to ignore-diacriticals
searching; but I also note that the concept is locale-dependent. In some
languages, the diacritical form indicates accent or sound; in others it is
a completely unrelated character (and the latter group already is
infuritated by i;unicode-casemap).
CJK is another part of the swamp. For example, U+5FB0 徳 and U+5FB7 德
are fundamentally the same character; they have the same meaning and
differ only by an added stroke in the Chinese/Korean form that the
Japanese form lacks. Yet at least one Chinese character set has both
forms. Adult CJK native speakers would say that the two should match in
search; and many would have to have that one stroke difference pointed out
to them before they'd notice it.
But that's just a simple case. CJK is full of these, and most are far more
complicated. There are lots of cases where the equivalency is one way;
that is, A is equivalent to B, but B is NOT equivalent to A (or worse is
SOMETIMES equivalent to A). At this point, the swamp reptiles are over
The bottom line is that, whatever you do, seek the advice of the language
folks. Your implementation will have to be tempered by realism; but at
least you can avoid a mistake. Undoing a mistake is far more costly.
-- Mark --
Democracy is two wolves and a sheep deciding what to eat for lunch.
Liberty is a well-armed sheep contesting the vote.
More information about the Imap-protocol