Much to my surprise, I just learned that some English-language documents use the ö character. I need to know, when sorting words in an English-language document, where is ö placed?
- before A?
- after Z ?
- before O ?
- after O ?
Also, I am curious about whether there is a capital ö. If there is, then the justification for ö would not exclusively be to break diphthongs. Why else use ö?
TL;DR: Ignore diacritics when sorting English — except to break ties.
When sorting English text — but not the text of various other languages — one does not distinguish letters with and without diacritics as different unless tie-breaking is required because all letters are the same otherwise. In such a case that two entries differ only by their diacritics — like with resume versus résumé — the one with diacritics follows the one without them.
This means that should an Ö occur in English text, it is to be treated just like a regular O unless there is also a version without the diacritic, as might occur with coöperate and cooperate or co-operate. Those three all have the same letters in them as far as English is concerned, differing only in non-letters.
Note that only letters are supposed to be considered when sorting text. One ignores non-letters such as spaces, hyphens, dashes, or full stops, when sorting English text. That means that the five imagined book-titles below are correctly sorted as text in this fashion:
Little Red, More Blue
Little Red Mushrooms
Little, Red Rider
Little Red Riding Hood
Little Red Tent
Diacritics are rare in English words, occurring only in loanwords or words marked for poetic meter, as in learnèd or Faërie. Being so rare in the first place, the ordering of different diacritics against each other (like é vs. è vs. ê vs. ë) may not be especially well defined in older dictionaries. The need for distinguishing those is rare enough that I haven’t yet found a single example of this in the OED.
In modern sorting of text, the default ordering of diacritics is provided by the Unicode Collation Algorithm’s DUCET. However, this is regularly modified under per-locale concerns. English needs no tailoring because the default works just fine for us unaltered. But many other languages do.
For example, French, Spanish, and German phonebooks all have their overrides in this regard. An English dictionary would place piñata before Pinochet . However, a Spanish dictionary would instead swap the order by listing Pinochet before piñata; this is because because unlike vowels with accent marks, n and ñ are considered completely separate letters in Spanish but not in English, collating in that order.
On the other hand, someone whose surname were Föhn in Germany would be expected to be found sorted as being exactly the same as people whose surnames were Foehn in the German phonebook, with forenames overriding ties.
Modern English has no such special needs apart from the unusual case of sorting people’s names, although words spelled with the typographic ligatures æ and œ must be treated as if spelled with two characters instead of one, so as though those were ae and oe respectively. That means that Cæsar sorts as though it were Caesar and œuvre as though it were oeuvre.
Languages Beyond Modern English
That’s for Modern English. Old English had its own alphabet, which at least at one point was ordered this way:
A, Æ, B, C, D, Ð, E, F, Ᵹ/G, H, I, L, M, N, O, P, R, S, T, Þ, U, Ƿ/W, X, Y
Notice how in Old English just like in today’s Icelandic, Æ was a real letter in its own right falling between A and B. It was not just a typographic ligature as in Cæsar; it’s an actual lexical ligature, which means it’s its own letter, not just a typesetting matter. So you really have to know the language of the text you are sorting to know what rules to apply: Modern English equates æ to ae, but Old English did not, while German phonebooks equate ä to ae. Sorting can be hard if you let it be.
Sometimes questions arise about what to do with letters outside a normal alphabet, which in Old English could have arisen if there had been some Greek loanword in it with a K or a Z, or a word with an “and” symbol like & (the “et” scribal abbreviation) or ⁊ (the Tironian “et”, still used in Irish).
What to do here really depends on your goal in the collation. Analphabetics like numbers and punctuation do not count for dictionary/library sorts, but you still have to do something with them. Otherwise you won’t get a deterministic result. Unicode does provide an ordering for these, which is normally consulted only when all alphabetics are the same and so the strings differ only in their analphabetics, but these do not necessarily “make sense” because there is no agreed-upon ordering for non-numeric analphabetics that everyone can just rattle off.
As for non-Latin letters, this really depends on what you are sorting. The OED, for example, transliterates the rare Greek letters that may appear in headwords into corresponding Latin ones for sorting purposes. That way β-nornicotine is sorted as though that Greek beta were spelled with a Latin B instead; it does not count as “beta”, just as “b”.
Sorting of Names
Personal names are special in many languages, not just in German — and this includes English. The sorting of English books by the author’s name, for example, has a special rule for where to place surnames beginning with Mc- or Mac- when but only when those are patronymic prefixes. They go after the L-authored books and before any other M-authored books, with Mc and Mac counting as identical. This is because you should not have to know which way somebody spelled it.
So for example, here are some hypothetical names sorted under this special “patronymic” rule:
Larry Ladd, Louis L’Amour, Douglas MacArthur,
The problem with that is that you do have to distinguish the surnames that start with Mc- or Mac- because they were originally patronymic from the ones that do so by for other reasons that patronymics — which is why Macedon and Macey aren’t in the Mc/Mac set that follow L and precede M. See Wikipedia’s list of Scottish writers for more
examples of this peculiar name-sorting strategy.
For how to sort text in various non-English languages, see also:
- the Icelandic alphabet
- the Swedish alphabet
- the Spanish alphabet
- the Welsh alphabet
- the Irish alphabet
- the Hungarian alphabet
- the German alphabet
Some of those languages count certain digraphs and trigraphs as single letters for alphabetic sorting. An obvious case is that before the reforms of 1997, Spanish counted CH as its own letter after C and before D, causing chocolate to come after color not before it. Although this was “officially” altered for Spanish, many other languages still have digraphs (or worse) that are expected to sort as their own letters.
English, fortunately, has no such digraph issue the way say Welsh still does, and so sorting English text is mostly a straightforward affair.