Case study: How to make e open with a tilde over it

◀

How to make e open with a tilde over it (ɛ̃) as well as any other letter with (almost) any diacritical mark

Once I was confronted with the question how to make the letter e open with a tilde over it. ЮNIK works well with «predefined» Unicode letters. But while the letter e open (ɛ) is already defined, there is no letter e open with a tilde (ɛ̃) yet. What to do then?

The solution is to use combining diacritical marks. That are Unicode symbols which are defined in the area between U+0300 and U+036F (http://unicode.org/charts/PDF/U0300.pdf; you also find here references to diacritical marks in other parts of Unicode). Here, we need the so called combining tilde ̃ (U+0303). You have to put the combining tilde just after the letter to be modified. To do so, you can use ЮNIK, or any other method of entering Unicode symbols.

This method works not only with the combining tilde but also with another diacritical marks, known well from well-known languages, like ̈ or ̂ , or exotic ones, at least to non-linguists, like ̐ or ̜ . The use of diacritical marks is a very flexible and powerful mechanism. You can put the tilde and many other diacritical marks over or below any symbol.

Some limits do exist as well. In general, whether and to which extent it works depends on a particular file format or a particular application, or both. For example, one and the same mail program can send mails using various formats, in some you’ll got what you want, in some not. Many important applications do work quite well, e.g. Open/LibreOffice, MS Office 2010, Firefox, Opera, IE 8 at least under Windows 7, etc.

Diacritical marks in a non-UTF-8-HTML file

By default, the use of Unicode requires that you’ll save HTML files in UTF-8 format. For some reasons you might not want to do so. For example, you’ve got a very long text in Russian, which means that every UTF-8-coded Cyrillic symbol takes two bytes instead of one byte by saving in Windows-1251. Your long text also contains two or maybe even five diacritical marks. It’s stupid almost to double the file size to show them. For a long time, HTML format has provided the opportunity to code any symbol using ASCII characters only. While editing the HTML source code, you simply enter the character’s decimal Unicode number in a special pattern: &#<code>;, or, if you use the hexadecimal code: &#x<code>;. The diacritical tilde (hexadecimal code: U+0303) looks then in this way: ̃, the e open with a tilde over it: ɛ̃.

You can stress any letter

Stress marks are not a problem as long as you use Latin letters. For every Latin vowel there is a version with accent aigu: á, é, í, ó, ú, ý. But if you need to stress letters of any other alphabet like Cyrillic, you used to get into trouble. The use of combining diacritical marks solves this problem: а́, е́, и́, о́, у́, ы́, э́, ю́, я́.

A bit of computer technologies’ history: A tilde is not equal to a tilde

The described technique is a little mysterious because you can use the tilde mark in another way. If you switch to the French (or any other, in this sense similar) keyboard layout, you can use the so called dead keys for making Latin letters with diacritical marks. First you press a dead key, for a tilde the combination GrAlt (Right Alt) + É/2. At this moment nothing happens, no new symbol appears and the caret doesn’t move. This is the reason why this method is called so. Then you press the key with A, O or N. As a result, one letter appears (ã or õ or ñ) and the caret also moves one position forward. This method works well with French or Spanish additional Latin letters. But you can’t make ỹ using it in spite of the fact that ỹ is a Unicode letter (Capital: U+1EF8, small: U+1EF9).

From the user’s point of view, there are two different methods of making letters with diacritical marks. The first method with dead keys was introduced many years ago in a 16-bit version of Windows. In order to understand the following explanation, you have to know that in Windows characters are transported by so called messages which are sent to or from an application (or within it). If you type on the keyboard, the active application, e.g. a word processor, gets messages transporting letters. It usually processes one message of type WM_CHAR per letter. In a 16-bit Windows such a message could only transport a one-byte value containing a character. Therefore Windows used sets of one-byte characters at that time, known as Windows/ANSI code pages. By using 1 byte you can define only 256 different symbols. For this cause you had to switch between different code pages if you wanted to have more: Windows-1250, 1251, 1252, etc.

The tilde mark was already defined under number 0x7E. It could (and still can) be typed in as a usual letter. In this case, the application gets the message WM_CHAR transporting the tilde sign and it knows, that is a normal letter to be shown. But using dead keys with the tilde mark leads to a different result. Therefore the application has to know how to interpret the tilde mark, as a usual tilde or as something special. To solve this problem Microsoft introduced a special type of message WM_DEADCHAR. This message will be sent to the application when you type GrAlt + É/2 on a French keyboard or in similar situations. In this case, the message still transports the usual tilde, a 1-byte long letter with the code 0x7E. But if the application gets WM_DEADCHAR message, it knows that it has to treat the tilde mark in a different way, not simply showing it (what it still can do in order to signal that something happens), but waiting for the next message of type WM_CHAR at first. This message is generated by the operating system in the case if the combination of the tilde (as a dead key) and the letter is a known one, like ñ. This method works not only with a tilde, but with some another diacritical marks, like accent aigu or accent grave. It is limited to letters already used at that time. For this reason you can’t use this method for making ỹ, let alone other more complicated symbols.

The use of two different message types was caused not only by the limited transport ability of Windows messages at that time, but by another reason: the Unicode didn’t exist yet. You had only one tilde mark, which had to be used for different purposes, as a normal letter and as a diacritical mark.

But nowadays you haven’t got both limits anymore. Unicode defines a lot of different variants for many symbols. E.g. you have at least three different tildes: the usual tilde, the combining tilde and the small tilde. So you don’t need the message WM_DEADCHAR at all and the dead key method as well. You have all diacritical marks as «normal» Unicode characters, which can be transported with normal – now 2-byte-capable – WM_CHAR messages.

Some more technology background: pure input method vs. stream of characters

The dead key method and the method using combining diacritical marks do work in a different way. The first one is a pure input technique. As such it is a part of a particular operating system, that means Windows. It needn’t work in another operating systems like Linux or Unix. If you use this method, the data segment or file gets a code of one letter with a diacritical mark. This letter must be predefined. You can’t make combinations which aren’t defined yet.

The second method sets the main focus in a different way. The input itself works as usual. You type one letter after another. If you’ve saved a file in the plain text format, you see nothing than Unicode letters, normal ones as well as diacritical marks. This is a job of an application to recognize characters which are to be processed in a special way. Thereby the application can be supported by the operating system (look at Uniscribe APIs by Microsoft) or not. If a suitable application reads a file in order to show it on the screen and gets a combining mark, it sets this mark over or below the letter before it. This application can run on any operating system, and the file can be of any standard format, text or html or odt, etc. Since Unix/Linux treat the keyboard input as a file, the platform independency of the second method is even stronger: in both cases you have a stream of characters. The most important difference to the original Microsoft method is that you can make any combination of a letter and a diacritical mark, including those that aren’t defined or can’t be processed yet. It is possible that this combination will become meaningful later without any changes to the saved file.

A. Rumyantsev, 2012