Character encoding on remote connections – strange accents
When files are moved between different operating systems, or stored in a common file system such as AFS, you may sometimes find that characters such as ÅÄÖ are shown incorrectly.
A character encoding determines which binary sequence is used to represent each letter, or other character. Many different ways to encode text have been used throughout the years. CSC's Unix systems have traditionally used “Latin-1” (ISO-8859-1), which contains the letters used in western European languages. Other operating systems have used other encodings, e.g. “Mac Roman” on Mac OS, “CP-1252” on MS Windows, or “CP-437” on MS DOS. All of these are extensions of ASCII (basically, American letters, digits and punctuation), which means that such characters are displayed correctly. But accented letters differ. In particular, the Swedish letters ÅÄÖ are not displayed correctly
These days, most OSs can use some form of UTF-8, but you may need to configure the applications to use it. To do so you choose a locale, which defines formatting many settings specific to a language and region, for example:
- Number formatting (e.g. using “1 234,5” or “1,234.5”)
- Date and time formatting
- String collation (i.e. sort order, so that “ångström” is sorted under A in English but Å in Swedish)
The locale is written as «language»_«variant».«encoding», e.g. “en_US.UTF-8” (American English, UTF-8) or “en_GB.ISO8859-1” (British English, latin-1).
Wikipedia's explanation of latin1 (external link)
Wikipedia's explanation of locales (external link)
Converting a file
To convert the contents of a file, you can open it in a locale-aware editor, and “save as...”
a different encoding, or use the iconv command-line tool:
iconv -f iso8859-1 -t utf-8 < original.txt > new.txt
When logging in remotely (with SSH), you can normally configure your local settings to be forwarded. Unfortunately, not all SSH servers support this. Currently (as of November 2010), CSC's Solaris SSH server does not permit forwarding of environment variables, which is needed for this to work. The relevant locales (en_US.UTF-8, sv_SE.UTF-8) are available on Solaris, and you can set them manually, but they won't be used by default.
Problem: ÅÄÖ shown as ���
Your application uses latin1 characters, but your terminal (or editor) tries to display them as UTF-8. Configure your application to use UTF-8 (see below), or change your terminal settings to use ISO-8859-1.
Problem: ÅÄÖ shown as åäö
Your application uses UTF-8, but they are displayed as latin1. Configure your application to use ISO-8859-1 (see below), or change your terminal settings to use UTF-8.
Problem: ÅÄÖ shown as ���
Your application is printing U+FFFD, the Unicode replacement character (�, usually displayed as a question mark on inverted background). This is then converted as if it were in latin1 to UTF-8 (a U+FFFD character in UTF-8 uses three bytes). Check the settings for all applications — including the terminal window — to ensure that they all agree on which encoding to use.
Select locale (application settings)
If your application is locale aware (most are, but not some legacy CSC applications), then you can select the locale by
export LC_ALL=en_US.UTF-8 ## bash
setenv LC_ALL en_US.UTF-8 ## tcsh
and then run your application. To only configure the character encoding, change the LC_CTYPE environment variable instead.
You can also select which locale to use when you log in locally, but this may cause trouble when you use a different operating system. We recommend that you use the default settings and re-configure the applications instead.
Configuring terminal encoding
Ubuntu
The encoding used by Gnome's terminal can be change under Terminal and then Set Character Encoding, but unless you have previously done so, you need to add the “Western (ISO-8859-1)” encoding.
Mac OS X
The default settings for Terminal.app is to use UTF-8. This can be changed by going to Terminal then Preferences… then Advanced.
The default for X11.app's xterm is to use latin1. You can change this by editing the startup sequence for X11, but it's easier to just use Terminal.app.
MS Windows
PuTTY's settings can be changed under Window then Translation in the configuration dialog.
CSC's Windows computers currently run SSH Secure Shell from Tectia (formerly SSH Communications Security Corp). It is not UTF-8 aware, and will default to using latin1 encoding.