l10n: String Length and Verbosity across Languages

A few months ago I was discussing with @kaze about the truncation plague on Firefox OS, and he came out with a sentence that left me doubtful:

according to the desktop metrics I had, French is the least compact locale and Chinese is the most compact one

So I had to check it somehow ;-) (in Italy they would call me Saint Thomas for being skeptical).

The basic idea was simple: use Silme to analyze all locales available in Mozilla l10n repositories, comparing string lengths between English and another language.

Here’s the resulting Python script (beware my slowly improving programming skills) and a table with the results (data can be sorted by clicking on column headers).

Sample and Reference

I’m using mozilla-beta as a reference, and comparing each locale against en-GB. Why not en-US? The reason is simple: en-US strings are scattered across the entire mozilla-central repository, so I should do tricks like Transvision in order to create a pseudo en-US string-only repository. Using en-GB leads to less precise results (see below), but for the sake of this analysis I considered it an acceptable compromise.

I’m not checking all folders, only the main ones (‘browser’, ‘dom’, ‘mail’, ‘mobile’, ‘netwerk’, ‘security’, ‘services’, ‘suite’, ‘toolkit’, ‘webapprt’). This still generates an archive of almost 18,000 strings for locales translating all products, so it seems a decent sample.

Caveats and Weird Results

String 1: en-GB 2 characters, locale X 4 characters -> +2 characters, +100%
String 2: en-GB 8 characters, locale X 4 characters -> -4 characters, -50%
Average for locale X: -1 characters, +25% (sum of differences divided by total number of items).

Not sure if this is the best choice, but I couldn’t think of an alternative. Note also that I’m ignoring single character strings (access keys, shortcuts).

In the table you’ll see a global column (average results) and “buckets”, with string groups based on en-GB original length. Too bad these groups are often unreliable because of the “concatenation conundrum”, where one string could be created by concatenating 3 different labels.

Typical example to create a sentence with a link (note that concatenation should be always avoided):

sentence.before = Hey, this is a
sentence.link = very interesting link
sentence.after = .

In Italian this could be localized as

sentence.before = Ehi, questo
sentence.link = link
sentence.after = è veramente interessante.

Do you see what just happened here? Length comparison based on groups just became less interesting, both averages and maximum/minimum differences.

Anyhow, here’s a good image (graph based on global difference in percentage) that I’d like to call “Why using English as a reference for designing UI may not be a great idea”.

Length Comparison - mozilla-beta
Open link in a new tab/window to see the full image

Why not use Gaia directly?

This sounds like a good idea: we have a real en-US repository, and we don’t have concatenations. But there are some disadvantages as well:

  • Most locales already did at least two rounds of QA, so a lot of strings have already been (heavily) shortened to fit in the UI. So data could be less useful and interesting.
  • Several locales are incomplete on gaia-l10n. For this very reason I excluded all locales with less than 1000 strings translated.

Here’s the same table for Gaia. And, again, a similar graph based on global difference in percentage.

Length Comparison - GaiaOpen link in a new tab/window to see the full image

Fun facts:

  • We know that en-GB is 0.16% longer than en-US, at least on Gaia.
  • A simple word as “OK” (2 characters) can become as long as “Kulungile” (9) in Xhosa, or “Ceart ma-thà” (12) in Scottish Gaelic.

Once upon a time there was a string freeze… pt.2

Since it probably looks like my favorite hobby is whining without a reason, let’s check what happened so far (always an optimist…) in this cycle.

Broken strings in Mozilla Beta

  • Bug 797036 – Update updater strings and icon
  • Bug 803344 – poor discoverability of the enable/disable menu item for Social API

Landing strings in Beta means that we did something wrong before (haste of moving forward features that weren’t probably ready, “we need this in ESR”, etc.).

Broken strings in Mozilla Aurora

Obviously the two changesets landed on beta, plus:

  • Bug 795691 – b2g fixes for the web console actors
  • Bug 800373 – Change marketplace strings to ‘Firefox Marketplace’

Consider several adding/removing strings both in beta and aurora (e.g. Bug 803630 or Bug 760951) and you’ll get the picture.

Bug 797036 is a good example of how bad we are working on the l10n side lately:

  1. changes land on central on Oct 02 16:34:08 (end of cycle is only 6 days ahead)
  2. the day after I wrote a comment in the bug about the bad review (that’s pure luck, I don’t work on localization every day, and there are very few localizers doing this kind of checks on central)
  3. nobody reacts, bad strings move to aurora and we need to break string freeze

For a starter a better review process could have avoided all this.

Once upon a time there was a string freeze…

Nine months ago I wrote this post. Are things better now? Not at all, they keep getting worse.

When people asked me “how can you be happy with the rapid release cycle?”, I always answered “because finally I have a clear schedule”. Now imagine how I feel about the rapid release cycle.

I’m not a developer, I’m not an engineer either,  but guess what: if you’re breaking things every single cycle, you’re doing it wrong. I think it would be a good time to start thinking about it, maybe before localizers start giving up.

l10n Memo for the Next Meeting

My personal short memo for the next meeting, even if I’m sure Axel is already on this:

  • Aurora is supposed to be string frozen, so that localizers have a full cycle to update their localization, test their work and sign-off the best changeset available for Beta. This worked quite well for 5 releases, why did everything go wrong this time? We’re just a couple of days away from the end of this cycle (Firefox 10 release, Jan 30th), a backout on toolkit broke everything1 and then a bug on devtools added even more confusion.
  • Being a Mozilla localizer already requires an awful amount of technical skills, please don’t even think of adding more stuff on top of that (“why can’t we or localizers just retrieve the previous string from hg blame?”).
  • Working on two different repositories is painful (see Native Fennec), I realized that I can’t transplant changesets around because often they change more strings that I need, so I have to move text around manually. I’m scared of seeing what will happen when I’ll merge my work from central to aurora.

1 Thanks to our l10n logic this is not literally true, since products fall back to the English string. From my point of view, this still means “breaking things”: exposing a partial translated UI means lowering the quality of our work, and that’s not something I like to do.

Dear reviewer,

I’m aware that l10n can be a nuisance for a lot of developers – and some localizers (e.g. me) can be a real pain in the *** – but when reviewing a patch that involve a change of existing strings you only have a short and quick checklist to follow:

  • Does the patch fix a typo or does it make a substantial change to the string? In the first case just fixing the string is fine, in the latter case you need to change the string ID, since not all localization tools (or localizers who simply use a text editor) can catch this kind of change.
  • Are you changing a string ID? Always check if there’s an associated access key and maintain the relation STRINGID.label <-> STRINGID.accesskey (again, localization tools rely on this kind of structure).

Once in a while a mistake can happen, but three times in a few days seems a bit out of average ;-)

Many of the Mozilla localizers have met @Moz08, the Mozilla summit in Whistler
Happy localizers in Whistler ’08 (source Tristan Nitot)

Bad Localization Example (Java on OS X)

This is the dialog window that appears when you try to run a Java Applet on Mac OS X 10.5.7 with the last Java update (I’m running Java 1.5.0_19 according to this test).

java_firefox

Take a look at the checkbox:

  • In Italian it’s “l’accesso” (definite article+noun), not “laccesso”. The same error appears in the first label, so I suppose they have some difficulties dealing with apostrophes. This problem was already there before the Java update.
  • Applet’s name and author are gone, replaced by {0} and {1} (this started with the last Java update).

Here’s my questions:

  • Who is to blame for this window? Sun (as I suppose) or Apple? Sure it’s not Mozilla’s fault, since the same thing happens with Safari 4.
  • Is this happening only with the Italian localization of OS X? Are other locales affected as well?
  • How can we try to fix that, since someone will think for sure that this is our (Mozilla localizers) fault?

Technorati Tags: , ,

I Hate Accesskeys

As usual, before the final release we’re doing a lot of QA work on our localized Firefox builds, and this includes a careful check on accesskeys. There are two different issues with accesskeys:

  • use of a character not available in the label. For example: using “F” as accesskey for “Shiretoko” creates a label “Shiretoko (F)”. This can easily happen if you update the label and forget to correct the corresponding accesskey.
  • duplicated accesskeys (two or more labels with the same scope share the same accesskey).

In the last 24 hours we found two duplicated accesskeys in the Italian build: the first one is quite hidden (you have to check for updates in the Extension manager and then click on the “More information” button), while the second one is located in the main window (Toolbar search). This last issue affects the en-US build (see bug 498840) and probably also other locales.

accesskey

I think that we should really start to think about accesskeys and how to introduce automated tests.

The first step should be to create a standard naming convention (it’s not even mandatory, but it would make things easier): right now you can find accesskeys named like “label_accesskey”, “labelaccesskey” or “label.accesskey”. At this point, checking for external characters shouldn’t be a problem.

The real challenge would be to find accesskeys conflicts – using different tables to store all the accesskeys with the same scope – in particular in pop-up menus. Have you ever tried to select different parts of a web page (create a selection with images, links, images with links, text, etc.) and check how the context menu change? Doing this kind of checks manually is simply crazy ;-)

Technorati Tags: ,

Survey for Ubiquity localization

How can we localize this set of commands in Italian (see Mitcho’s post)?

1. search HELLO
2. search HELLO with google
3. translate HELLO from English to French
4. lookup the weather for PLACE
5. shop for SHOES with Amazon
6. email HELLO to Bill
7. email HELLO to ADDRESS
8. map PLACE
9. find HELLO
10. tab to HELLO or switch to HELLO tab

1. cerca HELLO

Italian uses the same order of English, so this one is easy.

2. cerca HELLO su google

First minor problem (see also this comment): do you search “on” Google, “with” Google or “in” Google? The form “cerca su” (“search on”) is probably the most used nowadays. Note that the object is placed between the verb (cerca) and the preposition (su).

3. traduci HELLO da inglese a francese

Removing definite articles maybe gives a little less natural feeling (“da inglese a francese” instead of “dall’inglese al francese”), but it still sounds good.

4. controlla meteo di PLACE
5. compra SHOES su Amazon

Same order and structure of English, just need to find the most appropriate verbs (for example, you “check” the weather or “display” weather conditions?).

6. email HELLO to Bill
7. email HELLO to ADDRESS

These two commands are quite problematic to localize:

  • we don’t have a single Italian verb for “to email”
  • you can “send (or write) an email to someone”, the tricky part is to include the object (HELLO)

If “HELLO” is an object (like a map, a selection or a link), the structure “send this by email to someone” is ok:

invia HELLO per e-mail a Bill/ADDRESS

What if HELLO is a text, like “email «good luck» to Bill”? In this case the proposed structure sounds weird, but honestly I can’t find a better structure (any suggestion out there?).

6. invia HELLO per e-mail a Bill
7. invia HELLO per e-mail a ADDRESS

8. cerca mappa di PLACE

Since we don’t have a single verb equivalent to “to map”, we can use something like “search map of”.

9. trova HELLO

Same order of English.

10. passa alla scheda HELLO

“Tab to HELLO” is almost impossible to translate, while “switch to HELLO tab” has a different order in Italian (equivalent to “switch to tab HELLO”).

This is the final result, hopefully with chances of improvement on 6 and 7

1. cerca HELLO
2. cerca HELLO su google
3. traduci HELLO da inglese a francese
4. controlla meteo di PLACE
5. compra SHOES su Amazon
6. invia HELLO per e-mail a Bill
7. invia HELLO per e-mail a ADDRESS
8. cerca mappa di PLACE
9. trova HELLO
10. passa alla scheda HELLO

1. search this with google
2. translate this to French
3. bookmark this tab

The only problem in these 3 commands is the lack of a single verb for “bookmark”, which can be changed to “add to bookmarks”. The correct form is “aggiungi questo ai segnalibri” (“add this to bookmarks”).

1. cerca questo con google
2. traduci questo in francese
3. aggiungi questo ai segnalibri

Technorati Tags: ,

Localizer: Follow That Address!

A brief follow-up to the previous post: after the discussion held in mozilla.dev.l10n, a new pseudo account has been created in Bugzilla (see bug 484645) to track changes that affect the localization process in an earlier stage.

If you’re a localizer, maybe it’s a good choice to follow that account: in BugZilla’s Preferences, open the Email Preferences panel and add community@localization.bugs to the User Watching list.

Technorati Tags:

Why l10n should be involved in UI redesign

Take a look at this mock-up of the new Privacy panel: looks great, doesn’t it? But for me it’s just a l10n nightmare.

mockup

When you localize software, you have two possibilities (at least in Italian):

  • be informal and use the second-person singular
  • be formal and use passive forms and third-person singular

The second one is the obvious choice for professional translations, and we chose this path for our localization. This means that you should also try to avoid software personification: actions are done by (or with) the software, software’s name shouldn’t be used as a subject in sentences.

“Firefox will” is a bad choice for another reason: many languages don’t use auxiliary verbs to create future forms, so how can I translate that? Ok, I could try to find a suitable auxiliary verb, for example “deve” (must). “Firefox must: remember history/never remember history”. And there I’m stuck again: in negative forms, the “not” should go before the auxiliary verb:

  • Firefox deve salvare la cronologia (Firefox must remember history)
  • Firefox non deve mai salvare la cronologia (Firefox must never remember history)

The purpose of this rant is: please try to involve l10n in UI redesign, and try to land this massive changes before a string freeze.

Technorati Tags: , ,