Best practices in internationalizing text with lots of markup?

Best practices in internationalizing text with lots of markup? - java

I'm working on a web project that will (hopefully) be available in several languages one day (I say "hopefully" because while we only have an English language site planned today, other products of my company are multilingual and I am hoping we are successful enough to need that too).
I understand that the best practice (I'm using Java, Spring MVC, and Velocity here) is to put all text that the user will see in external files, and refer to them in the UI files by name, such as:
#in messages_en.properties:
welcome.header = Welcome to AppName!
#in the markup
<title>#springMessage("welcome.header")</title>
But, having never had to go through this process on a project myself before, I'm curious what the best way to deal with this is when you have some segments of the UI that are heavy on markup, such as:
<p>We are excited to announce that Company1 has been acquired by
Division X,
a fast-growing division of Company 2, Inc.
(Nasdaq: BLAH), based in...
One option I can think of would be to store this "low-level" of markup in messages.properties itself for the message - but this seems like the worst possible option.
Other options that I can think of are:
Store each non-markup inner fragment in messages.properties, such as acquisitionAnnounce1, acquisitionAnnounce2, acquisitionAnnounce3. This seems very tedious though.
Break this message into more reusable components, such as Company1.name, Company2.name, Company2.ticker, etc., as each of these is likely reused in many other messages. This would probably account for 80% of the words in this particular message.
Are there any best practices for dealing with internationalizing text that is heavy with markup such as this? Do you just have to bite down and bear the pain of breaking up every piece of text? What is the best solution from any projects you've personally dealt with?

Typically if you use a template engine such as Sitemesh or Velocity you can manage these smaller HTML building blocks as subtemplates more effectively.
By so doing, you can incrementally boil down the strings which are the purely internationalized ones into groups and make them relevant to those markup subtemplates. Having done this sort of work using templates for an app which spanned multi-languages in the same locale, as well as multiple locales, we never ever placed markup in our message bundles.
I'd suggest that a key good practice would be to avoid placing markup (even at a low-level as you put it) inside message properties files at all costs! The potential this has for unleashing hell is not something to be overlooked - biting the bullet and breaking things up correctly, is far less of a pain than having to manage many files with scattered HTML markup. Its important you can visualise markup as holistic chunks and scattering that everywhere would make everyday development a chore since:
You would lose IDE color highlighting and syntax validation
High possibility that one locale file or another can easily be missed when changes to designs / markup filter down
Breaking things down (to a realistic point, eg logical sentence structures but no finer) is somewhat hard work upfront but worth the effort.
Regarding string breakdown granularity, here's a sample of what we did:
comment.atom-details=Subscribe To Comments
comment.username-mandatory=You must supply your name
comment.useremail-mandatory=You must supply your email address
comment.email.notification=Dear {0}, the comment thread you are watching has been updated.
comment.feed.title=Comments on {0}
comment.feed.title.default=Comments
comment.feed.entry.title=Comment on {0} at {1,date,medium} {2,time,HH:mm} by {3}
comment.atom-details=Suscribir a Comentarios
comment.username-mandatory=Debes indicar tu nombre
comment.useremail-mandatory=Debes indicar tu direcci\u00f3n de correo electr\u00f3nico
comment.email.notification=La conversaci\u00f3n que estas viendo ha sido actualizada
comment.feed.title=Comentarios sobre {0}
comment.feed.title.default=Comentarios
comment.feed.entry.title=Comentarios sobre {0} a {1,date,medium} {2,time,HH:mm} por {3}
So you can do interesting things with how you string replace in the message bundle which may also help you preserve it's logical meaning but allow you to manipulate it mid sentence.

As others have said, please never split the strings into segments. You will cause translators grief as they have to coerce their language syntax to the ad-hoc rules you inadvertently create. Often it will not be possible to provide a grammatically correct translation, especially if you reuse certain segments in different contexts.
Do not remove the markup, either.
Please do not assume professional translators work in Notepad :) Computer-aided translation (CAT) tools, such as the Trados suite, know about markup perfectly well. If the tagging is HTML, rather than some custom XML format, no special preparation is required. Trados will protect the tags from accidental modification, while still allowing changes where necessary. Note that certain elements of tags often need to be localized, e.g. alt text or some query strings, so just stripping all the markup won't do.
Best of all, unless you're working on a zero-budget personal project, consider contacting a localization vendor. Localization is a service just like web design. A competent vendor will help you pick the optimal solution/format for your project and guide you through the preparation of the source material and incorporating the localized result. And of course they and their translators will have all the necessary tools. (Full disclosure: I am a translator / localization specialist. And don't split up strings :)

First off, don't split up your strings. This makes it much harder for localizers to translate text because they can't see the entire string to translate.
I would probably try to use placeholders around the links:
Division X
That's how I did it when I was localizing a site into 30 languages. It's not perfect, but it works.
I don't think it's possible (or easy) to remove all markup from strings, you need to have a way to insert the urls and any extra markup.

You should avoid breaking up your strings. Not only does this become a nightmare to translate, but it also makes grammatical assumptions which may not be correct in the target language.
While placeholders can be helpful for many things, I would not recommend using placeholders for URLs. This allows you to customize the URL for different locales. After all, no sense sending them to an English language page when their locale is Argentine Spanish!

Related

Java typed i18n (java)

I'd like to know if it's possible (and with which tooling) to do typesafe i18n in Java. Maybe it's not clear so here are some details, assuming we use something based on MessageFormat
1) Translate using typesafe parameters
I'd like to avoid having an interface like String translate(Object key,Object... values) where the values are untyped. It should be impossible to call with a bad parameter type.
Note I'm fine specifying the typing of all the keys. The solution I'm looking for should be scalable and should not increase the backend startup time significantly.
2) It should be known at compile time which keys are still used
I don't want my translation keys base to be like many websites' CSS, growing and growing forever and everybody being frightened to remove keys because we don't know easily if they are still useful or not.
In JS/React land there is babel-plugin-react-intl which permit to extract at compile time the translation keys that are still found in the code. Then we can diff these keys with our translation backend/SaaS and delete the unused keys automatically. Is there anything close to that experience in Java land?
I'm looking for:
any trick you have that could make i18n more manageable in Java regarding these 2 problems I have
current tooling that might help me solve the problem
hints on how to implement something custom if tooling does not exist
Also, is Enum suitable to store a huge fixed list of translation keys?

Translation keys are an open ended domain. For a closed domain an enum would do.
Having something like enums or constant lists likely causes a growth of different enums, constants classes.
And then there is the very important perspective of the translating business:
you would want at least one glossary (not needing translation occurrences), structurally equal phrases grouped,
comments maybe on ambivalent terms and usages (button/menu). This can reduce
the time costs and improve the quality. There also are things like online-help.
Up till now XML, like simple docbook / translation memory (tmx/xliff/...), was sufficient for that. And the tooling
including different forms of evaluation was done ourselves.
I hope a more professional answer will be given, but my answer might shed some light
on the desired functionality:
translation centric: as that needs the most work.
version control: some text lists involved.
checking tools: what you mentioned, integrity, missing, almost equal.

Tool for creating own rules for word lemmatization and similar tasks

I'm doing a lot of natural language processing with a bit unsusual requirements. Often I get tasks similar to lemmatization - given a word (or just piece of text) I need to find some patterns and transform the word somehow. For example, I may need to correct misspellings, e.g. given word "eatin" I need to transform it to "eating". Or I may need to transform words "ahahaha", "ahahahaha", etc. to just "ahaha" and so on.
So I'm looking for some generic tool that allows to define transormation rules for such cases. Rules may look something like this:
{w}in -> {w}ing
aha(ha)+ -> ahaha
That is I need to be able to use captured patterns from the left side on the right side.
I work with linguists who don't know programming at all, so ideally this tool should use external files and simple language for rules.
I'm doing this project in Clojure, so ideally this tool should be a library for one of JVM languages (Java, Scala, Clojure), but other languages or command line tools are ok too.
There are several very cool NLP projects, including GATE, Stanford CoreNLP, NLTK and others, and I'm not expert in all of them, so I could miss the tool I need there. If so, please let me know.
Note, that I'm working with several languages and perform very different tasks, so concrete lemmatizers, stemmers, misspelling correctors and so on for concrete languages do not fit my needs - I really need more generic tool.
UPD. It seems like I need to give some more details/examples of what I need.
Basically, I need a function for replacing text by some kind of regex (similar to Java's String.replaceAll()) but with possibility to use caught text in replacement string. For example, in real world text people often repeat characters to make emphasis on particular word, e.g. someoone may write "This film is soooo boooring...". I need to be able to replace these repetitive "oooo" with only single character. So there may be a rule like this (in syntax similar to what I used earlier in this post):
{chars1}<char>+{chars2}? -> {chars1}<char>{chars2}
that is, replace word starting with some chars (chars1), at least 3 chars and possibly ending with some other chars (chars2) with similar string, but with only a single . Key point here is that we catch on a left side of a rule and use it on a right side.

I am not an expert in NLP, but I believe Snowball might be of interest to you. Its a language to represent stemming algorithms. Its stemmer is used in the Lucene search engine.

I've found http://userguide.icu-project.org/transforms/general to be useful as well for some general pattern/transform tasks like this, ignore the stuff about transliteration, its nice for doing a lot of things.
You can just load up rules from a file into a String and register them, etc.
http://userguide.icu-project.org/transforms/general/rules

Camel case in web resources

What is you opinion on using camel case for web resources?
I am coming from a Java background where camel case is second nature, but still when naming web resources, such as html, css, javascript camel case does not feel right.
(e.g. http://localhost/application/editUserForm.html vs http://localhost/application/edit/user/form.html)
Any comments, suggestions are welcome!

The main consideration on naming schemes would be impact on SEO. From my understanding, Google (and presumably other engines) can 'read' amalgamated words in a single string, so camel case should be OK, as would a single case-insensitive string. Splitting the scheme by directory using rewrites would be clearer for less capable spiders. One piece of advice Google give is to use hyphens (-) rather than underscores (_), but that's not relevant here.
If you expect a real person to ever have to type the full address, using something easy to read would be a bonus in order to minimise error.

I don't find anything wrong with such naming.
My personal preference is to name web resources with -, like edit-user.jsp. I think it's more question of personal taste. I don't like _. - makes easier to visually find separate words in browser address bar (at least for me). And as far as I saw - is pretty common.

It might open the door to problems with 2 the same resources that differ in casing, if you are deploying your site on a windows environment (either for development or hosting).
But if you avoid 'double' filenames like that it's more or less taste.
A naming-scheme like this http://localhost/application/edit/user/form.htm does show the seperate words better, and might be easier to parse as something to do with "user".

I don't find camel casing very appealing. While it is the convention for Java and we should follow it when doing java, we don't have to when naming other stuff.
It's not that we find it tedious inserting a separator between every two words, but the underscore is really hard to type, it requires two pinky fingers to stretch very far. Unfortunately underscore is favored by language designers in identifiers.
Whoever invent the next programming language, please use '/' as name space separator, and '.' as word separator, so instead of
java.beans.beancontext.BeanContext.getResourceAsStream()
we have
java/beans/bean.context/Bean.Context/get.resource.as.stream()
wait... '/' is already used for division. never mind.

Java localization best practices

I have a Java application with server and Swing client. Now I need to localize the user interface and possibly also some of the data needs to be locale specific. There are few things in specific I would like to hear your opinions on.
How should I distribute the localized strings for the UI into properties files? In my application there are several views and each has several panels. Should I have one localization file per language for each panel or view or should I keep all translations for one language in the same file? I'm currently leaning towards one file per view and language, but I'm not sure how I should handle some domain specific terms which appear in many places. Having the same translation on several files does not sound too good.
The server throws some exceptions that contain a message that should be displayed to the user. I could get the selected locale from the session and handle the localization at the server, but I feel it would be more elegant to keep all localization files at the client. I have been thinking about sending only a localization key from the server with some kind of placeholders for error specific information, which would be sent with the exception. Then the client could construct the message based on the localization key and replace the placeholders with the error specific information. Does that sound like a good way to handle it, or are there other options? Typically my exception messages contain some additional information that changes for each case. It could be for example "A user with username Khilon already exists", in which case the string in the properties file would be something like "A user with username {0} already exists".
The localization of the data is the area that is the most unclear to me. As I'm not sure if it will be ever required, I have so far not planned it very much. The database part sounds straightforward enough, you basically just need an additional table for the strings and a column to tell for which locale the string is. Though I'm not sure if it would be best to have a localization table for each data table (eg Product and Product_names), or could I use one table for localization strings for all the data tables. The really tricky part is how to handle the UI, as to some degree it would be required for an user to enter text for an object in multiple languages. In practice this could mean for example, that a worker in Finland would give the object a name in Finnish and English, and then a worker in another country could translate it to her own language. If any of you has done something similar, I'd be happy to hear how you did it.
I'm very grateful to everybody who can share their experiences with me.
P.S. If you happen to know any exceptionally good websites or books on the subject, I would be happy to hear of them. I have of course done some googling and read some articles about localization, but nothing mind blowing yet.

Actually, what you are talking about is Internationalization (i18n), not Localization (L10n).
From my experience, you are on the right path.
ad 1). One properties file per view and locale (not necessary language, as you may want to use different translations for certain languages depending on country, i.e. using different strings for British an American English thus different locales) is the right approach. Since applications tend to evolve, it could save a good deal of money when you want to modify just one view (as translators will charge you even for something they won't touch - they will have to actually find strings that need to be updated/newly translated). It would be also easier to use with Translation Memory tools if you do it right (new strings at the end of the file all the time).
ad 2). The best idea is to sent out only the resource key from server or other process; other approach could be attaching a resource key and possibly the data (i.e. numeric value) using delimiters, so the message could be recreated and reformatted into local language.
ad 3). I have seen several approaches to localizing Databases, but the best (and it is not only my opinion, but also IEEE members) is to store resource keys and recreate the data on client side using appropriate locale. Of course this goes for pre-installed data, if you let users to enter the data, other issues will arose... There is no silver bullet, one need to think what works best in his/her context. I would lean to including a foreign key column that will identify the language, but it really depends on kind of data that will be stored.
Unfortunately i18n doesn't end here, please remember about correctly formatting dates and numbers so that they will be understandable for people using your program. And also, if you happen to have some list of strings, the sorting order should also depend on locale (it's called collation).
Sun used to have (now our beloved Oracle) has quite good i18n trail which you can find here: http://download.oracle.com/javase/tutorial/i18n/index.html .
If you want to read good book on the subject of i18n and L10n, that will save you years of learning these topics (although not necessary will teach you how to program it), there is great book from Microsoft Press: "Developing International Software" - http://www.amazon.com/Developing-International-Software-Dr/dp/0735615837 . It still relevant, although quite old.

1) I usually keep everything in one file and use names that signify where the properties are used. For example, I prefix with things like "view" and "menu"
view.add_request.title
view.add_request.contact_information.sectionheader
view.add_request.contact_information.first_name.label
view.add_request.contact_information.last_name.label
menu.admin.user_management.add_user.label
menu.admin.user_management.add_role.label
2) Yes, passing around the key makes things simpler and makes the server code easier to test. It also avoids having to pass locale information to the server to have it decide on a language for the client. Its a thick client, so let it handle the localization.
3) I haven't localized data before (usually just labels, and static UI verbage), but I would probably lean towards having a single table with all the localized strings and locales to start with (just to keep it simple). I'm not sure what you're asking about in reference to the UI, but I would suggest you make sure that whatever character-set you're using allows all the languages you want to support. Make sure you read Joel Spolsky's article entitled: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Natural language parsing, practical example

I am looking to use a natural language parsing library for a simple chat bot. I can get the Parts of Speech tags, but I always wonder. What do you do with the POS. If I know the parts of the speech, what then?
I guess it would help with the responses. But what data structures and architecture could I use.

A part-of-speech tagger assigns labels to the words in the input text. For example, the popular Penn Treebank tagset has some 40 labels, such as "plural noun", "comparative adjective", "past tense verb", etc. The tagger also resolves some ambiguity. For example, many English word forms can be either nouns or verbs, but in the context of other words, their part of speech is unambiguous.
So, having annotated your text with POS tags you can answer questions like: how many nouns do I have?, how many sentences do not contain a verb?, etc.
For a chatbot, you obviously need much more than that. You need to figure out the subjects and objects in the text, and which verb (predicate) they attach to; you need to resolve anaphors (which individual does a he or she point to), what is the scope of negation and quantifiers (e.g. every, more than 3), etc.
Ideally, you need to map you input text into some logical representation (such as first-order logic), which would let you bring in reasoning to determine if two sentences are equivalent in meaning, or in an entailment relationship, etc.
While a POS-tagger would map the sentence
Mary likes no man who owns a cat.
to such a structure
Mary/NNP likes/VBZ no/DT man/NN who/WP owns/VBZ a/DT cat/NN ./.
you would rather need something like this:
SubClassOf(
ObjectIntersectionOf(
Class(:man)
ObjectSomeValuesFrom(
ObjectProperty(:own)
Class(:cat)
)
)
ObjectComplementOf(
ObjectSomeValuesFrom(
ObjectInverseOf(ObjectProperty(:like))
ObjectOneOf(
NamedIndividual(:Mary)
)
)
)
)
Of course, while POS-taggers get precision and recall values close to 100%, more complex automatic processing will perform much worse.
A good Java library for NLP is LingPipe. It doesn't, however, go much beyond POS-tagging, chunking, and named entity recognition.

Natural language processing is wide and deep, with roots going back at least to the 60s. You could start reading up on computational linguistics in general, natural language generation, generative grammars, Markov chains, chatterbots and so forth.
Wikipedia has a short list of libraries which I assume you might have seen. Java doesn't have a long tradition in NLP, though I haven't looked at the Stanford libraries.
I doubt you'll get very impressive results without diving fairly deeply into linguistics and grammar. Not everybody's favourite school subject (or so I've heard reported -- loved'em meself!).

I'll skip a lot many details and keep this simple. Parts of Speech tagging help you to create an parse tree out of a sentence. Once you have this, you try to make out a meaning as unambiguously as possible. The result of this parsing step will greatly aid you to frame a suitable response for you chatterbot.

Once you have part of speech tags you can extract, for example, all nouns, so you know roughly what things or objects someone is talking about.
To give you an example:
Someone says "you can open a new window." When you have the POS tags you know they are not talking about a can (as in container, jar etc., which would even make sense in the context of open), but a window. You'll also know that open is a verb.
With this information, your chat bot can generate a much better reply that will have nothing to do with can openers etc.
Note: You don't need a parser to get POS tags. A simple POS tagger is enough. A parser will give you even more information (e.g. what is the subject, what the object of the sentence?)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.