Java localization best practices

Java localization best practices - java

I have a Java application with server and Swing client. Now I need to localize the user interface and possibly also some of the data needs to be locale specific. There are few things in specific I would like to hear your opinions on.
How should I distribute the localized strings for the UI into properties files? In my application there are several views and each has several panels. Should I have one localization file per language for each panel or view or should I keep all translations for one language in the same file? I'm currently leaning towards one file per view and language, but I'm not sure how I should handle some domain specific terms which appear in many places. Having the same translation on several files does not sound too good.
The server throws some exceptions that contain a message that should be displayed to the user. I could get the selected locale from the session and handle the localization at the server, but I feel it would be more elegant to keep all localization files at the client. I have been thinking about sending only a localization key from the server with some kind of placeholders for error specific information, which would be sent with the exception. Then the client could construct the message based on the localization key and replace the placeholders with the error specific information. Does that sound like a good way to handle it, or are there other options? Typically my exception messages contain some additional information that changes for each case. It could be for example "A user with username Khilon already exists", in which case the string in the properties file would be something like "A user with username {0} already exists".
The localization of the data is the area that is the most unclear to me. As I'm not sure if it will be ever required, I have so far not planned it very much. The database part sounds straightforward enough, you basically just need an additional table for the strings and a column to tell for which locale the string is. Though I'm not sure if it would be best to have a localization table for each data table (eg Product and Product_names), or could I use one table for localization strings for all the data tables. The really tricky part is how to handle the UI, as to some degree it would be required for an user to enter text for an object in multiple languages. In practice this could mean for example, that a worker in Finland would give the object a name in Finnish and English, and then a worker in another country could translate it to her own language. If any of you has done something similar, I'd be happy to hear how you did it.
I'm very grateful to everybody who can share their experiences with me.
P.S. If you happen to know any exceptionally good websites or books on the subject, I would be happy to hear of them. I have of course done some googling and read some articles about localization, but nothing mind blowing yet.

Actually, what you are talking about is Internationalization (i18n), not Localization (L10n).
From my experience, you are on the right path.
ad 1). One properties file per view and locale (not necessary language, as you may want to use different translations for certain languages depending on country, i.e. using different strings for British an American English thus different locales) is the right approach. Since applications tend to evolve, it could save a good deal of money when you want to modify just one view (as translators will charge you even for something they won't touch - they will have to actually find strings that need to be updated/newly translated). It would be also easier to use with Translation Memory tools if you do it right (new strings at the end of the file all the time).
ad 2). The best idea is to sent out only the resource key from server or other process; other approach could be attaching a resource key and possibly the data (i.e. numeric value) using delimiters, so the message could be recreated and reformatted into local language.
ad 3). I have seen several approaches to localizing Databases, but the best (and it is not only my opinion, but also IEEE members) is to store resource keys and recreate the data on client side using appropriate locale. Of course this goes for pre-installed data, if you let users to enter the data, other issues will arose... There is no silver bullet, one need to think what works best in his/her context. I would lean to including a foreign key column that will identify the language, but it really depends on kind of data that will be stored.
Unfortunately i18n doesn't end here, please remember about correctly formatting dates and numbers so that they will be understandable for people using your program. And also, if you happen to have some list of strings, the sorting order should also depend on locale (it's called collation).
Sun used to have (now our beloved Oracle) has quite good i18n trail which you can find here: http://download.oracle.com/javase/tutorial/i18n/index.html .
If you want to read good book on the subject of i18n and L10n, that will save you years of learning these topics (although not necessary will teach you how to program it), there is great book from Microsoft Press: "Developing International Software" - http://www.amazon.com/Developing-International-Software-Dr/dp/0735615837 . It still relevant, although quite old.

1) I usually keep everything in one file and use names that signify where the properties are used. For example, I prefix with things like "view" and "menu"
view.add_request.title
view.add_request.contact_information.sectionheader
view.add_request.contact_information.first_name.label
view.add_request.contact_information.last_name.label
menu.admin.user_management.add_user.label
menu.admin.user_management.add_role.label
2) Yes, passing around the key makes things simpler and makes the server code easier to test. It also avoids having to pass locale information to the server to have it decide on a language for the client. Its a thick client, so let it handle the localization.
3) I haven't localized data before (usually just labels, and static UI verbage), but I would probably lean towards having a single table with all the localized strings and locales to start with (just to keep it simple). I'm not sure what you're asking about in reference to the UI, but I would suggest you make sure that whatever character-set you're using allows all the languages you want to support. Make sure you read Joel Spolsky's article entitled: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Related

Outputting a data driven generated graphic which can be modified by the user

I'm trying to develop a system whereby clients can input a series of plant related data which can then be queried against a database to find a suitable list of plants.
These plants then need to be displayed in a graphic output, putting tall plants at the back and small plants at the front of a flower bed. The algorithm to do this I have set in my mind already, but my question to you is what would be the best software to use that:
1) Allows a user to enter in data
2) Queries a database to return suitable results
3) Outputs the data into a systemised graphic (simple rectangle with dots representing plants)
and the final step is an "if possible" and something I've not yet completely considered:
4) Allow users to move these dots using their mouse to reposition if wanted
--
I know PHP can produce graphic outputs, and I assume you could probably mix this in with a bit of jQuery which would allow the user to move the dots. Would this work well or could other software (such as Java or __) produce a better result?
Thanks and apologies if this is in the wrong section of Stack!

Your question is a bit vague. To answer it directly, any general programming language these days is able to do what you want, with the right libraries - be it C/++, Java, PHP+Javascript, Python, Ruby, and millions of others
With Java in particular, you'll probably want to use the swing toolkit for the GUI.
If you do know PHP+Javascript exclusively, it's probably best for your project to stick to what you know. If, however, you see this more as a learning opportunity than a project that needs be done NOW, you could take time to learn a new language in the process.
As to what language to learn, each person has a different opinion, obviously, but generally speaking, a higher-level a language is faster to prototype in.
EDIT
If you need this for a website, however, you'll need to use something web based - that is, you'll necessarily have two programs, one that runs server-side, the other one in the client (browser). On the server side, you could very well use PHP, JSP (JavaServer Pages), Python or Ruby. On the client side, you'll be limited to Javascript+DOM (maybe HTML5), a Java applet, or something flash-based.

String analysis and classification

I am developing a financial manager in my freetime with Java and Swing GUI. When the user adds a new entry, he is prompted to fill in: Moneyamount, Date, Comment and Section (e.g. Car, Salary, Computer, Food,...)
The sections are created "on the fly". When the user enters a new section, it will be added to the section-jcombobox for further selection. The other point is, that the comments could be in different languages. So the list of hard coded words and synonyms would be enormous.
So, my question is, is it possible to analyse the comment (e.g. "Fuel", "Car service", "Lunch at **") and preselect a fitting Section.
My first thought was, do it with a neural network and learn from the input, if the user selects another section.
But my problem is, I don´t know how to start at all. I tried "encog" with Eclipse and did some tutorials (XOR,...). But all of them are only using doubles as in/output.
Anyone could give me a hint how to start or any other possible solution for this?
Here is a runable JAR (current development state, requires Java7) and the Sourceforge Page

Forget about neural networks. This is a highly technical and specialized field of artificial intelligence, which is probably not suitable for your problem, and requires a solid expertise. Besides, there is a lot of simpler and better solutions for your problem.
First obvious solution, build a list of words and synonyms for all your sections and parse for these synonyms. You can then collect comments online for synonyms analysis, or use parse comments/sections provided by your users to statistically detect relations between words, etc...
There is an infinite number of possible solutions, ranging from the simplest to the most overkill. Now you need to define if this feature of your system is critical (prefilling? probably not, then)... and what any development effort will bring you. One hour of work could bring you a 80% satisfying feature, while aiming for 90% would cost one week of work. Is it really worth it?
Go for the simplest solution and tackle the real challenge of any dev project: delivering. Once your app is delivered, then you can always go back and improve as needed.

String myString = new String(paramInput);
if(myString.contains("FUEL")){
//do the fuel functionality
}

In a simple app, if you will be having only some specific sections in your application then you can get string from comments and check it if it contains some keywords and then according to it change the value of Section.

If you have a lot of categories, I would use something like Apache Lucene where you could index all the categories with their name's and potential keywords/phrases that might appear in a users description. Then you could simply run the description through Lucene and use the top matched category as a "best guess".
P.S. Neural Network inputs and outputs will always be doubles or floats with a value between 0 and 1. As for how to implement String matching I wouldn't even know where to start.

It seems to me that following will do:
hard word statistics
maybe a stemming class (English/Spanish) which reduce a word like "lunches" to "lunch".
a list of most frequent non-words (the, at, a, for, ...)
The best fit is a linear problem, so theoretical fit for a neural net, but why not take immediately the numerical best fit.

A machine learning algorithm such as an Artificial Neural Network doesn't seem like the best solution here. ANNs can be used for multi-class classification (i.e. 'to which of the provided pre-trained classes does the input represent?' not just 'does the input represent an X?') which fits your use case. The problem is that they are supervised learning methods and as such you need to provide a list of pairs of keywords and classes (Sections) that spans every possible input that your users will provide. This is impossible and in practice ANNs are re-trained when more data is available to produce better results and create a more accurate decision boundary / representation of the function that maps the inputs to outputs. This also assumes that you know all possible classes before you start and each of those classes has training input values that you provide.
The issue is that the input to your ANN (a list of characters or a numerical hash of the string) provides no context by which to classify. There's no higher level information provided that describes the word's meaning. This means that a different word that hashes to a numerically close value can be misclassified if there was insufficient training data.
(As maclema said, the output from an ANN will always be floats with each value representing proximity to a class - or a class with a level of uncertainty.)
A better solution would be to employ some kind of word-relation or synonym graph. A Bag of words model might be useful here.
Edit: In light of your comment that you don't know the Sections before hand,
an easy solution to program would be to provide a list of keywords in a file that gets updated as people use the program. Simply storing a mapping of provided comments -> Sections, which you will already have in your database, would allow you to filter out non-keywords (and, or, the, ...). One option is to then find a list of each Section that the typed keywords belong to and suggest multiple Sections and let the user pick one. The feedback that you get from user selections would enable improvements of suggestions in the future. Another would be to calculate a Bayesian probability - the probability that this word belongs to Section X given the previous stored mappings - for all keywords and Sections and either take the modal Section or normalise over each unique keyword and take the mean. Calculations of probabilities will need to be updated as you gather more information ofcourse, perhaps this could be done with every new addition in a background thread.

is creating a unique html file for each article a good practice?

sorry for poor topic name, i could not think for any thing better ;)
i am working on a news broadcast web site project, and the stake holder asked me to create a unique html file for each article and save it on disk instead of using a dbms like mysql , so the users can access the file directly and no computing will be needed so there wont be any bottle neck in that case.
and i did so.
and my question is , is this(what he asked me) a good and popular practice in programming?
what are the pros and cons?
thank you all and sorry for my poor English writing :P

If you got a template and can generate these pages automatically, it can be a good practise. Like you say, it prevents your server from having to generate the page. It only needs to put through the plain page.
And if you need to change the layout, or need to edit an article, you can just regenerate the page.
It is quite common, although lots of pages always have some dynamic content, like a date, user info or other session or time specific data. In this case you cannot cache the entire page. Of course you can combine both. Have dynamic index pages and front page, and only cache the actual articles themselves. But I read in your question that that is what you've done now.
Pros:
Faster retrieval of pages
Less load on your webserver
Less load on your database server
Cons:
Need to do some extra work to update the cache when an article is modified
Cannot have any dynamic content in the page
There probably isn't a problem at all. Most webservers are able to server large amounts of dynamic pages (premature optimization is the root of all evil).
There are other ways to speed things up, that don't have the above cons. You could cache query results in Memcache and/or use APC cache to speed up your PHP code and decrease disk I/O.
But there are web hosting companies dedicated entirely onto serving static content. That static content can be server from in-memory too, making it even faster than APC cached dynamic content, so if you really really really need the performance, yes, this is the way to go. But I seriously doubt you do.

Static pages are good for small websites. If you have the chance, go for it but if you need complex operations, dynamic page structure should be the way to go.
For an article site, I'd go with dynamic pages since the concept is dynamic (You'll need to update the site, add new articles, maybe add new features like commenting, user activity etc).
It is easier to add/delete/edit an article directly from an admin panel, with static pages, you'd have to find your way through the html code.
The list would go on and on...

Without a half-decent templating system, you'd have to store the full article AND the page layout and styles in the one file.
This means, it'd be difficult to update look and feel across all the published articles, and if you wanted to query the article list and return a list (such as those form a specific author or in a specific category), you'd be a bit stuck too.

If you think of it as a replacement for your database: No, that's not good pratice. You loose a lot of information, editing pages later will be harder as well es setting up indexed search functions,...
If you think of it as a caching solution: Then yes, this is good practice and also a common technique. But think on how to do the caching, when to replace the files with new versions and only do it if you have few write accesses and a lot of read accesses to your pages (which is typical for an article site ^^)

Definitely not a common practice, and I would not do it this way. Especially for the reasons of having a bottleneck - you won't have any bottletneck there. Nor any performance problem. How much unique visitors is your site likely to be getting? Hundreds of thousands?
In fact, reading from the disk is more likely to be a problem. DB operations can be optimized, cached in memory, etc - the db server performs various optimizations. On the other hand, you read the file each time (or handle the caching yourself).
The usual and preferred way to do it is:
store and load content from DB
have a template (header + footer) for the page, and only insert the content
have an admin panel with an editor (as rich as possible) where you can modify the content of the articel

I started out asking myself why a stakeholder might be asking you to implement a system this way. Why would he / she care, as long as your system meets the requirements? There are two possible answers to this:
The stakeholder is a bit of a control freak; e.g. an ex-techie who likes to interfere with what his developers do.
The stakeholder has had a bad experience in the past; e.g. with a previous system where the content was "locked into" a database with an unwieldy front end that made life hell for the users.
From this standpoint, how would you address the problem? My take is that you need to get to the bottom of why the stakeholder is asking for this. Does he have some genuine concern? Can you address that concern in the system design?
The bottom line is that "is this best practice" is not the overriding criterion here. Arguably, "what the customer wants" or "what the customer needs" are more important.
What I think you need to do is:
Find out what the stakeholder's real concern is.
Discuss with him / her (and other stakeholders) the design options that will address those concerns. Present them with the alternatives and an honest assessment of their implications, and involve them in the decision making.

Best practices in internationalizing text with lots of markup?

I'm working on a web project that will (hopefully) be available in several languages one day (I say "hopefully" because while we only have an English language site planned today, other products of my company are multilingual and I am hoping we are successful enough to need that too).
I understand that the best practice (I'm using Java, Spring MVC, and Velocity here) is to put all text that the user will see in external files, and refer to them in the UI files by name, such as:
#in messages_en.properties:
welcome.header = Welcome to AppName!
#in the markup
<title>#springMessage("welcome.header")</title>
But, having never had to go through this process on a project myself before, I'm curious what the best way to deal with this is when you have some segments of the UI that are heavy on markup, such as:
<p>We are excited to announce that Company1 has been acquired by
Division X,
a fast-growing division of Company 2, Inc.
(Nasdaq: BLAH), based in...
One option I can think of would be to store this "low-level" of markup in messages.properties itself for the message - but this seems like the worst possible option.
Other options that I can think of are:
Store each non-markup inner fragment in messages.properties, such as acquisitionAnnounce1, acquisitionAnnounce2, acquisitionAnnounce3. This seems very tedious though.
Break this message into more reusable components, such as Company1.name, Company2.name, Company2.ticker, etc., as each of these is likely reused in many other messages. This would probably account for 80% of the words in this particular message.
Are there any best practices for dealing with internationalizing text that is heavy with markup such as this? Do you just have to bite down and bear the pain of breaking up every piece of text? What is the best solution from any projects you've personally dealt with?

Typically if you use a template engine such as Sitemesh or Velocity you can manage these smaller HTML building blocks as subtemplates more effectively.
By so doing, you can incrementally boil down the strings which are the purely internationalized ones into groups and make them relevant to those markup subtemplates. Having done this sort of work using templates for an app which spanned multi-languages in the same locale, as well as multiple locales, we never ever placed markup in our message bundles.
I'd suggest that a key good practice would be to avoid placing markup (even at a low-level as you put it) inside message properties files at all costs! The potential this has for unleashing hell is not something to be overlooked - biting the bullet and breaking things up correctly, is far less of a pain than having to manage many files with scattered HTML markup. Its important you can visualise markup as holistic chunks and scattering that everywhere would make everyday development a chore since:
You would lose IDE color highlighting and syntax validation
High possibility that one locale file or another can easily be missed when changes to designs / markup filter down
Breaking things down (to a realistic point, eg logical sentence structures but no finer) is somewhat hard work upfront but worth the effort.
Regarding string breakdown granularity, here's a sample of what we did:
comment.atom-details=Subscribe To Comments
comment.username-mandatory=You must supply your name
comment.useremail-mandatory=You must supply your email address
comment.email.notification=Dear {0}, the comment thread you are watching has been updated.
comment.feed.title=Comments on {0}
comment.feed.title.default=Comments
comment.feed.entry.title=Comment on {0} at {1,date,medium} {2,time,HH:mm} by {3}
comment.atom-details=Suscribir a Comentarios
comment.username-mandatory=Debes indicar tu nombre
comment.useremail-mandatory=Debes indicar tu direcci\u00f3n de correo electr\u00f3nico
comment.email.notification=La conversaci\u00f3n que estas viendo ha sido actualizada
comment.feed.title=Comentarios sobre {0}
comment.feed.title.default=Comentarios
comment.feed.entry.title=Comentarios sobre {0} a {1,date,medium} {2,time,HH:mm} por {3}
So you can do interesting things with how you string replace in the message bundle which may also help you preserve it's logical meaning but allow you to manipulate it mid sentence.

As others have said, please never split the strings into segments. You will cause translators grief as they have to coerce their language syntax to the ad-hoc rules you inadvertently create. Often it will not be possible to provide a grammatically correct translation, especially if you reuse certain segments in different contexts.
Do not remove the markup, either.
Please do not assume professional translators work in Notepad :) Computer-aided translation (CAT) tools, such as the Trados suite, know about markup perfectly well. If the tagging is HTML, rather than some custom XML format, no special preparation is required. Trados will protect the tags from accidental modification, while still allowing changes where necessary. Note that certain elements of tags often need to be localized, e.g. alt text or some query strings, so just stripping all the markup won't do.
Best of all, unless you're working on a zero-budget personal project, consider contacting a localization vendor. Localization is a service just like web design. A competent vendor will help you pick the optimal solution/format for your project and guide you through the preparation of the source material and incorporating the localized result. And of course they and their translators will have all the necessary tools. (Full disclosure: I am a translator / localization specialist. And don't split up strings :)

First off, don't split up your strings. This makes it much harder for localizers to translate text because they can't see the entire string to translate.
I would probably try to use placeholders around the links:
Division X
That's how I did it when I was localizing a site into 30 languages. It's not perfect, but it works.
I don't think it's possible (or easy) to remove all markup from strings, you need to have a way to insert the urls and any extra markup.

You should avoid breaking up your strings. Not only does this become a nightmare to translate, but it also makes grammatical assumptions which may not be correct in the target language.
While placeholders can be helpful for many things, I would not recommend using placeholders for URLs. This allows you to customize the URL for different locales. After all, no sense sending them to an English language page when their locale is Argentine Spanish!

Designing Address validation for app

I am planning to design an address validation for users registering in my app. Possibly validating by zipcode and state.
Any idea how to handle addresses from around the globe?
Do i need to insert all the zipcodes in the database and then validate the address. Any possible suggestion for the implementation?
Thanks and Welcome :)
Krisp

Since there is no international standard for zip codes and a list of all zip codes in the world would be out of date before you were finished putting it together, I suggest a smaller approach:
Identify the countries that you will have to handle most and develop seperate validation rules for each of them. Make certain that with this you handle a vast majority of your users (e.g. 95%, or98%). For all the other countries, just accept what they enter vithout further validation.
There are so many different address formats in the world that it is just not worth the effort (if at all possible) to handle them all.

There is MASSIVE variance among address and postal code formats, such that there is not any "standard" way of doing this. See "Frank's Compulsive Guide to Postal Addresses"...
How much/what kind of validation do you really need? If the user is entering their shipping address, for example, they're more likely than you to know what particular format their local postal/shipping provider needs. Just give them a multiline textarea to enter it. If you need parts of it to calculate shipping costs, request just the information you need (City/Country, for example)

Postal Codes can actually be a headache because in some places they can represent very tiny areas as opposed to the US where they often represent relatively large areas (except in a big city where they may represent a few blocks).
Look at Canada, their postal codes can actually represent very very tiny areas. Two stores on opposite sides of the street often have different Canadian postal codes. Also in a list of Canadian businesses, when merging the list it is not uncommon to see the same address with a slightly different postal code. This just indicates that a lot of people get it wrong. On a customer basis I don't know how realistic it is that they actually get their exact zip code right.
http://www.columbia.edu/kermit/postal-ca.html
Basically it seems that each apartment or business dwelling may get their own zip code, which would make sense based upon what I have seen with Canadian business addresses.
The other point is that this is just Canada. Each European country will have its own address/postal code, so will Australia, Russia, etc... If you really want to do address verification, this is a major project.
To actually verify the address you need to to verify the postal code, city, and street. In the US the census releases the TIGER database files which often have a list of streets. But for other countries I don't know how you can get a list of streets. It may be best to look into a commercial package (maybe one of the GIS packages, although a lot of them only offer detailed addresses for the US/Canada and sometimes a few European countries).

A perfect Address validation can't be exactly placed in the already developed application, the validation of zip-code / postal code can be done as per the name of country though.
Please check the regex from the 'supplementalData.xml' xml file from the source xml-files source.
By parsing the xml you can find the corresponding postal-code regular expression for the country-code passed at the run-time, where you can check whether it's matching with country.

Have found another answer on this :
please refer the wiki's link : http://en.wikipedia.org/wiki/List_of_postal_codes.
Here you can find most of the zip-code patterns of most of the countries, of which you may write regex and maintain into database, which would help you to validate zip-code easily and also an optimized approach !

As many users have mentioned previously, verifying international addresses is basically impossible because there are no standards across countries and many countries don't have the resources for their postal system. Technically speaking, even in the United States, the USPS is struggling.
On a minimum you can offer address verification on a per-country basis. One of the easiest countries where you get a lot of coverage is in the USA. To do this you need to connect to some kind of address verification web service. There are several companies which have web services for this. One thing to be careful of is ensuring that each provider has geo-distribution of their API to ensure that any outages on their part don't flow back to you and kill your application. Beyond that, just make sure the results are CASS certified.
In the interest of full disclosure, I'm the founder of SmartyStreets. We have an address verification web service API called LiveAddress. You're more than welcome to contact me personally if you have any questions.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.