How to use single quotes with MessageFormat - java

On my current project, we are using properties files for strings. Those strings are then "formatted" using MessageFormat. Unfortunately, MessagFormat has a handling of single quotes that becomes a bit of a hindrance in languages, such as French, which use a lot of apostrophes.
For instance, suppose we have this entry
login.userUnknown=User {0} does not exist
When this gets translated into French, we get:
login.userUnknown=L'utilisateur {0} n'existe pas
This, MessageFormat does not like...
And I, do not like the following, i.e. having to use double quotes:
login.userUnknown=L''utilisateur {0} n''existe pas
The reason I don't like it is that it causes spellchecking errors everywhere.
Question: I am looking for an alternative to the instruction below, an alternative that does not need doubling quotes but still uses positional placeholders ({0}, {1}…). Is there anything else that can I use?
MessageFormat.format(Messages.getString("login.userUnkown"), username);

No there is no other way as it is how we are supposed to do it according to the javadoc.
A single quote itself must be represented by doubled single quotes '' throughout a String
As workaround, what you could do is doing it programmatically using replace("'", "''") or for this particular use case you could use the apostrophe character instead which is ’ it would be even more correct actually than using a single quote.

Probably too late for you, but someone else might find this useful: Instead of Java's MessageFormat, use ICU (International Components for Unicode) (or rather its Java port ICU4J). It's basically a set of tools and data to support you in internationalizing your application. And among those tools is their own version of MessageFormat. It's very similar (maybe even backwards compatible) and can handle single quotes exactly like you want it. It can even handle doubled/escaped single quotes so you can try it as a drop-in replacement for Java's MessageFormat without having to unescape your single quotes first.

Related

Using Resourcebundle Properties in Javascript

I work on a Java EE web application that uses a combination of Dojo and plain javascript for the front-end.
We've discovered that when ResourceBundle properties are used in javascript, in some cases they end up breaking code.
Specifically, this happens when the properties contain quotes (single and double) & escape sequences (\n, \s ...).
The solution seems to be to include extra escape characters. For instance, \n needs to be prepended by one more slash (\\n) when used in a Js alert
to correctly render the line break, and Quotes if not escaped truncate the content prematurely for obvious reasons.
Our solution to the above issues so far has been to put in the extra escape characters in the property files itself. But this is something that we would like to move away from.
It seems like this might be a widespread problem and I'd like to hear from the experts on how you might have solved this problem.
Current Usage: key=A newline is represented with \\n and this \" is within quotes \".
Envisioned Usage : key=A newline is represented with \n and this " is within quotes ".
PS: We typically use the <fmt:message> tag to access these values in the front end and for use in javascript.
Consider using StringUtils. If has a method to escape input like yours.
http://commons.apache.org/lang/api-2.5/org/apache/commons/lang/StringEscapeUtils.html#escapeJava(java.lang.String)

How do you escape HTML attribute values in Java without the Owasp Library?

I've been using Apache's StringEscapeUtils for HTML entities, but if you want to escape HTML attribute values, is there a standard way to do this? I guess that using the escapeHtml function won't cut it, since otherwise why would the Owasp
Encoder interface have two different methods to cope with this?
Does anyone know what is involved in escaping HTML attributes vs. entities and what to do about attribute encoding in the case that you don't have the Owasp library to hand?
It looks like this is Rule #2 of the Owasp's XSS Prevention Cheat Sheet. Note the bit where is says:
Properly quoted attributes can only be escaped with the corresponding
quote
Therefore, I guess so long as the attributes are correctly bounded with double or single quotes and you escape these (i.e. double quote (") becomes " and single quote (') becomes ' (or ')) then you should be ok. Note that Apache's StringEscapeUtils.escapeHtml will be insufficient for this task since it does not escape the single quote ('); you should use the String's replaceAll method to do this.
Otherwise, if the attribute is written: <div attr=some_value> then you need to follow the recommendation on that page and..
escape all characters with ASCII values less than 256 with the &#xHH;
format (or a named entity if available) to prevent switching out of
the attribute
Not sure if there a non-Owasp standard implementation of this though. However, it guess it's good practice not to write attributes in this manner anyway!
Note that this is only valid when you are putting in a standard attribute values, if the attribute is a href or some JavaScript handler, then it's a different story. For examples of possible XSS scripting attacks that can occur from unsafe code inside event handler attributes see: http://ha.ckers.org/xss.html.

Regex: what is InCombiningDiacriticalMarks?

The following code is very well known to convert accented chars into plain Text:
Normalizer.normalize(text, Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
I replaced my "hand made" method by this one, but i need to understand the "regex" part of the replaceAll
1) What is "InCombiningDiacriticalMarks" ?
2) Where is the documentation of it? (and similars?)
Thanks.
\p{InCombiningDiacriticalMarks} is a Unicode block property. In JDK7, you will be able to write it using the two-part notation \p{Block=CombiningDiacriticalMarks}, which may be clearer to the reader. It is documented here in UAX#44: “The Unicode Character Database”.
What it means is that the code point falls within a particular range, a block, that has been allocated to use for the things by that name. This is a bad approach, because there is no guarantee that the code point in that range is or is not any particular thing, nor that code points outside that block are not of essentially the same character.
For example, there are Latin letters in the \p{Latin_1_Supplement} block, like é, U+00E9. However, there are things that are not Latin letters there, too. And of course there are also Latin letters all over the place.
Blocks are nearly never what you want.
In this case, I suspect that you may want to use the property \p{Mn}, a.k.a. \p{Nonspacing_Mark}. All the code points in the Combining_Diacriticals block are of that sort. There are also (as of Unicode 6.0.0) 1087 Nonspacing_Marks that are not in that block.
That is almost the same as checking for \p{Bidi_Class=Nonspacing_Mark}, but not quite, because that group also includes the enclosing marks, \p{Me}. If you want both, you could say [\p{Mn}\p{Me}] if you are using a default Java regex engine, since it only gives access to the General_Category property.
You’d have to use JNI to get at the ICU C++ regex library the way Google does in order to access something like \p{BC=NSM}, because right now only ICU and Perl give access to all Unicode properties. The normal Java regex library supports only a couple of standard Unicode properties. In JDK7 though there will be support for the Unicode Script propery, which is just about infinitely preferable to the Block property. Thus you can in JDK7 write \p{Script=Latin} or \p{SC=Latin}, or the short-cut \p{Latin}, to get at any character from the Latin script. This leads to the very commonly needed [\p{Latin}\p{Common}\p{Inherited}].
Be aware that that will not remove what you might think of as “accent” marks from all characters! There are many it will not do this for. For example, you cannot convert Đ to D or ø to o that way. For that, you need to reduce code points to those that match the same primary collation strength in the Unicode Collation Table.
Another place where the \p{Mn} thing fails is of course enclosing marks like \p{Me}, obviously, but also there are \p{Diacritic} characters which are not marks. Sadly, you need full property support for that, which means JNI to either ICU or Perl. Java has a lot of issues with Unicode support, I’m afraid.
Oh wait, I see you are Portuguese. You should have no problems at all then if you only are dealing with Portuguese text.
However, you don’t really want to remove accents, I bet, but rather you want to be able to match things “accent-insensitively”, right? If so, then you can do so using the ICU4J (ICU for Java) collator class. If you compare at the primary strength, accent marks won’t count. I do this all the time because I often process Spanish text. I have an example of how to do this for Spanish sitting around here somewhere if you need it.
Took me a while, but I fished them all out:
Here's regex that should include all the zalgo chars including ones bypassed in 'normal' range.
([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F\u0483-\u0486\u05C7\u0610-\u061A\u0656-\u065F\u0670\u06D6-\u06ED\u0711\u0730-\u073F\u0743-\u074A\u0F18-\u0F19\u0F35\u0F37\u0F72-\u0F73\u0F7A-\u0F81\u0F84\u0e00-\u0eff\uFC5E-\uFC62])
Hope this saves you some time.

Java String quote delimiter

Is there any way in Java to use a special delimiter at the start and the end of a String to avoid having to backslash all of the quotes within that String?
i.e. not have to do this:
String s = "Quote marks like this \" are just the best, here are a few more \" \" \""
No, there is no such option. Sorry.
No - there's nothing like C#'s verbatim string literals or Groovy's slashy strings, for example.
On the other hand, it's the kind of feature which may be included in the future. It's not like it would require any fundamental changes in the type system. I'd be hugely surprised for it to make it into Java 7 this late in the day though, and I haven't seen any suggestions that it'll be in Java 8... so you're in for a long wait :(
The only way to achive this is to put your strings in some other file and read it from Java. For instance a resource bundle.
Its not possible as of now, May be NOT in future also.
if you can give us what and why you are loookng for this kind of feature we can defnitely Suggest some more alternatives

Should I use java.text.MessageFormat for localised messages without placeholders?

We are localising the user-interface text for a web application that runs on Java 5, and have a dilemma about how we output messages that are defined in properties files - the kind used by java.util.Properties.
Some messages include a placeholder that will be filled using java.text.MessageFormat. For example:
search.summary = Your search for {0} found {1} items.
MessageFormat is annoying, because a single quote is a special character, despite being common in English text. You have to type two for a literal single quote:
warning.item = This item''s {0} is not valid.
However, three-quarters of the application's 1000 or so messages do not include a placeholder. This means that we can output them directly, avoiding MessageFormat, and leave the single quotes alone:
help.url = The web page's URL
Question: should we use MessageFormat for all messages, for consistent syntax, or avoid MessageFormat where we can, so most messages do not need escaping?
There are clearly pros and cons either way.
Note that the API documentation for MessageFormat acknowledges the problem and suggests a non-solution:
The rules for using quotes within
message format patterns unfortunately
have shown to be somewhat confusing.
In particular, it isn't always obvious
to localizers whether single quotes
need to be doubled or not. Make sure
to inform localizers about the rules,
and tell them (for example, by using
comments in resource bundle source
files) which strings will be processed
by MessageFormat.
Just write your own implementation of MessageFormat without this annoying feature. You may look at the code of SLF4J Logger.
They have their own version of message formatter which can be used as followed:
logger.debug("Temperature set to {}. Old temperature was {}.", t, oldT);
Empty placeholders could be used with default ordering and numbered for some localization cases where different languages do permutations of words or parts of sentences.
In the end we decided to side-step the single quote problem by always using ‘curly’ quotes:
warning.item = This item\u2019s {0} is not valid.
Use the ` character instead of ' for quoting. We use it all the time without problems.
Use MessageFormat only when you need it, otherwise they only bloat up the code and have no extra value.
In my opinion, consistency is important for this sort of thing. Properties files and MessageFormat already have lots of limitations. If you find these troubling you could "compile" your properties files to generate properly-formed ones. But I'd say go with using MessageFormat everywhere. This way, as you maintain the code you don't need to worry about which strings are formatted and which aren't. It becomes simpler to deal with, since you can hand off message processing to a library and not worry about the details at a high level.
Another alternative...When loading the properties file, just wrap the inputstream in a FilterInpuStream that doubles up every single quote.

Categories