BreakIterator API Java - java

The documentation for BreakIterator.getWordInstance() has options to use it with the Locale parameter, presumably because different locales' end results may vary for methods like (WordInstance, LineInstance, SentenceInstance, CharacterInstance)
But, when I do not use this parameter, I still get the same results as I get when calling it with any Locale in getAvailableLocales().
Is there some pattern, String, or Locale which actually causes these methods to give different results?

I believe all "western" languages have the same rules.
Cursory scan shows that locale th (Thai) has it's own rules, given in file /sun/text/resources/th/WordBreakIteratorData_th inside .../jre/lib/ext/localedata.jar.
It's a binary file, so I don't know what it says, and even if I could understand the file, not knowing Thai, I still wouldn't understand it.

Related

How to find implicit usages of toString() method for specific type

Recently I changed all places in my code where passwords were processed from String to a new Password class. The Password.toString() method now just prints [********]. If I want to get the password I have to call Password.getPassword(). This way I can be sure that no password will ever accidentally be written into log files.
During the change I missed to change some lines that look like this:
String.format( "user:%s", password );
Before my change password was of type String, so it was formatted as desired. But after my change password was rendered as [*******]. That's exactly what I intended. But now I would like to find all these places automatically.
I tried findbugs with the fb-contrib plugin (ITU_INAPPROPRIATE_TOSTRING_USE) but it did not find the implicit use of Password.toString() within String.format().
Does anybody know of another findbugs contrib project or any other static code analyzer that could find these code places?
ITU only finds use of toString() which is used in which you are inferring the value of the content, say if you are parsing it for some value, or using it to build up some other string. The detector is there to let you know that really you shouldn't rely on the contents of toString() as it should be considered 'debugging' output, in many cases. (as you see here).
There would be way to many false positives just to check for uses of toString() which is always valid.
And no, don't know of any detectors, altho it would be pretty trivial to write one.

Purpose of String.toLowerCase() with default locale?

Java has two overloads each for String.toLowerCase and toUpperCase. One of the overloads takes a Locale as a parameter while the other one takes no parameters and uses the default locale (Locale.getDefault()).
The parameterless variants might not work as expected because case conversion respects internationalization, and the default locale is system dependent. Most notably, the lower case i is converted to an upper case dotted İ in the Turkish locale.
What is the purpose of these methods? Do the parameterless variants have any legitimate use? Or perhaps they were just a design mistake? (Not unlike several I/O APIs that use the system default character encoding by default.)
I think they're just convenience methods that will work most of the time, since apps that really need I18n are probably a small minority in the universe of java apps in the world.
If you hardcode a unix path for a File name in a java program and try to run in a windows box, you will also get wrong results and it's not java's fault.
I guess that's an implementation of write once run anywhere principle.
It makes sense, cause you can provide the default locale at JVM startup as one of the runtime parameters.
Furthermore, Java runtime has got a bunch of similar formatting methods for Dates and Numbers. (SimpleDateFormat, NumberFormat etc)
Several blog posts suggest that default locales and charsets indeed were a design mistake and have no meaningful use.

Possible to dynamically customize locale in Java?

I have a situation in which a client for obscure reasons wants a specific locale to be in place, except for the modification that month names in lower case as per the locale should be shown in upper case (which is not a standard variant of the locale in question). I already have SimpleDateFormatter code in place referencing an instance of Locale.
My question is whether it is possibly to dynamically construct an instance of Locale based on a designated country code, but with specifically given modifications? Or, alternatively, whether it is possible to build a locale instance from scratch, specifying all details at runtime, such that a SimpleDateFormatter referencing it would change its casing of months accordingly?
Thanks in advance.
The Javadoc for LocaleServiceProvider should get you started.

Parsing Joda-Time Partials

I'd like to produce Partials from Strings, but can't find anything in the API that supports that. Obviously, I can write my own parser outside of the Joda-Time framework and create the Partials, but I can't imagine that the API doesn't already have the ability to do this.
Use of threeten (JSR-310) would be an acceptable solution, but it doesn't seem to support Partials. I don't know whether that is due to its alpha status, or whether the Partial concept is handled in a different manner, which I haven't discovered.
What is the best way to convert a String (2011, 02/11, etc) into a Partial?
I've extended DateTimeParserBucket. My extended class intercepts calls to the saveField() methods, and stores the field type and value before delegating to super. I've also implemented a method that uses those stored field values to create a Partial.
I'm able to pass my bucket instance to DateTimeParser.parseInto(), and then ask it to create the Partial.
It works, but I can't say I'm impressed with Joda-Time - given that it doesn't support parsing Partials out of the box. The lack of DateTimeFormatter.parsePartial(String) is a glaring omission.
You have to start by defining the valid format for Partials which you will be accepting. There is no class which will just take text and infer the best possible match for a Partial. It's way too subjective based on locale, user preference, etc. So there's no way of getting around making a list of all of the valid formats for input. It will be very difficult to make these all mutually exclusive for each other, so there should be priorities. For example, you might want mm/dd and mm/yy to both be valid formats. If I give you the string 02/11, which one should have priority?
Once you've determined exactly the valid formats, you should use DateTimeFormat.forPattern to create a DateTimeFormatter for each one. Then you can use each formatter to try to parseInto a MutableDateTime. Then, go through each field in the MutableDateTime and transfer the value into a Partial.
Unfortunately, there is no better way to handle this in the Joda library.
The ISODateTimeFormat class allows partial printing. As you say, there is no parsing method on DateTimeFormatter (although you can parse to a LocalDate and interpret that).
ThreeTen/JSR-310 has the DateTimeFields class which replaces Partial. Parsing of partials into a CalendricalMerger is supported, however that may not be convertable back into a DateTimeFields yet.

How to get string.format to complain at compile time

The compiler has access to the format string AND the required types and parameters. So I assume there would be some way to indicate missing parameters for the varargs ... even if only for a subset of cases. Is there someway for eclipse or another ide to indicate that the varargs passed might cause a problem at runtime ?
It looks as if FindBugs can solve your problem. There are some warning categories related to format strings.
http://www.google.com/search?q=%2Bjava+%2Bprintf+%2Bfindbugs
http://findbugs.sourceforge.net/bugDescriptions.html#VA_FORMAT_STRING_MISSING_ARGUMENT
The Java compiler doesn't have any built-in semantic knowledge of StringFormat parameters, so it can't check on these at compile time. For all it knows, StringFormat is just another class and String.format is just another method, and the given format string is just another string like any other.
But yeah, I feel your pain, having come across these same problems in the past couple days. What they ought to have done is make it 'less careful' about the number of parameters, and just leave trailing %s markers un-replaced.

Categories