Java has two overloads each for String.toLowerCase and toUpperCase. One of the overloads takes a Locale as a parameter while the other one takes no parameters and uses the default locale (Locale.getDefault()).
The parameterless variants might not work as expected because case conversion respects internationalization, and the default locale is system dependent. Most notably, the lower case i is converted to an upper case dotted İ in the Turkish locale.
What is the purpose of these methods? Do the parameterless variants have any legitimate use? Or perhaps they were just a design mistake? (Not unlike several I/O APIs that use the system default character encoding by default.)
I think they're just convenience methods that will work most of the time, since apps that really need I18n are probably a small minority in the universe of java apps in the world.
If you hardcode a unix path for a File name in a java program and try to run in a windows box, you will also get wrong results and it's not java's fault.
I guess that's an implementation of write once run anywhere principle.
It makes sense, cause you can provide the default locale at JVM startup as one of the runtime parameters.
Furthermore, Java runtime has got a bunch of similar formatting methods for Dates and Numbers. (SimpleDateFormat, NumberFormat etc)
Several blog posts suggest that default locales and charsets indeed were a design mistake and have no meaningful use.
Related
I'm reading Java, A Beginner's guide by Schildt, and on the IO chapter, when using a FileWriter class, he tries to use the constructor of the type FileWriter(String filename, Charset charset). For the charset he uses System.console().charset().
However, my VSCode tells me the method charset is undefined for Console object...
Is there a way to obtain the charset of the console?
This book is showing you how to do it in the modern way. Whilst java-the-ecosystem makes a point of not changing everything every other year unlike some other programming language ecosystem, this did change, and changed, specifically, in the JDK17 release. 1
Starting with JDK17, this is the 'model' that the JVM uses to deal with charset encoding issues:
Every constructor and method in java that ends up converting bytes to characters or vice versa necessarily is going to use a charset (you need one to convert from bytes to character or vice versa, you can't not have one) - and they all have an overload (in all versions of java): You can either specify no charset in which case 'the default' would be used, or you could specify an explicit one. Starting with JDK17, all these methods use, as default, UTF-8, whether your host OS uses that as default charset or not. This is the change: Prior to JDK17, 'the default charset' meant the host OS charset. JDK17 and up, it means 'UTF-8'.
In the unlikely case you want to write data in the charset of the host OS, there's System.console().charset(). Therefore, new FileWriter("file.txt") would write in UTF-8, whereas new FileWriter("file.txt, System.console.charset()) would write in the host OS charset, whatever that might be. On linux it usually is UTF-8 and there's no difference; on windows it's more often something like Cp1252 or some such.
That's a change. Prior to JDK17, it worked differently:
Those use-default-charset methods/constructors, such as new FileWriter(fileName), would use the host OS charset. If you want UTF_8, you'd have to write new FileWriter(fileName, StandardCharsets.UTF_8).
There is no System.console.charset() method. It doesn't exist at all.
As an exception, all methods in the new files API (java.nio.file) default to UTF-8, even before JDK17.
Clearly then the JDK you told VSCode to use predates it (because it's telling you System.console().charset() does not exist. It was introduced in JDK17, conclusion: Your VSCode is using a JDK that predates this).
Unfortunately (I raised this point at the openjdk mailing lists, I don't think any openjdk team member cares, as nobody's taken any action on it), this means it is impossible to write code that works both pre- and post- JDK17 without jumping through some crazy hoops, if you want to write data explicitly in 'host OS charset' form.
As you said, you're a beginner, so presumably you don't particularly care about writing code that writes data using the host OS charset in a way that compiles on both pre- and post-JDK17.
Thus, pick a solution. Either one will work:
Upgrade to JDK17.
Use new FileWriter(fileName), without adding that charset.
I'd pick option #1 - the book is likely going to end up using other things introduced in JDK17, if it did so here.
NB: For a beginner's book, worrying about charset encoding is a bizarre choice, especially considering that evidently they thought it was okay to use the obsolete FileWriter, presumably to keep things simple. It's almost like someone who tries to explain how to drive a car taking a moment to explain how a carburettor works (waaaay too much detail, you don't need to know this when learning how to drive cars until much later), but sort of handwave away how to charge the car, which is relevant much earlier. Bizarre choice - consider it 1 minor demerit against this book, and know that this charset malarky is not what you should be worrying about right now, if your aim is to learn java.
[1] Congratulations - you are using a rather modern book. That's good news - a common problem is that tutorials are 20 years behind the times and end up showing you outdated ways of doing things. "Too new" is a better problem to have than "too old".
The documentation for BreakIterator.getWordInstance() has options to use it with the Locale parameter, presumably because different locales' end results may vary for methods like (WordInstance, LineInstance, SentenceInstance, CharacterInstance)
But, when I do not use this parameter, I still get the same results as I get when calling it with any Locale in getAvailableLocales().
Is there some pattern, String, or Locale which actually causes these methods to give different results?
I believe all "western" languages have the same rules.
Cursory scan shows that locale th (Thai) has it's own rules, given in file /sun/text/resources/th/WordBreakIteratorData_th inside .../jre/lib/ext/localedata.jar.
It's a binary file, so I don't know what it says, and even if I could understand the file, not knowing Thai, I still wouldn't understand it.
I was frustrated recently in this question where OP wanted to change the format of the output depending on a feature of the number being formatted.
The natural mechanism would be to construct the format dynamically but because PrintStream.format takes a String instead of a CharSequence the construction must end in the construction of a String.
It would have been so much more natural and efficient to build a class that implemented CharSequence that provided the dynamic format on the fly without having to create yet another String.
This seems to be a common theme in the Java libraries where the default seems to be to require a String even though immutability is not a requirement. I am aware that keys in Maps and Sets should generally be immutable for obvious reasons but as far as I can see String is used far too often where a CharSequence would suffice.
There are a few reasons.
In a lot of cases, immutability is a functional requirement. For example, you've identified that a lot of collections / collection types will "break" if an element or key is mutated.
In a lot of cases, immutability is a security requirement. For instance, in an environment where you are running untrusted code in a sandbox, any case where untrusted code could pass a StringBuilder instead of a String to trusted code is a potential security problem1.
In a lot of cases, the reason is backwards compatibility. The CharSequence interface was introduced in Java 1.4. Java APIs that predate Java 1.4 do not use it. Furthermore, changing an preexisting method that uses String to use CharSequence risks binary compatibility issues; i.e. it could prevent old Java code from running on a newer JVM.
In the remainder it could simply be - "too much work, too little time". Making changes to existing standard APIs involves a lot of effort to make sure that the change is going to be acceptable to everyone (e.g. checking for the above), and convincing everyone that it will all be OK. Work has to be prioritized.
So while you find this frustrating, it is unavoidable.
1 - This would leave the Java API designer with an awkward choice. Does he/she write the API to make (expensive) defensive copies whenever it is passed a mutable "string", and possibly change the semantics of the API (from the user's perspective!). Or does he/she label the API as "unsafe for untrusted code" ... and hope that developers notice / understand?
Of course, when you are designing your own APIs for your own reasons, you can make the call that security is not an issue. The Java API designers are not in that position. They need to design APIs that work for everyone. Using String is the simplest / least risky solution.
See http://docs.oracle.com/javase/6/docs/api/java/lang/CharSequence.html
Do you notice the part that explains that it has been around since 1.4? Previously all the API methods used String (which has been around since 1.0)
I have a situation in which a client for obscure reasons wants a specific locale to be in place, except for the modification that month names in lower case as per the locale should be shown in upper case (which is not a standard variant of the locale in question). I already have SimpleDateFormatter code in place referencing an instance of Locale.
My question is whether it is possibly to dynamically construct an instance of Locale based on a designated country code, but with specifically given modifications? Or, alternatively, whether it is possible to build a locale instance from scratch, specifying all details at runtime, such that a SimpleDateFormatter referencing it would change its casing of months accordingly?
Thanks in advance.
The Javadoc for LocaleServiceProvider should get you started.
I am looking for a way to add more Locales to the Locales available in Java 1.6. But the Locales I want to create do not have ISO-3166 country codes, nor ISO-639 language codes. Is there any way to do this anyways? The Locales I want to add only differ in the language names, but the smaller an ethnic group is, the more picky they get about their identity ;-)
So I thought about extending an existing Locale, something like
UserDefinedLocale extends Locale {
UserDefinedLocale (Locale parentLocale) {...}
}
but java.util.Locale is final, which makes it especially hard to hack something around...
So, is the idea that the list of Java Locales is exhaustive? Am I the first to miss some more Locales?
Read the javadoc for java.util Locale.
It says :
"Create a Locale object using the constructors in this class: "
It also says :
"Because a Locale object is just an identifier for a region, no validity check is performed when you construct a Locale"
It also says :
"A Locale is the mechanism for identifying the kind of object (NumberFormat) that you would like to get. The locale is just a mechanism for identifying objects, not a container for the objects themselves"
And finally, the javadoc for the getAvailableLocales() method says :
"The returned array represents the union of locales supported by the Java runtime environment and by installed LocaleServiceProvider implementations"
So you just have to invent a language code which is not in the standard list, and use it as an identifier for your locale.
See this answer:
...You can plug in support for additional locales via the SPIs (described here). For example, to provide a date formatter for a new locale, you would do it by implementing a DateFormatProvider service. You might be able to do this by decorating an existing implementation - I'd have a look at the ICU4J library to see if it provides the support you want.