In Java the Locale defines things that are related how people want to see things (like currency formats, the name of the months and when a week starts).
When parsing the name of a Month (with a DateTimeFormatter) it starts to become tricky.
If you use Locale.US or Locale.ENGLISH then September has the short form Sep.
If you use Locale.UK then September also has the short form Sep in Java 11 ... but when you try Java 17 then it has Sept (because of changes at the Unicode CLDR end for which I asked if this was correct).
The effect is that my tests started failing when trying to build with Java 17.
The reason my current code uses Locale.UK instead of Locale.ENGLISH is because in Java Locale.ENGLISH is actually not just English but also the non-ISO American way of defining a week (they use Sunday as the first day of the week). I want to have it the ISO way.
Simply:
WeekFields.ISO = WeekFields.of(Locale.UK) = WeekFields[MONDAY,4]
WeekFields.of(Locale.ENGLISH) = WeekFields.of(Locale.US) = WeekFields[SUNDAY,1]
So starting with Java 17 I have not yet been able to find a built in Locale that works correctly.
In my mind I have to take either the Locale.ENGLISH and change the WeekFields or take the Locale.UK and change the shortname of the month September to what I need.
My question is how do I do this (in Java 17)?
Or is there a better way to fix this?
Update 1:
I already got feedback from the people at Unicode indicating that the change for en_GB to use Sept instead of Sep is a bugfix because that is the way it should be abbreviated in the UK.
So it seems I will need not just a parser that accepts "Sep" but one that will accept a mix of "Sept" and "Sep" for English.
Update 2:
I have tweaked my code that in case of a parse exception it will try to change what is assumed to be the input ("Sep") into what the currently selected locate likes to have. This does not cover all cases, it covers enough cases for my specific situation.
For those interested: my commit.
I found a way of handling this by using SPI.
I'm documenting it here as a possibility that may work for others (it does not work for my context).
As an experiment I created a class:
package nl.basjes.parse.httpdlog.dissectors.locale;
import java.util.Locale;
import java.util.spi.CalendarDataProvider;
import static java.util.Calendar.MONDAY;
public class CalendarDataProviderISO8601 extends CalendarDataProvider {
public static final Locale ENGLISH_ISO = new Locale("en", "", "ISO");
#Override
public int getFirstDayOfWeek(Locale locale) {
return MONDAY;
}
#Override
public int getMinimalDaysInFirstWeek(Locale locale) {
return 4;
}
#Override
public Locale[] getAvailableLocales() {
return new Locale[]{ENGLISH_ISO};
}
}
and a file ./src/main/resources/META-INF/services/java.util.spi.CalendarDataProvider with
nl.basjes.parse.httpdlog.dissectors.locale.CalendarDataProviderISO8601
Because this is just a variant over the regionless "English" it will take everything from "English" and put the above class over it.
Although this works I cannot use it.
The problem is that although http://openjdk.java.net/jeps/252 describes The default lookup order will be CLDR, COMPAT, SPI, the current reality is that the SPI has been removed from this list in this change because of deprecating the Extension Mechanism.
So to use this construct the class must be in the classpath at startup and the commandline option -Djava.locale.providers=CLDR,COMPAT,SPI must be passed to the JVM.
Given that my library ( https://github.com/nielsbasjes/logparser/ ) is also used in situations (like Apache Flink/Beam/Drill/Pig) where classes are shipped in a more dynamic way (serialized and transported to an already running JVM) to multiple machines this construct cannot be used.
I currently do not know of a dynamic way of doing something like this in Java.
Related
How can I use an unsupported Locale (eg. ar-US) in JAVA 11 when I output a number via String.format()?
In Java 8 this worked just fine (try jdoodle, select JDK 1.8.0_66):
Locale locale = Locale.forLanguageTag("ar-US");
System.out.println(String.format(locale, "Output: %d", 120));
// Output: 120
Since Java 11 the output is in Eastern Arabic numerals (try jdoodle, use default JDK 11.0.4):
Locale locale = Locale.forLanguageTag("ar-US");
System.out.println(String.format(locale, "Output: %d", 120));
// Output: ١٢٠
It seems, this problem comes from the switch in the Locale Data Providers form JRE to CLDR (source: Localization Changes in Java 9 by #mcarth). Here is a list of supported locales: JDK 11 Supported Locales
UPDATE
I updated the questions example to ar-US, as my example before didn't make sense. The idea is to have a format which makes sense in that given country. In the example it would be the United States (US).
The behavior conforms to the CLDR being treated as the preferred Locale. To confirm this, the same snippet in Java-8 could be executed with
-Djava.locale.providers=CLDR
If you step back to look at the JEP 252: Use CLDR Locale Data by Default, the details follow :
The default lookup order will be CLDR, COMPAT, SPI, where COMPAT
designates the JRE's locale data in JDK 9. If a particular provider
cannot offer the requested locale data, the search will proceed to the
next provider in order.
So, in short if you really don't want the default behaviour to be that of Java-11, you can change the order of lookup with the VM argument
-Djava.locale.providers=COMPAT,CLDR,SPI
What might help further is understanding more about picking the right language using CLDR!
I'm sure I'm missing some nuance, but the problem is with your tag, so fix that. Specifically:
ar-EN makes no sense. That's short for:
language = arabic
country = ?? nobody knows.
EN is not a country. en is certainly a language code (for english), but the second part in a language tag is for country, and EN is not a country. (for context, there is en-GB for british english and en-US for american english).
Thus, this is as good as ar (as in, language = arabic, not tied to any particular country). Even if you did tie it to some country, that is mostly immaterial here; that would affect things like 'what is the first day of the week' ,'which currency symbol is to be presumed' and 'should temperatures be stated in Kelvin or Fahrenheit' perhaps. It has no bearing on how to show digits, because that's all based on language.
And language is arabic, thus, ١٢٠ is what you get when you try ar as a language tag when printing the number 120. The problem is that you expect this to return "120" which is a bizarre wish1, combined with the fact that java, unfortunately, shipped with a bug for a long long time that made it act in this bizarre fashion, thinking that rendering the number 120 in arabic is best done with "120", which is wrong.
So, with that context, in order of preference:
Best solution
Find out why your system ends up with ar-EN and nevertheless expects '120', and fix this. Also fix ar-EN in general; EN is not a country.
More generally, 'unsupported locale' isn't really a thing. the ar part is supported, and it's the only relevant part of the tag for rendering digits.
Alternatives
The most likely best answer if the above is not possible is to explicitly work around it. Detect the tag yourself, and write code that will just respond with the result of formatting this number using Locale.ENGLISH instead, guaranteeing that you get Output: 120. The rest seems considerably worse: You could try to write a localization provider which is a ton of work, or you can try to tell java to use the JRE version of the provider, but that one is obsoleted and will not be updated, so you're kicking the can down the road and setting yourself up for a maintenance burden later.
1.) Given that the JRE variant actually printed 120, and you're also indicating you want this, I get that nagging feeling I'm missing some political or historical info and the expectation that ar-EN results in rendering the number 120 as "120" is not so crazy. I'd love to hear that story if you care to provide it!
Recently I am being challenged by quite an "easy" problem. Suppose that there is sentences (saved in a String), and I need to find out if there is any date in this String. The challenges is that the date can be in a lot of different formats. Some examples are shown in the list:
June 12, 1956
London, 21st October 2014
13 October 1999
01/11/2003
Worth mentioning that these are contained in one string. So as an example it can be like:
String s = "This event took place on 13 October 1999.";
My question in this case would be how can I detect that there is a date in this string. My first approach was to search for the word "event", and then try to localize the date. But with more and more possible formats of the date this solution is not very beautiful. The second solution that I tried is to create a list for months and search. This had good results but still misses the cases when the date is expressed all in digits.
One solution which I have not tried till now is to design regular expressions and try to find a match in the string. Not sure how much this solution might decrease the performance.
What could be a good solution that I should probably consider? Did anybody face a similar problem before and what solutions did you find?
One thing is for sure that there are no time, so the only interesting part is the date.
Using the natty.joestelmach.com library
Natty is a natural language date parser written in Java. Given a date expression, natty will apply standard language recognition and translation techniques to produce a list of corresponding dates with optional parse and syntax information.
import com.joestelmach.natty.*;
List<Date> dates =new Parser().parse("Start date 11/30/2013 , end date Friday, Sept. 7, 2013").get(0).getDates();
System.out.println(dates.get(0));
System.out.println(dates.get(1));
//output:
//Sat Nov 30 11:14:30 BDT 2013
//Sat Sep 07 11:14:30 BDT 2013
You are after Named Entity Recognition. I'd start with Stanford NLP. The 7 class model includes date, but the online demo struggles and misses the "13". :(
Natty mentioned above gives a better answer.
If it's only one String you could use the Regular Expression as you mentioned. Having to find the different date format expressions. Here are some examples:
Regular Expressions - dates
In case it's a document or a big text, you will need a parser. You could use a Lexical analysis approach.
Depending on the project using an external library as mentioned in some answers might be a good idea. Sometimes it's not an option.
I've done this before with good precision and recall. You'll need GATE and its ANNIE plugin.
Use GATE UI tool to create a .GAPP file that will contain your
processing resources.
Use the .GAPP file to use the extracted Date
annotation set.
Step 2 can be done as follows:
Corpus corpus = Factory.newCorpus("Gate Corpus");
Document gateDoc = Factory.newDocument("This event took place on 13 October 1999.");
corpus.add(gateDoc);
File pluginsHome = Gate.getPluginsHome();
File ANNIEPlugin = new File(pluginsHome, "ANNIE");
File AnnieGapp = new File(ANNIEPlugin, "Test.gapp");
AnnieController =(CorpusController) PersistenceManager.loadObjectFromFile(AnnieGapp);
AnnieController.setCorpus(corpus);
AnnieController.execute();
Later you can see the extracted annotations like this:
AnnotationSetImpl ann = (AnnotationSetImpl) gateDoc.getAnnotations();
System.out.println("Found annotations of the following types: "+ gateDoc.getAnnotations().getAllTypes());
I'm sure you can do it easily with the inbuilt annotation set Date. It is also very enhancable.
To enhance the annotation set Date create a lenient annotation rule in JAPE say 'DateEnhanced' from inbuilt ANNIE annotation Date to include certain kinds of dates like "9/11" and use a Chaining of Java regex on R.H.S. of the 'DateEnhanced' annotations JAPE RULE, to filter some unwanted outputs (if any).
I wanted to retrieve dates and other temporal entities from a set of Strings. Can this be done without parsing the string for dates in JAVA as most parsers deal with a limited scope of input patterns. But input is a manual entry which here and hence ambiguous.
Inputs can be like:
12th Sep |mid-March |12.September.2013
Sep 12th |12th September| 2013
Sept 13 |12th, September |12th,Feb,2013
I've gone through many answers on finding date in Java but most of them don't deal with such a huge scope of input patterns.
I've tried using SimpleDateFormat class and using some parse() functions to check if parse function breaks which mean its not a date. I've tried using regex but I'm not sure if it falls fit in this scenario. I've also used ClearNLP to annotate the dates but it doesn't give a reliable annotation set.
The closest approach to getting these values could be using a Chain of responsibility as mentioned below. Is there a library that has a set of patterns for date. I can use that maybe?
A clean and modular approach to this problem would be to use a chain,
every element of the chain tries to match the input string against a regex,
if the regex matches the input string than you can convert the input string to something that can feed a SimpleDateFormat to convert it to the data structure you prefer (Date? or a different temporal representation that better suits your needs) and return it, if the regexp doesn't matches the chain element just delegates to the next element in the chain.
The responsibility of every element of the chain is just to test the regex against the string, give a result or ask the next element of the chain to give it a try.
The chain can be created and composed easily without having to change the implementation of every element of the chain.
In the end the result is the same as in #KirkoR response, with a 'bit' (:D) more code but a modular approach. (I prefer the regex approach to the try/catch one)
Some reference: https://en.wikipedia.org/wiki/Chain-of-responsibility_pattern
You could just implement support for all the pattern possibilities you can think of, then document that ... OK, these are all patterns my module supports. You could then throw some RuntimeException for all the other possibilities.
Then ... in an iterative way you can keep running your module over the input data, and keep adding support for more date formats until it stops raising any RuntimeException.
I think that's the best you can do here if you want to keep it reasonably simple.
Yes! I've finally extracted all sorts of dates/temporal values that can be as generic as :
mid-March | Last Month | 9/11
To as specific as:
11/11/11 11:11:11
This finally happened because of awesome libraries from GATE and JAPE
I've created a more lenient annotation rule in JAPE say 'DateEnhanced' to include certain kinds of dates like "9/11 or 11TH, February- 2001" and used a Chaining of Java regex on R.H.S. of the 'DateEnhanced' annotations JAPE RULE, to filter some unwanted outputs.
I can recommend to you very nice implementation of your problem, unfortunetlly in polish: http://koziolekweb.pl/2015/04/15/throw-to-taki-inny-return/
You can use google translator:
https://translate.google.pl/translate?sl=pl&tl=en&js=y&prev=_t&hl=en&ie=UTF-8&u=http%3A%2F%2Fkoziolekweb.pl%2F2015%2F04%2F15%2Fthrow-to-taki-inny-return&edit-text=
The code there looks really nice:
private static Date convertStringToDate(String s) {
if (s == null || s.trim().isEmpty()) return null;
ArrayList<String> patterns = Lists.newArrayList(YYYY_MM_DD_T_HH_MM_SS_SSS,
YYYY_MM_DD_T_HH_MM_SS
, YYYY_MM_DD_T_HH_MM
, YYYY_MM_DD);
for (String pattern : patterns) {
try {
return new SimpleDateFormat(pattern).parse(s);
} catch (ParseException e) {
}
}
return new Date(Long.valueOf(s));
}
mark.util.DateParser dp = new DateParser();
ParsePositionEx parsePosition = new ParsePositionEx(0);
Date startDate = dp.parse("12.September.2013", parsePosition);
System.out.println(startDate);
output: Thu Sep 12 17:18:18 IST 2013
mark.util.Dateparser is a part of library which is used by DateNormalizer PR. So in Jape file, we have to just import it.
Our Wicket app needs separate UI language and number/date format locales (e.g. UI in english, Number and date format: German) per user.
If you set the session locale to say Locale.GERMAN, you get both german number and date format AND german resources (e.g. MyForm_de.properties).
We worked around this by setting the session locale to the number and date locale and then use a custom ComponentStringResourceLoader to load strings (return super.loadStringResource(clazz, key, language != null ? new Locale(language) : locale, style, variation)). However, it looks like strings are being cached because if I log on as different users, I start getting a mixture of languages.
Anyone know to control the caching (assuming that is causing the problem)? Note: I don't want to prevent caching (since that would presumably hurt performance). I guess I want to override the caching behavior so it works correctly with our custom resource loader.
Or is there a better approach altogether to solving this problem?
Here's the code we used for the custom StringResourceLoader.
ComponentStringResourceLoader myComponentStringResourceLoader = new ComponentStringResourceLoader() {
#Override
public String loadStringResource(Class<?> clazz, String key, Locale locale, String style, String variation) {
return super.loadStringResource(clazz, key, getLoggedOnUser().getUILanguageLocale(), style, variation);
}
};
getResourceSettings().getStringResourceLoaders().add(0, myComponentStringResourceLoader);
Here's the code to set the session locale (used for number and date formatting).
getSession().setLocale(getLoggedOnUser().getNumberAndDateLocale());
You can use Session's locale for i18n of the labels and either override #getLocale() or #getConverter() for the components which should use the different locale for dates. I guess you talk about TextField which needs to render its value in German locale. If so, just create GermanTextField that always returns Locale.GERMAN in its #getLocale().
I have searched throughout the site but I think I have a slightly different issue and could really do with some help before I either have heart failure or burn the computer.
I dynamically generate a list of month names (in the form June 2011, July 2011) and obviously I want this to be locale sensitive: hence I use the simple date format object as follows:
//the actual locale name is dependent on UI selection
Locale localeObject=new Locale("pl");
// intended to return full month name - in local language.
DateFormat dtFormat = new SimpleDateFormat("MMMM yyyy",localeObject);
//this bit just sets up a calendar (used for other bits but here to illustrate the issue
String systemTimeZoneName = "GMT";
TimeZone systemTimeZone=TimeZone.getTimeZone(systemTimeZoneName);
Calendar mCal = new GregorianCalendar(systemTimeZone); //"gmt" time
mCal.getTime(); //current date and time
but if I do this:
String value=dtFormat.format(mCal.getTime());
this "should" return the localized version of the month name. In polish the word "September" is "Wrzesień" -- note the accent on the n. However all I get back is "Wrzesie?"
What am I doing wrong?
Thanks to all - I accept now that it's a presentation issue - but how can I "read" the result from dtFormat safely - I added some comments below ref using getBytes etc. - this worked in other situations, I just can't seem to get access to the string result without messing it up
-- FINAL Edit; for anyone that comes accross this issue
The answer was on BalusC's blog : http://balusc.blogspot.com/2009/05/unicode-how-to-get-characters-right.html#DevelopmentEnvironment
Basically the DTformat object was returning UTF-8 and was being automatically transformed back to the system default character set when I read it into a string
so this code worked for me
new String(dtFormat.format(mCal.getTime()).getBytes("UTF-8"),"ISO-8859-1");
thank you very much for the assistance
Your problem has nothing to do with SimpleDateFormat - you're just doing the wrong thing with the result.
You haven't told us what you're doing with the string afterwards - how you're displaying it in the UI - but that's the problem. You can see that it's fetching a localized string; it's only the display of the accented character which is causing a problem. You would see exactly the same thing if you had a string constant in there containing the same accented character.
I suggest you check all the encodings used throughout your app if it's a web app, or check the font you're displaying the string in if it's a console or Swing app.
If you examine the string in the debugger I'm sure you'll see it's got exactly the right characters - it's just how they're getting to the user which is the problem.
In my tests, dtFormat.format(mCal.getTime()) returns
październik 2011
new SimpleDateFormat(0,0,localeObject).format(mCal.getTime()) returns:
poniedziałek, 3 październik 2011 14:26:53 EDT