This is the problem:
I have some .csv files with travels info, and the dates appear like strings (each line for one travel):
"All Mondays from January-May and October-December. All days from June To September"
"All Fridays from February to June"
"Monday, Friday and Saturday and Sunday from 10 January to 30 April"
"from 01 of November to 30 April. All days except fridays from 2 to 24 of november and sunday from 2 to 30 of december"
"All sundays from 02 december to 28 april"
"5, 12, 20 of march, 11, 18 of april, 2, 16, 30 of may, 6, 13, 27 june"
"All saturdays from February to June, and from September to December"
"1 to 17 of december, 1 to 31 of january"
"All mondays from February to november"
I must parse the strings to Dates, and keep them into an array for each travel.
The problem is that I don't know how to do it. Even my univesrity teachers told me that they don't know how to do so :S. I can't find/create a pattern using http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html
After parsing them i have to search all travels between two dates.
But how? How to parse them? it's possible?
This requires Natural Language Processing (NLP) , see Wikipedia for an account:
http://en.wikipedia.org/wiki/Natural_language_processing.
Your problem as stated is very hard. There are many ways of representing a single date, and your examples include ranges of dates and formulae for generating dates. It sounds as if you have a limited subset of language - frequent use of "all", "from", etc.
If you are in control of the language (i.e. these are being generated by humans who comply with your documentation) then you have a chance of formalising it (although it will take a lot of work - months). If you are not in charge of it, then every time a new phrase appears you will have to add it to the specs.
I suggest you got through the file and look for stock phrases "All [weekdayname]s [from | between | until | before]". Or "in [January | February ...]". Then substitute these in in phrases. If you find this covers all the cases you may be able to extract particular phrases". But if you have anaphora like "next Tuesday" it will be much harder.
You're in the domain of NLP (Natural Language Processing), what is possible or impossible is fuzzy in this domain. From a fast Google search, I've found that the Natty Date Parser might be useful for you.
For more theory background on NLP, you might be interested in Natural Language Processing course of Stanford University on Coursera (at the moment the course is not open for enrolment, but lectures are available for free.
You can also use a set of strict regular expressions that would match only one of your possible cases and apply them from the most restrictive to the most relaxed.
The first thing I would define to attack your problem is what you expect as an output of your method, since in some cases it's a single date, in some cases an interval, in some others multiple intervals.
Related
I'm trying to extract data from sentences like this:
"every day before 27 march"
"mon, wed, sun except 29,30 march, 1,2 april"
"weekdays after 20 march"
"weekends before 3 april"
"1, 5 , 7 april"
and other combinations ...
Is there any standard solution for this problem?
It's not an imaginative literature. It's just server response with well known answer structure.
http://s13.postimg.org/6gjtuzyo7/image.jpg Not so many combinations, I think.
as we are talking about NLP for date/time , you should definitely check out prettytime nlp library written in java. it served us so well.
Recently I looked into the Documentation for SimpleDateFormat and noticed some inconsistencies (in my opinion) in how they handle the letters for parsing.
For example, look at these representations:
M: Month in year
D: Day in year
d: Day in month
"x in year" is a bigger timespan than "x in month" and has therefore uppercase letters so this makes perfect sense to me.
But then there is
w: Week in year
W: Week in month
Here, the letters are swapped, which is totally counter-intuitive in my opinion. It seems like these two should be the other way around, to conform to the "pattern" mentioned above.
Another example are the different hour-representations:
H: Hour in day (0-23)
k: Hour in day (1-24)
K: Hour in am/pm (0-11)
h: Hour in am/pm (1-12)
I kinda get the idea. Uppercase letters for hours starting with 0, lowercase letters for hours starting with 1.
Here, both lowercase letters should be swapped, because shouldn't the same letters belong to the same category? (H/h for hour in day, K/k for hour in am/pm)
So my question is this: Is there a reason behind this seemingly counter-intuitive representation?
The only reason i could think of is, that some of these pattern letters were added at a later time and they couldn't change the already existing ones, because of downwards compatibility. But other than that, it doesn't make much sense to me.
Citation:
"The only reason i could think of is, that some of these pattern
letters were added at a later time and they couldn't change the
already existing ones, because of downwards compatibility."
Your suspicion is correct. But you cannot (only) blame Sun respective Oracle designers for that. They have just overtaken the whole stuff originally from Taligent (now merged into IBM). And IBM itself is one of the leading companies behind Unicode consortium which defined the CLDR-standard. In that standard all these pattern symbols were defined (indeed in a totally inconsistent manner - only explainable by historic development).
Worse, the inconsistencies in CLDR don't stop: Recently we have got a NARROW variant in addition to SHORT, LONG etc. That means if you want the shortes possible representation of a month as a single letter then you need to specify the pattern symbol MMMMM (5 letters because one letter M is already reserved for the numerical short form).
Another notice: SimpleDateFormat does not even strictly follow CLDR. For example Oracle has defined the pattern symbol "u" as ISO-Day number of week (1 = Monday, ..., 7 = Sunday) in Java-version 7 although CLDR has already introduced the same symbol earlier as the proleptic ISO-year. And Java 8 again deviates, invents new symbols not known in CLDR but else tries to follow CLDR more closely.
We have already remarkable differences using pattern languages (compare Java-6, Java-7, Java-8, pure CLDR and Joda-Time). And I fear this will never stop.
I need to represent a time interval as localized string like this: 10 hours 25 minutes 1 second depending on Locale.
It is pretty easy to realize by hand in English:
String hourStr = hours == 1 ? "hour" : "hours" etc.
But I need some "out-of-the-box" Java (maybe Java8) mechanism according to rules of different languages.
Does Java have it, or I need to realize it for each Locale used in app by myself?
Look at Joda-Time. It supports the languages English, Danish, Dutch, French, German, Japanese, Polish, Portuguese and Spanish with version 2.5.
Period period = new Period(new LocalDate(2013, 4, 11), LocalDate.now());
PeriodFormatter formatter = PeriodFormat.wordBased(Locale.GERMANY);
System.out.println(formatter.print(period)); // output: 1 Jahr, 2 Monate und 3 Wochen
formatter = formatter.withLocale(Locale.ENGLISH);
System.out.println(formatter.print(period)); // output: 1 Jahr, 2 Monate und 3 Wochen (bug???)
formatter = PeriodFormat.wordBased(Locale.ENGLISH);
System.out.println(formatter.print(period)); // output: 1 year, 2 months and 3 weeks
You might to adjust the interpunctuation chars however. To do this you might need to copy and edit the messages-resource-files in your classpath which have this format (here english variant):
PeriodFormat.space=\
PeriodFormat.comma=,
PeriodFormat.commandand=,and
PeriodFormat.commaspaceand=, and
PeriodFormat.commaspace=,
PeriodFormat.spaceandspace=\ and
PeriodFormat.year=\ year
PeriodFormat.years=\ years
PeriodFormat.month=\ month
PeriodFormat.months=\ months
PeriodFormat.week=\ week
PeriodFormat.weeks=\ weeks
PeriodFormat.day=\ day
PeriodFormat.days=\ days
PeriodFormat.hour=\ hour
PeriodFormat.hours=\ hours
PeriodFormat.minute=\ minute
PeriodFormat.minutes=\ minutes
PeriodFormat.second=\ second
PeriodFormat.seconds=\ seconds
PeriodFormat.millisecond=\ millisecond
PeriodFormat.milliseconds=\ milliseconds
Since version 2.5 it might be also possible to apply complex regular expressions to model more complex plural rules. Personally I see it as user-unfriendly, and regular expressions might not be sufficient for languages like Arabic (my first impression). There are also other limitations with localization, see this pull request in debate.
Side notice: Java 8 is definitely not able to do localized duration formatting.
UPDATE from 2015-08-26:
With the version of my library Time4J-v4.3 (available in Maven Central) following more powerful solution is possible which supports currently 45 languages:
import static net.time4j.CalendarUnit.*;
import static net.time4j.ClockUnit.*;
// the input for creating the duration (in Joda-Time called Period)
IsoUnit[] units = {YEARS, MONTHS, DAYS, HOURS, MINUTES, SECONDS};
PlainTimestamp start = PlainDate.of(2013, 4, 11).atTime(13, 45, 21);
PlainTimestamp end = SystemClock.inLocalView().now();
// create the duration
Duration<?> duration = Duration.in(units).between(start, end);
// print the duration (here not abbreviated, but with full unit names)
String s = PrettyTime.of(Locale.US).print(duration, TextWidth.WIDE);
System.out.println(s);
// example output: 1 year, 5 months, 7 days, 3 hours, 25 minutes, and 49 seconds
Why is Time4J better for your problem?
It has a more expressive way to say in which units a duration should be calculated.
It supports 45 languages.
It supports the sometimes complex plural rules of languages inclusive right-to-left scripts like in Arabic without any need for manual configuration
It supports locale-dependent list patterns (usage of comma, space or words like "and")
It supports 3 different text widths: WIDE, ABBREVIATED (SHORT) and NARROW
The interoperability with Java-8 is better because Java-8-types like java.time.Period or java.time.Duration are understood by Time4J.
Assuming Brasilia GMT -0300: DST on 21/10/2012 at 00:00:00, when the clock should be advanced by one hour
Java
new Date(2012 - 1900, 9, 21, 0, 0, 0)
Sun Oct 21 01:00:00 BRST 2012
Chrome/FireFox (console)
new Date(2012, 9, 21, 0, 0 ,0)
Sat Oct 20 2012 23:00:00 GMT-0300 (Hora oficial do Brasil)
The result in Java is what I was expecting, but the result in JS I can not understand. I found this post where bjornd says
This is an absolutely correct behavior
but didn't explain why this behavior is OK.
My question is:
Why JS is returning a date one hour in the past?
P.S. I know Date is marked for "deprecation", but I'm using GWT; Date is my only option.
Basically, that answer was incorrect as far as I can see. I'm not entirely happy with the Java version, even.
Fundamentally, you're trying to construct a local date/time which never happened. Translating from local time to UTC is always tricky, as there are three possibilities:
Unambiguous mapping, which in most time zones is the case for all but two hours per year
Ambiguous mapping, during a backward transition, where the same local time period occurs twice (e.g. local time goes 12:59am, 1am, ... 1:59am, 1am, 1:59am, 2am)
"Gap" mapping, where a local time period simply doesn't exist (e.g. local time goes 12:59am, 2am, 2:01am)
Brazil moves its clocks forward at midnight, so local time actually goes:
October 20th 11:58pm
October 20th 11:59pm
October 21st 01:00am
October 21st 01:01am
The local time you've asked for simply never happened. It looks like Java is just assuming you want to roll it forward... whereas JavaScript is getting confused :( The JavaScript result would be more understandable (but still incorrect) if you were asking for midnight at the start of February 16th 2013, for example - where the clocks would have gone back to 11pm on the 15th. 12am on the 16th is unambiguous, as it can only happen after the "second" 11pm-11:59pm on the 15th.
A good date/time API (in my very biased view) would force you to say how you want to happen ambiguity and gaps when you do the conversion.
I had a test for object Calendar:
for (int i = 0; i < 11; i++)
System.out.println(calendar.get(i));
output:
1
2011
6
28
2
6
187
4
1
1
10
My question is how can that happen? There are also the same tricky problems for api calendar.get()
In Calendar.get(i) i represents a field such as ERA, YEAR, MONTH, etc..
For example, calendar.get(1) is the same as calendar.get(Calendar.YEAR) and so on.
I don't see what the problem is. The documentation states that you get the specific values for whatever field ID you provide.
You would normally use the field constants to get specific values (like DAY_OF_MONTH or MONTH but any integer will do provided it's within the range 0..FIELD_COUNT.
The field IDs are documented here (though this may change in future) so your specific values are:
ID Value Description
-- ----- -----------
0 1 Era (BC/AD for Gregorian).
1 2011 Year.
2 6 Month (zero-based).
3 28 Week-of-year.
4 2 Week-of-month.
5 6 Date/day-of-month.
6 187 Day-of-year.
7 4 Day-of-week.
8 1 Day-of-week-in-month.
9 1 AM/PM selector.
10 10 Hour.
That's July 6, 2011 AD, somewhere between 10:00:00 PM and 10:59:59 PM inclusive. The minutes and seconds values are field IDs 12 and 13 and your code doesn't print them out, hence the uncertainty on the time.
The API provided by java.util.Calendar is not very well designed as your confusion illustrates. However take a look at the JavaDoc for get(). The int value is meant to represent the field you want to get the value of. See all of the members listed at that JavaDoc described as "Field number ..." such as YEAR. So calendar.get(Calendar.YEAR) would equal 2011.
The Calendar class is an overkill for many common Date related scenarios. The history is somewhat as follows : The Date class was found to have many deficiencies w.r.t manipulating date objects. Hence the Calendar class was introduced. However, the Calendar class has proved to be an over-engineered solution to many of the common date related scenarios.
Read the Javadoc for better understanding of the Calendar class.