I'm trying to extract data from sentences like this:
"every day before 27 march"
"mon, wed, sun except 29,30 march, 1,2 april"
"weekdays after 20 march"
"weekends before 3 april"
"1, 5 , 7 april"
and other combinations ...
Is there any standard solution for this problem?
It's not an imaginative literature. It's just server response with well known answer structure.
http://s13.postimg.org/6gjtuzyo7/image.jpg Not so many combinations, I think.
as we are talking about NLP for date/time , you should definitely check out prettytime nlp library written in java. it served us so well.
Related
The RSS specification states that date-time strings should be RFC 822 compliant. Example:
Sat, 07 Sep 2002 00:00:01 GMT.
I can use:
TimeStamp timestamp = TimeStamp.parse("Sat, 07 Sep 2002 00:00:01 GMT")
to store the RFC 822 formatted string into Derby. However, I have one feed that formats its dates like so:
Thu, 22 Dec 2016 09:50:06 PST
Not RFC 822/ISO8061 compliant, right? What format is this? How can I make it RFC 822 compliant and, how could I predict these without having a DateTimeFormatter for every non-compliant date-time string that a feed may throw at me?
The format "Thu, 22 Dec 2016 09:50:06 PST" is NOT RFC-822-compliant for another reason than you think. The zone name PST is explicitly supported in RFC-822 (paragraph 5) and stands for GMT-07:00. However, the format given above is not compliant because it uses four-digit-year and not two-digit-year. This strong limitation was then corrected in RFC-1123 which says:
All mail software SHOULD use 4-digit years in dates, to ease the
transition to the next century.
Another question is how to support it in Java. The best way is to write your own formatter with a custom pattern in all cases where Java-8-support is not sufficient. Java-8 only supports the first format using GMT-prefix but not PST in DateTimeFormatter.RFC_1123_DATE_TIME:
RFC-1123 updates RFC-822 changing the year from two digits to four.
This implementation requires a four digit year. This implementation
also does not handle North American or military zone names, only 'GMT'
and offset amounts.
Update:
Meanwhile I have released a new version of my library Time4J-v.22 which supports northamerican timezone abbreviations, too. It is realized as constant ChronoFormatter.RFC_1123. Example of usage:
Instant instant =
ChronoFormatter.RFC_1123.parse("Thu, 22 Dec 2016 09:50:06 PST").toTemporalAccessor();
Only military zones are not yet supported, and I don't intend to go that way because the RFC-822-spec had got a sign error with all military zones (with the only exception of "Z"). So you can either use this simple and performant way (adding an extra dependency) or write a parser composed of many different formats based on instances of java.time.format.DateTimeFormatter.
I having problems to generate a regex for a range of dates.
For example this range [2015-11-17, 2017-10-05], How can I do? to validate if having a date belogns to that range using regex.
And second question if is possible to have a generic regex which I can use for several range of date, only replacing few values in the regex with the new ranges I have, and the regex continues validating a range of dates , but with the new ranges. Thanks in advance for help =)
Do not use Regex
As the comments state, Regex is not appropriate for a range of dates, nor any span of time. Regex is intended to be “dumb” in the sense of looking only at the syntax of the text not the semantics (the meaning).
java.time
Use the java.time framework built into Java 8 and later.
Parse your strings into LocalDate objects.
LocalDate start = LocalDate.parse( "2015-11-17" );
Compare by calling the isEqual, isBefore, and isAfter methods.
Note that we commonly use the Half-Open approach in date-time work where the beginning is inclusive while the ending is exclusive.
These issues are covered already in many other Questions and Answers on Stack Overflow. So I have abbreviated my discussion here.
Just for completeness: You can actually use regular expressions to recognize any finite set of strings, such as a specific date range, however it would be more of an academic exercise than an actual recommended usage. However, if you happen to be programming some arcane hardware it could actually be necessary.
Assuming the input is always a valid date in the given format, the regex for your example could consist of:
2015-0[1-9].* - 2015 January to September
2015-10.* - 2015 October
2015-11-0[1-9] - 2015 November 1 to 9
2015-11-1[0-7] - 2015 November 10 to 17
2016.* - all dates of 2016
Add analogously for 2017, make a disjunction using | (a|b|c|...), apply escaping of the regex implementation you use and then you have your date checker. If the input is not guaranteed to be a valid date it gets a bit more complicated but is still possible.
I need to represent a time interval as localized string like this: 10 hours 25 minutes 1 second depending on Locale.
It is pretty easy to realize by hand in English:
String hourStr = hours == 1 ? "hour" : "hours" etc.
But I need some "out-of-the-box" Java (maybe Java8) mechanism according to rules of different languages.
Does Java have it, or I need to realize it for each Locale used in app by myself?
Look at Joda-Time. It supports the languages English, Danish, Dutch, French, German, Japanese, Polish, Portuguese and Spanish with version 2.5.
Period period = new Period(new LocalDate(2013, 4, 11), LocalDate.now());
PeriodFormatter formatter = PeriodFormat.wordBased(Locale.GERMANY);
System.out.println(formatter.print(period)); // output: 1 Jahr, 2 Monate und 3 Wochen
formatter = formatter.withLocale(Locale.ENGLISH);
System.out.println(formatter.print(period)); // output: 1 Jahr, 2 Monate und 3 Wochen (bug???)
formatter = PeriodFormat.wordBased(Locale.ENGLISH);
System.out.println(formatter.print(period)); // output: 1 year, 2 months and 3 weeks
You might to adjust the interpunctuation chars however. To do this you might need to copy and edit the messages-resource-files in your classpath which have this format (here english variant):
PeriodFormat.space=\
PeriodFormat.comma=,
PeriodFormat.commandand=,and
PeriodFormat.commaspaceand=, and
PeriodFormat.commaspace=,
PeriodFormat.spaceandspace=\ and
PeriodFormat.year=\ year
PeriodFormat.years=\ years
PeriodFormat.month=\ month
PeriodFormat.months=\ months
PeriodFormat.week=\ week
PeriodFormat.weeks=\ weeks
PeriodFormat.day=\ day
PeriodFormat.days=\ days
PeriodFormat.hour=\ hour
PeriodFormat.hours=\ hours
PeriodFormat.minute=\ minute
PeriodFormat.minutes=\ minutes
PeriodFormat.second=\ second
PeriodFormat.seconds=\ seconds
PeriodFormat.millisecond=\ millisecond
PeriodFormat.milliseconds=\ milliseconds
Since version 2.5 it might be also possible to apply complex regular expressions to model more complex plural rules. Personally I see it as user-unfriendly, and regular expressions might not be sufficient for languages like Arabic (my first impression). There are also other limitations with localization, see this pull request in debate.
Side notice: Java 8 is definitely not able to do localized duration formatting.
UPDATE from 2015-08-26:
With the version of my library Time4J-v4.3 (available in Maven Central) following more powerful solution is possible which supports currently 45 languages:
import static net.time4j.CalendarUnit.*;
import static net.time4j.ClockUnit.*;
// the input for creating the duration (in Joda-Time called Period)
IsoUnit[] units = {YEARS, MONTHS, DAYS, HOURS, MINUTES, SECONDS};
PlainTimestamp start = PlainDate.of(2013, 4, 11).atTime(13, 45, 21);
PlainTimestamp end = SystemClock.inLocalView().now();
// create the duration
Duration<?> duration = Duration.in(units).between(start, end);
// print the duration (here not abbreviated, but with full unit names)
String s = PrettyTime.of(Locale.US).print(duration, TextWidth.WIDE);
System.out.println(s);
// example output: 1 year, 5 months, 7 days, 3 hours, 25 minutes, and 49 seconds
Why is Time4J better for your problem?
It has a more expressive way to say in which units a duration should be calculated.
It supports 45 languages.
It supports the sometimes complex plural rules of languages inclusive right-to-left scripts like in Arabic without any need for manual configuration
It supports locale-dependent list patterns (usage of comma, space or words like "and")
It supports 3 different text widths: WIDE, ABBREVIATED (SHORT) and NARROW
The interoperability with Java-8 is better because Java-8-types like java.time.Period or java.time.Duration are understood by Time4J.
This is the problem:
I have some .csv files with travels info, and the dates appear like strings (each line for one travel):
"All Mondays from January-May and October-December. All days from June To September"
"All Fridays from February to June"
"Monday, Friday and Saturday and Sunday from 10 January to 30 April"
"from 01 of November to 30 April. All days except fridays from 2 to 24 of november and sunday from 2 to 30 of december"
"All sundays from 02 december to 28 april"
"5, 12, 20 of march, 11, 18 of april, 2, 16, 30 of may, 6, 13, 27 june"
"All saturdays from February to June, and from September to December"
"1 to 17 of december, 1 to 31 of january"
"All mondays from February to november"
I must parse the strings to Dates, and keep them into an array for each travel.
The problem is that I don't know how to do it. Even my univesrity teachers told me that they don't know how to do so :S. I can't find/create a pattern using http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html
After parsing them i have to search all travels between two dates.
But how? How to parse them? it's possible?
This requires Natural Language Processing (NLP) , see Wikipedia for an account:
http://en.wikipedia.org/wiki/Natural_language_processing.
Your problem as stated is very hard. There are many ways of representing a single date, and your examples include ranges of dates and formulae for generating dates. It sounds as if you have a limited subset of language - frequent use of "all", "from", etc.
If you are in control of the language (i.e. these are being generated by humans who comply with your documentation) then you have a chance of formalising it (although it will take a lot of work - months). If you are not in charge of it, then every time a new phrase appears you will have to add it to the specs.
I suggest you got through the file and look for stock phrases "All [weekdayname]s [from | between | until | before]". Or "in [January | February ...]". Then substitute these in in phrases. If you find this covers all the cases you may be able to extract particular phrases". But if you have anaphora like "next Tuesday" it will be much harder.
You're in the domain of NLP (Natural Language Processing), what is possible or impossible is fuzzy in this domain. From a fast Google search, I've found that the Natty Date Parser might be useful for you.
For more theory background on NLP, you might be interested in Natural Language Processing course of Stanford University on Coursera (at the moment the course is not open for enrolment, but lectures are available for free.
You can also use a set of strict regular expressions that would match only one of your possible cases and apply them from the most restrictive to the most relaxed.
The first thing I would define to attack your problem is what you expect as an output of your method, since in some cases it's a single date, in some cases an interval, in some others multiple intervals.
I had a test for object Calendar:
for (int i = 0; i < 11; i++)
System.out.println(calendar.get(i));
output:
1
2011
6
28
2
6
187
4
1
1
10
My question is how can that happen? There are also the same tricky problems for api calendar.get()
In Calendar.get(i) i represents a field such as ERA, YEAR, MONTH, etc..
For example, calendar.get(1) is the same as calendar.get(Calendar.YEAR) and so on.
I don't see what the problem is. The documentation states that you get the specific values for whatever field ID you provide.
You would normally use the field constants to get specific values (like DAY_OF_MONTH or MONTH but any integer will do provided it's within the range 0..FIELD_COUNT.
The field IDs are documented here (though this may change in future) so your specific values are:
ID Value Description
-- ----- -----------
0 1 Era (BC/AD for Gregorian).
1 2011 Year.
2 6 Month (zero-based).
3 28 Week-of-year.
4 2 Week-of-month.
5 6 Date/day-of-month.
6 187 Day-of-year.
7 4 Day-of-week.
8 1 Day-of-week-in-month.
9 1 AM/PM selector.
10 10 Hour.
That's July 6, 2011 AD, somewhere between 10:00:00 PM and 10:59:59 PM inclusive. The minutes and seconds values are field IDs 12 and 13 and your code doesn't print them out, hence the uncertainty on the time.
The API provided by java.util.Calendar is not very well designed as your confusion illustrates. However take a look at the JavaDoc for get(). The int value is meant to represent the field you want to get the value of. See all of the members listed at that JavaDoc described as "Field number ..." such as YEAR. So calendar.get(Calendar.YEAR) would equal 2011.
The Calendar class is an overkill for many common Date related scenarios. The history is somewhat as follows : The Date class was found to have many deficiencies w.r.t manipulating date objects. Hence the Calendar class was introduced. However, the Calendar class has proved to be an over-engineered solution to many of the common date related scenarios.
Read the Javadoc for better understanding of the Calendar class.