Validate correct date time format fetched from csv - java

I need to read one csv file which has different time format in one timestamp column. It can be anything from below mentioned 5 formats. I need to match the fetched date and parse accordingly on each row.
Please suggest how to validate ad parse it. thanks in advance.
public static final String DEFAULT_DATE_FORMAT_PATTERN = "yyyy-MM-dd";
public static final String DEFAULT_DATE_TIME_FORMAT_PATTERN = "yyyy-MM-dd HH:mm:ss.SSS";
public static final String DATE_TIME_MINUTES_ONLY_FORMAT_PATTERN = "yyyy-MM-dd HH:mm";
public static final String DATE_TIME_WITHOUT_MILLIS_FORMAT_PATTERN = "yyyy-MM-dd HH:mm:ss";
Epoch in milli

What you need is a formatter with optional parts. A pattern can contain square brackets to denote an optional part, for example HH:mm[:ss]. The formatter then is required to parse HH:mm, and tries to parse the following text as :ss, or skips it if that fails. yyyy-MM-dd[ HH:mm[:ss[.SSS]]] would then be the pattern.
There is only one issue here – when you try to parse a string with the pattern yyyy-MM-dd (so without time part) using LocalDateTime::parse, it will throw a DateTimeFormatException with the message Unable to obtain LocalDateTime from TemporalAccessor. Apparently, at least one time part must be available to succeed.
Luckily, we can use a DateTimeFormatterBuilder to build a pattern, instructing the formatter to use some defaults if information is missing from the parsed text. Here it is:
DateTimeFormatter formatter = new DateTimeFormatterBuilder()
.appendPattern("yyyy-MM-dd[ HH:mm[:ss[.SSS]]]")
.parseDefaulting(ChronoField.HOUR_OF_DAY, 0)
.parseDefaulting(ChronoField.MINUTE_OF_HOUR, 0)
.parseDefaulting(ChronoField.SECOND_OF_MINUTE, 0)
.toFormatter();
LocalDateTime dateTime = LocalDateTime.parse(input, formatter);
Tests:
String[] inputs = {
"2020-10-22", // OK
"2020-10-22 14:55", // OK
"2020-10-22T14:55", // Fails: incorrect format
"2020-10-22 14:55:23",
"2020-10-22 14:55:23.9", // Fails: incorrect fraction of second
"2020-10-22 14:55:23.91", // Fails: incorrect fraction of second
"2020-10-22 14:55:23.917", // OK
"2020-10-22 14:55:23.9174", // Fails: incorrect fraction of second
"2020-10-22 14:55:23.917428511" // Fails: incorrect fraction of second
};
And what about epoch in milli?
Well, this cannot be parsed directly by the DateTimeFormatter. But what's more: an epoch in milli has an implicit timezone: UTC. The other patterns lack a timezone. So an epoch is a fundamentally different piece of information. One thing you could do is assume a timezone for the inputs missing one.
However, if you nevertheless want to parse the instant, you could try to parse it as a long using Long::parseLong, and if it fails, then try to parse with the formatter. Alternatively, you could use a regular expression (like -?\d+ or something) to try to match the instant, and if it does, then parse as instant, and if it fails, then try to parse with the abovementioned formatter.

The brute force approach:
simply try your 4 formats, one after the other to parse the incoming string
if parsing throws an exception, try the next one
if parsing passes, well, that format just matched
Of course, if we are talking about larger tables, that is quite inefficient. Possible optimisations:
obviously, the different patterns have subtle differences, so you could use indexOf() checks first. Like: if the value to be parsed contains no ':' char, then it can only be the first pattern.
you can look at your data manually to figure the actual distribution of patterns that are used. then you adapt the order of patterns to try to the likelihood of the pattern being used in your data
Alternatively: you could define your own regex. The only thing that makes it slightly ugly is the fact that your input uses month names, not month number. But I think it shouldn't be too hard to write up a single regex that covers all your cases.

Related

LocalDateTime(or any other suitable class) with time as optional paramter

I am trying to parse an incoming string which might contain time or not. Both the following dates should be accepted
"2022-03-03" and "2022-03-03 15:10:05".
The DateTimeFormatter that I know will fail in any one of the cases. This is one answer I got, but I don't know if in any ways time part can be made optional here.
ISO_DATE_TIME.format() to LocalDateTime with optional offset
The idea is if the time part is not present I should set it to the end of the day, so the time part should be 23:59:59.
Any help is appreciated. Thanks!
Well, you could utilize a DateTimeFormatterBuilder to specify defaults for missing fields:
private static LocalDateTime parse(String str) {
DateTimeFormatter formatter = new DateTimeFormatterBuilder()
.appendPattern("uuuu-MM-dd[ HH:mm:ss]")
.parseDefaulting(ChronoField.HOUR_OF_DAY, 23)
.parseDefaulting(ChronoField.MINUTE_OF_HOUR, 59)
.parseDefaulting(ChronoField.SECOND_OF_MINUTE, 59)
.toFormatter();
return LocalDateTime.parse(str, formatter);
}
The pattern specifies the pattern it will try to parse. Note that the square brackets ([]) are optional parts. Everything between them will be either completely consumed, or entirely discarded.
With parseDefaulting you can specify the default values for when fields are missing. In your case, if the user provides only the date, the hour-of-day, minute-of-hour and second-of-minute fields are missing, that's why it is needed to provide defaults for them.
Example
System.out.println(parse("2022-03-03"));
System.out.println(parse("2022-03-03 15:10:05"));
System.out.println(parse("2025"));
Outputs the following:
2022-03-03T23:59:59
2022-03-03T15:10:05
Exception in thread "main" java.time.format.DateTimeParseException: Text '2025' could not be parsed at index 4

Parsing PDF date using Java DateTimeFormatter

I'm trying to parse the date format used in PDFs. According to this page, the format looks as follows:
D:YYYYMMDDHHmmSSOHH'mm'
Where all components except the year are optional. I assume this means the string can be cut off at any point as i.e. specifying a year and an hour without specifying a month and a day seems kind of pointless to me. Also, it would make parsing pretty much impossible.
As far as I can tell, Java does not support zone offsets containing single quotes. Therefore, the first step would be to get rid of those:
D:YYYYMMDDHHmmSSOHHmm
The resulting Java date pattern should then look like this:
['D:']uuuu[MM[dd[HH[mm[ss[X]]]]]]
And my overall code looks like this:
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("['D:']uuuu[MM[dd[HH[mm[ss[X]]]]]]");
TemporalAccessor temporalAccessor = formatter.parseBest("D:20020101",
ZonedDateTime::from,
LocalDateTime::from,
LocalDate::from,
Month::from,
Year::from
);
I would expect that to result in a LocalDate object but what I get is java.time.format.DateTimeParseException: Text 'D:20020101' could not be parsed at index 2.
I've played around a bit with that and found out that everything works fine with the optional literal at the beginning but as soon as I add optional date components, I get an exception.
Can anybody tell me what I'm doing wrong?
Thanks in advance!
I've found a solution:
String dateString = "D:20020101120000+01'00'";
String normalized = dateString.replace("'", "");
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("['D:']ppppy[ppM[ppd[ppH[ppm[pps[X]]]]]]");
TemporalAccessor temporalAccessor = formatter.parseBest(normalized,
OffsetDateTime::from,
LocalDateTime::from,
LocalDate::from,
YearMonth::from,
Year::from
);
As it seems, the length of the components is ambiguous and parsing of the date without any separators thus failed.
When specifying a padding, the length of each component is clearly stated and the date can therefore be parsed.
At least that's my theory.

Unable to parse optional microseconds in localTime

I am receiving timestamp in format : HHmmss followed by milleseconds and microseconds.Microseconds after the '.' are optional
For example: "timestamp ":"152656375.489991" is 15:26:56:375.489991.
Below code is throwing exceptions:
final DateTimeFormatter FORMATTER = new DateTimeFormatterBuilder()
.appendPattern("HHmmssSSS")
.appendFraction(ChronoField.MICRO_OF_SECOND, 0, 6, true)
.toFormatter();
LocalTime.parse(dateTime,FORMATTER);
Can someone please help me with DateTimeformatter to get LocalTime in java.
Here is the stacktrace from the exception from the code above:
java.time.format.DateTimeParseException: Text '152656375.489991' could not be parsed: Conflict found: NanoOfSecond 375000000 differs from NanoOfSecond 489991000 while resolving MicroOfSecond
at java.base/java.time.format.DateTimeFormatter.createError(DateTimeFormatter.java:1959)
at java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1894)
at java.base/java.time.LocalTime.parse(LocalTime.java:463)
at com.ajax.so.Test.main(Test.java:31)
Caused by: java.time.DateTimeException: Conflict found: NanoOfSecond 375000000 differs from NanoOfSecond 489991000 while resolving MicroOfSecond
at java.base/java.time.format.Parsed.updateCheckConflict(Parsed.java:329)
at java.base/java.time.format.Parsed.resolveTimeFields(Parsed.java:462)
at java.base/java.time.format.Parsed.resolveFields(Parsed.java:267)
at java.base/java.time.format.Parsed.resolve(Parsed.java:253)
at java.base/java.time.format.DateTimeParseContext.toResolved(DateTimeParseContext.java:331)
at java.base/java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:1994)
at java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1890)
... 3 more
There are many options, depending on the possible variations in the strings you need to parse.
1. Modify the string so you need no formatter
String timestampString = "152656375.489991";
timestampString = timestampString.replaceFirst(
"^(\\d{2})(\\d{2})(\\d{2})(\\d{3})(?:\\.(\\d*))?$", "$1:$2:$3.$4$5");
System.out.println(timestampString);
LocalTime time = LocalTime.parse(timestampString);
System.out.println(time);
The output from this snippet is:
15:26:56.375489991
The replaceFirst() call modifies your string into 15:26:56.375489991, the default format for LocalTime (ISO 8601) so it can be parsed without any explicit formatter. For this I am using a regular expression that may not be too readable. (…) enclose groups that I use as $1, $2, etc., in the replacement string. (?:…) denotes a non-capturing group, that is, cannot be used in the replacement string. I put a ? after it to specify that this group is optional in the original string.
This solution accepts from 1 through 6 decimals after the point and also no fractional part at all.
2. Use a simpler string modification and a formatter
I want to modify the string so I can use this formatter:
private static DateTimeFormatter fullParser
= DateTimeFormatter.ofPattern("HHmmss.[SSSSSSSSS][SSS]");
This requires the point to be after the seconds rather than after the milliseoncds. So move it three places to the left:
timestampString = timestampString.replaceFirst("(\\d{3})(?:\\.|$)", ".$1");
LocalTime time = LocalTime.parse(timestampString, fullParser);
15:26:56.375489991
Again I am using a non-capturing group, this time to say that after the (captured) group of three digits must come either a dot or the end of the string.
3. The same with a more flexible parser
The formatter above specifies that there must be either 9 or 3 digits after the decimal point, which may be too rigid. If you want to accept something in between too, a builder can build a more flexible formatter:
private static DateTimeFormatter fullParser = new DateTimeFormatterBuilder()
.appendPattern("HHmmss")
.appendFraction(ChronoField.NANO_OF_SECOND, 3, 9, true)
.toFormatter();
I think that this would be my favourite approach, again depending on the exact requirements.
4. Parse only a part of the string
There is no problem so big and awful that it cannot simply be run away
from (Linus in Peanuts, from memory)
If you can live without the microseconds, ignore them:
private static DateTimeFormatter partialParser
= DateTimeFormatter.ofPattern("HHmmssSSS");
To parse only a the part of the string up to the point using this formatter:
TemporalAccessor parsed
= partialParser.parse(timestampString, new ParsePosition(0));
LocalTime time = LocalTime.from(parsed);
15:26:56.375
As you can see it has ignored the part from the decimal point, which I wouldn’t find too satisfactory.
What went wrong in your code?
Your 6 digits after the decimal point denote nanoseconds. Microseconds would have been only 3 decimals after the milliseconds. To use appendFraction() to parse these you would have needed a TemporalUnit of nano of millisecond. The ChronoUnit enum offers nano of day and nano of second, but not nano of milli. TemporalUnit is an interface, so in theory we could develop our own nano of milli class for the purpose. I tried to develop a class implementing TemporalUnit once, but gave up, I couldn’t get it to work.
Links
Wikipedia article: ISO 8601
Regular expressions in Java - Tutorial

Can I switch between formats for DateTimeFormatterBuilder?

I'm using DateTimeFormatterBuilder() to turn the JSON data I take in (the commented code) and convert them into one of two formats. To decide which format to use, I'm using a REGEX to find any instances of a square, [] , bracket (with anything inside " .*? " ). After choosing the correct format, I would parse the new value into another JSON object.
The problem is, my program either does not correctly choose which format to use (either a REGEX and method error), or doesn't format it correctly (formatting error), not sure which, and sends an error back (bottom of code), instead.
However, this is only for data that has square brackets. Data without square brackets gets processed correctly. I'm wondering if there are any solutions/suggestions to fix this?
// 2018-11-28T13:09:00.2-04:00
def utcDateFormatter = new DateTimeFormatterBuilder()
.appendPattern("yyyy-MM-dd'T'HH:mm:ss")
.appendFraction(ChronoField.MILLI_OF_SECOND, 0, 3, true)
.appendPattern("xxx")
.toFormatter()
// 2018-11-28T13:09:00.528-08:00[America/New_York]
def utcDateFormatterWithZone = new DateTimeFormatterBuilder()
.appendPattern("yyyy-MM-dd'T'HH:mm:ss")
.appendFraction(ChronoField.MILLI_OF_SECOND, 0, 3, true)
.appendPattern("xxx'['VV']'")
.toFormatter()
if (json.beginDateTime.find("\\[.*?\\]") == true) {
object.setDate(LocalDateTime.parse("${json.beginDateTime}", utcDateFormatterWithZone).format(outFormatter))
} else {
object.setDate(LocalDateTime.parse("${json.beginDateTime}", utcDateFormatter).format(outFormatter))
}
Error: Text '2019-09-26T15:01:07.941-05:00[America/New_York]' could not be parsed, unparsed text found at index 29
This is built-in: DateTimeFormatter.ISO_ZONED_DATE_TIME
This can be done a lot more easily. The built-in DateTimeFormatter.ISO_ZONED_DATE_TIME matches both of your formats.
String stringWithoutZoneId = "2018-11-28T13:09:00.2-04:00";
String stringWithZoneId = "2018-11-28T13:09:00.528-08:00[America/New_York]";
LocalDateTime parsedWithoutZoneId = LocalDateTime.parse(
stringWithoutZoneId, DateTimeFormatter.ISO_ZONED_DATE_TIME);
System.out.println(parsedWithoutZoneId);
LocalDateTime parsedWithZoneId = LocalDateTime.parse(
stringWithZoneId, DateTimeFormatter.ISO_ZONED_DATE_TIME);
System.out.println(parsedWithZoneId);
Output from this snippet is:
2018-11-28T13:09:00.200
2018-11-28T13:09:00.528
Use the offset too
A word of warning, though: Are you sure you want to ignore the offsets in the strings? With those offsets the strings represent unambiguous points in time. What you get from parsing into LocalDateTime are datetimes belonging at different unknown offsets. I can’t see how you can reliably use them for anything useful.
Consider parsing into ZonedDateTime. The one-arg ZonedDateTIme.parse will even do this without any explicit formatter. Then either store these ZonedDateTime directly in your objects or convert to Instant and store those. An Instant represents a point in time. If you cannot change the type stored, you will probably want to convert your ZonedDateTime to UTC (or another agreed-upon time zone), then convert to LocalDateTime. All of this said without knowing your real requirements, so I could be wrong, only I think not.
What went wrong in your code?
#daggett is correct: CharSequence.find returns a string, so for you if statement to work you would have needed:
if (json.beginDateTime.find("\\[.*?\\]") != null) {
A String can never be equal to true, so the formatter without zone was always chosen.
Link
Documentation of DateTimeFormatter.ISO_ZONED_DATE_TIME

Invalid format issue parsing string to JodaTime

String dateString = "20110706 1607";
DateTimeFormatter dateStringFormat = DateTimeFormat.forPattern("YYYYMMDD HHMM");
DateTime dateTime = dateStringFormat.parseDateTime(dateString);
Resulting stacktrace:
Exception in thread "main" java.lang.IllegalArgumentException: Invalid format: "201107206 1607" is malformed at " 1607"
at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:644)
at org.joda.time.convert.StringConverter.getInstantMillis(StringConverter.java:65)
at org.joda.time.base.BaseDateTime.<init>(BaseDateTime.java:171)
at org.joda.time.DateTime.<init>(DateTime.java:168)
......
Any thoughts? If I truncate the string to 20110706 with pattern "YYYYMMDD" it works, but I need the hour and minute values as well. What's odd is that I can convert a Jodatime DateTime to a String using the same pattern "YYYYMMDD HHMM" without issue
Thanks for looking
Look at your pattern - you're specifying "MM" twice. That can't possibly be right. That would be trying to parse the same field (month in this case) twice from two different bits of the text. Which would you expect to win? You want:
DateTimeFormat.forPattern("yyyyMMdd HHmm")
Look at the documentation for DateTimeFormat to see what everything means.
Note that although calling toString with that pattern will produce a string, it won't produce the string you want it to. I wouldn't be surprised if the output even included "YYYY" and "DD" due to the casing, although I can't test it right now. At the very least you'd have the month twice instead of the minutes appearing at the end.

Categories