Parsing PDF date using Java DateTimeFormatter - java

I'm trying to parse the date format used in PDFs. According to this page, the format looks as follows:
D:YYYYMMDDHHmmSSOHH'mm'
Where all components except the year are optional. I assume this means the string can be cut off at any point as i.e. specifying a year and an hour without specifying a month and a day seems kind of pointless to me. Also, it would make parsing pretty much impossible.
As far as I can tell, Java does not support zone offsets containing single quotes. Therefore, the first step would be to get rid of those:
D:YYYYMMDDHHmmSSOHHmm
The resulting Java date pattern should then look like this:
['D:']uuuu[MM[dd[HH[mm[ss[X]]]]]]
And my overall code looks like this:
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("['D:']uuuu[MM[dd[HH[mm[ss[X]]]]]]");
TemporalAccessor temporalAccessor = formatter.parseBest("D:20020101",
ZonedDateTime::from,
LocalDateTime::from,
LocalDate::from,
Month::from,
Year::from
);
I would expect that to result in a LocalDate object but what I get is java.time.format.DateTimeParseException: Text 'D:20020101' could not be parsed at index 2.
I've played around a bit with that and found out that everything works fine with the optional literal at the beginning but as soon as I add optional date components, I get an exception.
Can anybody tell me what I'm doing wrong?
Thanks in advance!

I've found a solution:
String dateString = "D:20020101120000+01'00'";
String normalized = dateString.replace("'", "");
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("['D:']ppppy[ppM[ppd[ppH[ppm[pps[X]]]]]]");
TemporalAccessor temporalAccessor = formatter.parseBest(normalized,
OffsetDateTime::from,
LocalDateTime::from,
LocalDate::from,
YearMonth::from,
Year::from
);
As it seems, the length of the components is ambiguous and parsing of the date without any separators thus failed.
When specifying a padding, the length of each component is clearly stated and the date can therefore be parsed.
At least that's my theory.

Related

Date time conversion in Java8

I tried searching across the web, but unable to find a suitable answer and hence posting here.
I am calling another API which gives me the date-time like
"2022-02-05T17:13:20-06:00[America/Chicago]"
I would like to convert this to a format like
"2022-02-05T17:13:20.000Z" (I am unsure what the milli-second will turn out as)
Could someone help me achieve this?
I am unable to get any example for this specific conversion scenario!!!
Regards,
Sriram
For the sample values given getting to a UTC timestamp should be something like
ZonedDateTime.parse("2022-02-05T17:13:20-06:00[America/Chicago]").toInstant().toString()
The datetime value including the timezone is the canonical representation of a ZonedDateTime and therefore can be parsed as such.
Instant is a UTC timestamp that prints in the form of 2022-02-05T23:13:20Z
You could take more influence on the behavior using a DateTimeFormatter - but since both input and output seem to be standard formats it does not seem necessary.
Here below is a code snippet that I was able to generate....
But, I am unsure of that is correct. Please let me know if you feel anything is incorrect.
String requestTime = "2022-02-05T17:13:20-06:00[America/Chicago]";
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ssXXX'['VV']'");
ZonedDateTime zonedDateTime = ZonedDateTime.parse(requestTime, formatter);
zonedDateTime.format(DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss'Z'"));
System.out.println(zonedDateTime.format(DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")));

Validate correct date time format fetched from csv

I need to read one csv file which has different time format in one timestamp column. It can be anything from below mentioned 5 formats. I need to match the fetched date and parse accordingly on each row.
Please suggest how to validate ad parse it. thanks in advance.
public static final String DEFAULT_DATE_FORMAT_PATTERN = "yyyy-MM-dd";
public static final String DEFAULT_DATE_TIME_FORMAT_PATTERN = "yyyy-MM-dd HH:mm:ss.SSS";
public static final String DATE_TIME_MINUTES_ONLY_FORMAT_PATTERN = "yyyy-MM-dd HH:mm";
public static final String DATE_TIME_WITHOUT_MILLIS_FORMAT_PATTERN = "yyyy-MM-dd HH:mm:ss";
Epoch in milli
What you need is a formatter with optional parts. A pattern can contain square brackets to denote an optional part, for example HH:mm[:ss]. The formatter then is required to parse HH:mm, and tries to parse the following text as :ss, or skips it if that fails. yyyy-MM-dd[ HH:mm[:ss[.SSS]]] would then be the pattern.
There is only one issue here – when you try to parse a string with the pattern yyyy-MM-dd (so without time part) using LocalDateTime::parse, it will throw a DateTimeFormatException with the message Unable to obtain LocalDateTime from TemporalAccessor. Apparently, at least one time part must be available to succeed.
Luckily, we can use a DateTimeFormatterBuilder to build a pattern, instructing the formatter to use some defaults if information is missing from the parsed text. Here it is:
DateTimeFormatter formatter = new DateTimeFormatterBuilder()
.appendPattern("yyyy-MM-dd[ HH:mm[:ss[.SSS]]]")
.parseDefaulting(ChronoField.HOUR_OF_DAY, 0)
.parseDefaulting(ChronoField.MINUTE_OF_HOUR, 0)
.parseDefaulting(ChronoField.SECOND_OF_MINUTE, 0)
.toFormatter();
LocalDateTime dateTime = LocalDateTime.parse(input, formatter);
Tests:
String[] inputs = {
"2020-10-22", // OK
"2020-10-22 14:55", // OK
"2020-10-22T14:55", // Fails: incorrect format
"2020-10-22 14:55:23",
"2020-10-22 14:55:23.9", // Fails: incorrect fraction of second
"2020-10-22 14:55:23.91", // Fails: incorrect fraction of second
"2020-10-22 14:55:23.917", // OK
"2020-10-22 14:55:23.9174", // Fails: incorrect fraction of second
"2020-10-22 14:55:23.917428511" // Fails: incorrect fraction of second
};
And what about epoch in milli?
Well, this cannot be parsed directly by the DateTimeFormatter. But what's more: an epoch in milli has an implicit timezone: UTC. The other patterns lack a timezone. So an epoch is a fundamentally different piece of information. One thing you could do is assume a timezone for the inputs missing one.
However, if you nevertheless want to parse the instant, you could try to parse it as a long using Long::parseLong, and if it fails, then try to parse with the formatter. Alternatively, you could use a regular expression (like -?\d+ or something) to try to match the instant, and if it does, then parse as instant, and if it fails, then try to parse with the abovementioned formatter.
The brute force approach:
simply try your 4 formats, one after the other to parse the incoming string
if parsing throws an exception, try the next one
if parsing passes, well, that format just matched
Of course, if we are talking about larger tables, that is quite inefficient. Possible optimisations:
obviously, the different patterns have subtle differences, so you could use indexOf() checks first. Like: if the value to be parsed contains no ':' char, then it can only be the first pattern.
you can look at your data manually to figure the actual distribution of patterns that are used. then you adapt the order of patterns to try to the likelihood of the pattern being used in your data
Alternatively: you could define your own regex. The only thing that makes it slightly ugly is the fact that your input uses month names, not month number. But I think it shouldn't be too hard to write up a single regex that covers all your cases.

java.time.format.DateTimeParseException: could not be parsed at index 0

I am trying to convert a String to timestamp.
my string contains time and time zone ('05:03:05.875+02:00') but I get the following error:
error
java.time.format.DateTimeParseException: Text '05:03:05.875+02:00'
could not be parsed at index 0
Code
String timewithZone= "05:03:05.875+02:00";
DateTimeFormatter formatter=DateTimeFormatter.ISO_OFFSET_DATE_TIME;
final ZonedDateTime a2=ZonedDateTime.parse(timewithZone,formatter);
String timewithZone = "05:03:05.875+02:00";
OffsetTime time = OffsetTime.parse(timewithZone);
System.out.println("Parsed into " + time);
This outputs
Parsed into 05:03:05.875+02:00
Your string contains a time and an offset, but no date. This conforms nicely, neither with an ISO_OFFSET_DATE_TIME nor a ZonedDateTime, but with an OffsetTime, a seldom used class that I think is there exactly because such a thing sometimes occurs in XML.
There is also an ISO_OFFSET_TIME formatter built in, but since this is the default format for OffsetTime we do not need to specify it.
It is failing because your string does not have the date related tokens. Check the example in the official documentation. In order to make it work you will need to add year/month/day data:
String timewithZone= "2018-07-3T05:03:05.875+02:00";
You cannot use the DateTimeFormatter.ISO_OFFSET_DATE_TIME because your date format does not adhere to that.
In fact, you don't even have a real date because there is no date component, only a time component.
You could define your own SimpleDateFormat instance with the format you're receiving, but you'll have to handle the fact that this data isn't really all that useful without date information. For example, that offset doesn't really tell us that much, because it might be in some region's Daylight Savings Time (DST) or not. And this heavily depends on on what actual DATE it is, not just what time.
You'll have to find out what the provider of this data even means with this, because right now you simply don't have enough information to really parse this into a proper date.
If this data just means a simple time stamp, used for for example saying "our worldwide office hours lunch time is at 12:30" then you could us the LocalTime class. That would be one of the few cases where a time string like this without a date is really useful. I have a sneaking suspicion this is not your scenario though.
A workaround may be
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("HH:mm:ss.SSSz");
String str = "05:03:05.875+02:00";
LocalTime time = LocalTime.parse(str, formatter);
System.out.println(time);
I tried the code in the format what you are getting,
Output :
05:03:05.875
Is this the one you are looking for?

SimpleDateFormat leniency leads to unexpected behavior

I have found that SimpleDateFormat::parse(String source)'s behavior is (unfortunatelly) defaultly set as lenient: setLenient(true).
By default, parsing is lenient: If the input is not in the form used by this object's format method but can still be parsed as a date, then the parse succeeds.
If I set the leniency to false, the documentation said that with strict parsing, inputs must match this object's format. I have used paring with SimpleDateFormat without the lenient mode and by mistake, I had a typo in the date (letter o instead of number 0). (Here is the brief working code:)
// PASSED (year 199)
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("dd.mm.yyyy");
System.out.println(simpleDateFormat.parse("03.12.199o"));
simpleDateFormat.setLenient(false);
System.out.println(simpleDateFormat.parse("03.12.199o")); //WTF?
In my surprise, this has passed and no ParseException has been thrown. I'd go further:
// PASSED (year 1990)
String string = "just a String to mess with SimpleDateFormat";
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("dd.mm.yyyy");
System.out.println(simpleDateFormat.parse("03.12.1990" + string));
simpleDateFormat.setLenient(false);
System.out.println(simpleDateFormat.parse("03.12.1990" + string));
Let's go on:
// FAILED on the 2nd line
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("dd.mm.yyyy");
System.out.println(simpleDateFormat.parse("o3.12.1990"));
simpleDateFormat.setLenient(false);
System.out.println(simpleDateFormat.parse("o3.12.1990"));
Finally, the exception is thrown: Unparseable date: "o3.12.1990". I wonder where is the difference in the leniency and why the last line of my first code snippet has not thrown an exception? The documentation says:
With strict parsing, inputs must match this object's format.
My input clearly doesn't strictly match the format - I expect this parsing to be really strict. Why does this (not) happen?
Why does this (not) happen?
It’s not very well explained in the documentation.
With lenient parsing, the parser may use heuristics to interpret
inputs that do not precisely match this object's format. With strict
parsing, inputs must match this object's format.
The documentation does help a bit, though, by mentioning that it is the Calendar object that the DateFormat uses that is lenient. That Calendar object is not used for the parsing itself, but for interpreting the parsed values into a date and time (I am quoting DateFormat documentation since SimpleDateFormat is a subclass of DateFormat).
SimpleDateFormat, no matter if lenient or not, will accept 3-digit year, for example 199, even though you have specified yyyy in the format pattern string. The documentation says about year:
For parsing, if the number of pattern letters is more than 2, the year
is interpreted literally, regardless of the number of digits. So using
the pattern "MM/dd/yyyy", "01/11/12" parses to Jan 11, 12 A.D.
DateFormat, no matter if lenient or not, accepts and ignores text after the parsed text, like the small letter o in your first example. It objects to unexpected text before or inside the text, as when in your last example you put the letter o in front. The documentation of DateFormat.parse says:
The method may not use the entire text of the given string.
As I indirectly said, leniency makes a difference when interpreting the parsed values into a date and time. So a lenient SimpleDateFormat will interpret 29.02.2019 as 01.03.2019 because there are only 28 days in February 2019. A strict SimpleDateFormat will refuse to do that and will throw an exception. The default lenient behaviour can lead to very surprising and downright inexplicable results. As a simple example, giving the day, month and year in the wrong order: 1990.03.12 will result in August 11 year 17 AD (2001 years ago).
The solution
VGR already in a comment mentioned LocalDate from java.time, the modern Java date and time API. In my experience java.time is so much nicer to work with than the old date and time classes, so let’s give it a shot. Try a correct date string first:
DateTimeFormatter dateFormatter = DateTimeFormatter.ofPattern("dd.mm.yyyy");
System.out.println(LocalDate.parse("03.12.1990", dateFormatter));
We get:
java.time.format.DateTimeParseException: Text '03.12.1990' could not
be parsed: Unable to obtain LocalDate from TemporalAccessor:
{Year=1990, DayOfMonth=3, MinuteOfHour=12},ISO of type
java.time.format.Parsed
This is because I used your format pattern string of dd.mm.yyyy, where lowercase mm means minute. When we read the error message closely enough, it does state that the DateTimeFormatter interpreted 12 as minute of hour, which was not what we intended. While SimpleDateFormat tacitly accepted this (even when strict), java.time is more helpful in pointing out our mistake. What the message only indirectly says is that it is missing a month value. We need to use uppercase MM for month. At the same time I am trying your date string with the typo:
DateTimeFormatter dateFormatter = DateTimeFormatter.ofPattern("dd.MM.yyyy");
System.out.println(LocalDate.parse("03.12.199o", dateFormatter));
We get:
java.time.format.DateTimeParseException: Text '03.12.199o' could not
be parsed at index 6
Index 6 is where is says 199. It objects because we had specified 4 digits and are only supplying 3. The docs say:
The count of letters determines the minimum field width …
It would also object to unparsed text after the date. In short it seems to me that it gives you everything that you had expected.
Links
DateFormat.setLenient documentation
Oracle tutorial: Date Time explaining how to use java.time.
Leniency is not about whether the entire input matches but whether the format matches. Your input can still be 3.12.1990somecrap and it would work.
The actual parsing is done in parse(String, ParsePosition) which you could use as well. Basically parse(String) will pass a ParsePosition that is set up to start at index 0 and when the parsing is done the current index of that position is checked.
If it's still 0 the start of the input didn't match the format, not even in lenient mode.
However, to the parser 03.12.199 is a valid date and hence it stops at index 8 - which isn't 0 and thus the parsing succeeded. If you want to check whether everything was parsed you'd have to pass your own ParsePosition and check whether the index is matches to the length of the input.
If you use setLenient(false) it will still parse the date till the desired pattern is meet. However, it will check the output date is a valid date or not. In your case, 03.12.199 is a valid date, so it will not throw an exception. Lets take an example to understand where the setLenient(false) different from setLenient(true)/default.
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("dd.MM.yyyy");
System.out.println(simpleDateFormat.parse("31.02.2018"));
The above will give me output: Sat Mar 03 00:00:00 IST 2018
But the below code throw ParseException as 31.02.2018 is not a valid/possible date:
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("dd.MM.yyyy");
simpleDateFormat.setLenient(false);
System.out.println(simpleDateFormat.parse("31.02.2018"));

Invalid format issue parsing string to JodaTime

String dateString = "20110706 1607";
DateTimeFormatter dateStringFormat = DateTimeFormat.forPattern("YYYYMMDD HHMM");
DateTime dateTime = dateStringFormat.parseDateTime(dateString);
Resulting stacktrace:
Exception in thread "main" java.lang.IllegalArgumentException: Invalid format: "201107206 1607" is malformed at " 1607"
at org.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:644)
at org.joda.time.convert.StringConverter.getInstantMillis(StringConverter.java:65)
at org.joda.time.base.BaseDateTime.<init>(BaseDateTime.java:171)
at org.joda.time.DateTime.<init>(DateTime.java:168)
......
Any thoughts? If I truncate the string to 20110706 with pattern "YYYYMMDD" it works, but I need the hour and minute values as well. What's odd is that I can convert a Jodatime DateTime to a String using the same pattern "YYYYMMDD HHMM" without issue
Thanks for looking
Look at your pattern - you're specifying "MM" twice. That can't possibly be right. That would be trying to parse the same field (month in this case) twice from two different bits of the text. Which would you expect to win? You want:
DateTimeFormat.forPattern("yyyyMMdd HHmm")
Look at the documentation for DateTimeFormat to see what everything means.
Note that although calling toString with that pattern will produce a string, it won't produce the string you want it to. I wouldn't be surprised if the output even included "YYYY" and "DD" due to the casing, although I can't test it right now. At the very least you'd have the month twice instead of the minutes appearing at the end.

Categories