Extracting dates from string

Extracting dates from string - java

I have a list with file names that look roughly like this: Gadget1-010912000000-020912235959.csv, i.e. they contain two dates indicating the timespan of their data.
The user enters a date format and a file format:
File Format in this case: *GADGET*-*DATE_FROM*-*DATE_TO*.csv
Date format in this case: ddMMyyHHmmss
What I want to do is extracting the three values out of the file name with the given file and date format.
My problem is: Since the date format can differ heavily (hours, minutes and seconds can be seperated by a colon, dates by a dot,...) I don't quite know how to create a fitting regular expression.

You can use a regular expression to remove non digits characters, and then parse value.
DateFormat dateFormat = new SimpleDateFormat("ddMMyyHHmmss");
String[] fileNameDetails = ("Gadget1-010912000000-020912235959").split("-");
/*Catch All non digit characters and removes it. If non exists maintains original string*/
String date = fileNameDetails[1].replaceAll("[^0-9]", "");
try{
dateFormat.parse(fileNameDetails[1]);
}catch (ParseException e) {
}
Hope it helps.

SimpleDateFormat solves your issue. You can define the format with commas, spaces and whatever and simply parse according to the format:
http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html
So you map your format (e.g ddMMyyHHmmss) to a corresponding SimpleDateFormat.
SimpleDateFormat format = new SimpleDateFormat("ddMMyyHHmmss");
Date x = format.parse("010912000000");
If the format changes, you simply change the SimpleDateFormat

You can use a series of date-time formats, trying each until one works.
You may need to order the formats to prioritize matches.
For example, with Joda time, you can use DateTimeFormat.forPattern() and DateTimeFormatter.getParser() for each of a series of patterns. Try DateTimeParser.parseInto() until one succeeds.
One nice thing about this approach is that it is easy to add and remove patterns.

Use Pattern and Matcher class.
Look at the example:
String inputDate = "01.09.12.00:00:00";
Pattern pattern = Pattern.compile(
"([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[:]{0,1}([0-9]{2})[:]{0,1}([0-9]{2})");
Matcher matcher = pattern.matcher(inputDate);
matcher.find();
StringBuilder cleanStr = new StringBuilder();
for(int i = 1; i <= matcher.groupCount(); i++) {
cleanStr.append(matcher.group(i));
}
SimpleDateFormat format = new SimpleDateFormat("ddMMyyHHmmss");
Date x = format.parse(cleanStr.toString());
System.out.println(x.toString());
The most important part is line
Pattern pattern = Pattern.compile(
"([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[:]{0,1}([0-9]{2})[:]{0,1}([0-9]
Here you define regexp and mark groups in paranthesis so ([0-9]{2}) marks a group. Then is expression for possible delimeters [\\.]* in this case 0 or 1 dot, but you can put more possible delimeters for example [\\.|\]{0,1}.
Then you run matcher.find() which returns true if pattern matches. And then using matcher.group(int) you can get group by group. Note that index of first group is 1.
Then I construct clean date String using StringBuilder. And then parse date.
Cheers,
Michal

Related

Strip date from String with junk data - Java

I need to know if there is anyway to strip the date alone from text like below using java. I am trying to find something more generic, but couldn't get any help, As the input varies.
Some example inputs:
This time is Apr.19,2021 Cheers
or
19-04-2021 Cheers
or
This time is 19-APR-2021
I have seen some code which works for trailing junk characters but couldn't find anything if the date is in between the string and if it varies to different formats.

We could use String#replaceAll here for a regex one-liner approach:
String[] inputs = new String[] {
"This time is Apr.19,2021 Cheers",
"19-04-2021 Cheers",
"This time is 19-APR-2021",
"Hello 2021-19-Apr World"
};
for (String input : inputs) {
String date = input.replaceAll(".*(?<!\\S)(\\S*\\b\\d{4}\\b\\S*).*", "$1");
System.out.println(date);
}
This prints:
Apr.19,2021
19-04-2021
19-APR-2021
2021-19-Apr

If you assume a "date" is any series of letter/number/dot/comma/dash chars that ends with a 4-digit "word", match that and replace with a blank to delete it
str = str.replaceAll("\\b[A-Za-z0-9.,-]+\\b\\d{4}\\b", "");

how can I select date from text in java? [duplicate]

This question already has answers here:
How to extract a date from a string and put it into a date variable in Java
(5 answers)
Closed 2 years ago.
how can I select date from text in java? for example if I have dates in format: 2007-01-12abcd, absc2008-01-31 and I need to have dates in format: 2007-01-12, 2008-01-31 (without text). I used matcher in my code but it is not working.
for (int i=0; i < list.size(); i++) {
Pattern compiledPattern = Pattern.compile("((?:19|20)[0-9][0-9])-(0?[1-9]|1[012])-(0?[1-9]|[12][0-9]|3[01])", Pattern.CASE_INSENSITIVE);
Matcher matcher = compiledPattern.matcher(list.get(i));
if (matcher.find() == true) {
new_list.add(list.get(i));
}
}

I would keep things simple and just search on the following regex pattern:
\d{4}-\d{2}-\d{2}
It is fairly unlikely that anything which is not a date in your text already would match to this pattern.
Sample code:
String input = "2007-01-12abcd, absc2008-01-31";
String pattern = "\\d{4}-\\d{2}-\\d{2}";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(input);
while (m.find()) {
System.out.println(m.group(0));
}
This prints:
2007-01-12
2008-01-31
By the way, your regex pattern can't be completely correct anyway, because it doesn't handle odd edge cases such as leap years, where February has 29 instead of 28 days.

well i havent made a code but i think i might help you. First of all I presuppose that the format of the date in the string is already the right way(the order of the numbers is right and there are commas between the dates). Go through the string with a for-each for each character. If the current character(char) is a proper letter like a, b or c then you donw add it to the final string. If not you do add it. If the character is a comma you have to add this string to the list. The same should happen if it is the last character. This might not be the best way to do that but i am very sure it should work

Regex that covers multiple date formats

What regex to choose to cover all the following scenarios:
Basically I have to extract prefix and suffix.
prefix.YYYY-MM-DD-HH-MM-SS.suffix
YYYY-MM-DD is mandatory.
HH-MM-SS is optional. (It could be HH only or HH-MM or HH-MM-SS)
Samples:
"test1.2020-03-07-00.test.com",
"test2.2020-03-06-16.test2.test1.com",
"test3.2020-03-06-16-13-40.test2.test1.com",
"test4.2020-03-06-16-13.test.com",
"test5.ext.2020-03-11-17-57.test1.com"
"test6.ext.2020-03-11.test1.test2.test3.com"
I use this regex but it fails:
Pattern.compile(".\\d{4}-\\d{2}-\\d{2}(-\\d{2}-\\d{2}-\\d{2})?.*?");

Here is one solution:
(.+)\.\d{4}(?:-\d{2}){2,5}\.(.+)
(.+) capturing group for the prefix.
\. literal dot.
\d{4} 4 digits.
(?:-\d{2}){2,5} non-capturing group for literal dash followed by 2 digits,
repeated at least 2 times and at most 5 times.
\. literal dot.
(.+) capturing group for the suffix.
For example:
var pattern = Pattern.compile("(.+)\\.\\d{4}(?:-\\d{2}){2,5}\\.(.+)");
var matcher = pattern.matcher("test1.2020-03-07-00.test.com");
if(matcher.matches())
{
String prefix = matcher.group(1);
String suffix = matcher.group(2);
System.out.println("prefix: " + prefix);
System.out.println("suffix: " + suffix);
}
Output:
prefix: test1
suffix: test.com

First remember that . period is a special regex pattern matching any character, so to specifically match a period, you need to escape it as \.
You said yourself that the time part "could be HH only or HH-MM or HH-MM-SS", so you shouldn't expect (-\\d{2}-\\d{2}-\\d{2})? to match that. Since you don't need to capture it, use a (?:...) non-capturing group, and nest them: (?:-\\d{2}(?:-\\d{2}(?:-\\d{2})?)?)?. Better yet, since the 3 parts are the same, use (?:-\\d{2}){0,3}
You said "I have to extract prefix and suffix", so you should add that to the pattern.
Pattern p = Pattern.compile("^(.*?)\\.(\\d{4}(?:-\\d{2}){2,5})\\.(.*)$");
for (String s : new String[] { "test1.2020-03-07-00.test.com",
"test2.2020-03-06-16.test2.test1.com",
"test3.2020-03-06-16-13-40.test2.test1.com",
"test4.2020-03-06-16-13.test.com",
"test5.ext.2020-03-11-17-57.test1.com",
"test6.ext.2020-03-11.test1.test2.test3.com" }) {
Matcher m = p.matcher(s);
if (m.matches()) {
System.out.printf("prefix = '%s', date = '%s', suffix = '%s'%n",
m.group(1), m.group(2), m.group(3));
} else {
System.out.printf("NO MATCH: '%s'%n", s);
}
}
Output
prefix = 'test1', date = '2020-03-07-00', suffix = 'test.com'
prefix = 'test2', date = '2020-03-06-16', suffix = 'test2.test1.com'
prefix = 'test3', date = '2020-03-06-16-13-40', suffix = 'test2.test1.com'
prefix = 'test4', date = '2020-03-06-16-13', suffix = 'test.com'
prefix = 'test5.ext', date = '2020-03-11-17-57', suffix = 'test1.com'
prefix = 'test6.ext', date = '2020-03-11', suffix = 'test1.test2.test3.com'

I would suggest a different approach. Finding an appropriate Regex would be very difficult if not impossible. I dealt with an issue of parsing a date from any possible format that is not known in advance and I came up with an idea. Of course, there is no 100% solution to this issue but here what I did. I created a property file that contains a list of currently supported formats. When a String needs to be parsed the attempts are made consecutively with each mask until you successfully parse the date or until you run out of masks. The pros of the idea
1. since the file is an external file it could be constantly updated with additional formats without any need to change the code.
2. file could be customized on the per-customer base where you place more preferable formats first. For example, for US-based customers, you would place US formats first (such as MM-dd-YYYY and after that European formats. And vise-versa for European-based customers. So when the date such as 07-08-2000 comes in, for US-based customers it would be parsed as July 8th but for European customers, it would be parsed as August 7th. So, in short - flexibility.
For more details read my article on the topic - Java 8 java.time package: parsing any string to date

Splitting string into two strings with regex

This question was asked several times before but I couldn't find an answer to my question:
I need to split a string into two strings. First part is date and the second string is text. This is what i got so far:
String test = "24.12.17 18:17 TestString";
String[] testSplit = test.split("\\d{2}.\\d{2}.\\d{2} \\d{2}:\\d{2}");
System.out.println(testSplit[0]); // "24.12.17 18:17" <-- Does not work
System.out.println(testSplit[1].trim()); // "TestString" <-- works
I can extract "TestString" but i miss the date. Is there any better (or even simpler) way? Help is highly appreciated!

Skip regex; Use three strings
You are working too hard. No need to include the date and the time together as one. Regex is tricky, and life is short.
Just use the plain String::split for three pieces, and re-assemble the date-time.
String[] pieces = "24.12.17 18:17 TestString".split( " " ) ; // Split into 3 strings.
LocalDate ld = LocalDate.parse( pieces[0] , DateTimeFormatter.ofPattern( "dd.MM.uu" ) ) ; // Parse the first string as a date value (`LocalDate`).
LocalTime lt = LocalTime.parse( pieces[1] , DateTimeFormatter.ofPattern( "HH:mm" ) ) ; // Parse the second string as a time-of-day value (`LocalTime`).
LocalDateTime ldt = LocalDateTime.of( ld , lt ) ; // Reassemble the date with the time (`LocalDateTime`).
String description = pieces[2] ; // Use the last remaining string.
See this code run live at IdeOne.com.
ldt.toString(): 2017-12-24T18:17
description: TestString
Tip: If you have any control over that input, switch to using standard ISO 8601 formats for date-time values in text. The java.time classes use the standard formats by default when generating/parsing strings.

You want to match only the separator. By matching the date, you consume it (it's thrown away).
Use a look behind, which asserts but does not consume:
test.split("(?<=^.{14}) ");
This regex means "split on a space that is preceded by 14 characters after the start of input".
Your test code now works:
String test = "24.12.17 18:17 TestString";
String[] testSplit = test.split("(?<=^.{14}) ");
System.out.println(testSplit[0]); // "24.12.17 18:17" <-- works
System.out.println(testSplit[1].trim()); // "TestString" <-- works

If your string is always in this format (and is formatted well), you do not even need to use a regex. Just split at the second space using .substring and .indexOf:
String test = "24.12.17 18:17 TestString";
int idx = test.indexOf(" ", test.indexOf(" ") + 1);
System.out.println(test.substring(0, idx));
System.out.println(test.substring(idx).trim());
See the Java demo.
If you want to make sure your string starts with a datetime value, you may use a matching approach to match the string with a pattern containing 2 capturing groups: one will capture the date and the other will capture the rest of the string:
String test = "24.12.17 18:17 TestString";
String pat = "^(\\d{2}\\.\\d{2}\\.\\d{2} \\d{2}:\\d{2})\\s(.*)";
Matcher matcher = Pattern.compile(pat, Pattern.DOTALL).matcher(test);
if (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2).trim());
}
See the Java demo.
Details:
^ - start of string
(\\d{2}\\.\\d{2}\\.\\d{2} \\d{2}:\\d{2}) - Group 1: a datetime pattern (xx.xx.xx xx:xx-like pattern)
\\s - a whitespace (if it is optional, add * after it)
(.*) - Group 2 capturing any 0+ chars up to the end of string (. will match line breaks, too, because of the Pattern.DOTALL flag).

How to parse a date from a URL format?

My database contains URLs stored as text fields and each URL contains a representation of the date of a report, which is missing from the report itself.
So I need to parse the date from the URL field to a String representation such as:
2010-10-12
2007-01-03
2008-02-07
What's the best way to extract the dates?
Some are in this format:
http://e.com/data/invoices/2010/09/invoices-report-wednesday-september-1st-2010.html
http://e.com/data/invoices/2010/09/invoices-report-thursday-september-2-2010.html
http://e.com/data/invoices/2010/09/invoices-report-wednesday-september-15-2010.html
http://e.com/data/invoices/2010/09/invoices-report-monday-september-13th-2010.html
http://e.com/data/invoices/2010/08/invoices-report-monday-august-30th-2010.html
http://e.com/data/invoices/2009/05/invoices-report-friday-may-8th-2009.html
http://e.com/data/invoices/2010/10/invoices-report-wednesday-october-6th-2010.html
http://e.com/data/invoices/2010/09/invoices-report-tuesday-september-21-2010.html
Note the inconsistent use of th following the day of the month in cases such as these two:
http://e.com/data/invoices/2010/09/invoices-report-wednesday-september-15-2010.html
http://e.com/data/invoices/2010/09/invoices-report-monday-september-13th-2010.html
Others are in this format (with three hyphens before the date starts, no year at the end and an optional use of invoices- before report):
http://e.com/data/invoices/2010/09/invoices-report---wednesday-september-1.html
http://e.com/data/invoices/2010/09/invoices-report---thursday-september-2.html
http://e.com/data/invoices/2010/09/invoices-report---wednesday-september-15.html
http://e.com/data/invoices/2010/09/invoices-report---monday-september-13.html
http://e.com/data/invoices/2010/08/report---monday-august-30.html
http://e.com/data/invoices/2009/05/report---friday-may-8.html
http://e.com/data/invoices/2010/10/report---wednesday-october-6.html
http://e.com/data/invoices/2010/09/report---tuesday-september-21.html

You want a regex like this:
"^http://e.com/data/invoices/(\\d{4})/(\\d{2})/\\D+(\\d{1,2})"
This exploits that everything up through the /year/month/ part of the URL is always the same, and that no number follows till the day of the month. After you have that, you don't care about anything else.
The first capture group is the year, the second the month, and the third the day. The day might not have a leading zero; convert from string to integer and format as needed, or just grab the string length and, if it's not two, then concatenate it to the string "0".
As an example:
import java.util.regex.*;
class URLDate {
public static void
main(String[] args) {
String text = "http://e.com/data/invoices/2010/09/invoices-report-wednesday-september-1st-2010.html";
String regex = "http://e.com/data/invoices/(\\d{4})/(\\d{2})/\\D+(\\d{1,2})";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
if (m.find()) {
int count = m.groupCount();
System.out.format("matched with groups:\n", count);
for (int i = 0; i <= count; ++i) {
String group = m.group(i);
System.out.format("\t%d: %s\n", i, group);
}
} else {
System.out.println("failed to match!");
}
}
}
gives the output:
matched with groups:
0: http://e.com/data/invoices/2010/09/invoices-report-wednesday-september-1st-2010.html
1: 2010
2: 09
3: 1
(Note that to use Matcher.matches() instead of Matcher.find(), you would have to make the pattern eat the entire input string by appending .*$ to the pattern.)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extracting dates from string - java

Related

Strip date from String with junk data - Java

how can I select date from text in java? [duplicate]

Regex that covers multiple date formats

Splitting string into two strings with regex

How to parse a date from a URL format?

Categories

Resources