Regex that covers multiple date formats - java

What regex to choose to cover all the following scenarios:
Basically I have to extract prefix and suffix.
prefix.YYYY-MM-DD-HH-MM-SS.suffix
YYYY-MM-DD is mandatory.
HH-MM-SS is optional. (It could be HH only or HH-MM or HH-MM-SS)
Samples:
"test1.2020-03-07-00.test.com",
"test2.2020-03-06-16.test2.test1.com",
"test3.2020-03-06-16-13-40.test2.test1.com",
"test4.2020-03-06-16-13.test.com",
"test5.ext.2020-03-11-17-57.test1.com"
"test6.ext.2020-03-11.test1.test2.test3.com"
I use this regex but it fails:
Pattern.compile(".\\d{4}-\\d{2}-\\d{2}(-\\d{2}-\\d{2}-\\d{2})?.*?");

Here is one solution:
(.+)\.\d{4}(?:-\d{2}){2,5}\.(.+)
(.+) capturing group for the prefix.
\. literal dot.
\d{4} 4 digits.
(?:-\d{2}){2,5} non-capturing group for literal dash followed by 2 digits,
repeated at least 2 times and at most 5 times.
\. literal dot.
(.+) capturing group for the suffix.
For example:
var pattern = Pattern.compile("(.+)\\.\\d{4}(?:-\\d{2}){2,5}\\.(.+)");
var matcher = pattern.matcher("test1.2020-03-07-00.test.com");
if(matcher.matches())
{
String prefix = matcher.group(1);
String suffix = matcher.group(2);
System.out.println("prefix: " + prefix);
System.out.println("suffix: " + suffix);
}
Output:
prefix: test1
suffix: test.com

First remember that . period is a special regex pattern matching any character, so to specifically match a period, you need to escape it as \.
You said yourself that the time part "could be HH only or HH-MM or HH-MM-SS", so you shouldn't expect (-\\d{2}-\\d{2}-\\d{2})? to match that. Since you don't need to capture it, use a (?:...) non-capturing group, and nest them: (?:-\\d{2}(?:-\\d{2}(?:-\\d{2})?)?)?. Better yet, since the 3 parts are the same, use (?:-\\d{2}){0,3}
You said "I have to extract prefix and suffix", so you should add that to the pattern.
Pattern p = Pattern.compile("^(.*?)\\.(\\d{4}(?:-\\d{2}){2,5})\\.(.*)$");
for (String s : new String[] { "test1.2020-03-07-00.test.com",
"test2.2020-03-06-16.test2.test1.com",
"test3.2020-03-06-16-13-40.test2.test1.com",
"test4.2020-03-06-16-13.test.com",
"test5.ext.2020-03-11-17-57.test1.com",
"test6.ext.2020-03-11.test1.test2.test3.com" }) {
Matcher m = p.matcher(s);
if (m.matches()) {
System.out.printf("prefix = '%s', date = '%s', suffix = '%s'%n",
m.group(1), m.group(2), m.group(3));
} else {
System.out.printf("NO MATCH: '%s'%n", s);
}
}
Output
prefix = 'test1', date = '2020-03-07-00', suffix = 'test.com'
prefix = 'test2', date = '2020-03-06-16', suffix = 'test2.test1.com'
prefix = 'test3', date = '2020-03-06-16-13-40', suffix = 'test2.test1.com'
prefix = 'test4', date = '2020-03-06-16-13', suffix = 'test.com'
prefix = 'test5.ext', date = '2020-03-11-17-57', suffix = 'test1.com'
prefix = 'test6.ext', date = '2020-03-11', suffix = 'test1.test2.test3.com'

I would suggest a different approach. Finding an appropriate Regex would be very difficult if not impossible. I dealt with an issue of parsing a date from any possible format that is not known in advance and I came up with an idea. Of course, there is no 100% solution to this issue but here what I did. I created a property file that contains a list of currently supported formats. When a String needs to be parsed the attempts are made consecutively with each mask until you successfully parse the date or until you run out of masks. The pros of the idea
1. since the file is an external file it could be constantly updated with additional formats without any need to change the code.
2. file could be customized on the per-customer base where you place more preferable formats first. For example, for US-based customers, you would place US formats first (such as MM-dd-YYYY and after that European formats. And vise-versa for European-based customers. So when the date such as 07-08-2000 comes in, for US-based customers it would be parsed as July 8th but for European customers, it would be parsed as August 7th. So, in short - flexibility.
For more details read my article on the topic - Java 8 java.time package: parsing any string to date

Related

Get substring between "first two" occurrences of a character

I have a String:
String thestra = "/aaa/bbb/ccc/ddd/eee";
Every time, in my situation, for this Sting, a minimum of two slashes will be present without fail.
And I am getting the /aaa/ like below, which is the subString between "FIRST TWO occurrences" of the char / in the String.
System.out.println("/" + thestra.split("\\/")[1] + "/");
It solves my purpose but I am wondering if there is any other elegant and cleaner alternative to this?
Please notice that I need both slashes (leading and trailing) around aaa. i.e. /aaa/
You can use indexOf, which accepts a second argument for an index to start searching from:
int start = thestra.indexOf("/");
int end = thestra.indexOf("/", start + 1) + 1;
System.out.println(thestra.substring(start, end));
Whether or not it's more elegant is a matter of opinion, but at least it doesn't find every / in the string or create an unnecessary array.
Scanner::findInLine returning the first match of the pattern may be used:
String thestra = "/aaa/bbb/ccc/ddd/eee";
System.out.println(new Scanner(thestra).findInLine("/[^/]*/"));
Output:
/aaa/
Use Pattern and Matcher from java.util.regex.
Pattern pattern = Pattern.compile("/.*?/");
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
String match = matcher.group(0); // output
}
Pattern.compile("/.*?/")
.matcher(thestra)
.results()
.map(MatchResult::group)
.findFirst().ifPresent(System.out::println);
You can test this variant :)
With best regards, Fr0z3Nn
Every time, in my situation, for this Sting, minimum two slashes will be present
if that is guaranteed, split at each / keeping those delimeters and take the first three substrings.
String str = String.format("%s%s%s",(thestra.split("((?<=\\/)|(?=\\/))")));
You could also match the leading forward slash, then use a negated character class [^/]* to optionally match any character except / and then match the trailing forward slash.
String thestra = "/aaa/bbb/ccc/ddd/eee";
Pattern pattern = Pattern.compile("/[^/]*/");
Matcher matcher = pattern.matcher(thestra);
if (matcher.find()) {
System.out.println(matcher.group());
}
Output
/aaa/
One of the many ways can be replacing the string with group#1 of the regex, [^/]*(/[^/].*?/).* as shown below:
public class Main {
public static void main(String[] args) {
String thestra = "/aaa/bbb/ccc/ddd/eee";
String result = thestra.replaceAll("[^/]*(/[^/].*?/).*", "$1");
System.out.println(result);
}
}
Output:
/aaa/
Explanation of the regex:
[^/]* : Not the character, /, any number of times
( : Start of group#1
/ : The character, /
[^/]: Not the character, /
.*?: Any character any number of times (lazy match)
/ : The character, /
) : End of group#1
.* : Any character any number of times
Updated the answer as per the following valuable suggestion from Holger:
Note that to the Java regex engine, the / has no special meaning, so there is no need for escaping here. Further, since you’re only expecting a single match (the .* at the end ensures this), replaceFirst would be more idiomatic. And since there was no statement about the first / being always at the beginning of the string, prepending the pattern with either , .*? or [^/]*, would be a good idea.
I am surprised nobody mentioned using Path as of Java 7.
String thestra = "/aaa/bbb/ccc/ddd/eee";
String path = Paths.get(thestra).getName(0).toString();
System.out.println("/" + path + "/");
/aaa/
String thestra = "/aaa/bbb/ccc/ddd/eee";
System.out.println(thestra.substring(0, thestra.indexOf("/", 2) + 1));

Splitting string into two strings with regex

This question was asked several times before but I couldn't find an answer to my question:
I need to split a string into two strings. First part is date and the second string is text. This is what i got so far:
String test = "24.12.17 18:17 TestString";
String[] testSplit = test.split("\\d{2}.\\d{2}.\\d{2} \\d{2}:\\d{2}");
System.out.println(testSplit[0]); // "24.12.17 18:17" <-- Does not work
System.out.println(testSplit[1].trim()); // "TestString" <-- works
I can extract "TestString" but i miss the date. Is there any better (or even simpler) way? Help is highly appreciated!
Skip regex; Use three strings
You are working too hard. No need to include the date and the time together as one. Regex is tricky, and life is short.
Just use the plain String::split for three pieces, and re-assemble the date-time.
String[] pieces = "24.12.17 18:17 TestString".split( " " ) ; // Split into 3 strings.
LocalDate ld = LocalDate.parse( pieces[0] , DateTimeFormatter.ofPattern( "dd.MM.uu" ) ) ; // Parse the first string as a date value (`LocalDate`).
LocalTime lt = LocalTime.parse( pieces[1] , DateTimeFormatter.ofPattern( "HH:mm" ) ) ; // Parse the second string as a time-of-day value (`LocalTime`).
LocalDateTime ldt = LocalDateTime.of( ld , lt ) ; // Reassemble the date with the time (`LocalDateTime`).
String description = pieces[2] ; // Use the last remaining string.
See this code run live at IdeOne.com.
ldt.toString(): 2017-12-24T18:17
description: TestString
Tip: If you have any control over that input, switch to using standard ISO 8601 formats for date-time values in text. The java.time classes use the standard formats by default when generating/parsing strings.
You want to match only the separator. By matching the date, you consume it (it's thrown away).
Use a look behind, which asserts but does not consume:
test.split("(?<=^.{14}) ");
This regex means "split on a space that is preceded by 14 characters after the start of input".
Your test code now works:
String test = "24.12.17 18:17 TestString";
String[] testSplit = test.split("(?<=^.{14}) ");
System.out.println(testSplit[0]); // "24.12.17 18:17" <-- works
System.out.println(testSplit[1].trim()); // "TestString" <-- works
If your string is always in this format (and is formatted well), you do not even need to use a regex. Just split at the second space using .substring and .indexOf:
String test = "24.12.17 18:17 TestString";
int idx = test.indexOf(" ", test.indexOf(" ") + 1);
System.out.println(test.substring(0, idx));
System.out.println(test.substring(idx).trim());
See the Java demo.
If you want to make sure your string starts with a datetime value, you may use a matching approach to match the string with a pattern containing 2 capturing groups: one will capture the date and the other will capture the rest of the string:
String test = "24.12.17 18:17 TestString";
String pat = "^(\\d{2}\\.\\d{2}\\.\\d{2} \\d{2}:\\d{2})\\s(.*)";
Matcher matcher = Pattern.compile(pat, Pattern.DOTALL).matcher(test);
if (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2).trim());
}
See the Java demo.
Details:
^ - start of string
(\\d{2}\\.\\d{2}\\.\\d{2} \\d{2}:\\d{2}) - Group 1: a datetime pattern (xx.xx.xx xx:xx-like pattern)
\\s - a whitespace (if it is optional, add * after it)
(.*) - Group 2 capturing any 0+ chars up to the end of string (. will match line breaks, too, because of the Pattern.DOTALL flag).

How to use Substring when String length is not fixed everytime

I have string something like :
SKU: XP321654
Quantity: 1
Order date: 01/08/2016
The SKU length is not fixed , so my function sometime returns me the first or two characters of Quantity also which I do not want to get. I want to get only SKU value.
My Code :
int index = Content.indexOf("SKU:");
String SKU = Content.substring(index, index+15);
If SKU has one or two more digits then also it is not able to get because I have specified limit till 15. If I do index + 16 to get long SKU data then for Short SKU it returns me some character of Quantity also.
How can I solve it. Is there any way to use instead of a static string character length as limit.
My SKU last digit will always number so any other thing which I can use to get only SKU till it's last digit?
Using .substring is simply not the way to process such things. What you need is a regex (or regular expression):
Pattern pat = Pattern.compile("SKU\\s*:\\s*(\\S+)");
String sku = null;
Matcher matcher = pattern.matcher(Content);
if(matcher.find()) { //we've found a match
sku = matcher.group(1);
}
//do something with sku
Unescaped the regex is something like:
SKU\s*:\s*(\S+)
you are thus looking for a pattern that starts with SKU then followed by zero or more \s (spacing characters like space and tab), followed by a colon (:) then potentially zero or more spacing characters (\s) and finally the part in which you are interested: one or more (that's the meaning of +) non-spacing characters (\S). By putting these in brackets, these are a matching group. If the regex succeeds in finding the pattern (matcher.find()), you can extract the content of the matching group matcher.group(1) and store it into a string.
Potentially you can improve the regex further if you for instance know more about how a SKU looks like. For instance if it consists only out of uppercase letters and digits, you can replace \S by [0-9A-Z], so then the pattern becomes:
Pattern pat = Pattern.compile("SKU\\s*:\\s*([0-9A-Z]+)");
EDIT: for the quantity data, you could use:
Pattern pat2 = Pattern.compile("Quantity\\s*:\\s*(\\d+)");
int qt = -1;
Matcher matcher = pat2.matcher(Content);
if(matcher.find()) { //we've found a match
qt = Integer.parseInt(matcher.group(1));
}
or see this jdoodle.
You know you can just refer to the length of the string right ?
String s = "SKU: XP321654";
String sku = s.substring(4, s.length()).trim();
I think using a regex is clearly overkill in this case, it is way way simpler than this. You can even split the expression although it's a bit less efficient than the solution above, but please don't use a regex for this !
String sku = "SKU: XP321654".split(':')[1].trim();
1: you have to split your input by lines (or split by \n)
2: when you have your line: you search for : and then you take the remaining of the line (with the String size as mentionned in Dici answer).
Depending on how exactly the string contains new lines, you could do this:
public static void main(String[] args) {
String s = "SKU: XP321654\r\n" +
"Quantity: 1\r\n" +
"Order date: 01/08/2016";
System.out.println(s.substring(s.indexOf(": ") + 2, s.indexOf("\r\n")));
}
Just note that this 1-liner has several restrictions:
The SKU property has to be first. If not, then modify the start index appropriately to search for "SKU: ".
The new lines might be separated otherwise, \R is a regex for all the valid new line escape characters combinations.

Java regex of string

I want to parse strings to get fields from them. The format of the string (which come from a dataset) is as so (the -> represents a tab, and the * represents a space):
Date(yyyymmdd)->Date(yyyymmdd)->*City,*State*-->Description
I am only interested in the 1st date and the State. I tried regex like this:
String txt="19951010 19951011 Red City, WI Description";
String re1="(\\d+)"; // Integer Number 1
String re2=".*?"; // Non-greedy match on filler
String re3="(?:[a-z][a-z]+)"; // Uninteresting: word
String re4=".*?"; // Non-greedy match on filler
String re5="(?:[a-z][a-z]+)"; // Uninteresting: word
String re6=".*?"; // Non-greedy match on filler
String re7="((?:[a-z][a-z]+))"; // Word 1
Pattern p = Pattern.compile(re1+re2+re3+re4+re5+re6+re7,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(txt);
if (m.find())
{
String int1=m.group(1);
String word1=m.group(2);
System.out.print("("+int1.toString()+")"+"("+word1.toString()+")"+"\n");
}
It works fine id the city has two words (Red City) then the State is extracted properly, but if the City only has one word it does not work. I can't figure it out, I don't need to use regex and am open to any other suggestions. Thanks.
Problem:
Your problem is that each component of your current regex essentially matches a number or [a-z] word, separated by anything that isn't [a-z], which includes commas. So your parts for a two word city are:
Input:
19951010 19951011 Red City, WI Description
Your components:
String re1="(\\d+)"; // Integer Number 1
String re2=".*?"; // Non-greedy match on filler
String re3="(?:[a-z][a-z]+)"; // Uninteresting: word
String re4=".*?"; // Non-greedy match on filler
String re5="(?:[a-z][a-z]+)"; // Uninteresting: word
String re6=".*?"; // Non-greedy match on filler
String re7="((?:[a-z][a-z]+))"; // Word 1
What they match:
re1: "19951010"
re2: " 19951011 "
re3: "Red" (stops at non-letter, e.g. whitespace)
re4: " "
re5: "City" (stops at non-letter, e.g. the comma)
re6: ", " (stops at word character)
re7: "WI"
But with a one-word city:
Input:
19951010 19951011 Pittsburgh, PA Description
What they match:
re1: "19951010"
re2: " 19951011 "
re3: "Pittsburgh" (stops at non-letter, e.g. the comma)
re4: ","
re5: "PA" (stops at non-letter, e.g. whitespace)
re6: " " (stops at word character)
re7: "Description" (but you want this to be the state)
Solution:
You should do two things. First, simplify your regex a bit; you are going kind of crazy specifying greedy vs. reluctant, etc. Just use greedy patterns. Second, think about the simplest way to express your rules.
Your rules really are:
Date
A bunch of characters that aren't a comma (including second date and city name).
A comma.
State (one word).
So build a regex that sticks to that. You can, as you are doing now, take a shortcut by skipping the second number, but note that you do lose support for cities that start with numbers (which probably won't happen). Also you don't care about the state. So, e.g.:
String re1 = "(\\d+)"; // match first number
String re2 = "[^,]*"; // skip everything thats not a comma
String re3 = ","; // skip the comma
String re4 = "[\\s]*"; // skip whitespace
String re5 = "([a-z]+)"; // match letters (state)
String regex = re1 + re2 + re3 + re4 + re5;
There are other options as well, but I personally find regular expressions to be very straightforward for things like this. You could use various combinations of split(), as other posters have detailed. You could directly look for commas and whitespace with indexOf() and pull out substrings. You could even convince a Scanner or perhaps a StringTokenizer or StreamTokenizer to work for you. However, regular expressions exist to solve problems like this and are a good tool for the job.
Here is an example with StringTokenizer:
StringTokenizer t = new StringTokenizer(txt, " \t");
String date = t.nextToken();
t.nextToken(); // skip second date
t.nextToken(","); // change delimiter to comma and skip city
t.nextToken(" \t"); // back to whitespace and skip comma
String state = t.nextToken();
Still, I feel a regex expresses the rules more cleanly.
By the way, for future debugging, sometimes it helps to just print out all of the capture groups, this can give you insight into what is matching what. A good technique is to put every component of your regex in a capture group temporarily, then print them all out.
no need to be so complex with this. you can split on whitespace!
//s is your string
String[] first = s.split("\\s*,\\s*")
String[] firstHalf = first[0].split("\\s+")
String[] secondHalf = first[1].split("\\s+")
String date = firstHalf[0]
String state = secondHalf[0]
now you have youre date and your state! do with them what you want.

Extracting dates from string

I have a list with file names that look roughly like this: Gadget1-010912000000-020912235959.csv, i.e. they contain two dates indicating the timespan of their data.
The user enters a date format and a file format:
File Format in this case: *GADGET*-*DATE_FROM*-*DATE_TO*.csv
Date format in this case: ddMMyyHHmmss
What I want to do is extracting the three values out of the file name with the given file and date format.
My problem is: Since the date format can differ heavily (hours, minutes and seconds can be seperated by a colon, dates by a dot,...) I don't quite know how to create a fitting regular expression.
You can use a regular expression to remove non digits characters, and then parse value.
DateFormat dateFormat = new SimpleDateFormat("ddMMyyHHmmss");
String[] fileNameDetails = ("Gadget1-010912000000-020912235959").split("-");
/*Catch All non digit characters and removes it. If non exists maintains original string*/
String date = fileNameDetails[1].replaceAll("[^0-9]", "");
try{
dateFormat.parse(fileNameDetails[1]);
}catch (ParseException e) {
}
Hope it helps.
SimpleDateFormat solves your issue. You can define the format with commas, spaces and whatever and simply parse according to the format:
http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html
So you map your format (e.g ddMMyyHHmmss) to a corresponding SimpleDateFormat.
SimpleDateFormat format = new SimpleDateFormat("ddMMyyHHmmss");
Date x = format.parse("010912000000");
If the format changes, you simply change the SimpleDateFormat
You can use a series of date-time formats, trying each until one works.
You may need to order the formats to prioritize matches.
For example, with Joda time, you can use DateTimeFormat.forPattern() and DateTimeFormatter.getParser() for each of a series of patterns. Try DateTimeParser.parseInto() until one succeeds.
One nice thing about this approach is that it is easy to add and remove patterns.
Use Pattern and Matcher class.
Look at the example:
String inputDate = "01.09.12.00:00:00";
Pattern pattern = Pattern.compile(
"([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[:]{0,1}([0-9]{2})[:]{0,1}([0-9]{2})");
Matcher matcher = pattern.matcher(inputDate);
matcher.find();
StringBuilder cleanStr = new StringBuilder();
for(int i = 1; i <= matcher.groupCount(); i++) {
cleanStr.append(matcher.group(i));
}
SimpleDateFormat format = new SimpleDateFormat("ddMMyyHHmmss");
Date x = format.parse(cleanStr.toString());
System.out.println(x.toString());
The most important part is line
Pattern pattern = Pattern.compile(
"([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[:]{0,1}([0-9]{2})[:]{0,1}([0-9]
Here you define regexp and mark groups in paranthesis so ([0-9]{2}) marks a group. Then is expression for possible delimeters [\\.]* in this case 0 or 1 dot, but you can put more possible delimeters for example [\\.|\]{0,1}.
Then you run matcher.find() which returns true if pattern matches. And then using matcher.group(int) you can get group by group. Note that index of first group is 1.
Then I construct clean date String using StringBuilder. And then parse date.
Cheers,
Michal

Categories