Splitting string into two strings with regex - java

This question was asked several times before but I couldn't find an answer to my question:
I need to split a string into two strings. First part is date and the second string is text. This is what i got so far:
String test = "24.12.17 18:17 TestString";
String[] testSplit = test.split("\\d{2}.\\d{2}.\\d{2} \\d{2}:\\d{2}");
System.out.println(testSplit[0]); // "24.12.17 18:17" <-- Does not work
System.out.println(testSplit[1].trim()); // "TestString" <-- works
I can extract "TestString" but i miss the date. Is there any better (or even simpler) way? Help is highly appreciated!

Skip regex; Use three strings
You are working too hard. No need to include the date and the time together as one. Regex is tricky, and life is short.
Just use the plain String::split for three pieces, and re-assemble the date-time.
String[] pieces = "24.12.17 18:17 TestString".split( " " ) ; // Split into 3 strings.
LocalDate ld = LocalDate.parse( pieces[0] , DateTimeFormatter.ofPattern( "dd.MM.uu" ) ) ; // Parse the first string as a date value (`LocalDate`).
LocalTime lt = LocalTime.parse( pieces[1] , DateTimeFormatter.ofPattern( "HH:mm" ) ) ; // Parse the second string as a time-of-day value (`LocalTime`).
LocalDateTime ldt = LocalDateTime.of( ld , lt ) ; // Reassemble the date with the time (`LocalDateTime`).
String description = pieces[2] ; // Use the last remaining string.
See this code run live at IdeOne.com.
ldt.toString(): 2017-12-24T18:17
description: TestString
Tip: If you have any control over that input, switch to using standard ISO 8601 formats for date-time values in text. The java.time classes use the standard formats by default when generating/parsing strings.

You want to match only the separator. By matching the date, you consume it (it's thrown away).
Use a look behind, which asserts but does not consume:
test.split("(?<=^.{14}) ");
This regex means "split on a space that is preceded by 14 characters after the start of input".
Your test code now works:
String test = "24.12.17 18:17 TestString";
String[] testSplit = test.split("(?<=^.{14}) ");
System.out.println(testSplit[0]); // "24.12.17 18:17" <-- works
System.out.println(testSplit[1].trim()); // "TestString" <-- works

If your string is always in this format (and is formatted well), you do not even need to use a regex. Just split at the second space using .substring and .indexOf:
String test = "24.12.17 18:17 TestString";
int idx = test.indexOf(" ", test.indexOf(" ") + 1);
System.out.println(test.substring(0, idx));
System.out.println(test.substring(idx).trim());
See the Java demo.
If you want to make sure your string starts with a datetime value, you may use a matching approach to match the string with a pattern containing 2 capturing groups: one will capture the date and the other will capture the rest of the string:
String test = "24.12.17 18:17 TestString";
String pat = "^(\\d{2}\\.\\d{2}\\.\\d{2} \\d{2}:\\d{2})\\s(.*)";
Matcher matcher = Pattern.compile(pat, Pattern.DOTALL).matcher(test);
if (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2).trim());
}
See the Java demo.
Details:
^ - start of string
(\\d{2}\\.\\d{2}\\.\\d{2} \\d{2}:\\d{2}) - Group 1: a datetime pattern (xx.xx.xx xx:xx-like pattern)
\\s - a whitespace (if it is optional, add * after it)
(.*) - Group 2 capturing any 0+ chars up to the end of string (. will match line breaks, too, because of the Pattern.DOTALL flag).

Related

Strip date from String with junk data - Java

I need to know if there is anyway to strip the date alone from text like below using java. I am trying to find something more generic, but couldn't get any help, As the input varies.
Some example inputs:
This time is Apr.19,2021 Cheers
or
19-04-2021 Cheers
or
This time is 19-APR-2021
I have seen some code which works for trailing junk characters but couldn't find anything if the date is in between the string and if it varies to different formats.
We could use String#replaceAll here for a regex one-liner approach:
String[] inputs = new String[] {
"This time is Apr.19,2021 Cheers",
"19-04-2021 Cheers",
"This time is 19-APR-2021",
"Hello 2021-19-Apr World"
};
for (String input : inputs) {
String date = input.replaceAll(".*(?<!\\S)(\\S*\\b\\d{4}\\b\\S*).*", "$1");
System.out.println(date);
}
This prints:
Apr.19,2021
19-04-2021
19-APR-2021
2021-19-Apr
If you assume a "date" is any series of letter/number/dot/comma/dash chars that ends with a 4-digit "word", match that and replace with a blank to delete it
str = str.replaceAll("\\b[A-Za-z0-9.,-]+\\b\\d{4}\\b", "");

Regex that covers multiple date formats

What regex to choose to cover all the following scenarios:
Basically I have to extract prefix and suffix.
prefix.YYYY-MM-DD-HH-MM-SS.suffix
YYYY-MM-DD is mandatory.
HH-MM-SS is optional. (It could be HH only or HH-MM or HH-MM-SS)
Samples:
"test1.2020-03-07-00.test.com",
"test2.2020-03-06-16.test2.test1.com",
"test3.2020-03-06-16-13-40.test2.test1.com",
"test4.2020-03-06-16-13.test.com",
"test5.ext.2020-03-11-17-57.test1.com"
"test6.ext.2020-03-11.test1.test2.test3.com"
I use this regex but it fails:
Pattern.compile(".\\d{4}-\\d{2}-\\d{2}(-\\d{2}-\\d{2}-\\d{2})?.*?");
Here is one solution:
(.+)\.\d{4}(?:-\d{2}){2,5}\.(.+)
(.+) capturing group for the prefix.
\. literal dot.
\d{4} 4 digits.
(?:-\d{2}){2,5} non-capturing group for literal dash followed by 2 digits,
repeated at least 2 times and at most 5 times.
\. literal dot.
(.+) capturing group for the suffix.
For example:
var pattern = Pattern.compile("(.+)\\.\\d{4}(?:-\\d{2}){2,5}\\.(.+)");
var matcher = pattern.matcher("test1.2020-03-07-00.test.com");
if(matcher.matches())
{
String prefix = matcher.group(1);
String suffix = matcher.group(2);
System.out.println("prefix: " + prefix);
System.out.println("suffix: " + suffix);
}
Output:
prefix: test1
suffix: test.com
First remember that . period is a special regex pattern matching any character, so to specifically match a period, you need to escape it as \.
You said yourself that the time part "could be HH only or HH-MM or HH-MM-SS", so you shouldn't expect (-\\d{2}-\\d{2}-\\d{2})? to match that. Since you don't need to capture it, use a (?:...) non-capturing group, and nest them: (?:-\\d{2}(?:-\\d{2}(?:-\\d{2})?)?)?. Better yet, since the 3 parts are the same, use (?:-\\d{2}){0,3}
You said "I have to extract prefix and suffix", so you should add that to the pattern.
Pattern p = Pattern.compile("^(.*?)\\.(\\d{4}(?:-\\d{2}){2,5})\\.(.*)$");
for (String s : new String[] { "test1.2020-03-07-00.test.com",
"test2.2020-03-06-16.test2.test1.com",
"test3.2020-03-06-16-13-40.test2.test1.com",
"test4.2020-03-06-16-13.test.com",
"test5.ext.2020-03-11-17-57.test1.com",
"test6.ext.2020-03-11.test1.test2.test3.com" }) {
Matcher m = p.matcher(s);
if (m.matches()) {
System.out.printf("prefix = '%s', date = '%s', suffix = '%s'%n",
m.group(1), m.group(2), m.group(3));
} else {
System.out.printf("NO MATCH: '%s'%n", s);
}
}
Output
prefix = 'test1', date = '2020-03-07-00', suffix = 'test.com'
prefix = 'test2', date = '2020-03-06-16', suffix = 'test2.test1.com'
prefix = 'test3', date = '2020-03-06-16-13-40', suffix = 'test2.test1.com'
prefix = 'test4', date = '2020-03-06-16-13', suffix = 'test.com'
prefix = 'test5.ext', date = '2020-03-11-17-57', suffix = 'test1.com'
prefix = 'test6.ext', date = '2020-03-11', suffix = 'test1.test2.test3.com'
I would suggest a different approach. Finding an appropriate Regex would be very difficult if not impossible. I dealt with an issue of parsing a date from any possible format that is not known in advance and I came up with an idea. Of course, there is no 100% solution to this issue but here what I did. I created a property file that contains a list of currently supported formats. When a String needs to be parsed the attempts are made consecutively with each mask until you successfully parse the date or until you run out of masks. The pros of the idea
1. since the file is an external file it could be constantly updated with additional formats without any need to change the code.
2. file could be customized on the per-customer base where you place more preferable formats first. For example, for US-based customers, you would place US formats first (such as MM-dd-YYYY and after that European formats. And vise-versa for European-based customers. So when the date such as 07-08-2000 comes in, for US-based customers it would be parsed as July 8th but for European customers, it would be parsed as August 7th. So, in short - flexibility.
For more details read my article on the topic - Java 8 java.time package: parsing any string to date

java regex capturing 2 numbers

I'm looking for a way to capture the year and the last number of a string. ex: "01/02/2017,546.12,24.2," My problem so far I only got Found value : 2017 and Found value : null. I'm not able to capture the group(2). Thanks
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.Scanner;
public class Bourse {
public static void main( String args[] ) {
Scanner clavier = new Scanner(System.in);
// String to be scanned to find the pattern.
String line = clavier.nextLine();
String pattern = "(?<=\\/)(\\d{4})|(\\d+(?:\\.\\d{1,2}))(?=,$)";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
if (m.find( )) {
System.out.println("Found value: " + m.group(1) );
System.out.println("Found value: " + m.group(2) );
} else {
System.out.println("NO MATCH");
}
}
}
Try this one:
(\\d{2}\\.?\\d{2})
\\d{2} - exactly two digits
\\.? - optional dot
\\d{2} - exactly two digits
If I understood you correctly you're looking for 4 digits, which could be separated by dot.
Your requirements are not very clear, but this works for me to simply grab the year and the last decimal value:
Pattern pattern = Pattern.compile("[0-9]{2}/[0-9]{2}/([0-9]{4}),[^,]+,([0-9.]+),");
String text = "01/02/2017,546.12,24.2,";
Matcher matcher = pattern.matcher(text);
if (matcher.find()) {
String year = matcher.group(1);
String lastDecimal = matcher.group(2);
System.out.println("Year "+year+"; decimal "+lastDecimal);
}
I don't know whether you're deliberately using lookbehind and lookahead, but I think it's simpler to explicitly specify the full date pattern and consume the value between two explicit comma characters. (Obviously if you need the comma to remain in play you can replace the final comma with a lookahead.)
By the way, I'm not a fan of the \d shorthand because in many languages this will match all digit characters from the entire Unicode character space, when usually only matching of ASCII digits 0-9 is desired. (Java does only match ASCII digits when \d is used, but I still think it's a bad habit.)
Parse, not regex
Regex is overkill here.
Just split the string on the comma-delimiter.
String input = "01/02/2017,546.12,24.2,";
String[] parts = input.split( "," );
Parse each element into a meaningful object rather than treating everything as text.
For a date-only value, the modern approach uses the java.time.LocalDate class built into Java 8 and later.
// Parse the first element, a date-only value.
DateTimeFormatter f = DateTimeFormatter.ofPattern( "dd/MM/uuuu" );
LocalDate localDate = null;
String inputDate = parts[ 0 ] ;
try
{
localDate = LocalDate.parse( inputDate , f );
} catch ( DateTimeException e )
{
System.out.println( "ERROR - invalid input for LocalDate: " + parts[ 0 ] );
}
For numbers with decimals where accuracy matters, avoid the floating-point types and instead use BigDecimal. Given your class name “Bourse“, I assume the numbers relate to money, so accuracy matters. Always use BigDecimal for money matters.
// Loop the numbers
List < BigDecimal > numbers = new ArrayList <>( parts.length );
for ( int i = 1 ; i < parts.length ; i++ )
{ // Start index at 1, skipping over the first element (the date) at index 0.
String s = parts[ i ];
if ( null == s )
{
continue;
}
if ( s.isEmpty( ) )
{
continue;
}
BigDecimal bigDecimal = new BigDecimal( parts[ i ] );
numbers.add( bigDecimal );
}
Extract your two desired pieces of information: the year, and the last number.
Consider passing around a Year object in your code rather than a mere integer to represent the year. This gives you type-safety and makes your code more self-documenting.
// Goals: (1) Get the year of the date. (2) Get the last number.
Year year = Year.from( localDate ); // Where possible, use an object rather than a mere integer to represent the year.
int y = localDate.getYear( );
BigDecimal lastNumber = numbers.get( numbers.size( ) - 1 ); // Fetch last element from the List.
Dump to console.
System.out.println("input: " + input );
System.out.println("year.toString(): " + year );
System.out.println("lastNumber.toString(): " + lastNumber );
See this code run live at IdeOne.com.
input: 01/02/2017,546.12,24.2,
year.toString(): 2017
lastNumber.toString(): 24.2

Java regex of string

I want to parse strings to get fields from them. The format of the string (which come from a dataset) is as so (the -> represents a tab, and the * represents a space):
Date(yyyymmdd)->Date(yyyymmdd)->*City,*State*-->Description
I am only interested in the 1st date and the State. I tried regex like this:
String txt="19951010 19951011 Red City, WI Description";
String re1="(\\d+)"; // Integer Number 1
String re2=".*?"; // Non-greedy match on filler
String re3="(?:[a-z][a-z]+)"; // Uninteresting: word
String re4=".*?"; // Non-greedy match on filler
String re5="(?:[a-z][a-z]+)"; // Uninteresting: word
String re6=".*?"; // Non-greedy match on filler
String re7="((?:[a-z][a-z]+))"; // Word 1
Pattern p = Pattern.compile(re1+re2+re3+re4+re5+re6+re7,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(txt);
if (m.find())
{
String int1=m.group(1);
String word1=m.group(2);
System.out.print("("+int1.toString()+")"+"("+word1.toString()+")"+"\n");
}
It works fine id the city has two words (Red City) then the State is extracted properly, but if the City only has one word it does not work. I can't figure it out, I don't need to use regex and am open to any other suggestions. Thanks.
Problem:
Your problem is that each component of your current regex essentially matches a number or [a-z] word, separated by anything that isn't [a-z], which includes commas. So your parts for a two word city are:
Input:
19951010 19951011 Red City, WI Description
Your components:
String re1="(\\d+)"; // Integer Number 1
String re2=".*?"; // Non-greedy match on filler
String re3="(?:[a-z][a-z]+)"; // Uninteresting: word
String re4=".*?"; // Non-greedy match on filler
String re5="(?:[a-z][a-z]+)"; // Uninteresting: word
String re6=".*?"; // Non-greedy match on filler
String re7="((?:[a-z][a-z]+))"; // Word 1
What they match:
re1: "19951010"
re2: " 19951011 "
re3: "Red" (stops at non-letter, e.g. whitespace)
re4: " "
re5: "City" (stops at non-letter, e.g. the comma)
re6: ", " (stops at word character)
re7: "WI"
But with a one-word city:
Input:
19951010 19951011 Pittsburgh, PA Description
What they match:
re1: "19951010"
re2: " 19951011 "
re3: "Pittsburgh" (stops at non-letter, e.g. the comma)
re4: ","
re5: "PA" (stops at non-letter, e.g. whitespace)
re6: " " (stops at word character)
re7: "Description" (but you want this to be the state)
Solution:
You should do two things. First, simplify your regex a bit; you are going kind of crazy specifying greedy vs. reluctant, etc. Just use greedy patterns. Second, think about the simplest way to express your rules.
Your rules really are:
Date
A bunch of characters that aren't a comma (including second date and city name).
A comma.
State (one word).
So build a regex that sticks to that. You can, as you are doing now, take a shortcut by skipping the second number, but note that you do lose support for cities that start with numbers (which probably won't happen). Also you don't care about the state. So, e.g.:
String re1 = "(\\d+)"; // match first number
String re2 = "[^,]*"; // skip everything thats not a comma
String re3 = ","; // skip the comma
String re4 = "[\\s]*"; // skip whitespace
String re5 = "([a-z]+)"; // match letters (state)
String regex = re1 + re2 + re3 + re4 + re5;
There are other options as well, but I personally find regular expressions to be very straightforward for things like this. You could use various combinations of split(), as other posters have detailed. You could directly look for commas and whitespace with indexOf() and pull out substrings. You could even convince a Scanner or perhaps a StringTokenizer or StreamTokenizer to work for you. However, regular expressions exist to solve problems like this and are a good tool for the job.
Here is an example with StringTokenizer:
StringTokenizer t = new StringTokenizer(txt, " \t");
String date = t.nextToken();
t.nextToken(); // skip second date
t.nextToken(","); // change delimiter to comma and skip city
t.nextToken(" \t"); // back to whitespace and skip comma
String state = t.nextToken();
Still, I feel a regex expresses the rules more cleanly.
By the way, for future debugging, sometimes it helps to just print out all of the capture groups, this can give you insight into what is matching what. A good technique is to put every component of your regex in a capture group temporarily, then print them all out.
no need to be so complex with this. you can split on whitespace!
//s is your string
String[] first = s.split("\\s*,\\s*")
String[] firstHalf = first[0].split("\\s+")
String[] secondHalf = first[1].split("\\s+")
String date = firstHalf[0]
String state = secondHalf[0]
now you have youre date and your state! do with them what you want.

Extracting dates from string

I have a list with file names that look roughly like this: Gadget1-010912000000-020912235959.csv, i.e. they contain two dates indicating the timespan of their data.
The user enters a date format and a file format:
File Format in this case: *GADGET*-*DATE_FROM*-*DATE_TO*.csv
Date format in this case: ddMMyyHHmmss
What I want to do is extracting the three values out of the file name with the given file and date format.
My problem is: Since the date format can differ heavily (hours, minutes and seconds can be seperated by a colon, dates by a dot,...) I don't quite know how to create a fitting regular expression.
You can use a regular expression to remove non digits characters, and then parse value.
DateFormat dateFormat = new SimpleDateFormat("ddMMyyHHmmss");
String[] fileNameDetails = ("Gadget1-010912000000-020912235959").split("-");
/*Catch All non digit characters and removes it. If non exists maintains original string*/
String date = fileNameDetails[1].replaceAll("[^0-9]", "");
try{
dateFormat.parse(fileNameDetails[1]);
}catch (ParseException e) {
}
Hope it helps.
SimpleDateFormat solves your issue. You can define the format with commas, spaces and whatever and simply parse according to the format:
http://docs.oracle.com/javase/6/docs/api/java/text/SimpleDateFormat.html
So you map your format (e.g ddMMyyHHmmss) to a corresponding SimpleDateFormat.
SimpleDateFormat format = new SimpleDateFormat("ddMMyyHHmmss");
Date x = format.parse("010912000000");
If the format changes, you simply change the SimpleDateFormat
You can use a series of date-time formats, trying each until one works.
You may need to order the formats to prioritize matches.
For example, with Joda time, you can use DateTimeFormat.forPattern() and DateTimeFormatter.getParser() for each of a series of patterns. Try DateTimeParser.parseInto() until one succeeds.
One nice thing about this approach is that it is easy to add and remove patterns.
Use Pattern and Matcher class.
Look at the example:
String inputDate = "01.09.12.00:00:00";
Pattern pattern = Pattern.compile(
"([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[:]{0,1}([0-9]{2})[:]{0,1}([0-9]{2})");
Matcher matcher = pattern.matcher(inputDate);
matcher.find();
StringBuilder cleanStr = new StringBuilder();
for(int i = 1; i <= matcher.groupCount(); i++) {
cleanStr.append(matcher.group(i));
}
SimpleDateFormat format = new SimpleDateFormat("ddMMyyHHmmss");
Date x = format.parse(cleanStr.toString());
System.out.println(x.toString());
The most important part is line
Pattern pattern = Pattern.compile(
"([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[\\.]{0,1}([0-9]{2})[:]{0,1}([0-9]{2})[:]{0,1}([0-9]
Here you define regexp and mark groups in paranthesis so ([0-9]{2}) marks a group. Then is expression for possible delimeters [\\.]* in this case 0 or 1 dot, but you can put more possible delimeters for example [\\.|\]{0,1}.
Then you run matcher.find() which returns true if pattern matches. And then using matcher.group(int) you can get group by group. Note that index of first group is 1.
Then I construct clean date String using StringBuilder. And then parse date.
Cheers,
Michal

Categories