Parsing String into Map using regular expressions in Java

Parsing String into Map using regular expressions in Java - java

I'm trying to parse input such as this:
VAR1: 7, VAR2: [1,2,3], VAR3: value1=1,value2=2, TIMEZONE: GMT+5, TIME: 17:15:00
into a Map:
{VAR1=7, VAR2=[1,2,3], VAR3=value1=1,value2=2, TIMEZONE=GMT, TIME=17:15:00}
So variables are separated by commas(,) and their values come after colon(:). They're not always in caps, I wrote them like this to make it more obvious which are names of variables and which are values. Also, whitespace can appear anywhere anywhere around names or in values.
Problem is that commas can appear in values like in VAR2 or VAR3 and colons can appear in variables like TIME.
I tried splitting string like this to get values out:
final String regex = ",?\\s*(\\w+)\\s*:\\s*";
final String[] values = inputString.split(regex);
and it works as long as inputString doesn't contain any time variables with colons in its value. Otherwise it returns this as values:
[, 7, [1,2,3], value1=1,value2=2, GMT+5, , , 00]
instead of:
[7, [1,2,3], value1=1,value2=2, GMT+5, 17:15:00]
I suspect that it matches the last colon in TIME rather than the first one located after variable's name separating it from its value.
I tried using reluctant quantifier for colon ",?\s*(\w+)\s*:?\s" but this returned:
[, :, , : [, , , ], :, =, , =, , :, +, , :, :, :]
Which is nonsense.
I would appreciate any ideas to improve regex.

Assuming that a variable name cannot start with a digit the colons in the date/time are not a problem. I have more issues with the commas in the values.
Here's how I solved the problem:
String input = "VAR1: 7, VAR2: [1,2,3], VAR3: value1=1,value2=2, TIMEZONE: GMT+5, TIME: 17:15:00";
Pattern re = Pattern.compile(
"^\\s*(\\p{Alpha}\\p{Alnum}*)\\s*:\\s*(\\S*)(?:,\\s*(\\p{Alpha}\\p{Alnum}*\\s*:.*))?$");
Matcher matcher = re.matcher(input);
while (matcher.matches()) {
String name = matcher.group(1);
String value = matcher.group(2);
String tail = matcher.group(3);
System.out.println(name + ": " + value);
if (tail == null) {
break;
}
matcher = re.matcher(tail);
}
Result:
VAR1: 7
VAR2: [1,2,3]
VAR3: value1=1,value2=2
TIMEZONE: GMT+5
TIME: 17:15:00
UPDATE:
It also works with:
Pattern re = Pattern.compile(
"^\\s*(\\w+)\\s*:\\s*(\\S*)(?:,\\s*(\\w+\\s*:.*))?\\s*$");

Possible solution (online test):
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "(.+?):\\s?(.+?)(?:,\\W|$)";
final String string = "VAR1: 7, VAR2: [1,2,3], VAR3: value1 =1,value2=2, TIMEZONE: GMT+5, TIME: 17:15:00";
final Pattern pattern = Pattern.compile(regex);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
Just collect the results in a map to obtain what you asked for
Regex explanation:
(.+?): Captures your keys (example: VAR1)
:: Captures the : symbol literally
\s?: Captures an optional space
(.+?): Captures your values (example: 7)
(?:,\\W|$): Captures a comma followed by a space (these two symbols together are our actual separator) OR the end of the string

Related

RegEx for matching special patterns

I'm trying to match a String like this:62.00|LQ+2*2,FP,MD*3 "Description"
Where the decimal value is 2 digits optional, each user is characterized by two Chars and it can be followed by
(\+[\d]+)? or (\*[\d]+)? or none, or both, or both in different order
like:
LQ*2+4 | LQ+4*2 | LQ*2 | LQ+8 | LQ
Description is also optional
What i have tried is this:
Pattern.compile("^(?<number>[\\d]+(\\.[\\d]{2})?)\\|(?<users>([A-Z]{2}){1}(((\\+[\\d]+)?(\\*[\\d]+)?)|((\\+[\\d]+)?(\\*[\\d]+)?))((,[A-Z]{2})(((\\+[\\d]+)?(\\*[\\d]+)?)|((\\+[\\d]+)?(\\*[\\d]+)?)))*)(\\s\\\"(?<message>.+)\\\")?$");
I need to get all the users so i can split them by ',' and then further regex my way into it.But i cannot grab anything out of it.The desired output from
62.00|LQ+2*2,FP,MD*3 "Description"
Should be:
62.00
LQ+2*2,FP,MD*3
Description
Accepted inputs should be of these kind:
62.00|LQ+2*2,FP,MD*3
30|LQ "Burgers"
35.15|LQ*2,FP+2*4,MD*3+4 "Potatoes"
35.15|LQ,FP,MD

The precise regex to match the inputs you described should be fulfilled by this regex,
^(\d+(?:\.\d{1,2})?)\|([a-zA-Z]{2}(?:(?:\+\d+(?:\*\d+)?)|(?:\*\d+(?:\+\d+)?))?(?:,[a-zA-Z]{2}(?:(?:\+\d+(?:\*\d+)?)|(?:\*\d+(?:\+\d+)?))?)*)(?: +(.+))?$
Where group1 will contain the number that can have optional decimals upto two digits and group2 will have the comma separated inputs as you described in your post and group3 will contain the optional description if present.
Explanation of regex:
^ - Start of string
(\d+(?:\.\d{1,2})?) - Matches the number which can have optional 2 digits after decimal and captures it in group1
\| - Matches literal | present in your input after the number
([a-zA-Z]{2}(?:(?:\+\d+(?:\*\d+)?)|(?:\*\d+(?:\+\d+)?))?(?:,[a-zA-Z]{2}(?:(?:\+\d+(?:\*\d+)?)|(?:\*\d+(?:\+\d+)?))?)*) - This part matches two letters followed by any combination of + followed by number and optionally having * followed by number OR * followed by number and optionally having + followed by number exactly either once or whole of it being optional and captures it in group2
(?: +(.+))? - This matches the optional description and captures it in group3
$ - Marks end of input
Regex Demo

I'm guessing that we have several optional groups here, that might not be a problem. The problem I'm having is that I'm not quite sure what would be the range of our inputs and what might be desired outputs.
RegEx 1
If we are just matching everything, that I'm guessing, we might like to start with something similar to:
[0-9]+(\.[0-9]{2})?\|[A-Z]{2}[+*]?([0-9]+)?[+*]?([0-9]+)?,[A-Z]{2},[A-Z]{2}[+*]?([0-9]+)?(\s+"Description")?
Here, we simply add a ? after every sub-expression that we wish to have it optional, then we use char lists and quantifiers, and start swiping everything from left to right, to cover all inputs.
If we like to capture, then we simply wrap any part that we want captured with a capturing group ().
Demo
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "[0-9]+(\\.[0-9]{2})?\\|[A-Z]{2}[+*]?([0-9]+)?[+*]?([0-9]+)?,[A-Z]{2},[A-Z]{2}[+*]?([0-9]+)?(\\s+\"Description\")?";
final String string = "62.00|LQ+2*2,FP,MD*3 \"Description\"\n"
+ "62|LQ+2*2,FP,MD*3 \"Description\"\n"
+ "62|LQ+2*2,FP,MD*3\n"
+ "62|LQ*2,FP,MD*3\n"
+ "62|LQ+8,FP,MD*3\n"
+ "62|LQ,FP,MD";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
RegEx 2
If we wish to output three groups that is listed:
([0-9]+(\.[0-9]{2})?)\|([A-Z]{2}[+*]?([0-9]+)?[+*]?([0-9]+)?,[A-Z]{2},[A-Z]{2}[+*]?([0-9]+)?)(\s+"Description")?
Demo 2
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "([0-9]+(\\.[0-9]{2})?)\\|([A-Z]{2}[+*]?([0-9]+)?[+*]?([0-9]+)?,[A-Z]{2},[A-Z]{2}[+*]?([0-9]+)?)(\\s+\"Description\")?";
final String string = "62.00|LQ+2*2,FP,MD*3 \"Description\"\n"
+ "62|LQ+2*2,FP,MD*3 \"Description\"\n"
+ "62|LQ+2*2,FP,MD*3\n"
+ "62|LQ*2,FP,MD*3\n"
+ "62|LQ+8,FP,MD*3\n"
+ "62|LQ,FP,MD";
final String subst = "\\1\\n\\3\\n\\7";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
// The substituted value will be contained in the result variable
final String result = matcher.replaceAll(subst);
System.out.println("Substitution result: " + result);
RegEx 3
Based on updated desired output, this might work:
([0-9]+(\.[0-9]{2})?)\|((?:[A-Z]{2}[+*]?([0-9]+)?[+*]?([0-9]+)?,?)(?:[A-Z]{2}[+*]?([0-9]+)?[*+]?([0-9]+)?,?[A-Z]{2}?[*+]?([0-9]+)?[+*]?([0-9]+)?)?)(\s+"(.+?)")?
DEMO

Java Regex. group excluding delimiters

I'm trying to split my string using regex. It should include even zero-length matches before and after every delimiter. For example, if delimiter is ^ and my string is ^^^ I expect to get to get 4 zero-length groups.
I can not use just regex = "([^\\^]*)" because it will include extra zero-length matches after every true match between delimiters.
So I have decided to use not-delimiter symbols following after beginning of line or after delimiter. It works perfect on https://regex101.com/ (I'm sorry, i couldn't find a share option on this web-site to share my example) but in Intellij IDEa it skips one match.
So, now my code is:
final String regex = "(^|\\^)([^\\^]*)";
final String string = "^^^^";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find())
System.out.println("[" + matcher.start(2) + "-" + matcher.end(2) + "]: \"" + matcher.group(2) + "\"");
and I expect 5 empty-string matches. But I have only 4:
[0-0]: ""
[2-2]: ""
[3-3]: ""
[4-4]: ""
The question is why does it skip [1-1] match and how can I fix it?

Your regex matches either the start of string or a ^ (capturing that into Group 1) and then any 0+ chars other than ^ into Group 2. When the first match is found (the start of the string), the first group keeps an empty string (as it is the start of string) and Group 2 also holds an empty string (as the first char is ^ and [^^]* can match an empty string before a non-matching char. The whole match is zero-length, and the regex engine moves the regex index to the next position. So, after the first match, the regex index is moved from the start of the string to the position after the first ^. Then, the second match is found, the second ^ and the empty string after it. Hence, the the first ^ is not matched, it is skipped.
The solution is a simple split one:
String[] result = string.split("\\^", -1);
The second argument makes the method output all empty matches at the end of the resulting array.
See a Java demo:
String str = "^^^^";
String[] result = str.split("\\^", -1);
System.out.println("Number of items: " + result.length);
for (String s: result) {
System.out.println("\"" + s+ "\"");
}
Output:
Number of items: 5
""
""
""
""
""

Parsing a string with [3:0] substring in it

I want to store two numbers from a string into two distinct variables - for example, var1 = 3 and var2 = 0 from "[3:0]". I have the following code snippet:
String myStr = "[3:0]";
if (myStr.trim().matches("\\[(\\d+)\\]")) {
// Do something.
// If it enter the here, here I want to store 3 and 0 in different variables or an array
}
Is it possible doing this with split and regular expressions?

Don't call trim(). Enhance you regex instead.
Your regex is missing the pattern for : and the second number, and you don't need to escape the ].
To capture the matched numbers, you need the Matcher:
String myStr = " [3:0] ";
Matcher m = Pattern.compile("\\s*\\[(\\d+):(\\d+)]\\s*").matcher(myStr);
if (m.matches())
System.out.println(m.group(1) + ", " + m.group(2));
Output
3, 0

You can use replaceAll and split
String myStr = "[3:0]";
if(myStr.trim().matches("\\[\\d+:\\d+\\]") {
String[] numbers = myStr.replaceAll("[\\[\\]]","").split(":");
}
Moreover, your regExp to match String should be \\[\\d+:\\d+\\], if you want to avoid trim you can add \\s+ at start and end to match the spaces.But trim is not bad.
EDIT
As suggested by Andreas in comments,
String myStr = "[3:0]";
String regExp = "\\[(\\d+):(\\d+)\\]";
Pattern pattern = Pattern.compile(regExp);
Matcher matcher = pattern.matcher(myStr.trim());
if(matcher.find()) {
int a = Integer.parseInt(matcher.group(1));
int b = Integer.parseInt(matcher.group(2));
System.out.println(a + " : " + b);
}
OUTPUT
3 : 0

Without any regular expressions you could do this:
// this will remove the braces [ and ] and just leave "3:0"
String numberString= myString.trim().replace("[", "").replace("]","");
// this will split the string in everything before the : and everything after the : (so two values as an array)
String[] numbers = numberString.split(":");
// get the first value and parse it as a number "3" will become a simple 3
int firstNumber = Integer.parseInt(numbers[0]) ;
// get the second value and parse it from "0" to a plain 0
int secondNumber = Integer.parseInt(numbers[1]);
be carefull when parsing numbers, depending on your input string and what other possibilities there might be (e.g. "3:12" is ok, but "3:02" might throw an error).

In case you don't need to validate input and you want to simply get numbers from it, you could simply find indexOf(":") and substring parts which you are interested, in which are:
from [ (which is at position 0) till :
and from index of : till ] (which is at position equal to length of string -1)
Your code can look like
String text = "[3:0]";
int colonIndex = text.indexOf(':');
String first = text.substring(1, colonIndex);
String second = text.substring(colonIndex + 1, text.length() - 1);

Find words in string surrounded by "[" and "]":

I need help with a simple task in java. I have the following sentence:
Where Are You [Employee Name]?
your have a [Shift] shift..
I need to extract the strings that are surrounded by [ and ] signs.
I was thinking of using the split method with " " parameter and then find the single words, but I have a problem using that if the phrase I'm looking for contains: " ". using indexOf might be an option as well, only I don't know what is the indication that I have reached the end of the String.
What is the best way to perform this task?
Any help would be appreciated.

Try with regex \[(.*?)\] to match the words.
\[: escaped [ for literal match as it is a meta char.
(.*?) : match everything in a non-greedy way.
Sample code:
Pattern p = Pattern.compile("\\[(.*?)\\]");
Matcher m = p.matcher("Where Are You [Employee Name]? your have a [Shift] shift.");
while(m.find()) {
System.out.println(m.group());
}

Here you go Java regular expression that extract text between two brackets including white spaces:
import java.util.regex.*;
class Main
{
public static void main(String[] args)
{
String txt="[ Employee Name ]";
String re1=".*?";
String re2="( )";
String re3="((?:[a-z][a-z]+))"; // Word 1
String re4="( )";
String re5="((?:[a-z][a-z]+))"; // Word 2
String re6="( )";
Pattern p = Pattern.compile(re1+re2+re3+re4+re5+re6,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(txt);
if (m.find())
{
String ws1=m.group(1);
String word1=m.group(2);
String ws2=m.group(3);
String word2=m.group(4);
String ws3=m.group(5);
System.out.print("("+ws1.toString()+")"+"("+word1.toString()+")"+"("+ws2.toString()+")"+"("+word2.toString()+")"+"("+ws3.toString()+")"+"\n");
}
}
}
if you want to ignore white space remove "( )";

This is a Scanner base solution
Scanner sc = new Scanner("Where Are You [Employee Name]? your have a [Shift] shift..");
for (String s; (s = sc.findWithinHorizon("(?<=\\[).*?(?=\\])", 0)) != null;) {
System.out.println(s);
}
output
Employee Name
Shift

Use a StringBuilder (I assume you don't need synchronization).
As you suggested, indexOf() using your square bracket delimiters will give you a starting index and an ending index. use substring(startIndex + 1, endIndex - 1) to get exactly the string you want.
I'm not sure what you meant by the end of the String, but indexOf("[") is the start and indexOf("]") is the end.

That's pretty much the use case for a regular expression.
Try "(\\[[\\w ]*\\])" as your expression.
Pattern p = Pattern.compile("(\\[[\\w ]*\\])");
Matcher m = p.matcher("Where Are You [Employee Name]? your have a [Shift] shift..");
if (m.find()) {
String found = m.group();
}
What does this expression do?
First it defines a group (...)
Then it defines the starting point for that group. \[ matches [ since [ itself is a 'keyword' for regular expressions it has to be masked by \ which is reserved in Java Strings and has to be masked by another \
Then it defines the body of the group [\w ]*... here the regexpression [] are used along with \w (meaning \w, meaning any letter, number or undescore) and a blank, meaning blank. The * means zero or more of the previous group.
Then it defines the endpoint of the group \]
and closes the group )

Punctuation Regex in Java

First, i'm read the documentation as follow
http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
And i want find any punctuation character EXCEPT #',& but i don't quite understand.
Here is :
public static void main( String[] args )
{
// String to be scanned to find the pattern.
String value = "#`~!#$%^";
String pattern = "\\p{Punct}[^#',&]";
// Create a Pattern object
Pattern r = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);
// Now create matcher object.
Matcher m = r.matcher(value);
if (m.find()) {
System.out.println("Found value: " + m.groupCount());
} else {
System.out.println("NO MATCH");
}
}
Result is NO MATCH.
Is there any mismatch ?
Thanks
MRizq

You're matching two characters, not one. Using a (negative) lookahead should solve the task:
(?![#',&])\\p{Punct}

You may use character subtraction here:
String pat = "[\\p{Punct}&&[^#',&]]";
The whole pattern represents a character class, [...], that contains a \p{Punct} POSIX character class, the && intersection operator and [^...] negated character class.
A Unicode modifier might be necessary if you plan to also match all Unicode punctuation:
String pat = "(?U)[\\p{Punct}&&[^#',&]]";
^^^^
The pattern matches any punctuation (with \p{Punct}) except #, ', , and &.
If you need to exclude more characters, add them to the negated character class. Just remember to always escape -, \, ^, [ and ] inside a Java regex character class/set. E.g. adding a backslash and - might look like "[\\p{Punct}&&[^#',&\\\\-]]" or "[\\p{Punct}&&[^#',&\\-\\\\]]".
Java demo:
String value = "#`~!#$%^，";
String pattern = "(?U)[\\p{Punct}&&[^#',&]]";
Pattern r = Pattern.compile(pattern); // Create a Pattern object
Matcher m = r.matcher(value); // Now create matcher object.
while (m.find()) {
System.out.println("Found value: " + m.group());
}
Output:
Found value: #
Found value: !
Found value: #
Found value: %
Found value: ，

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing String into Map using regular expressions in Java - java

Related

RegEx for matching special patterns

Java Regex. group excluding delimiters

Parsing a string with [3:0] substring in it

Find words in string surrounded by "[" and "]":

Punctuation Regex in Java

Categories

Resources