How to split a string which contains multiple key value pairs - java

I have a string:
Single line : Some text
Multi1: multi (Va1) Multi2 : multi (Va2) Multi3 : multi (Val3)
Dots....20/12/2013 (EOY)
and I am trying to retrieve all the key value pairs. My first attempt
(Single line|Multi[0-9]{1}|Dots)( *:? [.] *| *:? )(.)
seems to work but does not handle multiple key value pairs on one line. Is there any way to achieve this?

Try this:
String text = "Single line : Some text\r\n" +
"Multi1: multi (Va1) Multi2 : multi (Va2) Multi3 : multi (Val3)\r\n" +
"Dots....20/12/2013 (EOY)";
Pattern pattern = Pattern.compile("(\\p{Alnum}[\\p{Alnum}\\s/]+?)\\s?(:|\\.+)\\s?(\\p{Alnum}[\\p{Alnum}\\s/]+?)(?=($|\\()|(\\s\\())", Pattern.MULTILINE);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group(1) + "-->" + matcher.group(3));
}
Output:
Single line-->Some text
Multi1-->multi
Multi2-->multi
Multi3-->multi
Dots-->20/12/2013
Explanation:
I am limiting the keys and values to "starts with alphanumeric",
"contains any number of alphanumerics, spaces or slashes".
I am limiting the separator to "optional space, :, optional space" or
"optional space, any number of consecutive dots, optional space".
I am using groups 1 and 3 to define the key and value in the
Pattern.
Group 2 is used to provide alternate separators as above.
Finally, the Pattern is delimited at the end, either with a new
line, or with an open round bracket, or, with a space followed by an
open round bracket.
Note that you can't use quantifiers in a lookahead or lookbehind group, hence the repetition.

You can use this pattern:
public static void main(String[] args) {
String s = "Single line : Some text\n"
+ "Multi1: multi (Va1) Multi2 : multi (Va2) "
+ "Multi3 : multi (Val3)\n"
+ "Dots....20/12/2013 (EOY)";
String wd = "[^\\s.:]+(?:[^\\S\\n]+[^\\s.:]+)*";
Pattern p = Pattern.compile("(?<key>" + wd + ")"
+ "\\s*(?::|\\.+)\\s*"
+ "(?<value>" + wd + "(?:\\s*\\([^)]+\\))?)"
+ "(?!\\s*:)(?=\\s|$)");
Matcher m = p.matcher(s);
while (m.find()) {
System.out.println(m.group("key")+"->"+m.group("value"));
}
}

I don't recall the exact syntax, but I think it's something like this:
while (matcher.find()) {
String match = matcher.group();
}
The goal here is that you need to iterate over the current line and tell it "while you are still finding stuff, return to me the string on this line that matched." Since you have multiple matches on the same line, it should keep pulling out findings for you. Here is the JavaDoc for Matcher as a reference.
This is sadly another reason why Java is really not well-suited for this sort of thing, and before anyone downmods me understand I say that as a criticism of the Java APIs here, not the language.

Related

How not to match the first empty string in this regex?

(Disclaimer: the title of this question is probably too generic and not helpful to future readers having the same issue. Probably, it's just because I can't phrase it properly that I've not been able to find anything yet to solve my issue... I engage in modifying the title, or just close the question once someone will have helped me to figure out what the real problem is :) ).
High level description
I receive a string in input that contains two information of my interest:
A version name, which is 3.1.build and something else later
A build id, which is somenumbers-somenumbers-eitherwordsornumbers-somenumbers
I need to extract them separately.
More details about the inputs
I have an input which may come in 4 different ways:
Sample 1: v3.1.build.dev.12345.team 12345-12345-cici-12345 (the spaces in between are some \t first, and some whitespaces then).
Sample 2: v3.1.build.dev.12345.team 12345-12345-12345-12345 (this is very similar than the first example, except that in the second part, we only have numbers and -, no alphabetic characters).
Sample 3:
v3.1.build.dev.12345.team
12345-12345-cici-12345
(the above is very similar to sample 1, except that instead of \t and whitespaces, there's just a new line.
Sample 4:
v3.1.build.dev.12345.team
12345-12345-12345-12345
(same than above, with only digits and dashes in the second line).
Please note that in sample 3 and sample 4, there are some trailing spaces after both strings (not visible here).
To sum up, these are the 4 possible inputs:
String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
My code currently
I have written the following code to extract the information I need (here reporting only relevant, please visit the fiddle link to have a complete and runnable example):
String versionPattern = "^.+[\\s]";
String buildIdPattern = "[\\s].+";
Pattern pVersion = Pattern.compile(versionPattern);
Pattern pBuildId = Pattern.compile(buildIdPattern);
for (String str : possibilities) {
Matcher mVersion = pVersion.matcher(str);
Matcher mBuildId = pBuildId.matcher(str);
while(mVersion.find()) {
System.out.println("Version found: \"" + mVersion.group(0).replaceAll("\\s", "") + "\"");
}
while (mBuildId.find()) {
System.out.println("Build-id found: \"" + mBuildId.group(0).replaceAll("\\s", "") + "\"");
}
}
The issue I'm facing
The above code works, pretty much. However, in the Sample 3 and Sample 4 (those where the build-id is separated by the version with a \n), I'm getting two matches: the first, is just a "", the second is the one I wish.
I don't feel this code is stable, and I think I'm doing something wrong with the regex pattern to match the build-id:
String buildIdPattern = "[\\s].+";
Does anyone have some ideas in order to exclude the first empty match on the build-id for sample 3 and 4, while keeping all the other matches?
Or some better way to write the regexs themselves (I'm open to improvements, not a big expert of regex)?
Based on your description it looks like your data is in form
NonWhiteSpaces whiteSpaces NonWhiteSpaces (optionalWhiteSpaces)
and you want to get only NonWhiteSpaces parts.
This can be achieved in numerous ways. One of them would be to trim() your string to get rid of potential trailing whitespaces and then split on the whitespaces (there should now only be in the middle of string). Something like
String[] arr = data.trim().split("\\s+");// \s also represents line separators like \n \r
String version = arr[0];
String buildID = arr[1];
(^v\w.+)\s+(\d+-\d+-\w+-\d+)\s*
It will capture 2 groups. One will capture the first section (v3.1.build.dev.12345.team), the second gets the last section (12345-12345-cici-12345)
It breaks down like: (^v\w.+) ensures that the string starts with a v, then captures all characters that are a number or letter (stopping on white space tabs etc.) \s+ matches any white space or tabs/newlines etc. as many times as it can. (\d+-\d+-\w+-\d+) this reads it in, ensuring that it conforms to your specified formatting. Note that this will still read in the dashes, making it easier for you to split the string after to get the information you need. If you want you could even make these their own capture groups making it even easier to get your info.
Then it ends with \s* just to make sure it doesn't get messed up by trailing white space. It uses * instead of + because we don't want it to break if there's no trailing white space.
I think this would be strong for production (aside from the fact that the strings cannot begin with any white-space - which is fixable, but I wasn't sure if it's what you're going for).
public class Other {
static String patternStr = "^([\\S]{1,})([\\s]{1,})(.*)";
static String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
static String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
static String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
static String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
static Pattern pattern = Pattern.compile(patternStr);
public static void main(String[] args) {
List<String> possibilities = Arrays.asList(str1, str2, str3, str4);
for (String str : possibilities) {
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1).replaceAll("\\s", "") + "\"");
System.out.println("Some whitespace found: \"" + matcher.group(2).replaceAll("\\s", "") + "\"");
System.out.println("Build-id found: \"" + matcher.group(3).replaceAll("\\s", "") + "\"");
} else {
System.out.println("Pattern NOT found");
}
System.out.println();
}
}
}
Imo, it looks very similar to your original code. In case the regex doesn't look familiar to you, I'll explain what's going on.
Capital S in [\\S] basically means match everything except for [\\s]. .+ worked well in your case, but all it is really saying is match anything that isn't empty - even a whitespace. This is not necessarily bad, but would be troublesome if you ever had to modify the regex.
{1,} simple means one or more occurrences. {1,2}, to give another example, would be 1 or 2 occurrences. FYI, + usually means 0 or 1 occurrences (maybe not in Java) and * means one or more occurrences.
The parentheses denote groups. The entire match is group 0. When you add parentheses, the order from left to right represent group 1 .. group N. So what I did was combine your patterns using groups, separated by one or more occurrences of whitespace. (.*) is used for group 2, since that group can have both whitespace and non-whitespace, as long as it doesn't begin with whitespace.
If you have any questions feel free to ask. For the record, your current code is fine if you just add '+' to the buildId pattern: [\\s]+.+.
Without that, your regex is saying: match the whitespace that is followed by no characters or a single character. Since all of your whitespace is followed by more whitespace, you matching just a single whitespace.
TLDR;
Use the pattern ^(v\\S+)\\s+(\\S+), where the capture-groups capture the version and build respectively, here's the complete snippet:
String unitPattern ="^(v\\S+)\\s+(\\S+)";
Pattern pattern = Pattern.compile(unitPattern);
for (String str : possibilities) {
System.out.println("Analyzing \"" + str + "\"");
Matcher matcher = pattern.matcher(str);
while(matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1) + "\"");
System.out.println("Build-id found: \"" + matcher.group(2) + "\"");
}
}
Fiddle to try it.
Nitty Gritties
Reason for the empty lines in the output
It's because of how the Matcher class interprets the .; The . DOES NOT match newlines, it stops matching just before the \n. For that you need to add the flag Pattern.DOTALL using Pattern.compile(String pattern, int flags).
An attempt
But even with Pattern.DOTALL, you'll still not be able to match, because of the way you have defined the pattern. A better approach is to match the full build and version as a unit and then extract the necessary parts.
^(v\\S+)\\s+(\\S+)
This does trick where :
^(v\\S+) defines the starting of the unit and also captures version information
\\s+ matches the tabs, new line, spaces etc
(\\S+) captures the final contiguous build id

java regex- get specific index where not have a specific word before

Im trying to add the double quote on a xml string only on specific place.
Here an example of xml content
<opr:sec name=display>
<opr:fld name=fieldName>Value1</opr:fld>
<opr:fld name=someName>value2</opr:fld>
I need to add double quote like : name="fieldName" and the field names are different each line.
The first double quote are simple using the name= that need to be before
But for the closing double quote i think to use the > sign, but need to avoid the fld at end.
How i regex a letter that don't have a specific text before
Here is a simpler way to do what you want.
Use this regex :
name=([^>]*)>
And replace it by :
name="$1">
You can use capturing blocks, split your line into 3 blocks and reconstruct it from the pieces:
String line = "<opr:fld name=fieldName>Value1</opr:fld>";
String regex = "(.*name=)(.*)(>.*>)";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(line);
matcher.matches();
String result = matcher.group(1) + "\"" + matcher.group(2) + "\"" + matcher.group(3);
System.out.println(result);

Why does regex doesn't match

I have wrote the following code:
public static void main(String[] args) {
// String to be scanned to find the pattern.
String line = "'7858','1194','FSP,FRB,FWF,FBVS,FRRC','15'\n"
+ "'7859','1194','FIRM','21'";
String pattern = "^'*','*','*','*'$";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
if (m.find()) {
System.out.println("Found value: " + m.group(0));
System.out.println("Found value: " + m.group(1));
} else {
System.out.println("NO MATCH");
}
}
it returns NO MATCH always.
expected result - 2 rows
What do I wrong ?
There are several problems in your code :
A single "star" (*) in matches 0-N times the character it follows - in your code, '*' means "match 0-N times a single quote, followed by another single quote"
Also, the "star" qualifier is "greedy" by default, meaning it will eat as many matching chars as possible, including the ending quote in your groups. In your case, you may want to set it in "reluctant" mode (by appending a ? to it : *?), so that it matches only the text inside the single quotes.
The lines must be matched one by one, so the initial multi-line must be split on the line-separator character (\n). Unless you use the multi-line match option, but I think this is not what you want here.
Matching groups start at 1, not 0, so groups would be numbered 1 to 4 in your case.
Here is your code, corrected as explained above :
public static void main(String[] args) {
String line = "'7858','1194','FSP,FRB,FWF,FBVS,FRRC','15'\n" +
"'7859','1194','FIRM','21'";
Pattern r = Pattern.compile("'(.*?)','(.*?)','(.*?)','(.*?)'");
String[] lines = line.split("\n");
for (String l : lines) {
System.out.println("Line : " + l);
Matcher m = r.matcher(l);
if (m.find()) {
System.out.println("Found value: " + m.group(1));
System.out.println("Found value: " + m.group(2));
System.out.println("Found value: " + m.group(3));
System.out.println("Found value: " + m.group(4));
} else {
System.out.println("NO MATCH");
}
}
}
And here is the result :
Line : '7858','1194','FSP,FRB,FWF,FBVS,FRRC','15'
Found value: 7858
Found value: 1194
Found value: FSP,FRB,FWF,FBVS,FRRC
Found value: 15
Line : '7859','1194','FIRM','21'
Found value: 7859
Found value: 1194
Found value: FIRM
Found value: 21
"^'*','*','*','*'$" does not match anything because '* searches for as many 's as possible. It does not match what you want.
Also, the ^ and $ won't work.
I think that this regex is what you need:
'[0-9A-Z,]*','[0-9A-Z,]*','[0-9A-Z,]*','[0-9A-Z,]*'
Here I have added the character class [0-9A-Z,] to match numbers, letters and ,s. I think that this will give you what you need.
You could try with this expression:
(?<=^|[\r\n]+)'([^']*)','([^']*)','([^']*)','([^']*)'(?=[\r\n]+|$)
Breakdown:
(?<=^|[\r\n]+) is a positive look-behind checking for either the start of the input or a sequence of linebreak characters
'([^']*)' matches and captures one of your groups. You could use '(.*?)' (i.e. a reluctant qualifier) instead but the former version is safer since it won't match if your input lines contain more than 4 groups
(?=[\r\n]+|$) is a positive look-ahead checking your groups are followed by either a sequence of linebreak characters of the end of the input sequence
I also made the following assumptions about your code:
Your input contains multiple lines which you can't or don't want to split (otherwise String[] lines = input.split("[\\r\\n]+") would be better).
A matching line always consists of 4 groups which you want to access using group(1) etc.
Your groups can contain any character except a single quote. If a group is only allowed to contain certain characters (e.g. digits), it would be safer to reflect that in the expression (e.g. '[0-9]+')

How to use multiple different patterns?

how to check strings for multi-pattern regex not for single pattern if tried for one pattern but I need it for multi-pattern and i tried but it doesn't work.
when I running these codes just I can get one of them (time or price ) that is in the String but when I combine them don't show me any output.
thanks for your help....
here is my code :
String line = "This order was places for QT 30.00$ ! OK? and time is 2:45";
String pattern = "\\d+[.,]\\d+.[$]"+"\\d:\\d\\d";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(line);
if (m.find( )) {
System.out.println("Found value: " + m.group(0) );
} else {
System.out.println("NO MATCH");
}
The "+" operator does not separate patterns - it concatenates strings.
What you can do is provide a pattern that accepts characters in between the two groups.
String pattern = "(\\d+[.,]\\d+.[$]).*(\\d:\\d\\d)";
The parentheses above are optional. If you include them, you can get the matched price and time as separate strings:
if (m.find( )) {
System.out.println("Found value: " + m.group(1) + " with time: " + m.group(2));
}
EDIT:
Just noticed your comment that you're looking for OR, not AND.
You can do that with an expression of the form X | Y:
String pattern = "\\d+[.,]\\d+.[$]|\\d:\\d\\d";
This will match either a price or a time, whichever occurs first. You can get the match with m.group(0).

Regular expression help in java

I am lost when it comes to building regex strings. I need a regular expression that does the following.
I have the following strings:
[~class:obj]
[~class|class2|more classes:obj]
[!class:obj]
[!class|class2|more classes:obj]
[?method:class]
[text]
A string can have multiple of whats above. Example string would be "[if] [!class:obj]"
I want to know what is in between the [] and broken into match groups. For example, the first match group would be the symbol if present (~|!|?) next what is before the : so that could be class or class|class2|etc... then what is on the right of the : and stop before the ]. There may be no : and what goes before it, but just something between the [].
So, how would I go about writing this regex? And is it possible to give the match group names so I know what it matched?
This is for a java project.
If you're sure enough of your inputs, you can probably use something like /\[(\~|\!|\?)?(?:((?:[^:\]]*?)+):)?([^\]]+?)\]/. (to translate that into Java, you'll want to escape the backslashes and use quotation marks instead of forward slashes)
Here are some web sites that might be helpful:
http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html
http://txt2re.com/index.php3?s=Test+test+june+2011+test&submit=Show+Matches
http://www.regexplanet.com/simple/
I believe that this should work:
/[(.*?)(?:\|(.*?))*]/
Also:
[a-z]*
Try this code
final Pattern
outerP = Pattern.compile("\\[.*?\\]"),
innerP = Pattern.compile("\\[([~!?]?)([^:]*):?(.*)\\]");
for (String s : asList(
"[~class:obj]",
"[if][~class:obj]",
"[~class|class2|more classes:obj]",
"[!class:obj]",
"[!class|class2|more classes:obj]",
"[?method:class]",
"[text]"))
{
final Matcher outerM = outerP.matcher(s);
System.out.println("Input: " + s);
while (outerM.find()) {
final Matcher m = innerP.matcher(outerM.group());
if (m.matches()) System.out.println(
m.group(1) + ";" + m.group(2) + ";" + m.group(3));
else System.out.println("No match");
}
}

Categories