Extracting SQL data using regular expression - java

I have the un proper data in this way. I need to extract the data before dot and after dot symbol using regular expression. I am using but I am not able to get exact data.
String rightHeading=null;
String leftHeading=null;
String formulaData="ifnull(\"Content Status\".\"Week Of Quarter\",0)";
Matcher matcher = Pattern.compile("(\"?([^()]*?)\"?)\\.(\"?([##$%><{}\\w ]*)\"?)").matcher(formulaData);
while (matcher.find())
{
String Column_Data=matcher.group(0);
String[] pieces = Column_Data.split("\\.");
rightHeading=pieces[0].replace("\"", "");
leftHeading=pieces[1].replace("\"", "");
System.out.println(rightHeading+ ": "+leftHeading);
}//while
Output which I got is:
ifnullContent Status.Week Of Quarter,0)
Expected output:
Content Status.Week Of Quarter

Below is my solution for your problem, along with the output that it produces.
String formulaData="(100*(FILTER(\"Fact - Bookings\".\"$ Total Gross Bookings\" USING (\"Booking Date\".\"Year\" = VALUEOF(\"CUR_YEAR\"))) - FILTER(Fact - Bookings.$ Total Gross BookingsData USING \"Booking Date\".\"Year\" = VALUEOF(\"PREV_YEAR\") AND \"Booking Date\".Sortable Number <= VALUEOF(\"PRV_YEAR_TD\") ) ) / FILTER(Fact - Bookings.$TotalGrossBookingsUsage \" USING \"Booking Date\".\"Year\" = VALUEOF(\"PREV_YEAR\") AND \"Booking Date\".\"Sortable Number\" <= VALUEOF(\"PRV_YEAR_TD\") ) )";
String p1 = "(\"(\\w*\\s*-*)*?\"\\.\".*?\")|((?:\\()((\\w*\\s*-*)*?\\.\\$\\w+))|(\"(\\w*\\s*-*)*?\"\\.(\\w+\\s+)+)";
Pattern p = Pattern.compile(p1);
Matcher m = p.matcher(formulaData);
while(m.find())
{
System.out.println(m.group(0).replaceAll("\"|\\(|\\)", ""));
}
Outputs:
Fact - Bookings.$ Total Gross Bookings
Booking Date.Year
Fact - Bookings.$ Total Gross BookingsData
Booking Date.Year
Booking Date.Sortable Number
Fact - Bookings.$TotalGrossBookingsUsage
Booking Date.Year
Booking Date.Sortable Number
As you can see, I didn't use actually use a horrifically complex regex to solve your problem. This is because your input is far too varied to use this tool effectively.
The fact that your table.field pairs sometimes had $ or " symbols inside them made the data very inconsistent. Regular expressions find it hard to deal with this level of complexity, so I think my solution (in this example) is workable.
However, in future if you have any control over your data input, please try to sanitize it and make it as consistent as possible.
EDIT
Since that didn't work out for you, I've gone and changed my code snippet to use a regular expression.

Matcher matcher = Pattern.compile("([\\w[\\$##\\-^&]\\w\\[\\]' $]+)\\.([\\w\\[\\]' $]+)").matcher(formulaData);
while (matcher.lookingAt()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end());
System.out.println(" Found: " + matcher.group());
}
lookingAt() is more suitable here as per the requirements and as mentioned in doc --
lookingAt() Attempts to match the input sequence, starting at the beginning of the region, against the pattern.
Like the matches method, this method always starts at the beginning of the region; unlike that method, it does not require that the entire region be matched.
If the match succeeds then more information can be obtained via the start, end, and group methods.
Hope this helps.

Related

How not to match the first empty string in this regex?

(Disclaimer: the title of this question is probably too generic and not helpful to future readers having the same issue. Probably, it's just because I can't phrase it properly that I've not been able to find anything yet to solve my issue... I engage in modifying the title, or just close the question once someone will have helped me to figure out what the real problem is :) ).
High level description
I receive a string in input that contains two information of my interest:
A version name, which is 3.1.build and something else later
A build id, which is somenumbers-somenumbers-eitherwordsornumbers-somenumbers
I need to extract them separately.
More details about the inputs
I have an input which may come in 4 different ways:
Sample 1: v3.1.build.dev.12345.team 12345-12345-cici-12345 (the spaces in between are some \t first, and some whitespaces then).
Sample 2: v3.1.build.dev.12345.team 12345-12345-12345-12345 (this is very similar than the first example, except that in the second part, we only have numbers and -, no alphabetic characters).
Sample 3:
v3.1.build.dev.12345.team
12345-12345-cici-12345
(the above is very similar to sample 1, except that instead of \t and whitespaces, there's just a new line.
Sample 4:
v3.1.build.dev.12345.team
12345-12345-12345-12345
(same than above, with only digits and dashes in the second line).
Please note that in sample 3 and sample 4, there are some trailing spaces after both strings (not visible here).
To sum up, these are the 4 possible inputs:
String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
My code currently
I have written the following code to extract the information I need (here reporting only relevant, please visit the fiddle link to have a complete and runnable example):
String versionPattern = "^.+[\\s]";
String buildIdPattern = "[\\s].+";
Pattern pVersion = Pattern.compile(versionPattern);
Pattern pBuildId = Pattern.compile(buildIdPattern);
for (String str : possibilities) {
Matcher mVersion = pVersion.matcher(str);
Matcher mBuildId = pBuildId.matcher(str);
while(mVersion.find()) {
System.out.println("Version found: \"" + mVersion.group(0).replaceAll("\\s", "") + "\"");
}
while (mBuildId.find()) {
System.out.println("Build-id found: \"" + mBuildId.group(0).replaceAll("\\s", "") + "\"");
}
}
The issue I'm facing
The above code works, pretty much. However, in the Sample 3 and Sample 4 (those where the build-id is separated by the version with a \n), I'm getting two matches: the first, is just a "", the second is the one I wish.
I don't feel this code is stable, and I think I'm doing something wrong with the regex pattern to match the build-id:
String buildIdPattern = "[\\s].+";
Does anyone have some ideas in order to exclude the first empty match on the build-id for sample 3 and 4, while keeping all the other matches?
Or some better way to write the regexs themselves (I'm open to improvements, not a big expert of regex)?
Based on your description it looks like your data is in form
NonWhiteSpaces whiteSpaces NonWhiteSpaces (optionalWhiteSpaces)
and you want to get only NonWhiteSpaces parts.
This can be achieved in numerous ways. One of them would be to trim() your string to get rid of potential trailing whitespaces and then split on the whitespaces (there should now only be in the middle of string). Something like
String[] arr = data.trim().split("\\s+");// \s also represents line separators like \n \r
String version = arr[0];
String buildID = arr[1];
(^v\w.+)\s+(\d+-\d+-\w+-\d+)\s*
It will capture 2 groups. One will capture the first section (v3.1.build.dev.12345.team), the second gets the last section (12345-12345-cici-12345)
It breaks down like: (^v\w.+) ensures that the string starts with a v, then captures all characters that are a number or letter (stopping on white space tabs etc.) \s+ matches any white space or tabs/newlines etc. as many times as it can. (\d+-\d+-\w+-\d+) this reads it in, ensuring that it conforms to your specified formatting. Note that this will still read in the dashes, making it easier for you to split the string after to get the information you need. If you want you could even make these their own capture groups making it even easier to get your info.
Then it ends with \s* just to make sure it doesn't get messed up by trailing white space. It uses * instead of + because we don't want it to break if there's no trailing white space.
I think this would be strong for production (aside from the fact that the strings cannot begin with any white-space - which is fixable, but I wasn't sure if it's what you're going for).
public class Other {
static String patternStr = "^([\\S]{1,})([\\s]{1,})(.*)";
static String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
static String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
static String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
static String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
static Pattern pattern = Pattern.compile(patternStr);
public static void main(String[] args) {
List<String> possibilities = Arrays.asList(str1, str2, str3, str4);
for (String str : possibilities) {
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1).replaceAll("\\s", "") + "\"");
System.out.println("Some whitespace found: \"" + matcher.group(2).replaceAll("\\s", "") + "\"");
System.out.println("Build-id found: \"" + matcher.group(3).replaceAll("\\s", "") + "\"");
} else {
System.out.println("Pattern NOT found");
}
System.out.println();
}
}
}
Imo, it looks very similar to your original code. In case the regex doesn't look familiar to you, I'll explain what's going on.
Capital S in [\\S] basically means match everything except for [\\s]. .+ worked well in your case, but all it is really saying is match anything that isn't empty - even a whitespace. This is not necessarily bad, but would be troublesome if you ever had to modify the regex.
{1,} simple means one or more occurrences. {1,2}, to give another example, would be 1 or 2 occurrences. FYI, + usually means 0 or 1 occurrences (maybe not in Java) and * means one or more occurrences.
The parentheses denote groups. The entire match is group 0. When you add parentheses, the order from left to right represent group 1 .. group N. So what I did was combine your patterns using groups, separated by one or more occurrences of whitespace. (.*) is used for group 2, since that group can have both whitespace and non-whitespace, as long as it doesn't begin with whitespace.
If you have any questions feel free to ask. For the record, your current code is fine if you just add '+' to the buildId pattern: [\\s]+.+.
Without that, your regex is saying: match the whitespace that is followed by no characters or a single character. Since all of your whitespace is followed by more whitespace, you matching just a single whitespace.
TLDR;
Use the pattern ^(v\\S+)\\s+(\\S+), where the capture-groups capture the version and build respectively, here's the complete snippet:
String unitPattern ="^(v\\S+)\\s+(\\S+)";
Pattern pattern = Pattern.compile(unitPattern);
for (String str : possibilities) {
System.out.println("Analyzing \"" + str + "\"");
Matcher matcher = pattern.matcher(str);
while(matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1) + "\"");
System.out.println("Build-id found: \"" + matcher.group(2) + "\"");
}
}
Fiddle to try it.
Nitty Gritties
Reason for the empty lines in the output
It's because of how the Matcher class interprets the .; The . DOES NOT match newlines, it stops matching just before the \n. For that you need to add the flag Pattern.DOTALL using Pattern.compile(String pattern, int flags).
An attempt
But even with Pattern.DOTALL, you'll still not be able to match, because of the way you have defined the pattern. A better approach is to match the full build and version as a unit and then extract the necessary parts.
^(v\\S+)\\s+(\\S+)
This does trick where :
^(v\\S+) defines the starting of the unit and also captures version information
\\s+ matches the tabs, new line, spaces etc
(\\S+) captures the final contiguous build id

Regex pattern in java fails but works fine otherwise

I've implemented quite a complicated pattern` to match all occurences of ship set number. It works perfectly fine with global case insensitive comparison.
I use the following code to implement the same thing in Java but it doesn't match. Should Java regex be implemented differently?
int i = 0;
while (i < elementsArray.size()) {
System.out.println("List element:"+elementsArray.get(i));
String theRegex = "(?i)(([Ss]{2}|Ship\\s*(set))\\s*(\\#|Number|No\\.)?\\s*([:=\\-\\n\\'\\s])?\\s*\\d+\\s*(\\W*\\d+\\W?\\s*(to|and)?|(to|and)\\s*\\d+)*)";
if (elementsArray.get(i).matches(theRegex)) {
System.out.println("RESULT:");
String shipsets = "";
String thePattern = "(?i)(([Ss]{2}|Ship\\s*(set))\\s*(\\#|Number|No\\.)?\\s*([:=\\-\\n\\'\\s])?\\s*\\d+\\s*(\\W*\\d+\\W?\\s*(to|and)?|(to|and)\\s*\\d+)*)";
Pattern pattern = Pattern.compile(thePattern);
Matcher matcher = pattern.matcher(elementsArray.get(i));
if (matcher.find()) {
shipsets = matcher.group(0);
}
System.out.println("text==========" + shipsets);
}
i++;
}
Here is a simplification of your code which should work, assuming that your regex be working correctly in Java. From my preliminary investigations, it does seem to match many of the use cases in your link. You don't need to use String.matches() because you already are using a Matcher which will check whether or not you have a match.
List<String> elementsArray = new ArrayList<String>();
elementsArray.add("Shipset Number 323");
elementsArray.add("meh");
elementsArray.add("SS NO. : 34");
elementsArray.add("Mary had a little lamb");
elementsArray.add("Ship Set #2, #33 to #4.");
for (int i=0; i < elementsArray.size(); ++i) {
System.out.println("List element:"+elementsArray.get(i));
String shipsets = "";
String thePattern = "(?i)(([Ss]{2}|Ship\\s*(set))\\s*(\\#|Number|No\\.)?\\s*([:=\\-\\n\\'\\s])?\\s*\\d+\\s*(\\W*\\d+\\W?\\s*(to|and)?|(to|and)\\s*\\d+)*)";
Pattern pattern = Pattern.compile(thePattern);
Matcher matcher = pattern.matcher(elementsArray.get(i));
if (matcher.find()) {
shipsets = matcher.group(0);
System.out.println("Found a match at element " + i + ": " + shipsets);
}
}
}
You can see in the output below, that the three ship test strings all matched, and the controls "meh" and "Mary had a little lamb" did not match.
Output:
List element:Shipset Number 323
Found a match at element 0: Shipset Number 323
List element:meh
List element:SS NO. : 34
Found a match at element 2: SS NO. : 34
List element:Mary had a little lamb
List element:Ship Set #2, #33 to #4.
Found a match at element 4: Ship Set #2, #33 to #4.
In my opinion your problems are coused by:
usage of matches() in if(elementsArray.get(i).matches(theRegex)) - matches() will return
true only if whole string match to regex, so it will succeed in
many cases from your example, but it will fail with:
SS#1,SS#5,SS#6, SS1, SS2, SS3, SS4, etc. You can simulate this
situation by adding ^ at beginning and $ at the end of regex.
Check how it match HERE. So it would be better solution, to use
matcher.find() instead of String.matches(), like in Tim
Biegeleisen answer.
usage of if(matcher.find()) instead of while(matcher.find()) - in
some of strings you want to retrieve more than one result, so you
should use matcher.find() multiple times, to get all of them.
However if will act only once, so you will get only first matched
fragment from given string. To retrieve all, use loop, as matcher.find() will return false when it will not find next match in given String, and will end loop
Check this out. This is Tim Biegeleisen solution with small change (while, instead of if).

How to capture all nested matches?

I was trying to answer a question recently and while attempting to solve it, I ran into a question of my own.
Given the following code
private void regexample(){
String x = "a3ab4b5";
Pattern p = Pattern.compile("(\\D+(\\d+)\\D+){2}");
Matcher m = p.matcher(x);
while(m.find()){
for(int i=0;i<=m.groupCount();i++){
System.out.println("Group " + i + " = " + m.group(i));
}
}
}
And the output
Group 0 = a3ab4b
Group 1 = b4b
Group 2 = 4
Is there any straight-forward way I'm missing to get the value 3? The pattern should look for two occurrences of (\\D+(\\d+)\\D+) back-to-back, and a3a is part of the match. I realize I can change expression to (\\D+(\\d+)\\D+) and then look for all matches, but that isn't technically the same thing. Is the only way to do a double search? ie: Use the given pattern to match the string and then search again for each count of the outer group?
I guessed that the first values were overwritten with the second, but as I'm not that great with regex, I was hoping there was something I was missing.
It is impossible to capture multiple occurrences of the same group (with standard regex engines). You could use something like this:
Pattern.compile("(\\D+(\\d+)\\D+)(\\D+(\\d+)\\D+)");
Now, there are four groups instead of two, so you will get the values you expected.
This question deals with a similar problem.

Java - Regex Match Multiple Words

Lets say that you want to match a string with the following regex:
".when is (\w+)." - I am trying to get the event after 'when is'
I can get the event with matcher.group(index) but this doesnt work if the event is like Veteran's Day since it is two words. I am only able to get the first word after 'when is'
What regex should I use to get all of the words after 'when is'
Also, lets say I want to capture someones bday like
'when is * birthday
How do I capture all of the text between is and birthday with regex?
You could try this:
^when is (.*)$
This will find a string that starts with when is and capture everything else to the end of the line.
The regex will return one group. You can access it like so:
String line = "when is Veteran's Day.";
Pattern pattern = Pattern.compile("^when is (.*)$");
Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
System.out.println("group 1: " + matcher.group(1));
System.out.println("group 2: " + matcher.group(2));
}
And the output should be:
group 1: when is Veteran's Day.
group 2: Veteran's Day.
If you want to allow whitespace to be matched, you should explicitly allow whitespace.
([\w\s]+)
However, roydukkey's solution will work if you want to capture everything after when is.
Don't use regular expressions when you don't need to!! Although the theory of regular expressions is beautiful in the thought that you can have a string do code operations for you, it is very memory inefficient for simple use cases.
If you are trying to get the word after "when is" ending by a space, you could do something like this:
String start = "when is ";
String end = " ";
int startLocation = fullString.indexOf(start) + start.length();
String afterStart = fullString.substring(startLocation, fullString.length());
String word = afterStart.substring(0, afterStart.indexOf(end));
If you know the last word is Day, you can just make end = "Day" and add the length of that string of where to end the second substring.
You can express this as a character class and include spaces in it: when is ([\w ]+).
\w only includes word characters, which doesn't include spaces. Use [\w ]+ instead.

Split/tokenize/scan a string being aware of quotation marks

Is there a default/easy way in Java for split strings, but taking care of quotation marks or other symbols?
For example, given this text:
There's "a man" that live next door 'in my neighborhood', "and he gets me down..."
Obtain:
There's
a man
that
live
next
door
in my neighborhood
and he gets me down
Something like this works for your input:
String text = "There's \"a man\" that live next door "
+ "'in my neighborhood', \"and he gets me down...\"";
Scanner sc = new Scanner(text);
Pattern pattern = Pattern.compile(
"\"[^\"]*\"" +
"|'[^']*'" +
"|[A-Za-z']+"
);
String token;
while ((token = sc.findInLine(pattern)) != null) {
System.out.println("[" + token + "]");
}
The above prints (as seen on ideone.com):
[There's]
["a man"]
[that]
[live]
[next]
[door]
['in my neighborhood']
["and he gets me down..."]
It uses Scanner.findInLine, where the regex pattern is one of:
"[^"]*" # double quoted token
'[^']*' # single quoted token
[A-Za-z']+ # everything else
No doubt this doesn't work 100% always; cases where quotes can be nested etc will be tricky.
References
regular-expressions.info/Character class
Doubtful based on your logic, you have differentiation between an apostrophe and single quotes, i.e. There's and in my neighborhood
You'd have to develop some kind of pairing logic if you wanted what you have above. I'm thinking regular expressions. Or some kind of two part parse.

Categories