keeping data that regex expression parses - java

I have a Regex pattern that matches data I need to parse exactly as I need it. Unfortunately with the split method it is deleting the desired data and passing the garbage out to me. Normally I would just try another Regex expression doing the opposite but its not quite as simple as it sounds. It must be in Java as this section is part of a much bigger program/package.
Pattern p = Pattern.compile("/^\{\?|\:|\=|\||(\-
configurationFile)|(isUsingRESTDescription)|(\restURL)=(\s|\w|\.|\-|\:|\/|\;|\[|\]|\'|\})\r/g");
This is the string I'm parsing (there are carriage returns after each section):
SearchResult::getBleh(): {BLEHID=BLEH blehLastmoddate=1-Jul-11 bleh=BLEH; Beh description=blehbleh BlEh=bleh1231bleh bLeH=bleh-blehbleh 1 media=http://bleh.com/13 Date=22-May-12 name=[]} String[] items = p.split(input^);
The above gives me the opposite of what I want.
You'd think someone would have had this problem. Help would be appreciated :).

Use capture groups. You can read about them in the javadoc for Pattern.
An example:
Pattern p = Pattern.compile("[^/]*/([^/]*)/.*");
Matcher m = p.matcher("foo/bar/input");
if (m.find()) {
String captured = m.group(1); // This equals "bar"
String matched = m.group(0); // This equals "foo/bar/input"
}
Anything located inside of parentheses in a Pattern is a capture group. The Matcher indexes the capture groups based on when the opening parentheses is encountered. Group 0 is always the entire matched region.

Related

Extract all occurrences of pattern K and check if string matches "K*" in 1 pass

For a given input string and a given pattern K, I want to extract every occurrence of K (or some part of it (using groups)) from the string and check that the entire string matches K* (as in it consists of 0 or more K's with no other characters).
But I would like to do this in a single pass using regular expressions. More specifically, I'm currently finding the pattern using Matcher.find, but this is not strictly required.
How would I do this?
I already found a solution (and posted an answer), but would like to know if there is specific regex or Matcher functionality that addresses / can address this issue, or simply if there are better / different ways of doing it. But, even if not, I still think it's an interesting question.
Example:
Pattern: <[0-9]> (a single digit in <>)
Valid input: <1><2><3>
Invalid inputs:
<1><2>a<3>
<1><2>3
Oh look, a flying monkey!
<1><2><3
Code to do it in 2 passes with matches:
boolean products(String products)
{
String regex = "(<[0-9]>)";
Pattern pAll = Pattern.compile(regex + "*");
if (!pAll.matcher(products).matches())
return false;
Pattern p = Pattern.compile(regex);
Matcher matcher = p.matcher(products);
while (matcher.find())
System.out.println(matcher.group());
return true;
}
1. Defining the problem
Since it is not clear what to output when the whole string does not match pattern K*, I will redefine the problem to make it clear what to output in such case.
Given any pattern K:
Check that the string has the pattern K*.
If the string has pattern K*, then split the string into non-overlapping tokens that matches K.
If the string only has prefix that matches pattern K*, then pick the prefix that is chosen by K*+1, and split the prefix into tokens that matches K.
1 I don't know if there is anyway to get the longest prefix that matches K. Of course, you can always remove the last character one by one and test against K* until it matches, but it is obviously inefficient.
Unless specify otherwise, whatever I write below will follow my problem description above. Note that the 3rd bullet point of the problem is to resolve the ambiguity on which prefix string to take.
2. Repeated capturing group in .NET
The problem above can be solved if we have the solution to the problem:
Given a pattern (K)*, which is a repeated capturing group, get the captured text for all the repetitions, instead of only the last repetition.
In the case where the string has pattern K*, by matching against ^(K)*$, we can get all tokens that match pattern K.
In the case where the string only has prefix that matches K*, by matching against ^(K)*, we can get all tokens that match pattern K.
This is the case in .NET regex, since it keeps all the captured text for a repeated capturing group.
However, since we are using Java, we don't have access to such feature.
3. Solution in Java
Checking that the string has the pattern K* can always be done with Matcher.matches()/String.matches(), since the engine will do full-blown backtracking on the input string to somehow "unify" K* with the input string. The hard thing is to split the input string into tokens that matches pattern K.
If K* is equivalent to K*+
If the pattern K has the property:
For all strings2, K* is equivalent to K*+, i.e. how the input string is split up into tokens that match pattern K is the same.
2 You can define this condition for only the input strings you are operating on, but ensuring this pre-condition is not easy. When you define it for all strings, you only need to analyze your regex to check whether the condition holds or not.
Then a one-pass solution that solves the problem can be constructed. You can repeatedly use Matcher.find() on the pattern \GK, and checks that the last match found is right at the end of the string. This is similar to your current solution, except that you do the boundary check with code.
The + after the quantifier * in K*+ makes the quantifier possessive. Possessive quantifier will prevent the engine from backtracking, which means each repetition is always the first possible match for the pattern K. We need this property so that the solution \GK has equivalent meaning, since it will also return the first possible match for the pattern K.
If K* is NOT equivalent to K*+
Without the property above, we need 2 passes to solve the problem. First pass to call Matcher.matches()/String.matches() on the pattern K*. On second pass:
If the string does not match pattern K*, we will repeatedly use Matcher.find() on the pattern \GK until no more match can be found. This can be done due to how we define which prefix string to take when the input string does not match pattern K*.
If the string matches pattern K*, repeatedly use Matcher.find() on the pattern \GK(?=K*$) is one solution. This will result in redundant work matching the rest of the input string, though.
Note that this solution is universally applicable for any K. In other words, it also applies for the case where K* is equivalent to K*+ (but we will use the better one-pass solution for that case instead).
Here is an additional answer to the already accepted one. Below is an example code snippet that only goes through the pattern once with m.find(), which is similar to your one pass solution, but will not parse non-matching lines.
import java.util.regex.*;
class test{
public static void main(String args[]){
String t = "<1><2><3>";
Pattern pat = Pattern.compile("(<\\d>)(?=(<\\d>)*$)(?<=^(<\\d>)*)");
Matcher m = pat.matcher(t);
while (m.find()) {
System.out.println("Matches!");
System.out.println(m.group());
}
}
}
The regex explained:
<\\d> --This is your k pattern as defined above
?= -- positive lookahead (check what is ahead of K)
<\\d>* -- Match k 0 or more times
$ -- End of line
?<= -- positive lookbehind (check what is behind K)
^ -- beginning of line
<\\d>* -- followed by 0 or more Ks
Regular expressions are beautiful things.
Edit: As pointed out to me by #nhahtdh, this is just an implemented version of the answer. In fact the implementation above can be improved with the knowledge in the answer.(<\\d>)(?=(<\\d>)*$)(?<=^(<\\d>)*) can be changed to \\G<\\d>(?=(<\\d>)*$).
Below is a one-pass solution using Matcher.start and Matcher.end.
boolean products(String products)
{
String regex = "<[0-9]>";
Pattern p = Pattern.compile(regex);
Matcher matcher = p.matcher(products);
int lastEnd = 0;
while (matcher.find())
{
if (lastEnd != matcher.start())
return false;
System.out.println(matcher.group());
lastEnd = matcher.end();
}
if (lastEnd != products.length())
return false;
return true;
}
The only disadvantage is that it will print out (or process) all values prior to finding invalid data.
For example, products("<1><2>a<3>"); will print out:
<1>
<2>
prior to throwing the exception (because up until there the string is valid).
Either having this happen or having to store all of them temporarily seems to be unavoidable.
String t = "<1><2><3>";
Pattern pat = Pattern.compile("(<\\d>)*");
Matcher m = pat.matcher(t);
if (m.matches()) {
//String[] tt = t.split("(?<=>)"); // Look behind on '>'
String[] tt = t.split("(?<=(<\\d>))"); // Look behind on K
}

Extract substring after a certain pattern

I have the following string:
http://xxx/Content/SiteFiles/30/32531a5d-b0b1-4a8b-9029-b48f0eb40a34/05%20%20LEISURE.mp3?&mydownloads=true
How can I extract the part after 30/? In this case, it's 32531a5d-b0b1-4a8b-9029-b48f0eb40a34.I have another strings having same part upto 30/ and after that every string having different id upto next / which I want.
You can do like this:
String s = "http://xxx/Content/SiteFiles/30/32531a5d-b0b1-4a8b-9029-b48f0eb40a34/05%20%20LEISURE.mp3?&mydownloads=true";
System.out.println(s.substring(s.indexOf("30/")+3, s.length()));
split function of String class won't help you in this case, because it discards the delimiter and that's not what we want here. you need to make a pattern that looks behind. The look behind synatax is:
(?<=X)Y
Which identifies any Y that is preceded by a X.
So in you case you need this pattern:
(?<=30/).*
compile the pattern, match it with your input, find the match, and catch it:
String input = "http://xxx/Content/SiteFiles/30/32531a5d-b0b1-4a8b-9029-b48f0eb40a34/05%20%20LEISURE.mp3?&mydownloads=true";
Matcher matcher = Pattern.compile("(?<=30/).*").matcher(input);
matcher.find();
System.out.println(matcher.group());
Just for this one, or do you want a generic way to do it ?
String[] out = mystring.split("/")
return out[out.length - 2]
I think the / is definitely the delimiter you are searching for.
I can't see the problem you are talking about Alex
EDIT : Ok, Python got me with indexes.
Regular expression is the answer I think. However, how the expression is written depends on the data (url) format you want to process. Like this one:
Pattern pat = Pattern.compile("/Content/SiteFiles/30/([a-z0-9\\-]+)/.*");
Matcher m = pat.matcher("http://xxx/Content/SiteFiles/30/32531a5d-b0b1-4a8b-9029-b48f0eb40a34/05%20%20LEISURE.mp3?&mydownloads=true");
if (m.find()) {
System.out.println(m.group(1));
}

Split number string on java using regex

I want to using regex on Java to split a number string.
I using a online regex tester test the regex is right.
But in Java is wrong.
Pattern pattern = Pattern.compile("[\\\\d]{1,4}");
String[] results = pattern.split("123456");
// I expect 2 results ["1234","56"]
// Actual results is ["123456"]
Anything do I missing?
I knows this question is boring. But I wanna to solve this problem.
Answer
Pattern pattern = Pattern.compile("[\\d]{1,4}");
String[] results = pattern.split("123456");
// Results length is 0
System.out.println(results.length);
is not working. I have try it. It's will return nothing on the results.
Please try before answer it.
Sincerely thank the people who helped me.
Solution:
Pattern pattern = Pattern.compile("([\\d]{1,4})");
Matcher matcher = pattern.matcher("123456");
List<String> results = new ArrayList<String>();
while (matcher.find()) {
results.add(matcher.group(1));
}
Output 2 results ["1234","56"]
Pattern pattern = Pattern.compile("[\\\\d]{1,4}")
Too many backslashes, try [\\d]{1,4} (you only have to escape them once, so the backslash in front of the d becomes \\. The pattern you wrote is actually [\\d]{1,4} (a literal backslash or a literal d, one to four times).
When Java decided to add regular expressions to the standard library, they should have also added a regular expression literal syntax instead of shoe-horning it over Strings (with the unreadable extra escaping and no compile-time syntax checking).
Solution:
Pattern pattern = Pattern.compile("([\\d]{1,4})");
Matcher matcher = pattern.matcher("123456");
List<String> results = new ArrayList<String>();
while (matcher.find()) {
results.add(matcher.group(1));
}
Output 2 results ["1234","56"]
You can't do it in one method call, because you can't specify a capturing group for the split, which would be needed to break up into four char chunks.
It's not "elegant", but you must first insert a character to split on, then split:
String[] results = "123456".replaceAll("....", "$0,").split(",");
Here's the output:
System.out.println(Arrays.toString(results)); // prints [1234, 56]
Note that you don't need to use Pattern etc because String has a split-by-regex method, leading to a one-line solution.

Is this Regex incorrect? No matches found

I'm trying to parse through a string formatted like this, except with more values:
Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value
The Regex
((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))
In the actual string, there are about double the amount of key/values, but I'm keeping it short for brevity. I have them in parentheses so I can call them in groups. The keys I have stored as Constants, and they will always be the same. The problem is, it never finds a match which doesn't make sense (unless the Regex is wrong)
Judging by your comment above, it sounds like you're creating the Pattern and Matcher objects and associating the Matcher with the target string, but you aren't actually applying the regex. That's a very common mistake. Here's the full sequence:
String regex = "Key1=(.*),Key2=(.*)"; // etc.
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(targetString);
// Now you have to apply the regex:
if (m.find())
{
String value1 = m.group(1);
String value2 = m.group(2);
// etc.
}
Not only do you have to call find() or matches() (or lookingAt(), but nobody ever uses that one), you should always call it in an if or while statement--that is, you should make sure the regex actually worked before you call any methods like group() that require the Matcher to be in a "matched" state.
Also notice the absence of most of your parentheses. They weren't necessary, and leaving them out makes it easier to (1) read the regex and (2) keep track of the group numbers.
Looks like you'd do better to do:
String[] pairs = data.split(",");
Then parse the key/value pairs one at a time
Your regex is working for me...
If you are always getting an IllegalStateException, I would say that you are trying to do something like:
matcher.group(1);
without having invoked the find() method.
You need to call that method before any attempt to fetch a group (or you will be in an illegal state to call the group() method)
Give this a try:
String test = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
Pattern pattern = Pattern.compile("((Key1)=(.*)),((Key2)=(.*)),((Key3)=(.*)),((Key4)=(.*)),((Key5)=(.*)),((Key6)=(.*)),((Key7)=(.*))");
Matcher matcher = pattern.matcher(test);
matcher.find();
System.out.println(matcher.group(1));
It's not wrong per se, but it requires a lot of backtracking which might cause the regular expression engine to bail. I would try a split as suggested elsewhere, but if you really need to use a regular expression, try making it non-greedy.
((Key1)=(.*?)),((Key2)=(.*?)),((Key3)=(.*?)),((Key4)=(.*?)),((Key5)=(.*?)),((Key6)=(.*?)),((Key7)=(.*?))
To understand why it requires so much backtracking, understand that for
Key1=(.*),Key2=(.*)
applied to
Key1=x,Key2=y
Java's regular expression engine matches the first (.*) to x,Key2=y and then tries stripping characters off the right until it can get a match for the rest of the regular expression: ,Key2=(.*). It effectively ends up asking,
Does "" match ,Key2=(.*), no so try
Does "y" match ,Key2=(.*), no so try
Does "=y" match ,Key2=(.*), no so try
Does "2=y" match ,Key2=(.*), no so try
Does "y2=y" match ,Key2=(.*), no so try
Does "ey2=y" match ,Key2=(.*), no so try
Does "Key2=y" match ,Key2=(.*), no so try
Does ",Key2=y" match ,Key2=(.*), yes so the first .* is "x" and the second is "y".
EDIT:
In Java, the non-greedy qualifier changes things so that it starts off trying to match nothing and then building from there.
Does "x,Key2=(.*)" match ,Key2=(.*), no so try
Does ",Key2=(.*)" match ,Key2=(.*), yes.
So when you've got 7 keys it doesn't need to unmatch 6 of them which involves unmatching 5 which involves unmatching 4, .... It can do it's job in one forward pass over the input.
I'm not going to say that there's no regex that will work for this, but it's most likely more complicated to write (and more importantly, read, for the next person that has to deal with the code) than it's worth. The closest I'm able to get with a regex is if you append a terminal comma to the string you're matching, i.e, instead of:
"Key1=value1,Key2=value2"
you would append a comma so it's:
"Key1=value1,Key2=value2,"
Then, the regex that got me the closest is: "(?:(\\w+?)=(\\S+?),)?+"...but this doesn't quite work if the values have commas, though.
You can try to continue tweaking that regex from there, but the problem I found is that there's a conflict in the behavior between greedy and reluctant quantifiers. You'd have to specify a capturing group for the value that is greedy with respect to commas up to the last comma prior to an non-capturing group comprised of word characters followed by the equal sign (the next value)...and this last non-capturing group would have to be optional in case you're matching the last value in the sequence, and maybe itself reluctant. Complicated.
Instead, my advice is just to split the string on "=". You can get away with this because presumably the values aren't allowed to contain the equal sign character.
Now you'll have a bunch of substrings, each of which that is a bunch of characters that comprise a value, the last comma in the string, followed by a key. You can easily find the last comma in each substring using String.lastIndexOf(',').
Treat the first and last substrings specially (because the first one does not have a prepended value and the last one has no appended key) and you should be in business.
If you know you always have 7, the hack-of-least resistance is
^Key1=(.+),Key2=(.+),Key3=(.+),Key4=(.+),Key5=(.+),Key6=(.+),Key7=(.+)$
Try it out at http://www.fileformat.info/tool/regex.htm
I'm pretty sure that there is a better way to parse this thing down that goes through .find() rather than .matches() which I think I would recommend as it allows you to move down the string one key=value pair at a time. It moves you into the whole "greedy" evaluation discussion.
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. - Jamie Zawinski
The simplest solution is the most robust.
final String data = "Key1=value,Key2=value,Key3=value,Key4=value,Key5=value,Key6=value,Key7=value";
final String[] pairs = data.split(",");
for (final String pair: pairs)
{
final String[] keyValue = pair.split("=");
final String key = keyValue[0];
final String value = keyValue[1];
}

how to encode String into Pattern and retrieve the String

Question closed because I misunderstood the situation. To show my stupidity though, I'll not remove what I wrote.
I'd like to encode a piece of string into Pattern, and get the string back.
I tried:
String s = buff.readLine();
Pattern p = new Pattern(s);
and use the following to retrieve my string
System.out.println(p.toString());
But it didn't work, the output is just the "package name#(some random things)... I tried Pattern p = Pattern.compile (s);
but I got an error from the compiler.
Well I just tried this:
Pattern p = Pattern.compile("Hello");
System.out.println( p.toString() );
And it worked, printing out 'Hello'.
Are you importing the java.util.regex.Pattern package?
The javadoc for Pattern#toString() seems to indicate that the source of the complete regex is only returned since java 1.5. However, Pattern#pattern() does not have a since tag, so it is presumably available since the class was introduced (java 1.4). Try System.out.println(p.pattern());
You're using a regex Pattern object to store and retrieve a String. This makes no sense. A Pattern is not used for storing Strings. A Pattern is used for searching other strings. It's a regular expression engine. Let me give you an example of the use of a Pattern.
We really have 2 objects when using Regular Expressions in Java. Pattern, and Matcher.
Pattern = A Regular Expression.
Matcher = All of the Matches found when we apply the Pattern to a String.
Let me give you an example of Pattern and Matcher, we'll search for four digits, separated by a colon, like as in time, ie 12:42
long timeL;
Pattern pattern = Pattern.compile(".*([1234567890]{2}:[1234567890]{2}).*");
Matcher matcher = pattern.matcher("Match me! 12:42 Match me!");
if (matcher.matches()) {
String timeStr = matcher.group(1);
System.out.println("Just the time: "+timeStr);
System.out.println("The entire String: "+matcher.group(0));
String[] timeParts = timeStr.split("[:]");
int hours = Integer.parseInt(timeParts[0]);
int minutes = Integer.parseInt(timeParts[1]);
timeL = (hours*60*60*1000) + (minutes*60*1000);
System.out.println(timeL);
}
After we've applied the Pattern to the String, and gotten a Matcher, we ask if the Matcher actually has a Match or not. You'll notice that we then request group 1, which is the match in the parantheses in: .([1234567890]{2}:[1234567890]{2}).
group 0 would be the entire match, and would result in returning the String given.
So, I hope you understand why it's extremely weird to be using a Pattern to store a String.

Categories