Java Regular Expression Not Finding My Char Class - java

Very simply, idParser as seen below is not finding the number in my passedUrl string.
Here is the LogCat out for the Lod.d's:
01-05 11:27:48.532: D/WEBVIEW_REGEX(29447): Parsing: http://mymobisite.com/cat.php?id=33
01-05 11:27:48.532: D/WEBVIEW_REGEX(29447): idParse: No Matches Found.
annnnd heres the block of trouble.
Log.d("WEBVIEW_REGEX", "Parsing: "+passableUrl.toString());
Matcher idParser = Pattern.compile("[0-9]{5}|[0-9]{4}|[0-9]{3}|[0-9]{2}|[0-9]{1}").matcher(passableUrl);
if(idParser.groupCount() > 0)
Log.d("WEBVIEW_REGEX", "idParse: " + idParser.group());
else Log.d("WEBVIEW_REGEX", "idParse: No Matches Found.");
note, this is me getting a bit sloppy now, I've tried a bunch of different syntaxes (all verified working at http://www.regextester.com/index2.html on all three modes ) and I've even looked up the documentation ( http://docs.oracle.com/javase/tutorial/essential/regex/char_classes.html). This is starting to get on my final nerve.
using
.find()
instead of group() stuff just yields "false" ... Can someone help me to understand why i cant get this regular expression to work?
Cheers!

The problem is that groupCount() doesn't do what you think it does. You should instead use idParser.find(). Like this:
if(idParser.find())
Log.d("WEBVIEW_REGEX", "idParse: " + idParser.group());
else Log.d("WEBVIEW_REGEX", "idParse: No Matches Found.");
You could also simplify the pattern a bit, using \d{1,5} instead:
Matcher idParser = Pattern.compile("\\d{1,5}").matcher(passableUrl);
Full example:
String passableUrl = "http://mymobisite.com/cat.php?id=33";
Matcher idParser = Pattern.compile("\\d{1,5}").matcher(passableUrl);
if (idParser.find())
System.out.println("idParse: " + idParser.group());
else
System.out.println("idParse: No Matches Found.");
Outputs:
idParse: 33

There are no ( ) braces hence zero groups.
All groups are numbered from left to right with a starting (. Matcher.group(1) would be the first group. Matcher.group() is the entire match. You need find() to move to the first match. Others already indicated there are simpler patterns, like "\\d+$", a string ending with at least one digit.

Related

How not to match the first empty string in this regex?

(Disclaimer: the title of this question is probably too generic and not helpful to future readers having the same issue. Probably, it's just because I can't phrase it properly that I've not been able to find anything yet to solve my issue... I engage in modifying the title, or just close the question once someone will have helped me to figure out what the real problem is :) ).
High level description
I receive a string in input that contains two information of my interest:
A version name, which is 3.1.build and something else later
A build id, which is somenumbers-somenumbers-eitherwordsornumbers-somenumbers
I need to extract them separately.
More details about the inputs
I have an input which may come in 4 different ways:
Sample 1: v3.1.build.dev.12345.team 12345-12345-cici-12345 (the spaces in between are some \t first, and some whitespaces then).
Sample 2: v3.1.build.dev.12345.team 12345-12345-12345-12345 (this is very similar than the first example, except that in the second part, we only have numbers and -, no alphabetic characters).
Sample 3:
v3.1.build.dev.12345.team
12345-12345-cici-12345
(the above is very similar to sample 1, except that instead of \t and whitespaces, there's just a new line.
Sample 4:
v3.1.build.dev.12345.team
12345-12345-12345-12345
(same than above, with only digits and dashes in the second line).
Please note that in sample 3 and sample 4, there are some trailing spaces after both strings (not visible here).
To sum up, these are the 4 possible inputs:
String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
My code currently
I have written the following code to extract the information I need (here reporting only relevant, please visit the fiddle link to have a complete and runnable example):
String versionPattern = "^.+[\\s]";
String buildIdPattern = "[\\s].+";
Pattern pVersion = Pattern.compile(versionPattern);
Pattern pBuildId = Pattern.compile(buildIdPattern);
for (String str : possibilities) {
Matcher mVersion = pVersion.matcher(str);
Matcher mBuildId = pBuildId.matcher(str);
while(mVersion.find()) {
System.out.println("Version found: \"" + mVersion.group(0).replaceAll("\\s", "") + "\"");
}
while (mBuildId.find()) {
System.out.println("Build-id found: \"" + mBuildId.group(0).replaceAll("\\s", "") + "\"");
}
}
The issue I'm facing
The above code works, pretty much. However, in the Sample 3 and Sample 4 (those where the build-id is separated by the version with a \n), I'm getting two matches: the first, is just a "", the second is the one I wish.
I don't feel this code is stable, and I think I'm doing something wrong with the regex pattern to match the build-id:
String buildIdPattern = "[\\s].+";
Does anyone have some ideas in order to exclude the first empty match on the build-id for sample 3 and 4, while keeping all the other matches?
Or some better way to write the regexs themselves (I'm open to improvements, not a big expert of regex)?
Based on your description it looks like your data is in form
NonWhiteSpaces whiteSpaces NonWhiteSpaces (optionalWhiteSpaces)
and you want to get only NonWhiteSpaces parts.
This can be achieved in numerous ways. One of them would be to trim() your string to get rid of potential trailing whitespaces and then split on the whitespaces (there should now only be in the middle of string). Something like
String[] arr = data.trim().split("\\s+");// \s also represents line separators like \n \r
String version = arr[0];
String buildID = arr[1];
(^v\w.+)\s+(\d+-\d+-\w+-\d+)\s*
It will capture 2 groups. One will capture the first section (v3.1.build.dev.12345.team), the second gets the last section (12345-12345-cici-12345)
It breaks down like: (^v\w.+) ensures that the string starts with a v, then captures all characters that are a number or letter (stopping on white space tabs etc.) \s+ matches any white space or tabs/newlines etc. as many times as it can. (\d+-\d+-\w+-\d+) this reads it in, ensuring that it conforms to your specified formatting. Note that this will still read in the dashes, making it easier for you to split the string after to get the information you need. If you want you could even make these their own capture groups making it even easier to get your info.
Then it ends with \s* just to make sure it doesn't get messed up by trailing white space. It uses * instead of + because we don't want it to break if there's no trailing white space.
I think this would be strong for production (aside from the fact that the strings cannot begin with any white-space - which is fixable, but I wasn't sure if it's what you're going for).
public class Other {
static String patternStr = "^([\\S]{1,})([\\s]{1,})(.*)";
static String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
static String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
static String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
static String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
static Pattern pattern = Pattern.compile(patternStr);
public static void main(String[] args) {
List<String> possibilities = Arrays.asList(str1, str2, str3, str4);
for (String str : possibilities) {
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1).replaceAll("\\s", "") + "\"");
System.out.println("Some whitespace found: \"" + matcher.group(2).replaceAll("\\s", "") + "\"");
System.out.println("Build-id found: \"" + matcher.group(3).replaceAll("\\s", "") + "\"");
} else {
System.out.println("Pattern NOT found");
}
System.out.println();
}
}
}
Imo, it looks very similar to your original code. In case the regex doesn't look familiar to you, I'll explain what's going on.
Capital S in [\\S] basically means match everything except for [\\s]. .+ worked well in your case, but all it is really saying is match anything that isn't empty - even a whitespace. This is not necessarily bad, but would be troublesome if you ever had to modify the regex.
{1,} simple means one or more occurrences. {1,2}, to give another example, would be 1 or 2 occurrences. FYI, + usually means 0 or 1 occurrences (maybe not in Java) and * means one or more occurrences.
The parentheses denote groups. The entire match is group 0. When you add parentheses, the order from left to right represent group 1 .. group N. So what I did was combine your patterns using groups, separated by one or more occurrences of whitespace. (.*) is used for group 2, since that group can have both whitespace and non-whitespace, as long as it doesn't begin with whitespace.
If you have any questions feel free to ask. For the record, your current code is fine if you just add '+' to the buildId pattern: [\\s]+.+.
Without that, your regex is saying: match the whitespace that is followed by no characters or a single character. Since all of your whitespace is followed by more whitespace, you matching just a single whitespace.
TLDR;
Use the pattern ^(v\\S+)\\s+(\\S+), where the capture-groups capture the version and build respectively, here's the complete snippet:
String unitPattern ="^(v\\S+)\\s+(\\S+)";
Pattern pattern = Pattern.compile(unitPattern);
for (String str : possibilities) {
System.out.println("Analyzing \"" + str + "\"");
Matcher matcher = pattern.matcher(str);
while(matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1) + "\"");
System.out.println("Build-id found: \"" + matcher.group(2) + "\"");
}
}
Fiddle to try it.
Nitty Gritties
Reason for the empty lines in the output
It's because of how the Matcher class interprets the .; The . DOES NOT match newlines, it stops matching just before the \n. For that you need to add the flag Pattern.DOTALL using Pattern.compile(String pattern, int flags).
An attempt
But even with Pattern.DOTALL, you'll still not be able to match, because of the way you have defined the pattern. A better approach is to match the full build and version as a unit and then extract the necessary parts.
^(v\\S+)\\s+(\\S+)
This does trick where :
^(v\\S+) defines the starting of the unit and also captures version information
\\s+ matches the tabs, new line, spaces etc
(\\S+) captures the final contiguous build id

Java - matcher re-reading words

I'm trying to create a lexical analyzer for Delphi using java. Here's the sample code:
String[] keywords={"array","as","asm","begin","case","class","const","constructor","destructor","dispinterface","div","do","downto","else","end","except","exports","file","finalization","finally","for","function","goto","if","implementation","inherited","initialization","inline","interface","is","label","library","mod","nil","object","of","out","packed","procedure","program","property","raise","record","repeat","resourcestring","set","shl","shr","string","then","threadvar","to","try","type","unit","until","uses","var","while","with"};
String[] relation={"=","<>","<",">","<=",">="};
String[] logical={"and","not","or","xor"};
Matcher matcher = null;
for(int i=0;i<keywords.length;i++){
matcher=Pattern.compile(keywords[i]).matcher(line);
if(matcher.find()){
System.out.println("Keyword"+"\t\t"+matcher.group());
}
}
for(int i1=0;i1<logical.length;i1++){
matcher=Pattern.compile(logical[i1]).matcher(line);
if(matcher.find()){
System.out.println("logic_op"+"\t\t"+matcher.group());
}
}
for(int i2=0;i2<relation.length;i2++){
matcher=Pattern.compile(relation[i2]).matcher(line);
if(matcher.find()){
System.out.println("relational_op"+"\t\t"+matcher.group());
}
}
So, when I run the program, it works but it's re-reading certain words which the program considers as 2 token for example: record is a keyword, but re-reads it to find the word or for the token logical operators which is from rec"or"d. How can I cancel out the re-reading of words? Thanks!
Add \b to your regular expressions for breaks between words. So:
Pattern.compile("\\b" + keywords[i] + "\\b")
will ensure that the characters on either side of your word aren't letters.
This way "record" will only match with "record," not with "or."
As mentioned in answer by EvanM, you need to add a \b word boundary matcher before and after the keyword, to prevent substring matching within a word.
For better performance, you should also use the | logical regex operator to match one of many values, instead of creating multiple matchers, so you only have to scan the line once, and only have to compile one regex.
You can even combine the 3 different kinds of token you are looking for in a single regex, and use capture groups to differentiate them, so you only have to scan the line once in total.
Like this:
String regex = "\\b(array|as|asm|begin|case|class|const|constructor|destructor|dispinterface|div|do|downto|else|end|except|exports|file|finalization|finally|for|function|goto|if|implementation|inherited|initialization|inline|interface|is|label|library|mod|nil|object|of|out|packed|procedure|program|property|raise|record|repeat|resourcestring|set|shl|shr|string|then|threadvar|to|try|type|unit|until|uses|var|while|with)\\b" +
"|(=|<[>=]?|>=?)" +
"|\\b(and|not|or|xor)\\b";
for (Matcher m = Pattern.compile(regex).matcher(line); m.find(); ) {
if (m.start(1) != -1) {
System.out.println("Keyword\t\t" + m.group(1));
} else if (m.start(2) != -1) {
System.out.println("logic_op\t\t" + m.group(2));
} else {
System.out.println("relational_op\t\t" + m.group(3));
}
}
You can even optimize it further by combining keywords with common prefixes, e.g. as|asm could become asm?, i.e. as optionally followed by m. Will make the keyword list less readable, but would perform better.
In the code above, I did that for the logic ops, to show how, and also to fix the matching error in the original code, where >= in the line would show up 3 times as =, >, >= in that order, a problem similar to the sub-keyword problem asked for in the question.

How to capture all nested matches?

I was trying to answer a question recently and while attempting to solve it, I ran into a question of my own.
Given the following code
private void regexample(){
String x = "a3ab4b5";
Pattern p = Pattern.compile("(\\D+(\\d+)\\D+){2}");
Matcher m = p.matcher(x);
while(m.find()){
for(int i=0;i<=m.groupCount();i++){
System.out.println("Group " + i + " = " + m.group(i));
}
}
}
And the output
Group 0 = a3ab4b
Group 1 = b4b
Group 2 = 4
Is there any straight-forward way I'm missing to get the value 3? The pattern should look for two occurrences of (\\D+(\\d+)\\D+) back-to-back, and a3a is part of the match. I realize I can change expression to (\\D+(\\d+)\\D+) and then look for all matches, but that isn't technically the same thing. Is the only way to do a double search? ie: Use the given pattern to match the string and then search again for each count of the outer group?
I guessed that the first values were overwritten with the second, but as I'm not that great with regex, I was hoping there was something I was missing.
It is impossible to capture multiple occurrences of the same group (with standard regex engines). You could use something like this:
Pattern.compile("(\\D+(\\d+)\\D+)(\\D+(\\d+)\\D+)");
Now, there are four groups instead of two, so you will get the values you expected.
This question deals with a similar problem.

Java Regex does not match

I know that this kind of questions are proposed very often, but
I can't figure out why this RegEx does not match.
I want to check if there is a "M" at the beginning of the line, or not.
Finaly, i want the path at the end of the line.
This is why startsWith() doesn't fit my Needs.
line = "M 72208 70779 koj src\com\company\testproject\TestDomainf1.java";
if (line.matches("^(M?)(.*)$")) {}
I've also tried the other way out:
Pattern p = Pattern.compile("(M?)");
Matcher m = datePatt.matcher(line);
if (m.matches()) {
System.out.println("yay!");
}
if (line.matches("(M?)(.*)")) {}
Thanks
The correct regex would be simply
line.matches("M.*")
since the matches method enforces that the whole input sequence must match. However, this is such a simple problem that I wonder if you really need a regex for it. A plain
line.startsWith("M")
or
line.length() > 0 && line.charAt(0) == 'M'
or even just
line.indexOf('M') == 0
will work for your requirement.
Performance?
If you are also interested in performance, my second and third options win in that department, whereas the first one may easily be the slowest option: it must first compile the regex, then evaluate it. indexOf has the problem that its worst case is scanning the whole string.
UPDATE
In the meantime you have completely restated your question and made it clear that the regex is what you really need. In this case the following should work:
Matcher m = Pattern.compile("M.*?(\\S+)").matcher(input);
System.out.println(m.matches()? m.group(1) : "no match");
Note, this only works if the path doesn't contain spaces. If it does, then the problem is much harder.
You dont need a regex for that. Just use String#startsWith(String)
if (line.startsWith("M")) {
// code here
}
OR else use String#toCharArray():
if (line.length() > 0 && line.toCharArray()[0] == 'M') {
// code here
}
EDIT: After your edited requirement to get path from input string.
You still can avoid regex and have your code like this:
String path="";
if (line.startsWith("M"))
path = line.substring(line.lastIndexOf(' ')+1);
System.out.println(path);
OUTPUT:
src\com\company\testproject\TestDomainf1.java
You can use this pattern to check whether an M character appears as at the beginning of the string:
if (line.matches("M.*"))
But for something this simple, you can just use this:
if (line.length() > 0 && line.charAt(0) == 'M')
Why not do this
line.startsWith("M");
String str = new String("M 72208 70779 kij src/com/knapp/testproject/TestDomainf1.java");
if(str.startsWith("M") ){
------------------------
------------------------
}
If you need Path, you can split (I guess than \t is the separator) the string and take the latest field:
String[] tabS = "M 72208 70779 kij src\com\knapp\testproject\TestDomainf1.java".split("\t");
String path = tabS[tabS.length-1];

Regular expression help in java

I am lost when it comes to building regex strings. I need a regular expression that does the following.
I have the following strings:
[~class:obj]
[~class|class2|more classes:obj]
[!class:obj]
[!class|class2|more classes:obj]
[?method:class]
[text]
A string can have multiple of whats above. Example string would be "[if] [!class:obj]"
I want to know what is in between the [] and broken into match groups. For example, the first match group would be the symbol if present (~|!|?) next what is before the : so that could be class or class|class2|etc... then what is on the right of the : and stop before the ]. There may be no : and what goes before it, but just something between the [].
So, how would I go about writing this regex? And is it possible to give the match group names so I know what it matched?
This is for a java project.
If you're sure enough of your inputs, you can probably use something like /\[(\~|\!|\?)?(?:((?:[^:\]]*?)+):)?([^\]]+?)\]/. (to translate that into Java, you'll want to escape the backslashes and use quotation marks instead of forward slashes)
Here are some web sites that might be helpful:
http://www.cis.upenn.edu/~matuszek/General/RegexTester/regex-tester.html
http://txt2re.com/index.php3?s=Test+test+june+2011+test&submit=Show+Matches
http://www.regexplanet.com/simple/
I believe that this should work:
/[(.*?)(?:\|(.*?))*]/
Also:
[a-z]*
Try this code
final Pattern
outerP = Pattern.compile("\\[.*?\\]"),
innerP = Pattern.compile("\\[([~!?]?)([^:]*):?(.*)\\]");
for (String s : asList(
"[~class:obj]",
"[if][~class:obj]",
"[~class|class2|more classes:obj]",
"[!class:obj]",
"[!class|class2|more classes:obj]",
"[?method:class]",
"[text]"))
{
final Matcher outerM = outerP.matcher(s);
System.out.println("Input: " + s);
while (outerM.find()) {
final Matcher m = innerP.matcher(outerM.group());
if (m.matches()) System.out.println(
m.group(1) + ";" + m.group(2) + ";" + m.group(3));
else System.out.println("No match");
}
}

Categories