Java - matcher re-reading words

Java - matcher re-reading words - java

I'm trying to create a lexical analyzer for Delphi using java. Here's the sample code:
String[] keywords={"array","as","asm","begin","case","class","const","constructor","destructor","dispinterface","div","do","downto","else","end","except","exports","file","finalization","finally","for","function","goto","if","implementation","inherited","initialization","inline","interface","is","label","library","mod","nil","object","of","out","packed","procedure","program","property","raise","record","repeat","resourcestring","set","shl","shr","string","then","threadvar","to","try","type","unit","until","uses","var","while","with"};
String[] relation={"=","<>","<",">","<=",">="};
String[] logical={"and","not","or","xor"};
Matcher matcher = null;
for(int i=0;i<keywords.length;i++){
matcher=Pattern.compile(keywords[i]).matcher(line);
if(matcher.find()){
System.out.println("Keyword"+"\t\t"+matcher.group());
}
}
for(int i1=0;i1<logical.length;i1++){
matcher=Pattern.compile(logical[i1]).matcher(line);
if(matcher.find()){
System.out.println("logic_op"+"\t\t"+matcher.group());
}
}
for(int i2=0;i2<relation.length;i2++){
matcher=Pattern.compile(relation[i2]).matcher(line);
if(matcher.find()){
System.out.println("relational_op"+"\t\t"+matcher.group());
}
}
So, when I run the program, it works but it's re-reading certain words which the program considers as 2 token for example: record is a keyword, but re-reads it to find the word or for the token logical operators which is from rec"or"d. How can I cancel out the re-reading of words? Thanks!

Add \b to your regular expressions for breaks between words. So:
Pattern.compile("\\b" + keywords[i] + "\\b")
will ensure that the characters on either side of your word aren't letters.
This way "record" will only match with "record," not with "or."

As mentioned in answer by EvanM, you need to add a \b word boundary matcher before and after the keyword, to prevent substring matching within a word.
For better performance, you should also use the | logical regex operator to match one of many values, instead of creating multiple matchers, so you only have to scan the line once, and only have to compile one regex.
You can even combine the 3 different kinds of token you are looking for in a single regex, and use capture groups to differentiate them, so you only have to scan the line once in total.
Like this:
String regex = "\\b(array|as|asm|begin|case|class|const|constructor|destructor|dispinterface|div|do|downto|else|end|except|exports|file|finalization|finally|for|function|goto|if|implementation|inherited|initialization|inline|interface|is|label|library|mod|nil|object|of|out|packed|procedure|program|property|raise|record|repeat|resourcestring|set|shl|shr|string|then|threadvar|to|try|type|unit|until|uses|var|while|with)\\b" +
"|(=|<[>=]?|>=?)" +
"|\\b(and|not|or|xor)\\b";
for (Matcher m = Pattern.compile(regex).matcher(line); m.find(); ) {
if (m.start(1) != -1) {
System.out.println("Keyword\t\t" + m.group(1));
} else if (m.start(2) != -1) {
System.out.println("logic_op\t\t" + m.group(2));
} else {
System.out.println("relational_op\t\t" + m.group(3));
}
}
You can even optimize it further by combining keywords with common prefixes, e.g. as|asm could become asm?, i.e. as optionally followed by m. Will make the keyword list less readable, but would perform better.
In the code above, I did that for the logic ops, to show how, and also to fix the matching error in the original code, where >= in the line would show up 3 times as =, >, >= in that order, a problem similar to the sub-keyword problem asked for in the question.

Related

How not to match the first empty string in this regex?

(Disclaimer: the title of this question is probably too generic and not helpful to future readers having the same issue. Probably, it's just because I can't phrase it properly that I've not been able to find anything yet to solve my issue... I engage in modifying the title, or just close the question once someone will have helped me to figure out what the real problem is :) ).
High level description
I receive a string in input that contains two information of my interest:
A version name, which is 3.1.build and something else later
A build id, which is somenumbers-somenumbers-eitherwordsornumbers-somenumbers
I need to extract them separately.
More details about the inputs
I have an input which may come in 4 different ways:
Sample 1: v3.1.build.dev.12345.team 12345-12345-cici-12345 (the spaces in between are some \t first, and some whitespaces then).
Sample 2: v3.1.build.dev.12345.team 12345-12345-12345-12345 (this is very similar than the first example, except that in the second part, we only have numbers and -, no alphabetic characters).
Sample 3:
v3.1.build.dev.12345.team
12345-12345-cici-12345
(the above is very similar to sample 1, except that instead of \t and whitespaces, there's just a new line.
Sample 4:
v3.1.build.dev.12345.team
12345-12345-12345-12345
(same than above, with only digits and dashes in the second line).
Please note that in sample 3 and sample 4, there are some trailing spaces after both strings (not visible here).
To sum up, these are the 4 possible inputs:
String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
My code currently
I have written the following code to extract the information I need (here reporting only relevant, please visit the fiddle link to have a complete and runnable example):
String versionPattern = "^.+[\\s]";
String buildIdPattern = "[\\s].+";
Pattern pVersion = Pattern.compile(versionPattern);
Pattern pBuildId = Pattern.compile(buildIdPattern);
for (String str : possibilities) {
Matcher mVersion = pVersion.matcher(str);
Matcher mBuildId = pBuildId.matcher(str);
while(mVersion.find()) {
System.out.println("Version found: \"" + mVersion.group(0).replaceAll("\\s", "") + "\"");
}
while (mBuildId.find()) {
System.out.println("Build-id found: \"" + mBuildId.group(0).replaceAll("\\s", "") + "\"");
}
}
The issue I'm facing
The above code works, pretty much. However, in the Sample 3 and Sample 4 (those where the build-id is separated by the version with a \n), I'm getting two matches: the first, is just a "", the second is the one I wish.
I don't feel this code is stable, and I think I'm doing something wrong with the regex pattern to match the build-id:
String buildIdPattern = "[\\s].+";
Does anyone have some ideas in order to exclude the first empty match on the build-id for sample 3 and 4, while keeping all the other matches?
Or some better way to write the regexs themselves (I'm open to improvements, not a big expert of regex)?

Based on your description it looks like your data is in form
NonWhiteSpaces whiteSpaces NonWhiteSpaces (optionalWhiteSpaces)
and you want to get only NonWhiteSpaces parts.
This can be achieved in numerous ways. One of them would be to trim() your string to get rid of potential trailing whitespaces and then split on the whitespaces (there should now only be in the middle of string). Something like
String[] arr = data.trim().split("\\s+");// \s also represents line separators like \n \r
String version = arr[0];
String buildID = arr[1];

(^v\w.+)\s+(\d+-\d+-\w+-\d+)\s*
It will capture 2 groups. One will capture the first section (v3.1.build.dev.12345.team), the second gets the last section (12345-12345-cici-12345)
It breaks down like: (^v\w.+) ensures that the string starts with a v, then captures all characters that are a number or letter (stopping on white space tabs etc.) \s+ matches any white space or tabs/newlines etc. as many times as it can. (\d+-\d+-\w+-\d+) this reads it in, ensuring that it conforms to your specified formatting. Note that this will still read in the dashes, making it easier for you to split the string after to get the information you need. If you want you could even make these their own capture groups making it even easier to get your info.
Then it ends with \s* just to make sure it doesn't get messed up by trailing white space. It uses * instead of + because we don't want it to break if there's no trailing white space.

I think this would be strong for production (aside from the fact that the strings cannot begin with any white-space - which is fixable, but I wasn't sure if it's what you're going for).
public class Other {
static String patternStr = "^([\\S]{1,})([\\s]{1,})(.*)";
static String str1 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-cici-12345";
static String str2 = "v3.1.build.dev.12345.team\t\t\t\t\t 12345-12345-12345-12345";
static String str3 = "v3.1.build.dev.12345.team \n12345-12345-cici-12345 ";
static String str4 = "v3.1.build.dev.12345.team \n12345-12345-12345-12345 ";
static Pattern pattern = Pattern.compile(patternStr);
public static void main(String[] args) {
List<String> possibilities = Arrays.asList(str1, str2, str3, str4);
for (String str : possibilities) {
Matcher matcher = pattern.matcher(str);
if (matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1).replaceAll("\\s", "") + "\"");
System.out.println("Some whitespace found: \"" + matcher.group(2).replaceAll("\\s", "") + "\"");
System.out.println("Build-id found: \"" + matcher.group(3).replaceAll("\\s", "") + "\"");
} else {
System.out.println("Pattern NOT found");
}
System.out.println();
}
}
}
Imo, it looks very similar to your original code. In case the regex doesn't look familiar to you, I'll explain what's going on.
Capital S in [\\S] basically means match everything except for [\\s]. .+ worked well in your case, but all it is really saying is match anything that isn't empty - even a whitespace. This is not necessarily bad, but would be troublesome if you ever had to modify the regex.
{1,} simple means one or more occurrences. {1,2}, to give another example, would be 1 or 2 occurrences. FYI, + usually means 0 or 1 occurrences (maybe not in Java) and * means one or more occurrences.
The parentheses denote groups. The entire match is group 0. When you add parentheses, the order from left to right represent group 1 .. group N. So what I did was combine your patterns using groups, separated by one or more occurrences of whitespace. (.*) is used for group 2, since that group can have both whitespace and non-whitespace, as long as it doesn't begin with whitespace.
If you have any questions feel free to ask. For the record, your current code is fine if you just add '+' to the buildId pattern: [\\s]+.+.
Without that, your regex is saying: match the whitespace that is followed by no characters or a single character. Since all of your whitespace is followed by more whitespace, you matching just a single whitespace.

TLDR;
Use the pattern ^(v\\S+)\\s+(\\S+), where the capture-groups capture the version and build respectively, here's the complete snippet:
String unitPattern ="^(v\\S+)\\s+(\\S+)";
Pattern pattern = Pattern.compile(unitPattern);
for (String str : possibilities) {
System.out.println("Analyzing \"" + str + "\"");
Matcher matcher = pattern.matcher(str);
while(matcher.find()) {
System.out.println("Version found: \"" + matcher.group(1) + "\"");
System.out.println("Build-id found: \"" + matcher.group(2) + "\"");
}
}
Fiddle to try it.
Nitty Gritties
Reason for the empty lines in the output
It's because of how the Matcher class interprets the .; The . DOES NOT match newlines, it stops matching just before the \n. For that you need to add the flag Pattern.DOTALL using Pattern.compile(String pattern, int flags).
An attempt
But even with Pattern.DOTALL, you'll still not be able to match, because of the way you have defined the pattern. A better approach is to match the full build and version as a unit and then extract the necessary parts.
^(v\\S+)\\s+(\\S+)
This does trick where :
^(v\\S+) defines the starting of the unit and also captures version information
\\s+ matches the tabs, new line, spaces etc
(\\S+) captures the final contiguous build id

Java Regex needed

I need regex that will fail only for below patterns and pass for everything else.
RXXXXXXXXXX (X are digits)
XXX.XXX.XXX.XXX (IP address)
I have basic knowledge of regex but not sure how to achieve this one.
For the first part, I know how to use regex to not start with R but how to make sure it allows any number of digits except 10 is not sure.
^[^R][0-9]{10}$ - it will do the !R thing but not sure how to pull off the not 10 digits part.

Well, simply define a regex:
Pattern p = Pattern.compile("R[0-9]{10} ((0|1|)[0-9]{1,2}|2([0-4][0-9]|5[0-5]))(\\.((0|1|)[0-9]{1,2}|2([0-4][0-9]|5[0-5]))){3}");
Matcher m = p.matcher(theStringToMatch);
if(!m.matches()) {
//do something, the test didn't pass thus ok
}
Or a jdoodle.
EDIT:
Since you actually wanted two possible patterns to filter out, chance the pattern to:
Pattern p = Pattern.compile("(R[0-9]{10})|(((0|1|)[0-9]{1,2}|2([0-4][0-9]|5[0-5]))(\\.((0|1|)[0-9]{1,2}|2([0-4][0-9]|5[0-5]))){3})");
If you want to match the entire string (so that the string should start and end with the pattern, place ^ in from and $ at the end of the pattern.

This should work:
!(string.matches("R\d{10}|(\d{3}\\.){3}\d{3}");
The \d means any digit, the brackets mean how many times it is repeated, and the \. means the period character. Parentheses indicate a grouping.
Here's a good reference on java regex with examples.
http://www.vogella.com/tutorials/JavaRegularExpressions/article.html

Regex is not meant to validate every kind of input. You could, but sometimes it is not the right approach (similar to use a wrench as a hammer: it could do it but is not meant for it).
Split the string in two parts, by the space, then validate each:
String foo = "R1234567890 255.255.255.255";
String[] stringParts = foo.split(" ");
Pattern p = Pattern.compile("^[^R][0-9]{10}$");
Matcher m = p.macher(stringParts[0]);
if (m.matches()) {
//the first part is valid
//start validating the IP
String[] ipParts = stringParts.split("\\.");
for (String ip : ipParts) {
int ipPartValue = Integer.parseInt(ip);
if (!(ipPartValue >= 0 && ipPartValue <= 255)) {
//error...
}
}
}

Regex pattern for split

I would like to resolve this problem.
, comma : split terms
" double quote : String value (ignore special char)
[] array
For instance:
input : a=1,b="1,2,3",c=[d=1,e="1,2,3"]
expected output:
a=1
b="1,2,3"
c=[d=1,e="1,2,3"]
But I could not get above result.
I have written the code below:
String line = "a=1,b=\"1,2,3\",c=[d=1,e=\"1,11\"]";
String[] tokens = line.split(",(?=(([^\"]*\"){2})*[^\"]*$)");
for (String t : tokens)
System.out.println("> " + t);
and my output is:
a=1
b="1,2,3"
c=[d=1
e="1,11"]
What do I need to change to get the expected output? Should I stick to a regular expression or might another solution be more flexible and easier to maintain?

This regex does the trick:
",(?=(([^\"]*\"){2})*[^\"]*$)(?=([^\\[]*?\\[[^\\]]*\\][^\\[\\]]*?)*$)"
It works by adding a look-ahead for matching pairs of square brackets after the comma - if you're inside a square-bracketed term, of course you won't have balanced brackets following.
Here's some test code:
String line = "a=1,b=\"1,2,3\",c=[d=1,e=\"1,11\"]";
String[] tokens = line.split(",(?=(([^\"]*\"){2})*[^\"]*$)(?=([^\\[]*?\\[[^\\]]*\\][^\\[\\]]*?)*$)");
for (String t : tokens)
System.out.println(t);
Output:
a=1
b="1,2,3"
c=[d=1,e="1,11"]

I know the question is nearly a year old, but... this regex is much simpler:
\[[^]]*\]|"[^"]*"|(,)
The leftmost branch of the | matches [complete brackets]
The next side of the | matches \"strings like this\"
The right side captures commas to Group 1, and we know they are the right commas because they weren't matched by the expressions on the left
All we need to do is split on Group 1
Splitting on Group 1 Captures
You can do it like this (see the output at the bottom of the online demo):
String subject = "a=1,b=\"1,2,3\",c=[d=1,e=\"1,11\"]";
Pattern regex = Pattern.compile("\\[[^]]*\\]|\".*?\"|(,)");
Matcher m = regex.matcher(subject);
StringBuffer b= new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "##SplitHere##");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("##SplitHere##");
for (String split : splits) System.out.println(split);
This is a two-step split: first, we replace the commas with something distinctive, such as ##SplitHere##
Pros and Cons
The main benefit of this technique is that it is extremely easy to understand and maintain. If you suddenly decide to exclude commas {inside , curlies}, you just add another OR branch to the left of the regex: {[^{}]*}
When you are familiar with it, you can use it in many contexts
In this case, the main drawback is that we proceed in two steps as we replace before splitting. In my view, with modern processors that's irrelevant. Maintainable code is much more important.
Reference
This technique has many applications. It is fully explained in these two links.
How to match (or replace) a pattern except in situations s1, s2, s3...
Article about matching a pattern unless...

Regex to match only commas not in parentheses?

I have a string that looks something like the following:
12,44,foo,bar,(23,45,200),6
I'd like to create a regex that matches the commas, but only the commas that are not inside of parentheses (in the example above, all of the commas except for the two after 23 and 45). How would I do this (Java regular expressions, if that makes a difference)?

Assuming that there can be no nested parens (otherwise, you can't use a Java Regex for this task because recursive matching is not supported):
Pattern regex = Pattern.compile(
", # Match a comma\n" +
"(?! # only if it's not followed by...\n" +
" [^(]* # any number of characters except opening parens\n" +
" \\) # followed by a closing parens\n" +
") # End of lookahead",
Pattern.COMMENTS);
This regex uses a negative lookahead assertion to ensure that the next following parenthesis (if any) is not a closing parenthesis. Only then the comma is allowed to match.

Paul, resurrecting this question because it had a simple solution that wasn't mentioned. (Found your question while doing some research for a regex bounty quest.)
Also the existing solution checks that the comma is not followed by a parenthesis, but that does not guarantee that it is embedded in parentheses.
The regex is very simple:
\(.*?\)|(,)
The left side of the alternation matches complete set of parentheses. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right commas because they were not matched by the expression on the left.
In this demo, you can see the Group 1 captures in the lower right pane.
You said you want to match the commas, but you can use the same general idea to split or replace.
To match the commas, you need to inspect Group 1. This full program's only goal in life is to do just that.
import java.util.*;
import java.io.*;
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) throws java.lang.Exception {
String subject = "12,44,foo,bar,(23,45,200),6";
Pattern regex = Pattern.compile("\\(.*?\\)|(,)");
Matcher regexMatcher = regex.matcher(subject);
List<String> group1Caps = new ArrayList<String>();
// put Group 1 captures in a list
while (regexMatcher.find()) {
if(regexMatcher.group(1) != null) {
group1Caps.add(regexMatcher.group(1));
}
} // end of building the list
// What are all the matches?
System.out.println("\n" + "*** Matches ***");
if(group1Caps.size()>0) {
for (String match : group1Caps) System.out.println(match);
}
} // end main
} // end Program
Here is a live demo
To use the same technique for splitting or replacing, see the code samples in the article in the reference.
Reference
How to match pattern except in situations s1, s2, s3
How to match a pattern unless...

I don’t understand this obsession with regular expressions, given that they are unsuited to most tasks they are used for.
String beforeParen = longString.substring(longString.indexOf('(')) + longString.substring(longString.indexOf(')') + 1);
int firstComma = beforeParen.indexOf(',');
while (firstComma != -1) {
/* do something. */
firstComma = beforeParen.indexOf(',', firstComma + 1);
}
(Of course this assumes that there always is exactly one opening parenthesis and one matching closing parenthesis coming somewhen after it.)

Find ASCII "arrows" in text

I'm trying to find all the occurrences of "Arrows" in text, so in
"<----=====><==->>"
the arrows are:
"<----", "=====>", "<==", "->", ">"
This works:
String[] patterns = {"<=*", "<-*", "=*>", "-*>"};
for (String p : patterns) {
Matcher A = Pattern.compile(p).matcher(s);
while (A.find()) {
System.out.println(A.group());
}
}
but this doesn't:
String p = "<=*|<-*|=*>|-*>";
Matcher A = Pattern.compile(p).matcher(s);
while (A.find()) {
System.out.println(A.group());
}
No idea why. It often reports "<" instead of "<====" or similar.
What is wrong?

Solution
The following program compiles to one possible solution to the question:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class A {
public static void main( String args[] ) {
String p = "<=+|<-+|=+>|-+>|<|>";
Matcher m = Pattern.compile(p).matcher(args[0]);
while (m.find()) {
System.out.println(m.group());
}
}
}
Run #1:
$ java A "<----=====><<---<==->>==>"
<----
=====>
<
<---
<==
->
>
==>
Run #2:
$ java A "<----=====><=><---<==->>==>"
<----
=====>
<=
>
<---
<==
->
>
==>
Explanation
An asterisk will match zero or more of the preceding characters. A plus (+) will match one or more of the preceding characters. Thus <-* matches < whereas <-+ matches <- and any extended version (such as <--------).

When you match "<=*|<-*|=*>|-*>" against the string "<---", it matches the first part of the pattern, "<=*", because * includes zero or more. Java matching is greedy, but it isn't smart enough to know that there is another possible longer match, it just found the first item that matches.

Your first solution will match everything that you are looking for because you send each pattern into matcher one at a time and they are then given the opportunity to work on the target string individually.
Your second attempt will not work in the same manner because you are putting in single pattern with multiple expressions OR'ed together, and there are precedence rules for the OR'd string, where the leftmost token will be attempted first. If there is a match, no matter how minimal, the get() will return that match and continue on from there.
See Thangalin's response for a solution that will make the second work like the first.

for <======= you need <=+ as the regex. <=* will match zero or more ='s which means it will always match the zero case hence <. The same for the other cases you have. You should read up a bit on regexs. This book is FANTASTIC:
Mastering Regular Expressions

Your provided regex pattern String does work for your example: "<----=====><==->>"
String p = "<=*|<-*|=*>|-*>";
Matcher A = Pattern.compile(p).matcher(s);
while (A.find()) {
System.out.println(A.group());
}
However it is broken for some other examples pointed out in the answers such as input string "<-" yields "<", yet strangely "<=" yields "<=" as it should.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java - matcher re-reading words - java

Add \b to your regular expressions for breaks between words. So: Pattern.compile("\\b" + keywords[i] + "\\b") will ensure that the characters on either side of your word aren't letters. This way "record" will only match with "record," not with "or."

Related

How not to match the first empty string in this regex?

Java Regex needed

Regex pattern for split

Regex to match only commas not in parentheses?

Find ASCII "arrows" in text

Categories

Resources