Java Regex finding operators - java

I'm trying to use regex to get numbers and operators from a string containing an expression. It finds the numbers but i doesn't find the operators. After every match (number or operator) at the beginning of the string it truncates the expression in order to find the next one.
String expression = "23*12+11";
Pattern intPattern;
Pattern opPattern;
Matcher intMatch;
Matcher opMatch;
intPattern = Pattern.compile("^\\d+");
intMatch = intPattern.matcher(expression);
opPattern = Pattern.compile("^[-+*/()]+");
opMatch = opPattern.matcher(expression);
while ( ! expression.isEmpty()) {
System.out.println("New expression: " + expression);
if (intMatch.find()) {
String inputInt = intMatch.group();
System.out.println(inputInt);
System.out.println("Found at index: " + intMatch.start());
expression = expression.substring(intMatch.end());
intMatch = intPattern.matcher(expression);
System.out.println("Truncated expression: " + expression);
} else if (opMatch.find()) {
String nextOp = opMatch.group();
System.out.println(nextOp);
System.out.println("Found at index: " + opMatch.start());
System.out.println("End index: " + opMatch.end());
expression = expression.substring(opMatch.end());
opMatch = opPattern.matcher(expression);
System.out.println("Truncated expression: " + expression);
} else {
System.out.println("Last item: " + expression);
break;
}
}
The output is
New expression: 23*12+11
23
Found at index: 0
Truncated expression: *12+11
New expression: *12+11
Last item: *12+11
As far as I have been able to investigate there is no need to escape the special characters *, + since they are inside a character class. What's the problem here?

First, your debugging output is confusing, because it's exactly the same in both branches. Add something to distinguish them, such as an a and b prefix:
System.out.println("a.Found at index: " + intMatch.start());
Your problem is that you're not resetting both matchers to the updated string. At the end of both branches in your if-else (or just once, after the entire if-else block), you need to do this:
intMatch = intPattern.matcher(expression);
opMatch = opPattern.matcher(expression);
One last thing: Since you're creating a new matcher over and over again via Pattern.matcher(s), you might want to consider creating the matcher only once, with a dummy-string, at the top of your code
//"": Unused string so matcher object can be reused
intMatch = Pattern.compile(...).matcher("");
and then resetting it in each loop iteration
intMatch.reset(expression);
You can implement the reusable Matchers like this:
//"": Unused to-search strings, so the matcher objects can be reused.
Matcher intMatch = Pattern.compile("^\\d+").matcher("");
Matcher opMatch = Pattern.compile("^[-+*/()]+").matcher("");
String expression = "23*12+11";
while ( ! expression.isEmpty()) {
System.out.println("New expression: " + expression);
intMatch.reset(expression);
opMatch.reset(expression);
if(intMatch.find()) {
...
The
Pattern *Pattern = ...
lines can be removed from the top, and the
*Match = *Pattern.matcher(expression)
lines can be removed from both if-else branches.

Your main problem is that when you found int you or operator you are reassigning only intMatch or opMatch. So if you find int operator is still try to find match on old version of expression. So you need to place this lines in both your positive cases
intMatch = intPattern.matcher(expression);
opMatch = opPattern.matcher(expression);
But maybe instead of your approach with two Patterns and recreating expression just use one regex which will find ints or operators and place them in different group categories? I mean something like
String expression = "23*12+11";
Pattern p = Pattern.compile("(\\d+)|([-+*/()]+)");
Matcher m = p.matcher(expression);
while (m.find()){
if (m.group(1)==null){//group 1 is null so match must come from group 2
System.out.println("opperator found: "+m.group(2));
}else{
System.out.println("integer found: "+m.group(1));
}
}
Also if you don't need to separately handle integers and operators you can just split on places before and after operators using look-around mechanisms
String expression = "23*12+11";
for (String s : expression.split("(?<=[-+*/()])|(?=[-+*/()])"))
System.out.println(s);
Output:
23
*
12
+
11

Try this one
Note:You have missed modulus % operator
String expression = "2/3*1%(2+11)";
Pattern pt = Pattern.compile("[-+*/()%]");
Matcher mt = pt.matcher(expression);
int lastStart = 0;
while (mt.find()) {
if (lastStart != mt.start()) {
System.out.println("number:" + expression.substring(lastStart, mt.start()));
}
lastStart = mt.start() + 1;
System.out.println("operator:" + mt.group());
}
if (lastStart != expression.length()) {
System.out.println("number:" + expression.substring(lastStart));
}
output
number:2
operator:/
number:3
operator:*
number:1
operator:%
operator:(
number:2
operator:+
number:11
operator:)

Related

Search substring in a string using regex

I'm trying to search for a set of words, contained within an ArrayList(terms_1pers), inside a string and, since the precondition is that before and after the search word there should be no letters, I thought of using expression regular.
I just don't know what I'm doing wrong using the matches operator. In the code reported, if the matching is not verified, it writes to an external file.
String url = csvRecord.get("url");
String text = csvRecord.get("review");
String var = null;
for(String term : terms_1pers)
{
if(!text.matches("[^a-z]"+term+"[^a-z]"))
{
var="true";
}
}
if(!var.equals("true"))
{
bw.write(url+";"+text+"\n");
}
In order to find regex matches, you should use the regex classes. Pattern and Matcher.
String term = "term";
ArrayList<String> a = new ArrayList<String>();
a.add("123term456"); //true
a.add("A123Term5"); //false
a.add("term456"); //true
a.add("123term"); //true
Pattern p = Pattern.compile("^[^A-Za-z]*(" + term + ")[^A-Za-z]*$");
for(String text : a) {
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Found: " + m.group(1) );
//since the term you are adding is the second matchable portion, you're looking for group(1)
}
else System.out.println("No match for: " + term);
}
}
In the example there, we create an instance of a https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html to find matches in the text you are matching against.
Note that I adjusted the regex a bit. The choice in this code excludes all letters A-Z and the lowercase versions from the initial matching part. It will also allow for situations where there are no characters at all before or after the match term. If you need to have something there, use + instead of *. I also limited the regex to force the match to only contain matches for these three groups by using ^ and $ to verify end the end of the matching text. If this doesn't fit your use case, you may need to adjust.
To demonstrate using this with a variety of different terms:
ArrayList<String> terms = new ArrayList<String>();
terms.add("term");
terms.add("the book is on the table");
terms.add("1981 was the best year ever!");
ArrayList<String> a = new ArrayList<String>();
a.add("123term456");
a.add("A123Term5");
a.add("the book is on the table456");
a.add("1##!231981 was the best year ever!9#");
for (String term: terms) {
Pattern p = Pattern.compile("^[^A-Za-z]*(" + term + ")[^A-Za-z]*$");
for(String text : a) {
Matcher m = p.matcher(text);
if (m.find()) {
System.out.println("Found: " + m.group(1) + " in " + text);
//since the term you are adding is the second matchable portion, you're looking for group(1)
}
else System.out.println("No match for: " + term + " in " + text);
}
}
Output for this is:
Found: term in 123term456
No match for: term in A123Term5
No match for: term in the book is on the table456....
In response to the question about having String term being case insensitive, here's a way that we can build a string by taking advantage of java.lang.Character to options for upper and lower case letters.
String term = "This iS the teRm.";
String matchText = "123This is the term.";
StringBuilder str = new StringBuilder();
str.append("^[^A-Za-z]*(");
for (int i = 0; i < term.length(); i++) {
char c = term.charAt(i);
if (Character.isLetter(c))
str.append("(" + Character.toLowerCase(c) + "|" + Character.toUpperCase(c) + ")");
else str.append(c);
}
str.append(")[^A-Za-z]*$");
System.out.println(str.toString());
Pattern p = Pattern.compile(str.toString());
Matcher m = p.matcher(matchText);
if (m.find()) System.out.println("Found!");
else System.out.println("Not Found!");
This code outputs two lines, the first line is the regex string that's being compiled in the Pattern. "^[^A-Za-z]*((t|T)(h|H)(i|I)(s|S) (i|I)(s|S) (t|T)(h|H)(e|E) (t|T)(e|E)(r|R)(m|M).)[^A-Za-z]*$" This adjusted regex allows for letters in the term to be matched regardless of case. The second output line is "Found!" because the mixed case term is found within matchText.
There are several things to note:
matches requires a full string match, so [^a-z]term[^a-z] will only match a string like :term.. You need to use .find() to find partial matches
If you pass a literal string to a regex, you need to Pattern.quote it, or if it contains special chars, it will not get matched
To check if a word has some pattern before or after or at the start/end, you should either use alternations with anchors (like (?:^|[^a-z]) or (?:$|[^a-z])) or lookarounds, (?<![a-z]) and (?![a-z]).
To match any letter just use \p{Alpha} or - if you plan to match any Unicode letter - \p{L}.
The var variable is more logical to set to Boolean type.
Fixed code:
String url = csvRecord.get("url");
String text = csvRecord.get("review");
Boolean var = false;
for(String term : terms_1pers)
{
Matcher m = Pattern.compile("(?<!\\p{L})" + Pattern.quote(term) + "(?!\\p{L})").matcher(text);
// If the search must be case insensitive use
// Matcher m = Pattern.compile("(?i)(?<!\\p{L})" + Pattern.quote(term) + "(?!\\p{L})").matcher(text);
if(!m.find())
{
var = true;
}
}
if (!var) {
bw.write(url+";"+text+"\n");
}
you did not consider the case where the start and end may contain letters
so adding .* at the front and end should solve your problem.
for(String term : terms_1pers)
{
if( text.matches(".*[^a-zA-Z]+" + term + "[^a-zA-Z]+.*)" )
{
var="true";
break; //exit the loop
}
}
if(!var.equals("true"))
{
bw.write(url+";"+text+"\n");
}

Count regex matches with streams

I am trying to count the number of matches of a regex pattern with a simple Java 8 lambdas/streams based solution. For example for this pattern/matcher :
final Pattern pattern = Pattern.compile("\\d+");
final Matcher matcher = pattern.matcher("1,2,3,4");
There is the method splitAsStream which splits the text on the given pattern instead of matching the pattern. Although it's elegant and preserves immutability, it's not always correct :
// count is 4, correct
final long count = pattern.splitAsStream("1,2,3,4").count();
// count is 0, wrong
final long count = pattern.splitAsStream("1").count();
I also tried (ab)using an IntStream. The problem is I have to guess how many times I should call matcher.find() instead of until it returns false.
final long count = IntStream
.iterate(0, i -> matcher.find() ? 1 : 0)
.limit(100)
.sum();
I am familiar with the traditional solution while (matcher.find()) count++; where count is mutable. Is there a simple way to do that with Java 8 lambdas/streams ?
To use the Pattern::splitAsStream properly you have to invert your regex. That means instead of having \\d+(which would split on every number) you should use \\D+. This gives you ever number in your String.
final Pattern pattern = Pattern.compile("\\D+");
// count is 4
long count = pattern.splitAsStream("1,2,3,4").count();
// count is 1
count = pattern.splitAsStream("1").count();
The rather contrived language in the javadoc of Pattern.splitAsStream is probably to blame.
The stream returned by this method contains each substring of the input sequence that is terminated by another subsequence that matches this pattern or is terminated by the end of the input sequence.
If you print out all of the matches of 1,2,3,4 you may be surprised to notice that it is actually returning the commas, not the numbers.
System.out.println("[" + pattern.splitAsStream("1,2,3,4")
.collect(Collectors.joining("!")) + "]");
prints [!,!,!,]. The odd bit is why it is giving you 4 and not 3.
Obviously this also explains why "1" gives 0 because there are no strings between numbers in the string.
A quick demo:
private void test(Pattern pattern, String s) {
System.out.println(s + "-[" + pattern.splitAsStream(s)
.collect(Collectors.joining("!")) + "]");
}
public void test() {
final Pattern pattern = Pattern.compile("\\d+");
test(pattern, "1,2,3,4");
test(pattern, "a1b2c3d4e");
test(pattern, "1");
}
prints
1,2,3,4-[!,!,!,]
a1b2c3d4e-[a!b!c!d!e]
1-[]
You can extend AbstractSpliterator to solve this:
static class SpliterMatcher extends AbstractSpliterator<Integer> {
private final Matcher m;
public SpliterMatcher(Matcher m) {
super(Long.MAX_VALUE, NONNULL | IMMUTABLE);
this.m = m;
}
#Override
public boolean tryAdvance(Consumer<? super Integer> action) {
boolean found = m.find();
if (found)
action.accept(m.groupCount());
return found;
}
}
final Pattern pattern = Pattern.compile("\\d+");
Matcher matcher = pattern.matcher("1");
long count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 1
matcher = pattern.matcher("1,2,3,4");
count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 4
matcher = pattern.matcher("foobar");
count = StreamSupport.stream(new SpliterMatcher(matcher), false).count();
System.out.println("Count: " + count); // 0
Shortly, you have a stream of String and a String pattern : how many of those strings match with this pattern ?
final String myString = "1,2,3,4";
Long count = Arrays.stream(myString.split(","))
.filter(str -> str.matches("\\d+"))
.count();
where first line can be another way to stream List<String>().stream(), ...
Am I wrong ?
Java 9
You may use Matcher#results() to get hold of all matches:
Stream<MatchResult>    results()
Returns a stream of match results for each subsequence of the input sequence that matches the pattern. The match results occur in the same order as the matching subsequences in the input sequence.
Java 8 and lower
Another simple solution based on using a reverse pattern:
String pattern = "\\D+";
System.out.println("1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
Here, all non-digits are removed from the start and end of a string, and then the string is split by non-digit sequences without reporting any empty trailing whitespace elements (since 0 is passed as a limit argument to split).
See this demo:
String pattern = "\\D+";
System.out.println("1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
System.out.println("1,2,3".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);// => 3
System.out.println("hz 1".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
System.out.println("1 hz".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length); // => 1
System.out.println("xxx 1 223 zzz".replaceAll("^" + pattern + "|" + pattern + "$", "").split(pattern, 0).length);//=>2

Parsing a string with [3:0] substring in it

I want to store two numbers from a string into two distinct variables - for example, var1 = 3 and var2 = 0 from "[3:0]". I have the following code snippet:
String myStr = "[3:0]";
if (myStr.trim().matches("\\[(\\d+)\\]")) {
// Do something.
// If it enter the here, here I want to store 3 and 0 in different variables or an array
}
Is it possible doing this with split and regular expressions?
Don't call trim(). Enhance you regex instead.
Your regex is missing the pattern for : and the second number, and you don't need to escape the ].
To capture the matched numbers, you need the Matcher:
String myStr = " [3:0] ";
Matcher m = Pattern.compile("\\s*\\[(\\d+):(\\d+)]\\s*").matcher(myStr);
if (m.matches())
System.out.println(m.group(1) + ", " + m.group(2));
Output
3, 0
You can use replaceAll and split
String myStr = "[3:0]";
if(myStr.trim().matches("\\[\\d+:\\d+\\]") {
String[] numbers = myStr.replaceAll("[\\[\\]]","").split(":");
}
Moreover, your regExp to match String should be \\[\\d+:\\d+\\], if you want to avoid trim you can add \\s+ at start and end to match the spaces.But trim is not bad.
EDIT
As suggested by Andreas in comments,
String myStr = "[3:0]";
String regExp = "\\[(\\d+):(\\d+)\\]";
Pattern pattern = Pattern.compile(regExp);
Matcher matcher = pattern.matcher(myStr.trim());
if(matcher.find()) {
int a = Integer.parseInt(matcher.group(1));
int b = Integer.parseInt(matcher.group(2));
System.out.println(a + " : " + b);
}
OUTPUT
3 : 0
Without any regular expressions you could do this:
// this will remove the braces [ and ] and just leave "3:0"
String numberString= myString.trim().replace("[", "").replace("]","");
// this will split the string in everything before the : and everything after the : (so two values as an array)
String[] numbers = numberString.split(":");
// get the first value and parse it as a number "3" will become a simple 3
int firstNumber = Integer.parseInt(numbers[0]) ;
// get the second value and parse it from "0" to a plain 0
int secondNumber = Integer.parseInt(numbers[1]);
be carefull when parsing numbers, depending on your input string and what other possibilities there might be (e.g. "3:12" is ok, but "3:02" might throw an error).
In case you don't need to validate input and you want to simply get numbers from it, you could simply find indexOf(":") and substring parts which you are interested, in which are:
from [ (which is at position 0) till :
and from index of : till ] (which is at position equal to length of string -1)
Your code can look like
String text = "[3:0]";
int colonIndex = text.indexOf(':');
String first = text.substring(1, colonIndex);
String second = text.substring(colonIndex + 1, text.length() - 1);

How to Determine if a String starts with exact number of zeros?

How can I know if my string exactly starts with {n} number of leading zeros?
For example below, the conditions would return true but my real intention is to check if the string actually starts with only 2 zeros.
String str = "00063350449370"
if (str.startsWith("00")) { // true
...
}
You can do something like:
if ( str.startsWith("00") && ! str.startsWith("000") ) {
// ..
}
This will make sure that the string starts with "00", but not a longer string of zeros.
You can try this regex
boolean res = s.matches("00[^0]*");
How about?
final String zeroes = "00";
final String zeroesLength = zeroes.length();
str.startsWith(zeroes) && (str.length() == zeroes.length() || str.charAt(zeroes.length()) != '0')
Slow but:
if (str.matches("(?s)0{3}([^0].*)?") {
This uses (?s) DOTALL option to let . also match line-breaks.
0{3} is for 3 matches.
How about using a regular expression?
0{n}[^0]*
where n is the number of leading '0's you want. You can utilise the Java regex API to check if the input matches the expression:
Pattern pattern = Pattern.compile("0{2}[^0]*"); // n = 2 here
Matcher matcher = pattern.matcher(input);
if (matcher.matches()) {
// code
}
You can use a regular expression to evaluate the String value:
String str = "00063350449370";
String pattern = "[0]{2}[1-9]{1}[0-9]*"; // [0]{2}[1-9]{1} starts with 2 zeros, followed by a non-zero value, and maybe some other numbers: [0-9]*
if (Pattern.matches(pattern, str))
{
// DO SOMETHING
}
There might be a better regular expression to resolve this, but this should give you a general idea how to proceed if you choose the regular expression path.
The long way
String TestString = "0000123";
Pattern p = Pattern.compile("\\A0+(?=\\d)");
Matcher matcher = p.matcher(TestString);
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
System.out.println(" Group: " + matcher.group());
}
Your probably better off with a small for loop though
int leadZeroes;
for (leadZeroes=0; leadZeroes<TestString.length(); leadZeroes++)
if (TestString.charAt(leadZeroes) != '0')
break;
System.out.println("Count of Leading Zeroes: " + leadZeroes);

I need to get a substring from a java string Tokenizer

I need to get a substring from a java string tokenizer.
My inpunt string is = Pizza-1*Nutella-20*Chicken-65*
StringTokenizer productsTokenizer = new StringTokenizer("Pizza-1*Nutella-20*Chicken-65*", "*");
do
{
try
{
int pos = productsTokenizer .nextToken().indexOf("-");
String product = productsTokenizer .nextToken().substring(0, pos+1);
String count= productsTokenizer .nextToken().substring(pos, pos+1);
System.out.println(product + " " + count);
}
catch(Exception e)
{
}
}
while(productsTokenizer .hasMoreTokens());
My output must be:
Pizza 1
Nutella 20
Chicken 65
I need the product value and the count value in separate variables to insert that values in the Data Base.
I hope you can help me.
You could use String.split() as
String[] products = "Pizza-1*Nutella-20*Chicken-65*".split("\\*");
for (String product : products) {
String[] prodNameCount = product.split("\\-");
System.out.println(prodNameCount[0] + " " + prodNameCount[1]);
}
Output
Pizza 1
Nutella 20
Chicken 65
You invoke the nextToken() method 3 times. That will get you 3 different tokens
int pos = productsTokenizer .nextToken().indexOf("-");
String product = productsTokenizer .nextToken().substring(0, pos+1);
String count= productsTokenizer .nextToken().substring(pos, pos+1);
Instead you should do something like:
String token = productsTokenizer .nextToken();
int pos = token.indexOf("-");
String product = token.substring(...);
String count= token.substring(...);
I'll let you figure out the proper indexes for the substring() method.
Also instead of using a do/while structure it is better to just use a while loop:
while(productsTokenizer .hasMoreTokens())
{
// add your code here
}
That is don't assume there is a token.
An alternative answer you may want to use if your input grows:
// find all strings that match START or '*' followed by the name (matched),
// a hyphen and then a positive number (not starting with 0)
Pattern p = Pattern.compile("(?:^|[*])(\\w+)-([1-9]\\d*)");
Matcher finder = p.matcher(products);
while (finder.find()) {
// possibly check if the new match directly follows the previous one
String product = finder.group(1);
int count = Integer.valueOf(finder.group(2));
System.out.printf("Product: %s , count %d%n", product, count);
}
Some people dislike regex, but this is a good application for them. All you need to use is "(\\w+)-(\\d{1,})\\*" as your pattern. Here's a toy example:
String template = "Pizza-1*Nutella-20*Chicken-65*";
String pattern = "(\\w+)-(\\d+)\\*";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(template);
while(m.find())
{
System.out.println(m.group(1) + " " + m.group(2));
}
To explain this a bit more, "(\\w+)-(\\d+)\\*" looks for a (\\w+), which is any set of at least 1 character from [A-Za-z0-9_], followed by a -, followed by a number \\d+, where the+ means at least one character in length, followed by a *, which must be escaped. The parentheses capture what's inside of them. There are two sets of capturing parentheses in this regex, so we reference them by group(1) and group(2) as seen in the while loop, which prints:
Pizza 1
Nutella 20
Chicken 65

Categories