Java pattern matcher find multiple strings - java

I am using Pattern.compile() to find if a text string contains two other strings. But it needs to be in one regex pattern.
For example the string must have "StringOne" and "StringTwo" in it.
I could do Pattern.compile("(StringOne StringTwo|StrinTwo StringOne"), but both strings are quite long and I want to see if I can compress it.
If I do "(StringOne )?StringTwo( StringOne)?" it would match "StringTwo" and "StringOne StringTwo StringOne".

Use this regex:
^(?=.*\\bStringOne\\b)(?=.*\\bStringTwo\\b)
This uses two look-aheads anchored to start of input to assert that both strings appear somewhere
Edit:
Added word boundaries \b to ends of strings to prevent matches of one string within another, although this was not a stated requirement of the question.

There is question of speed.
You could probably use lookaheads to accomplish this, but it's costly speed-wise. lookaheads are really expansive on long strings.
If the strings are long, the faster approach would be to do two separate matches.
If you really need to do one, use your original way string A string B|String B String A

Related

Why is my String array length 3 instead of 2?

I'm trying to understand regex. I wanted to make a String[] using split to show me how many letters are in a given string expression?
import java.util.*;
import java.io.*;
public class Main {
public static String simpleSymbols(String str) {
String result = "";
String[] alpha = str.split("[\\+\\w\\+]");
int alphaLength = alpha.length;
// System.out.print(alphaLength);
String[] charCount = str.split("[a-z]");
int charCountLength = charCount.length;
System.out.println(charCountLength);
}
}
My input string is "+d+=3=+s+". I split the string to count the number of letters in string. The array length should be two but I'm getting three. Also, I'm trying to make a regex to check the pattern +b+, with b being any letter in the alphabet? Is that correct?
So, a few things pop out to me:
First, your regex looks correct. If you're ever worried about how your regex will perform, you can use https://regexr.com/ to check it out. Just put your regex on the top and enter your string in the bottom to see if it is matching correctly
Second, upon close inspection, I see you're using the split function. While it is convenient for quickly splitting strings, you need to be careful as to what you are splitting on. In this case, you're removing all of the strings that you were initially looking at, which would make it impossible to find. If you print it out, you would notice that the following shows (for an input string of +d+=3=+s+):
+
+=3=+
+
Which shows that you accidentally cut out what you were looking to find in the first place. Now, there are several ways of fixing this, depending on what your criteria is.
Now, if what you wanted was just to separate on all +s and it doesn't matter that you find only what is directly bounded by +s, then split works awesome. Just do str.split("+"), and this will return you a list of the following (for +d+=3=+s+):
d
=3=
s
However, you can see that this poses a few problems. First, it doesn't strip out the =3= that we don't want, and second, it does not truly give us values that are surrounded by a +_+ format, where the underscore represents the string/char you're looking for.
Seeing as you're using +w, you intend to find words that are surrounded by +s. However, if you're just looking to find one character, I would suggest using another like [a-z] or [a-zA-Z] to be more specific. However, if you want to find multiple alphabetical characters, your pattern is fine. You can also add a * (0 or more) or a + (1 or more) at the end of the pattern to dictate what exactly you're looking for.
I won't give you the answer outright, but I'll give you a clue as to what to move towards. Try using a pattern and a matcher to find the regex that you listed above and then if you find a match, make sure to store it somewhere :)
Also, for future reference, you should always start a function name with a lower case, at least in Java. Only constants and class names should start in a capital :)
I am trying to use split to count the number of letters in that string. The array length should be two, but I'm getting three.
The regex in the split functions is used as delimiters and will not be shown in results. In your case "str.split([a-z])" means using alphabets as delimiters to separate your input string, which makes three substrings "(+)|d|(+=3=+)|s|(+)".
If you really want to count the number of letters using "split", use 'str.split("[^a-z]")'. But I would recommend using "java.util.regex.Matcher.find()" in order to find out all letters.
Also, I'm trying to make a regex to check the pattern +b+, with b being any letter in the alphabet? Is that correct?
Similarly, check the functions in "java.util.regex.Matcher".

Java Regex - Trying to use regex to get an array of "function" arguments

I have a bunch of strings representing mathematical functions (which could be nested and have any number of arguments), and I want to be able to use regex to return an array of strings, each string being an argument of the outer-most function. Here's an example:
"f1(f2(x),f3(f4(f5(x,y,z))),f(f(1)))"
I would want a regex pattern that I could use to somehow get an array of all the arguments of f1, which in this case are the strings "f2(x)", "f3(f4(f5(x,y,z)))", and "f(f(1))". There will be no spaces in the input string.
Thank you very much to anyone who can help.
I don't think this can be done with regexes alone.
This would probably require being able to identify balanced parentheses -- for example, once we've parsed f1(f2(x), the next character could either be a ) or a , -- and that's a canonical example of something that can't be done with regexes, but requires a more sophisticated parser.

java regular expression for String.contains

I'm looking for how to create a regular expression, which is 100% equivalent to the "contains" method in the String class. Basically, I have thousands of phrases that I'm searching for, and from what I understand it is much better for performance reasons to compile the regular expression once and use it multiple times, vs calling "mystring.contains(testString)" over and over again on different "mystring" values, with the same testString values.
Edit: to expand on my question... I will have many thousands of "testString" values, and I don't want to have to convert those to a format that the regular expression mechanism understands. I just want to be able to directly pass in a phrase that users enter, and see if it is found in whatever value "mystring" happens to contain. "testString" will not change it's value ever, but there will be thousands of them so that is why I was thinking of creating the matcher object and re-using it over and over etc. (Obviously my regexp skills are not up to snuff)
You can use the LITERAL flag when compiling your pattern to tell the engine you're using a literal string, e.g.:
Pattern p = Pattern.compile(yourString, Pattern.LITERAL);
But are you really sure that doing that and then reusing the result is faster than just String#contains? Enough to make the complexity worth it?
Well you could use Pattern.quote to get a "piece of regular expression" for each input string. Do any of your terms contain line breaks? If so, that could at least make life slightly trickier, though far from impossible.
Anyway, you'd basically just join the quoted terms together as:
Pattern pattern = Pattern.compile("quoted1|quoted2|quoted3|...");
You might want to use Guava's Joiner to easily join the quoted strings together, although obviously it's not terribly hard to do manually.
However, I would try this and then test whether it's actually more efficient than just calling contains. Have you already got a benchmark which shows that contains is too slow?

Java string: classes or packages with advanced functions?

I am doing string manipulations and I need more advanced functions than the original ones provided in Java.
For example, I'd like to return a substring between the (n-1)th and nth occurrence of a character in a string.
My question is, are there classes already written by users which perform this function, and many others for string manipulations? Or should I dig on stackoverflow for each particular function I need?
Check out the Apache Commons class StringUtils, it has plenty of interesting ways to work with Strings.
http://commons.apache.org/lang/api-2.3/index.html?org/apache/commons/lang/StringUtils.html
Have you looked at the regular expression API? That's usually your best bet for doing complex things with strings:
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
Along the lines of what you're looking to do, you can traverse the string against a pattern (in your case a single character) and match everything in the string up to but not including the next instance of the character as what is called a capture group.
It's been a while since I've written a regex, but if you were looking for the character A for instance, then I think you could use the regex A([^A]*) and keep matching that string. The stuff in the parenthesis is a capturing group, which I reference below. To match it, you'd use the matcher method on pattern:
http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html#matcher%28java.lang.CharSequence%29
On the Matcher instance, you'd make sure that matches is true, and then keep calling find() and group(1) as needed, where group(1) would get you what is in between the parentheses. You could use a counter in your looping to make sure you get the n-1 instance of the letter.
Lastly, Pattern provides flags you can pass in to indicate things like case insensitivity, which you may need.
If I've made some mistakes here, then someone please correct me. Like I said, I don't write regexes every day, so I'm sure I'm a little bit off.

Pattern match numbers/operators

Hey, I've been trying to figure out why this regular expression isn't matching correctly.
List l_operators = Arrays.asList(Pattern.compile(" (\\d+)").split(rtString.trim()));
The input string is "12+22+3"
The output I get is -- [,+,+]
There's a match at the beginning of the list which shouldn't be there? I really can't see it and I could use some insight. Thanks.
Well, technically, there is an empty string in front of the first delimiter (first sequence of digits). If you had, say a line of CSV, such as abc,def,ghi and another one ,jkl,mno you would clearly want to know that the first value in the second string was the empty string. Thus the behaviour is desirable in most cases.
For your particular case, you need to deal with it manually, or refine your regular expression somehow. Like this for instance:
Pattern p = Pattern.compile("\\d+");
Matcher m = p.matcher(rtString);
if (m.find()) {
List l_operators = Arrays.asList(p.split(rtString.substring(m.end()).trim()));
// ...
}
Ideally however, you should be using a parser for these type of strings. You can't for instance deal with parenthesis in expressions using just regular expressions.
That's the behavior of split in Java. You just have to take it (and deal with it) or use other library to split the string. I personally try to avoid split from Java.
An example of one alternative is to look at Splitter from Google Guava.
Try Guava's Splitter.
Splitter.onPattern("\\d+").omitEmptyStrings().split(rtString)

Categories