String#contains using Pattern

String#contains using Pattern - java

If I would want to make a 100% clone of String#contains(CharSequence s): boolean in Java regex using Pattern. Would the following calls be identical?
input.contains(s);
and
Pattern.compile(".*" + Pattern.quote(s) + ".*").matcher(input).matches();
Similarly, would the following code have the same functionality?
Pattern.compile(Pattern.quote(s)).matcher(input).find();
I presume that the regex search is less performant, but only by a constant factor. Is this correct? Is there any way to optimize the regular expressions to mimic contains?
The reason that I'm asking is that I have a piece of code that is written around Pattern and it seems wasteful to create a separate piece of code that uses contains. On the other hand, I don't want different test results - even minor ones - for each code. Are there any Unicode related differences, for instance?

If you need to write a .contains like method based on Pattern, you should choose the Matcher#find() version:
Pattern.compile(Pattern.quote(s)).matcher(input).find()
If you want to use .matches(), you should bear in mind that:
.* will not match line breaks by default and you need (?s) inline modifier at the start of the pattern or use Pattern.DOTALL option
The .* at the pattern start will cause too much backtracking and you may get a stack overflow exception, or the code execution might just freeze.

There are 2 ways to see if a String matches a Pattern:
return Pattern.compile(Pattern.quote(s)).asPredicate().test(input);
or
return Pattern.compile(Pattern.quote(s)).matcher.find(input);
There is no need for matching on .*. this will match anything surrounding the actual result and just be overhead.

This just to share how I decided to solve this little conundrum. I've redesigned by library to not take a Pattern but to take a predicate, like this:
public static Set<String> findAll() {
return find(input -> true);
}
public static Set<String> findSubstring(String s) {
return find(input -> input.contains(s));
}
public static Set<String> findPattern(Pattern p) {
return find(p.asPredicate());
}
public static Set<String> findCaseInsensitiveSubstring(String s) {
return find(Pattern.compile(Pattern.quote(s), Pattern.CASE_INSENSITIVE).asPredicate());
}
private static Set<String> find(Predicate<String> matcher) {
var testInput = Set.of("some", "text", "to", "test");
return testInput.stream().filter(matcher).collect(Collectors.toSet());
}
public static void main(String[] args) {
System.out.println(findAll());
System.out.println(findSubstring("t"));
System.out.println(findPattern(Pattern.compile("^[^s]")));
System.out.println(findCaseInsensitiveSubstring("T"));
}
where I've used all the comments and answers given up to now.
Note that there is also Pattern#asMatchPredicate() in case matching is required instead, e.g. for a function matchPattern.
Of course above is just a demonstration, not the actual functions in my solution.

Related

Java Truth OR assertion

I would like to check with Java Truth assertion library if any of the following statements is satisfied:
assertThat(strToCheck).startsWith("a");
assertThat(strToCheck).contains("123");
assertThat(strToCheck).endsWith("#");
In another word, I am checking if strToCheck starts with a OR contains the substring 123, OR ends with #. Aka, if any of the 3 conditions applies. I am just giving the assertions as an example.
Is there a way to do the logical OR assertion with Truth?
I know with Hamcrest, we could do something like:
assertThat(strToCheck, anyOf(startsWith("a"), new StringContains("123"), endsWith("#")));

assertTrue(strToCheck.startsWith("a") || strToCheck.contains("123") ||strToCheck.endsWith("#"));
You can do what you asked for with this single line only.

Why not use a regular expression to solve this:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String strToCheck = "afoobar123barfoo#";
Pattern pattern = Pattern.compile("a.*123.*#");
Matcher matcher = pattern.matcher(strToCheck);
boolean matchFound = matcher.find();
//matchFound now contains a true/false value.
}
}

All the ways of doing this with Truth currently either are very clumsy or don't produce as informative a failure message as we'd aim for. See this comment on issue 991, which mentions some possible future enhancements, but this is never going to be something that Truth is as good at as Hamcrest is.
If I were writing a test that needed this, I would probably write something like:
boolean valid =
strToCheck.startsWith("a")
|| strToCheck.contains("123")
|| strToCheck.endsWith("#");
if (!valid) {
assertWithMessage(
"expected to be a valid <some description of what kind of string you expect>"
+ "\nbut was: %s", strToCheck)
.fail()
}
And then I'd extract that to a method if it's going to be commonly needed.

Going to flip this on its head, since you're talking about testing.
You should be explicit about what you're asserting, and not so wide-open about it.
For instance, it sounds like you're expecting something like:
a...123#
a123#
a
#
123
...but you may only actually care about one of those cases.
So I would encourage you to explicitly validate only one of each. Even though Hamcrest allows you to find any match, this too feels like an antipattern; you should be more explicit about what it is you're expecting given a set of strings.

How do I determine if a string is not a regular expression?

I am trying to improve the performance of some code. It looks something like this:
public boolean isImportant(String token) {
for (Pattern pattern : patterns) {
return pattern.matches(token).find();
}
}
What I noticed is that many of the Patterns seem to be simple string literals with no regular expression constructs. So I want to simply store these in a separate list (importantList) and do an equality test instead of performing a more expensive pattern match, such as follows:
public boolean isImportant(String token) {
if (importantList.contains(token)) return true;
for (Pattern pattern : patterns) {
return pattern.matches(token).find();
}
}
How do I programmatically determine if a particular string contains no regular expression constructs?
Edit:
I should add that the answer doesn't need to be performance-sensitive. (i.e. regular expressions can be used) I'm mainly concerned with the performance of isImportant() because it's called millions of times, while the initialzation of the patterns is only done once.

I normally hate answers that say this but...
Don't do that.
It probably won't make the code run faster, in fact it might even cause the program to take more time.
if you really need to optimize your code, there are likely much mush much more effective places where you can go.

It's going to be difficult. You can check for the non-presence of any regex metacharacters; that should be a good approximation:
Pattern regex = Pattern.compile("[$^()\\[\\]{}.*+?\\\\]");
Matcher regexMatcher = regex.matcher(subjectString);
regexIsLikely = regexMatcher.find();
Whether it's worth it is another question. Are you sure a regex match is slower than a list lookup (especially since you'll be doing a regex match after that in many cases anyway)? I'd bet it's much faster to just keep the regex match.

There is no way to determine it as every regex pattern is nothing else than a string. Furthermore there is nearly no performance difference as regex is smart nowadays and I'm pretty sure, if the pattern and source lengths are the same, equity check is the first that will be done

This is wrong
for (Pattern pattern : patterns)
you should create one big regex that ORs all patterns; then for each input you only match once.

select a word from a section of string?

I'm trying to find out if there are any methods in Java which would me achieve the following.
I want to pass a method a parameter like below
"(hi|hello) my name is (Bob|Robert). Today is a (good|great|wonderful) day."
I want the method to select one of the words inside the parenthesis separated by '|' and return the full string with one of the words randomly selected. Does Java have any methods for this or would I have to code this myself using character by character checks in loops?

You can parse it by regexes.
The regex would be \(\w+(\|\w+)*\); in the replacement you just split the argument on the '|' and return the random word.
Something like
import java.util.regex.*;
public final class Replacer {
//aText: "(hi|hello) my name is (Bob|Robert). Today is a (good|great|wonderful) day."
//returns: "hello my name is Bob. Today is a wonderful day."
public static String getEditedText(String aText){
StringBuffer result = new StringBuffer();
Matcher matcher = fINITIAL_A.matcher(aText);
while ( matcher.find() ) {
matcher.appendReplacement(result, getReplacement(matcher));
}
matcher.appendTail(result);
return result.toString();
}
private static final Pattern fINITIAL_A = Pattern.compile(
"\\\((\\\w+(\\\|\w+)*)\\\)",
Pattern.CASE_INSENSITIVE
);
//aMatcher.group(1): "hi|hello"
//words: ["hi", "hello"]
//returns: "hello"
private static String getReplacement(Matcher aMatcher){
var words = aMatcher.group(1).split('|');
var index = randomNumber(0, words.length);
return words[index];
}
}
(Note that this code is written just to illustrate an idea and probably won't compile)

May be it helps,
Pass three strings("hi|hello"),(Bob|Robert) and (good|great|wonderful) as arguments to the method.
Inside method split the string into array
by, firststringarray[]=thatstring.split("|"); use this for other two.
and Use this to use random string selection.

As per my knowledge java don't have any method to do it directly.
I have to write code for it or regexe

I don't think Java has anything that will do what you want directly. Personally, instead of doing things based on regexps or characters, I would make a method something like:
String madLib(Set<String> greetings, Set<String> names, Set<String> dispositions)
{
// pick randomly from each of the sets and insert into your background string
}

There is no direct support for this. And you should ideally not try a low level solution.
You should search for 'random sentence generator'. The way you are writing
`(Hi|Hello)`
etc. is called a grammar. You have to write a parser for the grammar. Again there are many solutions for writing parsers. There are standard ways to specify grammar. Look for BNF.
The parser and generator problems have been solved many time over, and the interesting part of your problem will be writing the grammar.

Java does not provide any readymade method for this. You can use either Regex as described by Penartur or create your own java method to split Strings and store random words. StringTokenizer class can help you if following second approach.

Regular expression performance in Java -- better few complex or many simple?

I am doing some fairly extensive string manipulations using regular expressions in Java. Currently, I have many blocks of code that look something like:
Matcher m = Pattern.compile("some pattern").matcher(text);
StringBuilder b = new StringBuilder();
int prevMatchIx = 0;
while (m.find()) {
b.append(text.substring(prevMatchIx, m.start()));
String matchingText = m.group(); //sometimes group(n)
//manipulate the matching text
b.append(matchingText);
prevMatchIx = m.end();
}
text = b.toString()+text.substring(prevMatchIx);
My question is which of the two alternatives is more efficient (primarily time, but space to some extent):
1) Keep many existing blocks as above (assuming there isn't a better way to handle such blocks -- I can't use a simple replaceAll() because the groups must be operated on).
2) Consolidate the blocks into one big block. Use a "some pattern" that is the combination of all the old blocks' patterns using the |/alternation operator. Then, use if/else if within the loop to handle each of the matching patterns.
Thank you for your help!

If the order in which the replacements are made matters, you would have to be careful when using technique #1. Allow me to give an example: If I want to format a String so it is suitable for inclusion in XML, I have to first replace all & with & and then make the other replacements (like < to <). Using technique #2, you would not have to worry about this because you are making all the replacements in one pass.
In terms of performance, I think #2 would be quicker because you would be doing less String concatenations. As always, you could implement both techniques and record their speed and memory consumption to find out for certain. :)

I'd suggest caching the patterns and having a method that uses the cache.
Patterns are expensive to compile so at least you will only compile them once and there is code reuse in using the same method for each instance. Shame about the lack of closures though as that would make things a lot cleaner.
private static Map<String, Pattern> patterns = new HashMap<String, Pattern>();
static Pattern findPattern(String patStr) {
if (! patterns.containsKey(patStr))
patterns.put(patStr, Pattern.compile(patStr));
return patterns.get(patStr);
}
public interface MatchProcessor {
public void process(String field);
}
public static void processMatches(String text, String pat, MatchProcessor processor) {
Matcher m = findPattern(pat).matcher(text);
int startInd = 0;
while (m.find(startInd)) {
processor.process(m.group());
startInd = m.end();
}
}

Last time I was in your position I used a product called jflex.
Java's regex doesn't provide the traditional O(N log M) performance guarantees of true regular expression engines (for input strings of length N, and patterns of length M). Instead it inherits from its perl roots exponential time for some patterns. Unfortunately these pathological patterns, while rare in normal use, are all too common when combining regexes as you propose to do (I can attest to this from personal experience).
Consequently, my advice is to either:
a) pre-compile your patterns as "static final Pattern" constants, so they will be initialized once during [cinit]; or
b) switch to a lexer package such as jflex, which will provide a more declarative, and far more readable, syntax to approach this sort of cascading/sequential regex processing; and
c) seriously consider using a parser generator package. My current favourite is Beaver, but CUP is also a good option. Both of these are excellent tools and I highly recommend both of them, and as they both sit on top of jflex you can add them as/when you need them.
That being said, if you haven't used a parser-generator before and you are in a hurry, it will be easier to get up to speed with JavaCC. Not as powerful as Beaver/CUP but its parsing model is easier to understand.
Whatever you do, please don't use Antlr. It is very fashionable, and has great cheerleaders, but its online documentation sucks, its syntax is awkward, its performance is poor, and its scannerless design makes several common simple cases painful to handle. You would be better off using an abomination like sablecc(v1).
Note: Yes I have used everything I have mentioned above, and more besides; so this advice comes from personal experience.

First, does this need to be efficient? If not, don't bother -- complexification won't help code maintainability.
Assuming it does, doing them separately is usually the most efficient. This is especially true if there are large blocks of text in the expressions: without alternation this can be used to speed up matching, with it can't help at all.
If performance is really critical, you can code it several ways and test with sample data.

Option #2 is almost certainly the better way to go, assuming it isn't too difficult to combine the regexes. And you don't have to implement it from scratch, either; the lower-level API that replaceAll() is built on (i.e., appendReplacement() and appendTail()), is also available for your use.
Taking the example that #mangst used, here's how you might process some text to be inserted into an XML document:
import java.util.regex.*;
public class Test
{
public static void main(String[] args)
{
String test_in = "One < two & four > three.";
Pattern p = Pattern.compile("(&)|(<)|(>)");
Matcher m = p.matcher(test_in);
StringBuffer sb = new StringBuffer(); // (1)
while (m.find())
{
String repl = m.start(1) != -1 ? "&" :
m.start(2) != -1 ? "<" :
m.start(3) != -1 ? ">" : "";
m.appendReplacement(sb, ""); // (2)
sb.append(repl);
}
m.appendTail(sb);
System.out.println(sb.toString());
}
}
In this very simple example, all I need to know about each match is which capture group participated in it, which I find out by means of the start(n) method. But you can use the group() or group(n) method to examine the matched text, as you mentioned in the question.
Note (1) As of JDK 1.6, we have to use a StringBuffer here because StringBuilder didn't exist yet when the Matcher class was written. JDK 1.7 will add support for StringBuilder, plus some other improvements.
Note (2) appendReplacement(StringBuffer, String) processes the String argument to replace any $n sequence with the contents of the n'th capture group. We don't want that to happen, so we pass it an empty string and then append() the replacement string ourselves.

Is there a regular expression for finding/replacing the common start of all lines in a chunk of text?

Imagine this string:
if(editorPart instanceof ITextEditor){
ITextEditor editor = (ITextEditor)editorPart;
selection = (ITextSelection) editor.getSelectionProvider().getSelection();
}else if( editorPart instanceof MultiPageEditorPart){
//this would be the case for the XML editor
selection = (ITextSelection) editorPart.getEditorSite().getSelectionProvider().getSelection();
}
I can see, visually, that the "common" start in each of these lines is two tab characters. Is there a regular expression that would replace -- only at the beginning of each line (including the first and last line), this common start, such that after the regex I'd end up with that same string, only essentially un-indented?
I can't simply search for "two tabs" in this case because there might be two tabs elsewhere in the text but not at the start of a line.
I've implemented this functionality with a different method but thought it'd be a fun regex challenge, if it's possible at all

The ^ symbol in a regular expression matches the beginning of a line. So:
/^\t\t//g
Would remove two tabs at the beginning of a line.

In general (i.e. if you want to match an arbitrary prefix, not necessarily two tabs), there may or may not be a way. It depends on which regular expression engine you're using. I would imagine that maybe something roughly like this might work:
\B^(.+).*?$(?:^\1.*?$)+\E
note that I've probably screwed up the regex syntax, just think of it as regex pseudocode of sorts (\B is beginning of string, ^ is beginning of line, $ is end of line, \E is end of string)
But this really isn't a job I would do with a regular expression. A simple character-by-character parser seems much better suited.

Not in one regex. You need to make two passes: matches() to find the longest common prefix, then replaceAll() to remove it. Here's my best solution:
import java.util.regex.*;
public class Test
{
public static void main(String[] args) throws Exception
{
String target =
"\t\tif(editorPart instanceof ITextEditor){\n"
+ "\t\t\tITextEditor editor = (ITextEditor)editorPart;\n"
+ "\t\t\tselection = (ITextSelection) fee.fie().fum();\n"
+ "\t\t}else if( editorPart instanceof MultiPageEditorPart){\n"
+ "\t\t\t//this would be the case for the XML editor\n"
+ "\t\t\tselection = (ITextSelection) fee.fie().foe().fum();\n"
+ "\t\t}";
System.out.printf("%n%s%n", target);
Pattern p = Pattern.compile("^(\\s+).*+(?:\n\\1.*+)*+");
Matcher m = p.matcher(target);
if (m.matches())
{
String indent = m.group(1);
String result = target.replaceAll("(?m)^" + indent, "");
System.out.printf("%n%s%n", result);
}
}
}
Of course, this assumes (as Jonathan Leffler hinted at in his comment to your question) that the target string is not part of a larger string, and you're only removing whitespace. Without those assumptions the task becomes a lot more complex.

It's absolutely possible. As everyone points out, I'd never inflict this on a real project, though.
My answer, if you're curious, is here. I tried writing it in perl, but it doesn't support variable-length lookbehinds.
EDIT: Fixed it! The linked code now works. If you'd like hints, just comment -- I don't want to give it away if you want to solve it yourself, though.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

String#contains using Pattern - java

Related

Java Truth OR assertion

How do I determine if a string is not a regular expression?

select a word from a section of string?

Regular expression performance in Java -- better few complex or many simple?

Is there a regular expression for finding/replacing the common start of all lines in a chunk of text?

Categories

Resources