I would like to check with Java Truth assertion library if any of the following statements is satisfied:
assertThat(strToCheck).startsWith("a");
assertThat(strToCheck).contains("123");
assertThat(strToCheck).endsWith("#");
In another word, I am checking if strToCheck starts with a OR contains the substring 123, OR ends with #. Aka, if any of the 3 conditions applies. I am just giving the assertions as an example.
Is there a way to do the logical OR assertion with Truth?
I know with Hamcrest, we could do something like:
assertThat(strToCheck, anyOf(startsWith("a"), new StringContains("123"), endsWith("#")));
assertTrue(strToCheck.startsWith("a") || strToCheck.contains("123") ||strToCheck.endsWith("#"));
You can do what you asked for with this single line only.
Why not use a regular expression to solve this:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String[] args) {
String strToCheck = "afoobar123barfoo#";
Pattern pattern = Pattern.compile("a.*123.*#");
Matcher matcher = pattern.matcher(strToCheck);
boolean matchFound = matcher.find();
//matchFound now contains a true/false value.
}
}
All the ways of doing this with Truth currently either are very clumsy or don't produce as informative a failure message as we'd aim for. See this comment on issue 991, which mentions some possible future enhancements, but this is never going to be something that Truth is as good at as Hamcrest is.
If I were writing a test that needed this, I would probably write something like:
boolean valid =
strToCheck.startsWith("a")
|| strToCheck.contains("123")
|| strToCheck.endsWith("#");
if (!valid) {
assertWithMessage(
"expected to be a valid <some description of what kind of string you expect>"
+ "\nbut was: %s", strToCheck)
.fail()
}
And then I'd extract that to a method if it's going to be commonly needed.
Going to flip this on its head, since you're talking about testing.
You should be explicit about what you're asserting, and not so wide-open about it.
For instance, it sounds like you're expecting something like:
a...123#
a123#
a
#
123
...but you may only actually care about one of those cases.
So I would encourage you to explicitly validate only one of each. Even though Hamcrest allows you to find any match, this too feels like an antipattern; you should be more explicit about what it is you're expecting given a set of strings.
Related
If I would want to make a 100% clone of String#contains(CharSequence s): boolean in Java regex using Pattern. Would the following calls be identical?
input.contains(s);
and
Pattern.compile(".*" + Pattern.quote(s) + ".*").matcher(input).matches();
Similarly, would the following code have the same functionality?
Pattern.compile(Pattern.quote(s)).matcher(input).find();
I presume that the regex search is less performant, but only by a constant factor. Is this correct? Is there any way to optimize the regular expressions to mimic contains?
The reason that I'm asking is that I have a piece of code that is written around Pattern and it seems wasteful to create a separate piece of code that uses contains. On the other hand, I don't want different test results - even minor ones - for each code. Are there any Unicode related differences, for instance?
If you need to write a .contains like method based on Pattern, you should choose the Matcher#find() version:
Pattern.compile(Pattern.quote(s)).matcher(input).find()
If you want to use .matches(), you should bear in mind that:
.* will not match line breaks by default and you need (?s) inline modifier at the start of the pattern or use Pattern.DOTALL option
The .* at the pattern start will cause too much backtracking and you may get a stack overflow exception, or the code execution might just freeze.
There are 2 ways to see if a String matches a Pattern:
return Pattern.compile(Pattern.quote(s)).asPredicate().test(input);
or
return Pattern.compile(Pattern.quote(s)).matcher.find(input);
There is no need for matching on .*. this will match anything surrounding the actual result and just be overhead.
This just to share how I decided to solve this little conundrum. I've redesigned by library to not take a Pattern but to take a predicate, like this:
public static Set<String> findAll() {
return find(input -> true);
}
public static Set<String> findSubstring(String s) {
return find(input -> input.contains(s));
}
public static Set<String> findPattern(Pattern p) {
return find(p.asPredicate());
}
public static Set<String> findCaseInsensitiveSubstring(String s) {
return find(Pattern.compile(Pattern.quote(s), Pattern.CASE_INSENSITIVE).asPredicate());
}
private static Set<String> find(Predicate<String> matcher) {
var testInput = Set.of("some", "text", "to", "test");
return testInput.stream().filter(matcher).collect(Collectors.toSet());
}
public static void main(String[] args) {
System.out.println(findAll());
System.out.println(findSubstring("t"));
System.out.println(findPattern(Pattern.compile("^[^s]")));
System.out.println(findCaseInsensitiveSubstring("T"));
}
where I've used all the comments and answers given up to now.
Note that there is also Pattern#asMatchPredicate() in case matching is required instead, e.g. for a function matchPattern.
Of course above is just a demonstration, not the actual functions in my solution.
I'm trying to extract variables from code statements and "if" condition. I have a regex to that but mymatcher.find() doesn't return any values matched.
I don't know what is wrong.
here is my code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class test {
public static void main(String[] args) {
String test="x=y+z/n-10+my5th_integer+201";
Pattern mypattern = Pattern.compile("^[a-zA-Z_$][a-zA-Z_$0-9]*$");
Matcher mymatcher = mypattern.matcher(test);
while (mymatcher.find()) {
String find = mymatcher.group(1) ;
System.out.println("variable:" + find);
}
}
}
You need to remove ^ and $ anchors that assert positions at start and end of string repectively, and use mymatcher.group(0) instead of mymatcher.group(1) because you do not have any capturing groups in your regex:
String test="x=y+z/n-10+my5th_integer+201";
Pattern mypattern = Pattern.compile("[a-zA-Z_$][a-zA-Z_$0-9]*");
Matcher mymatcher = mypattern.matcher(test);
while (mymatcher.find()) {
String find = mymatcher.group(0) ;
System.out.println("variable:" + find);
}
See IDEONE demo, the results are:
variable:x
variable:y
variable:z
variable:n
variable:my5th_integer
Usually processing source code with just a regex simply fails.
If all you want to do is pick out identifiers (we discuss variables further below) you have some chance with regular expressions (after all, this is how lexers are built).
But you probably need a much more sophisticated version than what you have, even with corrections as suggested by other authors.
A first problem is that if you allow arbitrary statements, they often have keywords that look like identifiers. In your specific example, "if" looks like an identifier. So your matcher either has to recognize identifier-like substrings, and subtract away known keywords, or the regex itself must express the idea that an identifier has a basic shape but not cannot look like a specific list of keywords. (The latter is called a subtractive regex, and aren't found in most regex engines. It looks something like:
[a-zA-Z_$][a-zA-Z_$0-9]* - (if | else | class | ... )
Our DMS lexer generator [see my bio] has subtractive regex because this is extremely useful in language-lexing).
This gets more complex if the "keywords" are not always keywords, that is,
they can be keywords only in certain contexts. The Java "keyword" enum is just that: if you use it in a type context, it is a keyword; otherwise it is an identifier; C# is similar. Now the only way to know
if a purported identifier is a keyword is to actually parse the code (which is how you detect the context that controls its keyword-ness).
Next, identifiers in Java allow a variety of Unicode characters (Latin1, Russian, Chinese, ...) A regexp to recognize this, accounting for all the characters, is a lot bigger than the simple "A-Z" style you propose.
For Java, you need to defend against string literals containing what appear to be variable names. Consider the (funny-looking but valid) statement:
a = "x=y+z/n-10+my5th_integer+201";
There is only one identifier here. A similar problem occurs with comments
that contain content that look like statements:
/* Tricky:
a = "x=y+z/n-10+my5th_integer+201";
*/
For Java, you need to worry about Unicode escapes, too. Consider this valid Java statement:
\u0061 = \u0062; // means "a=b;"
or nastier:
a\u006bc = 1; // means "akc=1;" not "abc=1;"!
Pushing this, without Unicode character decoding, you might not even
notice a string. The following is a variant of the above:
a = \u0042x=y+z/n-10+my5th_integer+201";
To extract identifiers correctly, you need to build (or use) the equivalent of a full Java lexer, not just a simple regex match.
If you don't care about being right most of the time, you can try your regex. Usually regex-applied-to-source-code-parsing ends badly, partly because of the above problems (e.g, oversimplification).
You are lucky in that you are trying to do for Java. If you had to do this for C#, a very similar language, you'd have to handle interpolated strings, which allow expressions inside strings. The expressions themselves can contain strings... its turtles all the way down. Consider the C# (version 6) statement:
a = $"x+{y*$"z=${c /* p=q */}"[2]}*q" + b;
This contains the identifiers a, b, c and y. Every other "identifier" is actually just a string or comment character. PHP has similar interpolated strings.
To extract identifiers from this, you need a something that understands the nesting of string elements. Lexers usually don't do recursion (Our DMS lexers handle this, for precisely this reason), so to process this correctly you usually need a parser, or at least something that tracks nesting.
You have one other issue: do you want to extract just variable names?
What if the identifier represents a method, type, class or package?
You can't figure this out without having a full parser and full Java name and type resolution, and you have to do this in the context in which the statement is found. You'd be amazed how much code it takes to do this right.
So, if your goals are simpleminded and you don't care if it handles these complications, you can get by with a simple regex to pick out things
that look like identifiers.
If you want to it well (e.g., use this in some production code) the single regex will be total disaster. You'll spend your life explaining to users what they cannot type, and that never works.
Summary: because of all the complications, usually processing source code with just a regex simply fails. People keep re-learning this lesson. It is one of key reasons that lexer generators are widely used in language processing tools.
I'd like to validate an expected response against an actual from an actual API response as part of my unit tests.
So I have this:
Expectation:
disqualified because taxpayer can be claimed as dependent
I'd like to assert against this:
Actual:
disqualified because you can be claimed as a dependent
API response from a compiled binary.
Since these pieces of text are close enough, I'd like to pass this assertion I have in my test:
Assert.assertEquals(titleText.toLowerCase(), t.toLowerCase(), "Did not match!");
Since this assertion obviously did not pass because the 2 pieces of text are not equal.
I tried this instead:
Assert.assertTrue(titleText.toLowerCase().contains(t.toLowerCase()));
But the test still fails....How can I make this test pass? Any suggestions would be much appreciated.
You will have to write comparator of your own.
class OwnComparator {
public static boolean checkThoseForEquality(String str1, String str2) {
// your logic goes here
}
}
Use it with
assertTrue(OwnComparator.checkThoseForEquality(titleText, t);
You can even move toLowerCase() from code to this comparator, making your general code shorter.
I don't think expecting testNG to have implemented exactly what you mean by "almost equal" is valid.
The best approach would probably be to assert those parts that are predictable, using for example regex:
Assert.assertTrue(titleText
.matches("(?i)disqualified because \\w+ can be claimed as dependent"));
The regex term \w+ means "a series of 1 or more word characters".
Note also how regex also lets you match case insensitively via the (?i) ignore-case flag, so avoiding the toLowerCase() call.
Or
If "close enough" really is good enough, use a Levenshtein distance test, an implementation of which you can find in Apache common-lang's StringUtils.getLevenshteinDistance() method:
Assert.assertTrue(StringUtils.getLevenshteinDistance(titleText, t) < 10);
Well, if you are sure this fits you, you could use Levenshtein Distance (or some other string matching score to compare the strings.
//Checks that at most 20% of the string are different using Levenshtein distance score
assertTrue(StringUtils.getLevenshteinDistance(titleText.toLowerCase(), t.toLowerCase()) < titleText.length()*0.2)
don't use regex. regex can help you only if you know the text patterns in advance and if the patterns are relatively simple.
you have to choose a metric of similarity that best fits your needs. the most famous is Levenshtein distance (or edit distance). you can find it, for example, in apache commons. if you need something more complex, you can use fuzzy string search or even use full text search (lucene etc)
but probably you won't find any of those in assertions/matchers libraries - use dedicated tool to do the comparision
I am creating a regular expression to evaluate if an IP address is a valid multicast address. This validation is occurring in real time while you type (if you type an invalid / out of range character it is not accepted) so I cannot simply evaluate the end result against the regex. The problem I am having with it is that it allows for a double period after each group of numbers (224.. , 224.0.., 224.0.0.. all show as valid).
The code below is a static representation of what's happening. Somehow 224.. is showing as a legal value. I've tested this regex online (non-java'ized: ^2(2[4-9]|3\d)(.(25[0-5]|2[0-4]\d|1\d\d|[1-9]\d|\d)){3}$ ) and it works perfectly and does not accept the invalid input i'm describing.
Pattern p = Pattern.compile("^2(2[4-9]|3\\d)(\\.(25[0-5]|2[0-4]\\d|[0-1]?\\d?\\d)){3}$");
Matcher m = p.matcher("224..");
if (!m.matches() && !m.hitEnd()) {
System.out.println("Invalid");
} else {
System.out.println("Valid");
}
It seems that the method m.hitEnd() is evaluating to true whenever I input 224.. which does not make sense to me.
If someone could please look this over and make sure I'm not making any obvious mistake and maybe explain why hitEnd() is returning true in this case I'd appreciate it.
Thanks everyone.
After doing some evaluating myself (after discovering this was on Android), I realized that the same code responds differently on Dalvik than it does on a regular JVM.
The code is:
Pattern p = Pattern.compile("^2(2[4-9]|3\\d)(\\.(25[0-5]|2[0-4]\\d|[0-1]?\\d?\\d)){3}$");
Matcher m = p.matcher("224..");
if (!m.matches() && !m.hitEnd()) {
System.out.println("Invalid");
} else {
System.out.println("Valid");
}
This code (albeit modified a bit), prints Valid on Android and Invalid on the JVM.
I do not know how have you tested your regex but it does not look correct according to your description.
Your regext requires all 4 sections of digits. There is no chance it will match 224..
Only [0-1] and \d are marked with question mark and therefore are optional.
So, without dealing with details of limitations of wich specific digits are permitted I'd suggest you something like this:
^\\d{1-3}\\.(\\d{0-3}\\.)?(\\d{0-3}\\.)?(\\d{0-3}\\.)?$
And you do not have to use hitEnd(): $ in the end is enough. And do not use matches(). Use find() instead. matches() is like find() but adds ^ and $ automatically.
I just tested out your code and m.hitEnd() evaluates to false for me, and I am receiving invalid...
So I'm not really sure what the problem here is?
I reported bug 20625 in Dalvik. In the interim, you don't need to use hitEnd(), having the $ suffix should be sufficient.
public void testHitEnd() {
String text = "b";
String pattern = "^aa$";
Matcher matcher = Pattern.compile(pattern).matcher(text);
assertFalse(matcher.matches());
assertFalse(matcher.hitEnd());
}
I am doing some fairly extensive string manipulations using regular expressions in Java. Currently, I have many blocks of code that look something like:
Matcher m = Pattern.compile("some pattern").matcher(text);
StringBuilder b = new StringBuilder();
int prevMatchIx = 0;
while (m.find()) {
b.append(text.substring(prevMatchIx, m.start()));
String matchingText = m.group(); //sometimes group(n)
//manipulate the matching text
b.append(matchingText);
prevMatchIx = m.end();
}
text = b.toString()+text.substring(prevMatchIx);
My question is which of the two alternatives is more efficient (primarily time, but space to some extent):
1) Keep many existing blocks as above (assuming there isn't a better way to handle such blocks -- I can't use a simple replaceAll() because the groups must be operated on).
2) Consolidate the blocks into one big block. Use a "some pattern" that is the combination of all the old blocks' patterns using the |/alternation operator. Then, use if/else if within the loop to handle each of the matching patterns.
Thank you for your help!
If the order in which the replacements are made matters, you would have to be careful when using technique #1. Allow me to give an example: If I want to format a String so it is suitable for inclusion in XML, I have to first replace all & with & and then make the other replacements (like < to <). Using technique #2, you would not have to worry about this because you are making all the replacements in one pass.
In terms of performance, I think #2 would be quicker because you would be doing less String concatenations. As always, you could implement both techniques and record their speed and memory consumption to find out for certain. :)
I'd suggest caching the patterns and having a method that uses the cache.
Patterns are expensive to compile so at least you will only compile them once and there is code reuse in using the same method for each instance. Shame about the lack of closures though as that would make things a lot cleaner.
private static Map<String, Pattern> patterns = new HashMap<String, Pattern>();
static Pattern findPattern(String patStr) {
if (! patterns.containsKey(patStr))
patterns.put(patStr, Pattern.compile(patStr));
return patterns.get(patStr);
}
public interface MatchProcessor {
public void process(String field);
}
public static void processMatches(String text, String pat, MatchProcessor processor) {
Matcher m = findPattern(pat).matcher(text);
int startInd = 0;
while (m.find(startInd)) {
processor.process(m.group());
startInd = m.end();
}
}
Last time I was in your position I used a product called jflex.
Java's regex doesn't provide the traditional O(N log M) performance guarantees of true regular expression engines (for input strings of length N, and patterns of length M). Instead it inherits from its perl roots exponential time for some patterns. Unfortunately these pathological patterns, while rare in normal use, are all too common when combining regexes as you propose to do (I can attest to this from personal experience).
Consequently, my advice is to either:
a) pre-compile your patterns as "static final Pattern" constants, so they will be initialized once during [cinit]; or
b) switch to a lexer package such as jflex, which will provide a more declarative, and far more readable, syntax to approach this sort of cascading/sequential regex processing; and
c) seriously consider using a parser generator package. My current favourite is Beaver, but CUP is also a good option. Both of these are excellent tools and I highly recommend both of them, and as they both sit on top of jflex you can add them as/when you need them.
That being said, if you haven't used a parser-generator before and you are in a hurry, it will be easier to get up to speed with JavaCC. Not as powerful as Beaver/CUP but its parsing model is easier to understand.
Whatever you do, please don't use Antlr. It is very fashionable, and has great cheerleaders, but its online documentation sucks, its syntax is awkward, its performance is poor, and its scannerless design makes several common simple cases painful to handle. You would be better off using an abomination like sablecc(v1).
Note: Yes I have used everything I have mentioned above, and more besides; so this advice comes from personal experience.
First, does this need to be efficient? If not, don't bother -- complexification won't help code maintainability.
Assuming it does, doing them separately is usually the most efficient. This is especially true if there are large blocks of text in the expressions: without alternation this can be used to speed up matching, with it can't help at all.
If performance is really critical, you can code it several ways and test with sample data.
Option #2 is almost certainly the better way to go, assuming it isn't too difficult to combine the regexes. And you don't have to implement it from scratch, either; the lower-level API that replaceAll() is built on (i.e., appendReplacement() and appendTail()), is also available for your use.
Taking the example that #mangst used, here's how you might process some text to be inserted into an XML document:
import java.util.regex.*;
public class Test
{
public static void main(String[] args)
{
String test_in = "One < two & four > three.";
Pattern p = Pattern.compile("(&)|(<)|(>)");
Matcher m = p.matcher(test_in);
StringBuffer sb = new StringBuffer(); // (1)
while (m.find())
{
String repl = m.start(1) != -1 ? "&" :
m.start(2) != -1 ? "<" :
m.start(3) != -1 ? ">" : "";
m.appendReplacement(sb, ""); // (2)
sb.append(repl);
}
m.appendTail(sb);
System.out.println(sb.toString());
}
}
In this very simple example, all I need to know about each match is which capture group participated in it, which I find out by means of the start(n) method. But you can use the group() or group(n) method to examine the matched text, as you mentioned in the question.
Note (1) As of JDK 1.6, we have to use a StringBuffer here because StringBuilder didn't exist yet when the Matcher class was written. JDK 1.7 will add support for StringBuilder, plus some other improvements.
Note (2) appendReplacement(StringBuffer, String) processes the String argument to replace any $n sequence with the contents of the n'th capture group. We don't want that to happen, so we pass it an empty string and then append() the replacement string ourselves.