Regular Expressions match randomly instead of around quotes in Java

Regular Expressions match randomly instead of around quotes in Java - java

I am writing a program in Java, using Regular expressions, and have run into an error. What I am trying to do, is basically make a programming language, and parse it line by line. Where I am going wrong, is when it tries to find any strings. The thing is, is that I have to have it in the order of identifiers, strings, then integers, but I can have the identifiers find strings. Strings are defined by having double quotes around them. Here is where I have a test, and my expression: here, or here, if you do not want to go to the link:
[^"]([^\W][a-zA-Z0-9]+)[^"]
I cannot show my Java code, because it is all over the place, with the way I programmed it. It should just be the expression, and that's it.

It would be helpful if you can explain more what exactly you are trying to match. E.g. give some example texts and what your expression currently outputs for them.
At the moment I think you are trying to match Strings, text that is surrounded by ". For example foofoo"text123"barbar and your desired output is text123.
If defining a regular expression in Java, you need to escape special characters like ". Here is a Java-usable version for the Regex you have provided:
Pattern pattern = Pattern.compile("[^\"]([^\\W][a-zA-Z0-9]+)[^\"]");
You may then use the Pattern object together with a Matcher object to find your text. Here's the Java-Doc for Pattern.
Here is a Pattern that matches text surrounded by ":
Pattern pattern = Pattern.compile("\"[^\"]*\"");

Related

java 8 regular expression for meta characters [duplicate]

This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 3 years ago.
Trying to write a regular expression to check if the sentence as metacharacters "I need to make payment of $50 for the purchase, should i use CASH|CC". In this sentence i need to identify if metacharacters are present.
\\\\$ or ^(\\\\$)\\$. What is the right syntax for Pattern.matches("^([\\\\$]$)", text); to identify the special characters. I don't need to replace just identify if the sentence contains these characters.

If you want to know whether a string contains meta characters, you can use some like this:
boolean hasIt = sentence.chars().anyMatch(c -> "\\.[]{}()*+?^$|".indexOf(c) >= 0);
By not using the Regex engine, you don’t need to quote the characters which have a special meaning to it.
Using Pattern.matches creates three unnecessary obstacles to the task. First, you have to quote all characters correctly, then, you need a regex construct to turn the characters into alternatives, e.g. [abc] or a|b|c, third, matches checks whether the entire string matches the pattern, rather than contains an occurrences, so you’d need something like .*pattern.* to make matches to behave like find, if you insist on it.
Which leads to the xy-problem of this task. It’s not clear which metacharacters you actually want to check and why you need this information in the first place.
If you want to search for this sentence within another text, just use Pattern.compile(sentence, Pattern.LITERAL) to disable interpretation of meta characters. Or Pattern.quote(sentence) when you want to assemble a pattern containing the sentence.
But if you don’t want to search for it, this information has no relevance. Note that “Is this a meta character?” may lead to a different answer than “Does it need quoting?”. Even this tutorial combines these questions in a misleading way. At two close places it names the metacharacters and describes the quoting syntax, leading to the wrong impression that all of these characters need quoting.
For example, - only has a special meaning within a character class, so if there is no character class, which you detect by the presence of [, the - does not imply the presence of metacharacters. But while - truly needs quoting within the character class, the characters = and ! are metacharacters only in a certain context, which requires a metacharacter, so they never require quoting.
But if you are trying to check for a metacharacter to decide whether to use the Regex engine or to perform a plain text search, e.g. via String.indexOf, you are performing premature optimization. This is not only a waste of development effort, optimizing before you even have an actual code you could measure often leads to the opposite result. Performing a pattern matching using the Regex engine with a string containing no metacharacters can lead to a more efficient search than a plain indexOf on the String. In the reference implementation, the Regex engine uses the Boyer Moore algorithm while the plaintext search methods on String use a naive search.

Edit: As mentioned by commenters Andreas and Holger, the meta characters used by regular expressions are sometimes depending on a syntactical subdefinition, like character classes, specific sequences (lookahead, lookbehind,...) and are therefore not intrinsically metacaracters per se. Some are only meta characters in a specific context. However the answer provided here will include all possible meta characters, with the exception of the operators that only become meta characters when prefixed by \. However, this means, that sometimes characters will be matched, in locations where they are not actually meta characters.
This question has half the answer: List of all special characters that need to be escaped in a regex
You can look at the javadoc of the Pattern class: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
The Java regular expression system exposes no character class for it's own special characters (regrettably).
Special constructs (named-capturing and non-capturing)
(?X) X, as a named-capturing group
(?:X) X, as a non-capturing group
(?idmsuxU-idmsuxU) Nothing, but turns match flags i d m s u x U on - off
(?idmsux-idmsux:X) X, as a non-capturing group with the given flags i d m s u x on - off
(?=X) X, via zero-width positive lookahead
(?!X) X, via zero-width negative lookahead
This block alone contains a lot (though not all) of the meta characters. The last two rows of the citation I had ot leave out, because the character sequences confused the parser of this page.
I would suggest the following:
public static final Pattern META_CHARS = Pattern.compile("[\\\\\\]\\[(){}\\-!$?*+<>\\:\\.\\=\\,\\|^]");
But be aware, that this list might very well be incomplete, and that this contains typical characters such as , and . which are part of the regex syntax. So you probably got a lot of escaping to do...
From there you can:
Matcher metaDetector = META_CHARS.matcher(stringToTest);
if (metaDetector.find()) {
// this is the found meta character...
String metaCharacter = metaDetector.group(0);
System.out.print(metaCharacter);
}
And if you want to find all meta characters, then make a while out of if in the above code snippet. If you do, for the line "I need to make \\payment{[ of $50 for !!the purc\"hase, sh###ould i use CASH|CC." you receive \{[$!!,|., which is correct, as # and " are not meta characters in regex.
As Andreas correctly mentions, the exact pattern can be reduced to "[\\\\\\]\\[(){}^$?*+.|]", because this will tell you, whether or not at least one meta character is present. However this might miss some meta characters, if multiple are present. If this is not important, then the shorter chain is sufficient.

Finding whole word only in Java string search

I'm running into the problem of finding a searched pattern within a larger pattern in my Java program. For example, I'll try and find all for loops, but will stumble upon formula. Most of the suggestions I've found talk about using regular expression searches like
String regex = "\\b"+keyword+"\\b";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(searchString);
or some variant of this. The issue I'm running into is that I'm crawling through code, not a book-like text where there are spaces on either side of every word. For example, this will miss for(, which I would like to find. Is there another clever way to find whole words only?
Edit: Thanks for the suggestions. How about cases in which there the keyword starts on the first entry of the string? For example,
class Vec {
public:
...
};
where I'm searching for class (or alternatively public). The patterns suggested by Thanga, Austin Lee, npinti, and Kai Iskratsch do not work in this case. Any ideas?

In your case, the issue is that the \b flag will look for punctuation marks, white spaces and the beginning or end of the string. An opening bracket does not fall within any of these categories, and is thus omitted.
The easiest way to fix this would be to replace "\\b"+keyword+"\\b" with "[\\b(]"+keyword+"[\\b)]".
In regex syntax, the square brackets denote a set of which the regex engine will attempt to match any character it contains.
As per this previous SO question, it would seem that \b and [\b] are not the same. Whilst \b represents a word boundary, [\b] represents a backspace character. To fix this, simply replace "\\b"+keyword+"\\b" with "(\b|\()"+keyword+"(\b|\))".

Regex should match 0 or more chars. The below code change will fix the issue
String regex = ".*("+keyword+").*";

You could modify your regex to search for multiple characters afterwords, for example
[^\w]+"for"+[^\w] using the Pattern class in Java.
For your reference:
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Basically you will have to adapt your regex to all the possible patterns it can find. But considering your actually dealing with code, you are better of building a parser/tokenizer for that language, or using one that already exists. Then all you have to do is run through the tokens to find the the ones you want.

Regex which matches a string containing at least the specified characters

I have a huge dictionary which I'm trying to look through using a regex. What I would like to do is to find all the words in the dictionary which contain at least one occurrences of each character I provide in no particular order.
Right now I can find words which only contain the specified characters but like I said that is not exactly what I want.
Example:
I want at least one occurrence of each of the following characters {b, a, d}
astring.matches(regex)
I would expect words like:
badder,
baddest,
baffled
Notice they all contain at least one occurence of each character but in no particular order and other characters are present in the strings.
Anyone know how to do this? Other suggestions are also welcome!

You need a series of look-aheads:
^(?=.*b)(?=.*a)(?=.*d).*
which is a pain to construct. However, you can ease the pain by using regex to build it:
String regex = "^" + "bad".replaceAll(".", "(?=.*$0)") + ".*";
If using repeatedly with String.matches(), you would be better to use the following code, because every call to String.matches() compiles the regex again (there is no caching):
// do this once
Pattern pattern = Pattern.compile(regex);
// reuse the pattern many times
if (pattern.matcher(input).matches())

You can use a lookahead to do this if it's available
(?=.*b)(?=.*a)(?=.*d)
However this is quite inefficient. Any reason you can't use multiple String.indexOf checks?

Regular Expression - Return all matches as a single match

I'm working with a piece of code that applies a regex to a string and returns the first match. I don't have access to modify the code to return all matches, nor do I have the ability to implement alternative code.
I have the following example target string:
usera,userb,,userc,,userd,usere,userf,
This is a list of comma delimited usernames joined from multiple sources, some of which were blank resulting in two commas in some places. I'm trying to write a regex that will return all of the comma delimited usernames except for specific values.
For example, consider the following expression:
[^,]\w{1,},(?<!(userb|userc|userd),)
This results in three matches:
usera,
usere,
userf,
Is there any way to get these results as a single match, instead of a match collection, e.g. a single match having the text 'usera,usere,userf,' ?
If I could write code in any language this would be trivial, but I'm limited to input of only the target string and the pattern, and I need a single match that has all items except for the ones I'm omitting. I'm not sure if this is even possible, everything I've ever done with regex's involves processing multiple items in a match collection.
Here is an example in Regex Coach. This image shows that there are the three matches I want, but my requirement is to have the text in a single match, not three separate matches.
EDIT1:
To clarify this ticket is specifically intended to solve the use case using only regular expression syntax. Solving this problem in code is trivial but solving it using only a regex was the requirement given the fact that the executing code is part of a 3rd party product that I didn't want to reverse engineer, wrap, or replace.

Is there any way to get these results as a single match, instead of a match collection, e.g. a single match having the text 'usera,usere,userf,'?
No. Regex matches are consecutive.
A regular expression matches a (sub)string from start to finish. You cannot drop the middle part, this is not how regex engines work. But you can apply the expression again to find another matching substring (incremental search - that's what Regex Coach does). This would result in a match collection.
That being said, you could also just match everything you don't want to keep and remove it, e.g.
,(?=[\s,]+)|(userb|userc|userd)[\s,]*
http://rubular.com/r/LOKOg6IeBa

Differences in RegEx syntax between Python and Java

I have a working regex in Python and I am trying to convert to Java. It seems that there is a subtle difference in the implementations.
The RegEx is trying to match another reg ex. The RegEx in question is:
/(\\.|[^[/\\\n]|\[(\\.|[^\]\\\n])*])+/([gim]+\b|\B)
One of the strings that it is having problems on is: /\s+/;
The reg ex is not supposed to be matching the ending ;. In Python the RegEx works correctly (and does not match the ending ;, but in Java it does include the ;.
The Question(s):
What can I do to get this RegEx working in Java?
Based on what I read here there should be no difference for this RegEx. Is there somewhere a list of differences between the RegEx implementations in Python vs Java?

Java doesn't parse Regular Expressions in the same way as Python for a small set of cases. In this particular case the nested ['s were causing problems. In Python you don't need to escape any nested [ but you do need to do that in Java.
The original RegEx (for Python):
/(\\.|[^[/\\\n]|\[(\\.|[^\]\\\n])*])+/([gim]+\b|\B)
The fixed RegEx (for Java and Python):
/(\\.|[^\[/\\\n]|\[(\\.|[^\]\\\n])*\])+/([gim]+\b|\B)

The obvious difference b/w Java and Python is that in Java you need to escape a lot of characters.
Moreover, you are probably running into a mismatch between the matching methods, not a difference in the actual regex notation:
Given the Java
String regex, input; // initialized to something
Matcher matcher = Pattern.compile( regex ).matcher( input );
Java's matcher.matches() (also Pattern.matches( regex, input )) matches the entire string. It has no direct equivalent in Python. The same result can be achieved by using re.match( regex, input ) with a regex that ends with $.
Java's matcher.find() and Python's re.search( regex, input ) match any part of the string.
Java's matcher.lookingAt() and Python's re.match( regex, input ) match the beginning of the string.
For more details also read Java's documentation of Matcher and compare to the Python documentation.
Since you said that isn't the problem, I decided to do a test: http://ideone.com/6w61T
It looks like java is doing exactly what you need it to (group 0, the entire match, doesn't contain the ;). Your problem is elsewhere.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.