Regex Lookahead and Lookbehinds: followed by this or that - java

I'm trying to write a regular expression that checks ahead to make sure there is either a white space character OR an opening parentheses after the words I'm searching for.
Also, I want it to look back and make sure it is preceded by either a non-Word (\W) or nothing at all (i.e. it is the beginning of the statement).
So far I have,
"(\\W?)(" + words.toString() + ")(\\s | \\()"
However, this also matches the stuff at either ends - I want this pattern to match ONLY the word itself - not the stuff around it.
I'm using Java flavor Regex.

As you tagged your question yourself, you need lookarounds:
String regex = "(?<=\\W|^)(" + Pattern.quote(words.toString()) + ")(?= |[(])"
(?<=X) means "preceded by X"
(?<!=X) means "not preceded by X"
(?=X) means "followed by X"
(?!=X) means "not followed by X"

What about the word itself: will it always start with a word character (i.e., one that matches \w)? If so, you can use a word boundary for the leading condition.
"\\b" + theWord + "(?=[\\s(])"
Otherwise, you can use a negative lookbehind:
"(?<!\\w)" + theWord + "(?=[\\s(])"
I'm assuming the word is either quoted like so:
String theWord = Pattern.quote(words.toString());
...or doesn't need to be.

If you don't want a group to be captured by the matching, you can use the special construct (?:X)
So, in your case:
"(?:\\W?)(" + words.toString() + ")(?:\\s | \\()"
You will only have two groups then, group(0) for the whole string and group(1) for the word you are looking for.

Related

Java Regex to match comma delimited list of interfaces

I'm trying to write a Java regex to match a comma delimited list of interfaces. Something like:
Runnable, Serializable, List, Map
There can be zero or more entries in the list. A trailing comma is invalid. Space is optional. I came up with the following, which gets me to one or more entries, and then check for empty:
String validName = "[a-zA-Z_][a-zA-Z0-9_]*";
String regex = validName + "\\s*(,\\s*" + validName + ")*";
if (s.matches(regex) || s.trim().isEmpty())
...
But is there a way to include the "zero entries" condition into the regex?
To make a pattern optional, use a group with a ? quantifier set to it:
String regex = "(?:" + validName + "\\s*(?:,\\s*" + validName + ")*)?";
// ^^^ ^^
if (s.matches(regex) {
// ....
}
The ? greedy quantifier matches one or zero occurrences of the pattern it is applied to. (greedy means it prefers to get 1 occurrence rather than 0)
The (?: character sequence opens a non-capturing group. I.E. this is how you use "normal parentheses" for logically grouping sections of regular expressions.
You may add \\s* subpatterns at the start/end of the pattern to allow leading/trailing whitespace.
Try String regex = "^$|" + regex ^$ means "nothing between the beginning of the input and the end of the input". ^$| means "either match nothing or match whatever matches the rest of the regex"

Regex replace word while preserving spaces/punctuation

I am trying to go through a document and change all instances of a name using regular expressions in Java. My code looks something like this:
Pattern replaceWordPattern = Pattern.compile("(^|\\s)" + replaceWord + "^|\\W");
followed by:
String line = matcher.replaceAll("Alice");
The problem is that this does not preserve the spaces or punctuation or other non-word characters that followed. If I had "Jack jumped" it becomes "Alicejumped". Does anyone know a way to fix this?
\W consumes the space after the replaceWord. Replace ^|\\W with word boundary \\b which does not consume symbols. Consider doing same for the first delimiter group, as I suspect you do not want to consume anything there too.
Pattern replaceWordPattern = Pattern.compile("\\b" + replaceWord + "\\b");
If semantic of word boundaries is not suitable for you, consider using lookahead and lookbehind constructs which do not consume input too.
You're missing brackets on the second non-whitespace character expression:
Pattern replaceWordPattern = Pattern.compile("[^|\\s]" + replaceWord + "[^|\\W]");

Regex and lookahead : java

I'm trying to remove punctuation except dots (to keep the sentence structure) from a String with regex
Actually, i have no clue how it's working, i just code this :
public static String removePunctuation(String s){
s = s.replaceAll("(?!.)\\p{Punct}" , " ");
return s;
}
I found that we could use "negative lookahead" for this kind of problem, but when i run this code, it doesn't erase anything. The negative lookahead cancelled the \p{Punct} regex.
The unescaped dot matches anything (except newlines). You need at least
s = s.replaceAll("(?!\\.)\\p{Punct}" , " ");
but for that sort of thing I'd much rather use a character class (within which the dot is no longer a metacharacter and therefore doesn't need to be escaped):
s = s.replaceAll("[^\\P{Punct}.]" , " ");
Explanation:
[^abc] matches any character that's not an a, b, or c.
[^\P{Punct}] matches any character that's "not a not a" punctuation character, effectively matching identically to \p{Punct}.
[^\P{Punct}.] therefore matches any character that's a punctuation character except a dot.
The . character has special meaning in regular expressions. It essentially means 'any character except new lines' (unless the DOTALL flag is specified, in which case it means 'any character'), so your pattern will match 'any punctuation character that is a new line character—in other words, it never match anything.
If you want it to mean a literal . character, you need to escape it like this:
s = s.replaceAll("(?!\\.)\\p{Punct}" , " ");
Or wrap it in a character class, like this:
s = s.replaceAll("(?![.])\\p{Punct}" , " ");

java performance issue - regular expression VS internal String method

I'm having the following issue:
I have some string somewhere in my application that I want to check - the check is whether this string contains a character that is different than " "(white space), /n and /r
For example:
" g" - Contains
" /n " - Not Contains
" " - Not Contains
I want to do it in a reg expression, but I don't want to use the common pattern .*[a-zA-Z0-9]+.* . Instead, I want something like .*[!" ""/n"/r"]. (every character that is different than " " "/r" and "n").
My problems are that
I don't know if this pattern is valid (the above isn't working)
I'm not sure if it would be me much faster then using the
regular Strings methods.
Firstly, you mean \n and \r, and in Java this means escaping the backslash as well with \\n and \\r.
Secondly, if you merely mean to catch any non-whitespace, just use the pattern \\S* or [^\\s]. \S is non-whitespace, or \s is whitespace and [^<charset>] means "match anything that isn't one of these."
Thirdly, if this is a repeated check, be sure to only compile the regex once then use it multiple times.
Fourthly, follow usual strategy for profiling. Firstly is this in a critical strip in your application? If so then benchmark yourself.
here's something that does exactly what you want, but (like i said above), it'll be faster going over characters:
Pattern NOT_WHITESPACE_DETECTOR = Pattern.compile("[^ \\n\\r]");
Matcher m = NOT_WHITESPACE_DETECTOR.matcher(" \n \r bla ");
if (m.find()) {
//string contains a non-white-space
}
also note that the definition of whitespace in java is much wider than you specified, and even then there are whitespaces out there in unicode that java doesnt detect (there are libraries that do, however)

Regular expression to match strings enclosed in square brackets or double quotes

I need 2 simple reg exps that will:
Match if a string is contained within square brackets ([] e.g [word])
Match if string is contained within double quotes ("" e.g "word")
\[\w+\]
"\w+"
Explanation:
The \[ and \] escape the special bracket characters to match their literals.
The \w means "any word character", usually considered same as alphanumeric or underscore.
The + means one or more of the preceding item.
The " are literal characters.
NOTE: If you want to ensure the whole string matches (not just part of it), prefix with ^ and suffix with $.
And next time, you should be able to answer this yourself, by reading regular-expressions.info
Update:
Ok, so based on your comment, what you appear to be wanting to know is if the first character is [ and the last ] or if the first and last are both " ?
If so, these will match those:
^\[.*\]$ (or ^\\[.*\\]$ in a Java String)
"^.*$"
However, unless you need to do some special checking with the centre characters, simply doing:
if ( MyString.startsWith("[") && MyString.endsWith("]") )
and
if ( MyString.startsWith("\"") && MyString.endsWith("\"") )
Which I suspect would be faster than a regex.
Important issues that may make this hard/impossible in a regex:
Can [] be nested (e.g. [foo [bar]])? If so, then a traditional regex cannot help you. Perl's extended regexes can, but it is probably better to write a parser.
Can [, ], or " appear escaped (e.g. "foo said \"bar\"") in the string? If so, see How can I match double-quoted strings with escaped double-quote characters?
Is it possible for there to be more than one instance of these in the string you are matching? If so, you probably want to use the non-greedy quantifier modifier (i.e. ?) to get the smallest string that matches: /(".*?"|\[.*?\])/g
Based on comments, you seem to want to match things like "this is a "long" word"
#!/usr/bin/perl
use strict;
use warnings;
my $s = 'The non-string "this is a crazy "string"" is bad (has own delimiter)';
print $s =~ /^.*?(".*").*?$/, "\n";
Are they two separate expressions?
[[A-Za-z]+]
\"[A-Za-z]+\"
If they are in a single expression:
[[\"]+[a-zA-Z]+[]\"]+
Remember that in .net you'll need to escape the double quotes " by ""

Categories