Match exact word using regex in java

Match exact word using regex in java - java

I need a regular expression to match an exact word.
For example:
There is a string "draft guidance allerg Excellence" and I want to search allerg then I have written \ballerg\b. It gives me exact match. But when I pass string as "draft guidance 12=allerg Excellence" then it also return true, but this is wrong.
Which regular expression do I need to match only exact words?

The \b boundary would normally handle this situation, even in your case of "draft guidance 12=allerg Excellence"; however, you're saying that the = is part of the word (in normal English, this is not the case).
I'm assuming then that by "whole word", you mean a word that is surrounded by a space or normal sentence punctuation. In this case, a regex such as the following should work:
(?:^|[\s\.;\?\!,])allerg(?:$|[\s\.;\?\!,])
You can, obviously, add or remove characters as needed.
Regex Explained:
(?: # non-matching group
^ # beginning of string
| [\s\.;\?\!,] # OR a space, period, and other misc. punctuation
)
allerg # string to match
(?: # non-matching group
$ # end of string
| [\s\.;\?\!,] # OR a space, period, and other misc. punctuation
)

If i understood the question correctly, you want to match a word "allerg" . A word is enclosed with whitespace characters and "=allerg" has the "=" character which you dont want to match.
To match the word "allerg" you can use the following regex:
\s+allerg\s+

Related

RegEx matching $1 tokens [duplicate]

Imagine you are trying to pattern match "stackoverflow".
You want the following:
this is stackoverflow and it rocks [MATCH]
stackoverflow is the best [MATCH]
i love stackoverflow [MATCH]
typostackoverflow rules [NO MATCH]
i love stackoverflowtypo [NO MATCH]
I know how to parse out stackoverflow if it has spaces on both sites using:
/\s(stackoverflow)\s/
Same with if its at the start or end of a string:
/^(stackoverflow)\s/
/\s(stackoverflow)$/
But how do you specify "space or end of string" and "space or start of string" using a regular expression?

You can use any of the following:
\b #A word break and will work for both spaces and end of lines.
(^|\s) #the | means or. () is a capturing group.
/\b(stackoverflow)\b/
Also, if you don't want to include the space in your match, you can use lookbehind/aheads.
(?<=\s|^) #to look behind the match
(stackoverflow) #the string you want. () optional
(?=\s|$) #to look ahead.

(^|\s) would match space or start of string and ($|\s) for space or end of string. Together it's:
(^|\s)stackoverflow($|\s)

Here's what I would use:
(?<!\S)stackoverflow(?!\S)
In other words, match "stackoverflow" if it's not preceded by a non-whitespace character and not followed by a non-whitespace character.
This is neater (IMO) than the "space-or-anchor" approach, and it doesn't assume the string starts and ends with word characters like the \b approach does.

\b matches at word boundaries (without actually matching any characters), so the following should do what you want:
\bstackoverflow\b

Replace substring of text matching regexp

I have text that looks like something like this:
1. Must have experience in Java 2. Team leader...
I want to render this in HTML as an ordered list. Now adding the </li> tag to the end is simple enough:
s = replace(s, ". ", "</li>");
But how do I go about replacing the 1., 2. etc with <li>?
I have the regular expression \d*\.$ which matches a number with a period, but the problem is is that is a substring so matching 1. Must have experience in Java 2. Team leader with \d*\.$ returns false.

Code
See regex in use here
\d+\.\s+(.*?)\s*(?=\d+\.\s+|$)
Replace
<li>$1</li>\n
Results
Input
Must have experience in Java 2. Team leader...
Output
<li>Must have experience in Java</li>
<li>Team leader...</li>
Explanation
\d+ Match one or more digits
\. Match the dot character . literally
\s+ Match one or more whitespace characters
(.*?) Capture any character any number of times, but as few as possible, into capture group 1
\s* Match any number of whitespace characters
(?=\d+\.\s+|$) Positive lookahead ensuring either of the following doesn't match
\d+\.\s+
\d+ Match one or more digits
\. Match the dot character . literally
\s+ Match one or more whitespace characters
$ Assert position at the end of the line

But how do I go about replacing the 1., 2. etc with <li>?
You can use String#replaceAll which can allow regex instead of replace :
s = s.replaceAll("\\d+\\.\\s", "</li>");
Note
You don't need to use $ in the end of your regex.
You have to escape dot . because it's mean any character in regex
You can use \s for one space or \s* for zero or more spaces or \s+ for one or more space

We want
<ol>
<li>one</li>
<li>two<li>
</ol>
This can be done as:
s = s.replaceAll("(?s)(\\d+\\.)\\s+(.*\\.)\\s*", "<li>$2</li></ol>");
s = s.replaceFirst("<li>", "<ol><li>");
s = s.replaceAll("(?s)</li></ol><li>", "</li>\n<li>");
The trick is to first add </li></ol> with a spurious </ol> that should only remain after the last list item.
(?s) is the DOTALL notation, causing . to also match line breaks.
In case of more than one numbered list this will not do. Also it assumes one single sentence per list item.

Analysing a more complex regex

In a previous question that i asked,
String split in java using advanced regex
someone gave me a fantastic answer to my problem (as described on the above link)
but i never managed to fully understand it. Can somebody help me? The regex i was given
is this"
"(?s)(?=(([^\"]+\"){2})*[^\"]*$)\\s+"
I can understand some basic things, but there are parts of this regex that even after
thoroughly searching google i could not find, like the question mark preceding the s in the
start, or how exactly the second parenthesis works with the question mark and the equation in the start. Is it possible also to expand it and make it able to work with other types of quotes, like “ ” for example?
Any help is really appreciated.

"(?s)(?=(([^\"]+\"){2})*[^\"]*$)\\s+" Explained;
(?s) # This equals a DOTALL flag in regex, which allows the `.` to match newline characters. As far as I can tell from your regex, it's superfluous.
(?= # Start of a lookahead, it checks ahead in the regex, but matches "an empty string"(1) read more about that [here][1]
(([^\"]+\"){2})* # This group is repeated any amount of times, including none. I will explain the content in more detail.
([^\"]+\") # This is looking for one or more occurrences of a character that is not `"`, followed by a `"`.
{2} # Repeat 2 times. When combined with the previous group, it it looking for 2 occurrences of text followed by a quote. In effect, this means it is looking for an even amount of `"`.
[^\"]* # Matches any character which is not a double quote sign. This means literally _any_ character, including newline characters without enabling the DOTALL flag
$ # The lookahead actually inspects until end of string.
) # End of lookahead
\\s+ # Matches one or more whitespace characters, including spaces, tabs and so on
That complicated group up there that is repeated twice will match in whitespaces in this string which is not in between two ";
text that has a "string in it".
When used with String.split, splitting the string into; [text, that, has, a, "string in it".]
It will only match if there are an even number of ", so the following will match on all spaces;
text that nearly has a "string in it.
Splitting the string into [text, that, nearly, has, a, "string, in, it.]
(1) When I say that a capture group matches "an empty string", I mean that it actually captures nothing, it only looks ahead from the point in the regex you are, and check a condition, nothing is actually captured. The actual capture is done by \\s+ which follows the lookahead.

The (?s) part is an embedded flag expression, enabling the DOTALL mode, which means the following:
In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.
The (?=expr) is a look-ahead expression. This means that the regex looks to match expr, but then moves back to the same point before continuing with the rest of the evaluation.
In this case, it means that the regex matches any \\s+ occurence, that is followed by any even number of ", then followed by non-" until the end ($). In other words, it checks that there are an even number of " ahead.
It can definitely be expanded to other quotes too. The only problem is the ([^\"]+\"){2} part, that will probably have to be made to use a back-reference (\n) instead of the {2}.

This is fairly simple..
Concept
It split's at \s+ whenever there are even number of " ahead.
For example:
Hello hi "Hi World"
^ ^ ^
| | |->will not split here since there are odd number of "
----
|
|->split here because there are even number of " ahead
Grammar
\s matches a \n or \r or space or \t
+ is a quantifier which matches previous character or group 1 to many times
[^\"] would match anything except "
(x){2} would match x 2 times
a(?=bc) would match if a is followed by bc
(?=ab)a would first check for ab from current position and then return back to its position.It then matches a.(?=ab)c would not match c
With (?s)(singleline mode) . would match newlines.So,In this case no need of (?s) since there are no .
I would use
\s+(?=([^"]*"[^"]*")*[^"]*$)

Regex for "* word"

Any Regex masters out there? I need a regular expression in Java that matches:
"RANDOMSTUFF SPECIFICWORD"
Including the quotation marks.
Thus I need
to match the first quote,
RANDOMSTUFF (any number of words with spaces between preceding SPECIFICWORD)
SPECIFICWORD (a specific word which I won't specify here.)
and the ending quote.
I don't want to match things such as:
RANDOMSTUFF SPECIFICWORD
"RANDOMSTUFF NOTTHESPECIFICWORD"
"RANDOMSTUFF SPECIFICWORD MORERANDOMSTUFF"

\".*\sSPECIFICWORD\"
If you don't want to allow quotes in between, use \"[^"]*\sSPECIFICWORD\"
. matches any character
* says 0 or more of the preceding character (in this case, 0 or more of any characters)
\s matches any whitespace character
SPECIFICWORD will be treated as a string literal, assuming there are no special characters (escape them if there are)
\" matches the quote
[^"] means any character except a quote (the ^ is what makes it 'except')
Also, this link could be useful. Regex's are powerful expressions and are applicable across virtually any language, so it would be a good thing to become comfortable with using them.
EDIT:
As several other posters have pointed out, adding ^ to the beginning and $ to the end will only match if the entire line matches.
^ matches the beginning of the line
$ matches the end of the line

^.*\s+SPECIFICWORD"$
'^' matches 'from the start of the line'
.* matches anything
\s+ matches 'any amount of whitespace, but at least some'
SPECIFICWORD" is a string literal
$ means 'this is the end of the line'
Note that ^ and $ are not always 'line'-based; most languages allow you to specify a 'multiline' mode that would cause them to match 'start of the string/end of the string' instead of one line at a time.

Will this string be matched as a line by line basis or will it be found within the text? If so, you can add anchors to ensure that it matches the string.
^(\".*\sSPECIFICWPRD\")$
Saying, at the start of the line, look for a double quote followed by zero or more random characters followed by a single whitespace, followed by the specific word, followed by a double quote at the end of the string.
Optionally, there are excellent tools for designing regex patterns and seeing what they match in real time.
Here are a couple of examples:
http://gskinner.com/RegExr/
http://regex101.com/r/zC3fM1

Try:
\"[\w\s]*SPECIFICWORD\"
Works like this:
\" matches opening quote
[\w\s]* matches zero or more of the characters from the following sets:
[a-zA-Z_0-9] (\w part)
[ \t\n\x0B\f\r] (\s part)
SPECIFICWORD matches the SPECIFICWORD
\" matches closing quote

Regex pattern matching not working recursivily

I want to implement the pattern matching in the form
(a+b)(c-or*or/d)..............
in any number of times.
I use the following pattern but it is not working recursively.
It is just reading the first group.
Pattern pattern;
String regex="(([0-9]*)([+,-,/,*])([0-9]*)*)";
pattern=Pattern.compile(regex);
Matcher match = pattern.matcher(userInput);

The regular expression you need to match that sort of sequence is this:
\s*-?\d+(?:\s*[-+/*]\s*-?\d+)+\s*
Let's break that down to it's component parts!
\s* # Optional space
-? # Optional minus sign
\d+ # Mandatory digits
(?: # Start sub-regex
\s* # Optional space
[-+*/] # Mandatory single arithmetic operator
\s* # Optional space
-? # Optional minus sign
\d+ # Mandatory digits
)+ # End sub-regex: want one or more matches of it
\s* # Optional space
(If you don't want to match spaces, remove all of those \s* and be aware that it will surprise users quite a lot.)
Now, when encoding the above as a String literal in Java (before compilation) you need to be careful to escape each of the \ characters in it:
String regex="\\s*-?\\d+(?:\\s*[-+/*]\\s*-?\\d+)+\\s*";
The other thing to be aware of is that this doesn't pull the regular expression apart into pieces for Java to parse and build an expression evaluation tree from; it just (with the rest of your code) matches the whole string or not. (Even putting in capturing parentheses wouldn't help a lot; when put inside some form of repetition, they only report the first place in the string where they matched.) The simplest way of doing that properly would be to use a parser generator like Antlr (it would also let you do things like parenthesized subexpressions, managing operator precedence, etc.)

You will need an expression like this
[0-9]+-[0-9]+[\/*-+][0-9]+[\/*-+][0-9]+[\/*-+][0-9]+[\/*-+][0-9]+
You have to match the whole expression. You can not match part of the expression and do a second search because the pattern is repeated.
Note: In ruby \ is excape sequence of / character so you can omit it in C# or replace it with another characer.
Demo

The pattern
<!--
\((\d|[\+\-\/\\\*\^%!]+|(or|and) *)+\)
Options: ^ and $ match at line breaks
Match the character “(” literally «\(»
Match the regular expression below and capture its match into backreference number 1 «(\d|[\+\-\/\\\*\^%!]+|(or|and) *)+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Note: You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «+»
Match either the regular expression below (attempting the next alternative only if this one fails) «\d»
Match a single digit 0..9 «\d»
Or match regular expression number 2 below (attempting the next alternative only if this one fails) «[\+\-\/\\\*\^%!]+»
Match a single character present in the list below «[\+\-\/\\\*\^%!]+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
A + character «\+»
A - character «\-»
A / character «\/»
A \ character «\\»
A * character «\*»
A ^ character «\^»
One of the characters “%!” «%!»
Or match regular expression number 3 below (the entire group fails if this one fails to match) «(or|and) *»
Match the regular expression below and capture its match into backreference number 2 «(or|and)»
Match either the regular expression below (attempting the next alternative only if this one fails) «or»
Match the characters “or” literally «or»
Or match regular expression number 2 below (the entire group fails if this one fails to match) «and»
Match the characters “and” literally «and»
Match the character “ ” literally « *»
Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “)” literally «\)»
-->
The calculation algorithm
For parsing and processing input string you have to use a stack. Visit here for the concept.
Regards
Cylian

your expression doesnt escape special chars like +,(,)
try this
/\(\d+[\+|-|\/|\*]\d+)\G?/
\G is the whole pattern over again
? means the previous thing is optional
i changed your [0-9]* to \d+ which i think is more correct
i changed your , to |

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Match exact word using regex in java - java

If i understood the question correctly, you want to match a word "allerg" . A word is enclosed with whitespace characters and "=allerg" has the "=" character which you dont want to match. To match the word "allerg" you can use the following regex: \s+allerg\s+

Related

RegEx matching $1 tokens [duplicate]

Replace substring of text matching regexp

Analysing a more complex regex

Regex for "* word"

Regex pattern matching not working recursivily

Categories

Resources