Analysing a more complex regex

Analysing a more complex regex - java

In a previous question that i asked,
String split in java using advanced regex
someone gave me a fantastic answer to my problem (as described on the above link)
but i never managed to fully understand it. Can somebody help me? The regex i was given
is this"
"(?s)(?=(([^\"]+\"){2})*[^\"]*$)\\s+"
I can understand some basic things, but there are parts of this regex that even after
thoroughly searching google i could not find, like the question mark preceding the s in the
start, or how exactly the second parenthesis works with the question mark and the equation in the start. Is it possible also to expand it and make it able to work with other types of quotes, like “ ” for example?
Any help is really appreciated.

"(?s)(?=(([^\"]+\"){2})*[^\"]*$)\\s+" Explained;
(?s) # This equals a DOTALL flag in regex, which allows the `.` to match newline characters. As far as I can tell from your regex, it's superfluous.
(?= # Start of a lookahead, it checks ahead in the regex, but matches "an empty string"(1) read more about that [here][1]
(([^\"]+\"){2})* # This group is repeated any amount of times, including none. I will explain the content in more detail.
([^\"]+\") # This is looking for one or more occurrences of a character that is not `"`, followed by a `"`.
{2} # Repeat 2 times. When combined with the previous group, it it looking for 2 occurrences of text followed by a quote. In effect, this means it is looking for an even amount of `"`.
[^\"]* # Matches any character which is not a double quote sign. This means literally _any_ character, including newline characters without enabling the DOTALL flag
$ # The lookahead actually inspects until end of string.
) # End of lookahead
\\s+ # Matches one or more whitespace characters, including spaces, tabs and so on
That complicated group up there that is repeated twice will match in whitespaces in this string which is not in between two ";
text that has a "string in it".
When used with String.split, splitting the string into; [text, that, has, a, "string in it".]
It will only match if there are an even number of ", so the following will match on all spaces;
text that nearly has a "string in it.
Splitting the string into [text, that, nearly, has, a, "string, in, it.]
(1) When I say that a capture group matches "an empty string", I mean that it actually captures nothing, it only looks ahead from the point in the regex you are, and check a condition, nothing is actually captured. The actual capture is done by \\s+ which follows the lookahead.

The (?s) part is an embedded flag expression, enabling the DOTALL mode, which means the following:
In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.
The (?=expr) is a look-ahead expression. This means that the regex looks to match expr, but then moves back to the same point before continuing with the rest of the evaluation.
In this case, it means that the regex matches any \\s+ occurence, that is followed by any even number of ", then followed by non-" until the end ($). In other words, it checks that there are an even number of " ahead.
It can definitely be expanded to other quotes too. The only problem is the ([^\"]+\"){2} part, that will probably have to be made to use a back-reference (\n) instead of the {2}.

This is fairly simple..
Concept
It split's at \s+ whenever there are even number of " ahead.
For example:
Hello hi "Hi World"
^ ^ ^
| | |->will not split here since there are odd number of "
----
|
|->split here because there are even number of " ahead
Grammar
\s matches a \n or \r or space or \t
+ is a quantifier which matches previous character or group 1 to many times
[^\"] would match anything except "
(x){2} would match x 2 times
a(?=bc) would match if a is followed by bc
(?=ab)a would first check for ab from current position and then return back to its position.It then matches a.(?=ab)c would not match c
With (?s)(singleline mode) . would match newlines.So,In this case no need of (?s) since there are no .
I would use
\s+(?=([^"]*"[^"]*")*[^"]*$)

Related

Replace substring of text matching regexp

I have text that looks like something like this:
1. Must have experience in Java 2. Team leader...
I want to render this in HTML as an ordered list. Now adding the </li> tag to the end is simple enough:
s = replace(s, ". ", "</li>");
But how do I go about replacing the 1., 2. etc with <li>?
I have the regular expression \d*\.$ which matches a number with a period, but the problem is is that is a substring so matching 1. Must have experience in Java 2. Team leader with \d*\.$ returns false.

Code
See regex in use here
\d+\.\s+(.*?)\s*(?=\d+\.\s+|$)
Replace
<li>$1</li>\n
Results
Input
Must have experience in Java 2. Team leader...
Output
<li>Must have experience in Java</li>
<li>Team leader...</li>
Explanation
\d+ Match one or more digits
\. Match the dot character . literally
\s+ Match one or more whitespace characters
(.*?) Capture any character any number of times, but as few as possible, into capture group 1
\s* Match any number of whitespace characters
(?=\d+\.\s+|$) Positive lookahead ensuring either of the following doesn't match
\d+\.\s+
\d+ Match one or more digits
\. Match the dot character . literally
\s+ Match one or more whitespace characters
$ Assert position at the end of the line

But how do I go about replacing the 1., 2. etc with <li>?
You can use String#replaceAll which can allow regex instead of replace :
s = s.replaceAll("\\d+\\.\\s", "</li>");
Note
You don't need to use $ in the end of your regex.
You have to escape dot . because it's mean any character in regex
You can use \s for one space or \s* for zero or more spaces or \s+ for one or more space

We want
<ol>
<li>one</li>
<li>two<li>
</ol>
This can be done as:
s = s.replaceAll("(?s)(\\d+\\.)\\s+(.*\\.)\\s*", "<li>$2</li></ol>");
s = s.replaceFirst("<li>", "<ol><li>");
s = s.replaceAll("(?s)</li></ol><li>", "</li>\n<li>");
The trick is to first add </li></ol> with a spurious </ol> that should only remain after the last list item.
(?s) is the DOTALL notation, causing . to also match line breaks.
In case of more than one numbered list this will not do. Also it assumes one single sentence per list item.

How to match ^(d+) in a particular text using regex

For example I have text like below :
case1:
(1) Hello, how are you?
case2:
Hi. (1) How're you doing?
Now I want to match the text which starts with (\d+).
I have tried the following regex but nothing is working.
^[\(\d+\)], ^\(\d+\).

[] are used to match any of the things you specify inside the brackets, and are to be followed by a quantifier.
The second regexp will work: ^\(\d+\), so check your code.
Check also so there's no space in front of the first parenthesis, or add \s* in front.
EDIT: Also, java can be tricky with escapes depending on if the regexp you type is directly translated to a regexp or is first a string literal. You may need to double escape your escapes.

In Java you have to escape parenthesis, so "\\(\\d+\\)" should match (1) in case one and two. Adding ^ as you did "^\\(\\d+\\)" will match only case1.
You have to use double back slashes within java string. Consider this
"\n" give you [line break]
"\\n" give you [backslash][n]

If you are going to downvote my post, at least comment to tell me WHY it's not useful.
I believe Java's Regex Engine supports Positive Lookbehind, in which case you can use the following regex:
(?<=[(][0-9]{1,9999}[)]\s?)\b.*$
Which matches:
The literal text (
Any digit [0-9], between 1 and 9999 times {1,9999}
The literal text )
A space, between 0 and 1 times \s?
A word boundary \b
Any character, between 0 and unlimited times .*
The end of a string $

Regex for "* word"

Any Regex masters out there? I need a regular expression in Java that matches:
"RANDOMSTUFF SPECIFICWORD"
Including the quotation marks.
Thus I need
to match the first quote,
RANDOMSTUFF (any number of words with spaces between preceding SPECIFICWORD)
SPECIFICWORD (a specific word which I won't specify here.)
and the ending quote.
I don't want to match things such as:
RANDOMSTUFF SPECIFICWORD
"RANDOMSTUFF NOTTHESPECIFICWORD"
"RANDOMSTUFF SPECIFICWORD MORERANDOMSTUFF"

\".*\sSPECIFICWORD\"
If you don't want to allow quotes in between, use \"[^"]*\sSPECIFICWORD\"
. matches any character
* says 0 or more of the preceding character (in this case, 0 or more of any characters)
\s matches any whitespace character
SPECIFICWORD will be treated as a string literal, assuming there are no special characters (escape them if there are)
\" matches the quote
[^"] means any character except a quote (the ^ is what makes it 'except')
Also, this link could be useful. Regex's are powerful expressions and are applicable across virtually any language, so it would be a good thing to become comfortable with using them.
EDIT:
As several other posters have pointed out, adding ^ to the beginning and $ to the end will only match if the entire line matches.
^ matches the beginning of the line
$ matches the end of the line

^.*\s+SPECIFICWORD"$
'^' matches 'from the start of the line'
.* matches anything
\s+ matches 'any amount of whitespace, but at least some'
SPECIFICWORD" is a string literal
$ means 'this is the end of the line'
Note that ^ and $ are not always 'line'-based; most languages allow you to specify a 'multiline' mode that would cause them to match 'start of the string/end of the string' instead of one line at a time.

Will this string be matched as a line by line basis or will it be found within the text? If so, you can add anchors to ensure that it matches the string.
^(\".*\sSPECIFICWPRD\")$
Saying, at the start of the line, look for a double quote followed by zero or more random characters followed by a single whitespace, followed by the specific word, followed by a double quote at the end of the string.
Optionally, there are excellent tools for designing regex patterns and seeing what they match in real time.
Here are a couple of examples:
http://gskinner.com/RegExr/
http://regex101.com/r/zC3fM1

Try:
\"[\w\s]*SPECIFICWORD\"
Works like this:
\" matches opening quote
[\w\s]* matches zero or more of the characters from the following sets:
[a-zA-Z_0-9] (\w part)
[ \t\n\x0B\f\r] (\s part)
SPECIFICWORD matches the SPECIFICWORD
\" matches closing quote

Match exact word using regex in java

I need a regular expression to match an exact word.
For example:
There is a string "draft guidance allerg Excellence" and I want to search allerg then I have written \ballerg\b. It gives me exact match. But when I pass string as "draft guidance 12=allerg Excellence" then it also return true, but this is wrong.
Which regular expression do I need to match only exact words?

The \b boundary would normally handle this situation, even in your case of "draft guidance 12=allerg Excellence"; however, you're saying that the = is part of the word (in normal English, this is not the case).
I'm assuming then that by "whole word", you mean a word that is surrounded by a space or normal sentence punctuation. In this case, a regex such as the following should work:
(?:^|[\s\.;\?\!,])allerg(?:$|[\s\.;\?\!,])
You can, obviously, add or remove characters as needed.
Regex Explained:
(?: # non-matching group
^ # beginning of string
| [\s\.;\?\!,] # OR a space, period, and other misc. punctuation
)
allerg # string to match
(?: # non-matching group
$ # end of string
| [\s\.;\?\!,] # OR a space, period, and other misc. punctuation
)

If i understood the question correctly, you want to match a word "allerg" . A word is enclosed with whitespace characters and "=allerg" has the "=" character which you dont want to match.
To match the word "allerg" you can use the following regex:
\s+allerg\s+

regex for specific digit prefix

I am trying to have the following regx rule, but couldn't find solution.
I am sorry if I didn't make it clear. I want for each rule different regx. I am using Java.
rule should fail for all digit inputs start with prefix '1900' or '1901'.
(190011 - fail, 190111 - fail, 41900 - success...)
rule should success for all digit inputs with the prefix '*'
different regex for each rule (I am not looking for the combination of both of them together)

Is this RE fitting the purpose ? :
'\A(\*|(?!190[01])).*'
\A means 'the beginning of string' . I think it's the same in Java's regexes
.
EDIT
\A : "from the very beginning of the string ....". In Python (which is what I know, in fact) this can be omitted if we use the function match() that always analyzes from the very beginning, instead of search() that search everywhere in a string. If you want the regex able to analyze lines from the very beginning of each line, this must be replaced by ^
(...|...) : ".... there must be one of the two following options : ....."
\* : "...the first option is one character only, a star; ..." . As a star is special character meaning 'zero, one or more times what is before' in regex's strings, it must be escaped to strictly mean 'a star' only.
(?!190[01]) : "... the second option isn't a pattern that must be found and possibly catched but a pattern that must be absent (still after the very beginning). ...". The two characters ?! are what says 'there must not be the following characters'. The pattern not to be found is 4 integer characters long, '1900' or '1901' .
(?!.......) is a negative lookahead assertion. All kinds of assertion begins with (? : the parenthese invalidates the habitual meaning of ? , that's why all assertions are always written with parentheses.
If \* have matched, one character have been consumed. On the contrary, if the assertion is verified, the corresponding 4 first characters of the string haven't been consumed: the regex motor has gone through the analysed string until the 4th character to verify them, and then it has come back to its initial position, that is to say, presently, at the very beginning of the string.
If you want the bi-optional part (...|...) not to be a capturing group, you will write ?: just after the first paren, then '\A(?:\*|(?!190[01])).*'
.* : After the beginning pattern (one star catched/matched, or an assertion verified) the regex motor goes and catch all the characters until the end of the line. If the string has newlines and you want the regex to catch all the characters until the end of the string, and not only of a line, you will specify that . must match the newlines too (in Python it is with re.MULTILINE), or you will replace .* with (.|\r|\n)*
I finally understand that you apparently want to catch strings composed of digits characters. If so the RE must be changed to '\A(?:\*|(?!190[01]))\d*' . This RE matches with empty strings. If you want no-match with empty strings, put \d+ in place of \d* . If you want that only strings with at least one digit, even after the star when it begins with a star, match, then do '\A(?:\*|(?!190[01]))(?=\d)\d*'

For the first rule, you should use a combo regex with two captures, one to capture the 1900/1901-prefixed case, and one the capture the rest. Then you can decide whether the string should succeed or fail by examining the two captures:
(190[01]\d+)|(\d+)
Or just a simple 190[01]\d+ and negate your logic.
Regex's are not really very good at excluding something.
You may exclude a prefix using negative look-behind, but it won't work in this case because the prefix is itself a stream of digits.
You seem to be trying to exclude 1-900/901 phone numbers in the US. If the number of digits is definite, you can use a negative look-behind to exclude this prefix while matching the remaining exact number digits.
For the second rule, simply:
\*\d+

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Analysing a more complex regex - java

Related

Replace substring of text matching regexp

How to match ^(d+) in a particular text using regex

Regex for "* word"

Match exact word using regex in java

regex for specific digit prefix

Categories

Resources