How do I match escaped characters in groups in Java RegEx - java

I'm recently working on a command line project in java and I need to parse through commands. But I'm having issues in matching this particular command.
15.00|GR,LQ,MD "Uber"
where the amount can be with a decimal fraction of two or an int. I need to collect all the information on groups. "Uber" is an optional description.
Here is what I have tried..
Pattern.compile("ˆ([\\d]+(\\.[\\d]{2})?\\|([A-Z]{2}){1})(,[A-Z]{2})*\\s(\\\".+\\\")?$");
What I expect is to get the number, the two character composed users and optionally the description too..

Your regex analyzed:
"ˆ([\\d]+(\\.[\\d]{2})?\\|([A-Z]{2}){1})(,[A-Z]{2})*\\s(\\\".+\\\")?$"
First, let's un-escape the Java string literal into the actual regex string:
ˆ([\d]+(\.[\d]{2})?\|([A-Z]{2}){1})(,[A-Z]{2})*\s(\".+\")?$
Now lets split it apart:
ˆ Incorrect character 'ˆ', should be '^'
Match start of input, but your input starts with '['
(
[\d]+ The '[]' is superfluous, use '\d+'
(\.[\d]{2})? Don't capture this, use '(?:X)?'
\|
([A-Z]{2}){1} The '{1}` is superfluous, and don't capture just this
) You're capturing too much. Move back to before '\|'
(,[A-Z]{2})* Will only capture last ',XX'.
Use a capture group around all the letters, then split that on ','
\s
(\".+\")? No need to escape '"', and only capture the content
$ Match end of input, but your input ends with ']'
So, cleaned up it will be:
^\[
(
\d+
(?:\.[\d]{2})?
)
\|
(
[A-Z]{2}
(?:,[A-Z]{2})*
)
\s
(?:"(.+)")?
\]$
Joined back together:
^\[(\d+(?:\.[\d]{2})?)\|([A-Z]{2}(?:,[A-Z]{2})*)\s(?:"(.+)")?\]$
With input [15.00|GR,LQ,MD "Uber"] that will capture:
15.00 - The full number
GR,LQ,MD - Use split(",") to get array { "GR", "LQ", "MD" }
Uber - Just the text without the quotes
See Demo on regex101.com.

The first character is a ˆ and not ^. Beside that you should change your first group to ([\d]+(\.[\d]{2})?) to get only 15.00 and not 15.00|GR.
The full example would look like this:
Pattern.compile("^([\\d]+(\\.[\\d]{2})?)\\|(([A-Z]{2})(,[A-Z]{2})*)\\s(\".+\")?$");

There are 2 main issues.
The ˆ character is an accent circumflex instead of a ^ caret.
You're not including the square brackets in the regex.
A possible solution could be like this
Pattern.compile("^\\[(?<number>[\\d]+(?>\\.[\\d]{2})?)\\|(?<codes>(?>[A-Z]{2},?)+)(?>\\s\\\"(?<comment>.+)\\\")?\\]$");
This solution also has named capturing groups which makes it nicer to specify which group you want to get value from. https://regex101.com/r/HEboNf/2
All three of the 2 letter codes are grouped in a single capturing group, you can split them in your code on the comma.

Related

Replace substring of text matching regexp

I have text that looks like something like this:
1. Must have experience in Java 2. Team leader...
I want to render this in HTML as an ordered list. Now adding the </li> tag to the end is simple enough:
s = replace(s, ". ", "</li>");
But how do I go about replacing the 1., 2. etc with <li>?
I have the regular expression \d*\.$ which matches a number with a period, but the problem is is that is a substring so matching 1. Must have experience in Java 2. Team leader with \d*\.$ returns false.
Code
See regex in use here
\d+\.\s+(.*?)\s*(?=\d+\.\s+|$)
Replace
<li>$1</li>\n
Results
Input
Must have experience in Java 2. Team leader...
Output
<li>Must have experience in Java</li>
<li>Team leader...</li>
Explanation
\d+ Match one or more digits
\. Match the dot character . literally
\s+ Match one or more whitespace characters
(.*?) Capture any character any number of times, but as few as possible, into capture group 1
\s* Match any number of whitespace characters
(?=\d+\.\s+|$) Positive lookahead ensuring either of the following doesn't match
\d+\.\s+
\d+ Match one or more digits
\. Match the dot character . literally
\s+ Match one or more whitespace characters
$ Assert position at the end of the line
But how do I go about replacing the 1., 2. etc with <li>?
You can use String#replaceAll which can allow regex instead of replace :
s = s.replaceAll("\\d+\\.\\s", "</li>");
Note
You don't need to use $ in the end of your regex.
You have to escape dot . because it's mean any character in regex
You can use \s for one space or \s* for zero or more spaces or \s+ for one or more space
We want
<ol>
<li>one</li>
<li>two<li>
</ol>
This can be done as:
s = s.replaceAll("(?s)(\\d+\\.)\\s+(.*\\.)\\s*", "<li>$2</li></ol>");
s = s.replaceFirst("<li>", "<ol><li>");
s = s.replaceAll("(?s)</li></ol><li>", "</li>\n<li>");
The trick is to first add </li></ol> with a spurious </ol> that should only remain after the last list item.
(?s) is the DOTALL notation, causing . to also match line breaks.
In case of more than one numbered list this will not do. Also it assumes one single sentence per list item.

How to match ^(d+) in a particular text using regex

For example I have text like below :
case1:
(1) Hello, how are you?
case2:
Hi. (1) How're you doing?
Now I want to match the text which starts with (\d+).
I have tried the following regex but nothing is working.
^[\(\d+\)], ^\(\d+\).
[] are used to match any of the things you specify inside the brackets, and are to be followed by a quantifier.
The second regexp will work: ^\(\d+\), so check your code.
Check also so there's no space in front of the first parenthesis, or add \s* in front.
EDIT: Also, java can be tricky with escapes depending on if the regexp you type is directly translated to a regexp or is first a string literal. You may need to double escape your escapes.
In Java you have to escape parenthesis, so "\\(\\d+\\)" should match (1) in case one and two. Adding ^ as you did "^\\(\\d+\\)" will match only case1.
You have to use double back slashes within java string. Consider this
"\n" give you [line break]
"\\n" give you [backslash][n]
If you are going to downvote my post, at least comment to tell me WHY it's not useful.
I believe Java's Regex Engine supports Positive Lookbehind, in which case you can use the following regex:
(?<=[(][0-9]{1,9999}[)]\s?)\b.*$
Which matches:
The literal text (
Any digit [0-9], between 1 and 9999 times {1,9999}
The literal text )
A space, between 0 and 1 times \s?
A word boundary \b
Any character, between 0 and unlimited times .*
The end of a string $

How to make a regular expression that matches tokens with delimiters and separators?

I want to be able to write a regular expression in java that will ensure the following pattern is matched.
<D-05-hello-87->
For the letter D, this can either my 'D' or 'E' in capital letters and only either of these letters once.
The two numbers you see must always be a 2 digit decimal number, not 1 or 3 numbers.
The string must start and end with '<' and '>' and contain '-' to seperate parts within.
The message in the middle 'hello' can be any character but must not be more than 99 characters in length. It can contain white spaces.
Also this pattern will be repeated, so the expression needs to recognise the different individual patterns within a logn string of these pattersn and ensure they follow this pattern structure. E.g
So far I have tried this:
([<](D|E)[-]([0-9]{2})[-](.*)[-]([0-9]{2})[>]\z)+
But the problem is (.*) which sees anything after it as part of any character match and ignores the rest of the pattern.
How might this be done? (Using Java reg ex syntax)
Try making it non-greedy or negation:
(<([DE])-([0-9]{2})-(.*?)-([0-9]{2})>)
Live Demo: http://ideone.com/nOi9V3
Update: tested and working
<([DE])-(\d{2})-(.{1,99}?)-(\d{2})>
See it working: http://rubular.com/r/6Ozf0SR8Cd
You should not wrap -, < and > in [ ]
Assuming that you want to stop at the first dash, you could use [^-]* instead of .*. This will match all non-dash characters.

Analysing a more complex regex

In a previous question that i asked,
String split in java using advanced regex
someone gave me a fantastic answer to my problem (as described on the above link)
but i never managed to fully understand it. Can somebody help me? The regex i was given
is this"
"(?s)(?=(([^\"]+\"){2})*[^\"]*$)\\s+"
I can understand some basic things, but there are parts of this regex that even after
thoroughly searching google i could not find, like the question mark preceding the s in the
start, or how exactly the second parenthesis works with the question mark and the equation in the start. Is it possible also to expand it and make it able to work with other types of quotes, like “ ” for example?
Any help is really appreciated.
"(?s)(?=(([^\"]+\"){2})*[^\"]*$)\\s+" Explained;
(?s) # This equals a DOTALL flag in regex, which allows the `.` to match newline characters. As far as I can tell from your regex, it's superfluous.
(?= # Start of a lookahead, it checks ahead in the regex, but matches "an empty string"(1) read more about that [here][1]
(([^\"]+\"){2})* # This group is repeated any amount of times, including none. I will explain the content in more detail.
([^\"]+\") # This is looking for one or more occurrences of a character that is not `"`, followed by a `"`.
{2} # Repeat 2 times. When combined with the previous group, it it looking for 2 occurrences of text followed by a quote. In effect, this means it is looking for an even amount of `"`.
[^\"]* # Matches any character which is not a double quote sign. This means literally _any_ character, including newline characters without enabling the DOTALL flag
$ # The lookahead actually inspects until end of string.
) # End of lookahead
\\s+ # Matches one or more whitespace characters, including spaces, tabs and so on
That complicated group up there that is repeated twice will match in whitespaces in this string which is not in between two ";
text that has a "string in it".
When used with String.split, splitting the string into; [text, that, has, a, "string in it".]
It will only match if there are an even number of ", so the following will match on all spaces;
text that nearly has a "string in it.
Splitting the string into [text, that, nearly, has, a, "string, in, it.]
(1) When I say that a capture group matches "an empty string", I mean that it actually captures nothing, it only looks ahead from the point in the regex you are, and check a condition, nothing is actually captured. The actual capture is done by \\s+ which follows the lookahead.
The (?s) part is an embedded flag expression, enabling the DOTALL mode, which means the following:
In dotall mode, the expression . matches any character, including a line terminator. By default this expression does not match line terminators.
The (?=expr) is a look-ahead expression. This means that the regex looks to match expr, but then moves back to the same point before continuing with the rest of the evaluation.
In this case, it means that the regex matches any \\s+ occurence, that is followed by any even number of ", then followed by non-" until the end ($). In other words, it checks that there are an even number of " ahead.
It can definitely be expanded to other quotes too. The only problem is the ([^\"]+\"){2} part, that will probably have to be made to use a back-reference (\n) instead of the {2}.
This is fairly simple..
Concept
It split's at \s+ whenever there are even number of " ahead.
For example:
Hello hi "Hi World"
^ ^ ^
| | |->will not split here since there are odd number of "
----
|
|->split here because there are even number of " ahead
Grammar
\s matches a \n or \r or space or \t
+ is a quantifier which matches previous character or group 1 to many times
[^\"] would match anything except "
(x){2} would match x 2 times
a(?=bc) would match if a is followed by bc
(?=ab)a would first check for ab from current position and then return back to its position.It then matches a.(?=ab)c would not match c
With (?s)(singleline mode) . would match newlines.So,In this case no need of (?s) since there are no .
I would use
\s+(?=([^"]*"[^"]*")*[^"]*$)

What does this Regular expression do?

I'm in the processing of converting a program from Perl to Java. I have come across the line
my ($title) = ($info{$host} =~ /^\s*\(([^\)]+)\)\s*$/);
I'm not very good with regular expressions but from what I can tell this is matching something in the string $info{$host} to the regular expression ^\s*(([^)]+))\s*$ and assigning the match to $title.
My problem is that I have no clue what the regular expression is doing and what it will match. Any help would be appreciated.
Thanks
The regular expression matches a string that contains exactly one pair of matching parentheses (actually, one opening and one matching closing parenthesis, but inside any number of further opening parentheses may occur).
The string may begin and end with whitespace characters, but no others. Inside the parantheses, however, arbitrary characters may occur (at least one).
The following strings should match it:
(abc)
(()
(ab)
By the way, you may simply use the regular expression as-is in Java (after escaping the backslashes), using the Pattern class.
It will match a bunch of leading whitespace, followed by a left paren, followed by some text not including a right paren, followed by a right paren, followed by some more whitespace.
Matches:
(some stuff)
Fails:
(some stuff
some stuff)
(some stuff) asadsad
Ok step by step
/ - quote the regex
^ - the begining of the string
\s* - zero or more of any spacelike character
( - an actual ( character
( - begin a capture group
[^)]+ any of the characters ^ or ) the + indicating at least one
) -end the capture group
) and actual ) character
\s* zero or more space like characters
$ - the end of the string
/ - close the regex quote
So as far as I can work out we are looking for strings like " (^) " or "())"
methinks I am missing something here.
my ($title) = ($info{$host} =~ /^\s*\(([^\)]+)\)\s*$/);
First, m// in list context returns the captured matches. my ($title) puts the right hand side in list context. Second, $info{$host} is matched against the following pattern:
/^ \s* \( ( [^\)]+) \) \s* $/x
Yes, used the x flag so I could insert some spaces. ^\s* skips any leading whitespace. Then we have an escaped paranthesis (therefore no capture group is created. Then we have a capture group containing [^\)]. That character class can be better written as [^)] because the right parenthesis is not special in a character class and means anything but a left parenthesis.
If there are one or more characters other than a closing parenthesis following the opening parenthesis followed by a closing parenthesis optionally surrounded on either side by whitespace, that sequence of characters is captured and put in to $x.

Categories