match repeated blocks separated by ~ - java

To match the following text:
text : SS~B66\88~PRELIMINARY PAGES\M01~HEADING PAGES
It has this format:<code1>~<description1>\<code2>~<description2>\<code3>~<description3>....<codeN>~<descriptionN>
I used this regex: [A-Z0-9 ]+~[A-Z0-9 ]+(?:\\[A-Z0-9 ]+~[A-Z0-9 ]+)+
So:
case 1. SS~B66\88~PRELIMINARY PAGES\M01~HEADING PAGES (Match: OK)
case 2. SS~B66\88~PRELIMINARY PAGES~HEADING PAGES (No Match: OK because I removed the code 'M01')
case 3. SS~B66~PRELIMINARY PAGES\M01~HEADING PAGES (No Match: OK because I removed the code '88')
More examples:
SS~B66\88~MEKLKE\M01~MOIIE
B~A310\0~PRELIM#INARY\00-00~HEADING
My problem is that <code> and <description> can accept any type of characters, so when I replaced my regex with:
My new regex .+~.+(?:\\.+~.+)+ , but it can match case 2 and case 3.
Thank you for your help.

Instead of using [A-Z0-9 ] which would not match all the allowed chars, or .+ which would match too much, you can use a negated character class [^~\\] matching any char except \ and ~ to set the boundaries for the matched parts.
^[^~]+~[^~\\]+(?:\\[^~]+~[^~\\]+)+$
^ Start of string
[^~]+~ Match any char other than ~, then match ~
[^~\\]+ Repeat matching 1+ times any char other than ~ and `
(?: Non capture group
\\[^~]+~[^~\\]+ Match \ and a ~ between other chars than ~ before and ~ \ after
)+ Close the group and repeat 1 or more times to match at least a \
$ End of string
Regex demo (The demo contains \n to not cross the newlines in the example data)

Related

Regex to extract hashtags with two dot-separated parts

I'm trying to create a regular expression in order to extract some text from strings. I want to extract text from urls or normal text messages e.g.:
endpoint/?userId=#someuser.id
OR
Hi #someuser.name, how are you?
And from both I want to extract exactly #someuser.name from message and #someuser.id from url. There might be be many of those string to extract from the url and messages.
My regular expression currently looks like this:
(#[^\.]+?\.)([^\W]\w+\b)
It works fine, except one for one case and I don't know how to do it - e.g.:
Those strings SHOULD NOT be matched: # .id, #.id. There must be at least one character between # and .. One or more spaces between those characters should not be matched.
How can I do it using my current regex?
You may use
String regex = "#[^.#]*[^.#\\s][^#.]*\\.\\w+";
See the regex demo and its graph:
Details
# - a # symbol
[^.#]* - zero or more chars other than . and #
[^.#\\s] - any char but ., # and whitespace
[^#.]* - - zero or more chars other than . and #
\. - a dot
\w+ - 1+ word chars (letters, digits or _).
Java demo:
String s = "# #.id\nendpoint/?userId=#someuser.id\nHi #someuser.name, how are you?";
String regex = "#[^.#]*[^.#\\s][^#.]*\\.\\w+";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(0));
}
Output:
#someuser.id
#someuser.name
You can try the following regex:
#(\w+)\.(\w+)
demo
Notes:
remove the parenthesis if you do not want to capture any group.
in your java regex string you need to escape every \
this gives #(\\w+)\\.(\\w+)
if the id is only made of numbers you can change the second \w by [0-9]
if the username include other characters than alphabet, numbers and underscore you have to change \w into a character class with all the authorised characters defined explicitly.
Code sample:
String input = "endpoint/?userId=#someuser.id Hi #someuser.name, how are you? # .id, #.id.";
Matcher m = Pattern.compile("#(\\w+)\\.(\\w+)").matcher(input);
while (m.find()) {
System.out.println(m.group());
}
output:
#someuser.id
#someuser.name
The redefined requirements are:
We search for pattern #A.B
A can be anything, except for only whitespaces, nor may it contain # or .
B can only be regular ASCII letters or digits
Converting those requirements into a (possible) regex:
#[^.#]+((?<!#\\s+)\\.)[A-Za-z0-9]+
Explanation:
#[^.#]+((?<!#\\s+)\\.)[A-Za-z0-9]+ # The entire capture for the Java-Matcher:
# # A literal '#' character
[^.#]+ # Followed by 1 or more characters which are NOT '.' nor '#'
( \\.) # Followed by a '.' character
(?<! ) # Which is NOT preceded by (negative lookbehind):
# # A literal '#'
\\s+ # With 1 or more whitespaces
[A-Za-z0-9]+ # Followed by 1 or more alphanumeric characters
# (PS: \\w+ could be used here if '_' is allowed as well)
Test code:
String input = "endpoint/?userId=#someuser.id Hi #someuser.name, how are you? # .id #.id %^*##*(.H(#EH Ok, # some spaces here .but none here #$p€©ï#l.$p€©ï#l that should do it..";
System.out.println("Input: \""+ input + '"');
System.out.println("Outputs: ");
java.util.regex.Matcher matcher = java.util.regex.Pattern.compile("#[^.#]+((?<!#\\s+)\\.)[A-Za-z0-9]+")
.matcher(input);
while(matcher.find())
System.out.println('"'+matcher.group()+'"');
Try it online.
Which outputs:
Input: "endpoint/?userId=#someuser.id Hi #someuser.name, how are you? # .id #.id %^*##*(.H(#EH Ok, # some spaces here .but none here #$p€©ï#l.$p€©ï#l that should do it.."
Outputs:
"#someuser.id"
"#someuser.name"
"##*(.H"
"# some spaces here .but"
#(\w+)[.](\w+)
results two groups, e.g
endpoint/?userId=#someuser.id -> group[0]=someuser and group[1]=id

high-level regular expression with not

Hi regular expression experts,
I have the following text
<[~UNKNOWN:a-z\.]> <[~UNKNOWN:A-Z\-0-9]> <[~UNKNOWN:A-Z\]a-z]
And the following reg expr
\[\~[^\[\~\]]*\]
It works fine for the 1st and 2nd group in the text but not for the 3rd one.
The 1st group is
[~UNKNOWN:a-z\.]
The 2nd is
[~UNKNOWN:A-Z\-0-9]
and the 3rd one is
[~UNKNOWN:A-Z\]a-z]
However the reg exp finds the following text
[~UNKNOWN:A-Z\]
I understand why and I know that I have to add the following rule to the reg exp:
starting with '[' and '~' characters and ending with ']' UNLESS there is a '\' in front of ']'. So I should add a NOT expression but not sure how.
Could anybody please help?
Thanks,
V.
Why not simply:
<([^>]+)>?
Regex Demo
This should work (first line pattern, second line your pattern (ignore whitespace), third line my changes):
\[\~(?:[^\[\~\]]|(?<=\\)\])*(?<!\\)\]
\[\~ [^\[\~\]] * \]
(?: |(?<=\\)\]) (?<!\\)
Your regex:
\[\~ # Literal characters [~
[^ # Character group, NONE of the following:
\[\~\] # [ or ~ or ]
]* # 0 or more of this character group
\] # Followed by ]
Your pattern in words: [~, everything in between, up to the next ], as long as there is no [ or ~ or ] in there.
My pattern , only relevant changes explained:
\[\~
(?: # Non capturing group
[^\[\~\]]
| # OR
(?<=\\)\] # ], preceded by \
)*
(?<!\\)\] # ], not preceded by \
In words: Same as yours, plus ] may be contained if it is preceded by \, and the closing ] may not be preceded by \

Java regex to extract text sequences across multiple lines

Given an excerpt of text like
Preface (optional, up to multiple lines)
Main : sequence1
sequence2
sequence3
sequence4
Epilogue (optional, up to multiple lines)
which Java regular expression could be used to extract all the sequences (i.e. sequence1, sequence2, sequence3, sequence4 above)? For example, a Matcher.find() loop?
Each "sequence" is preceded by and may also contain 0 or more white spaces (including tabs).
The following regex
(?m).*Main(?:[ |t]+:(?:[ |t]+(\S+)[\r\n])+
only yields the first sequence (sequence1).
You may use the following regex:
(?m)(?:\G(?!\A)[^\S\r\n]+|^Main\s*:\s*)(\S+)\r?\n?
Details:
(?m) - multiline mode on
(?:\G(?!\A)[^\S\r\n]+|^Main\s*:\s*) - either of the two:
\G(?!\A)[^\S\r\n]+ - end of the previous successful match (\G(?!\A)) and then 1+ horizontal whitespaces ([^\S\r\n]+, can be replaced with [\p{Zs}\t]+ or [\s&&[^\r\n]]+)
| - or
^Main\s*:\s* - start of a line, Main, 0+ whitespaces, :, 0+ whitespaces
(\S+) - Group 1 capturing 1+ non-whitespace symbols
\r?\n? - an optional CR and an optional LF.
See the Java code below:
String p = "(?m)(?:\\G(?!\\A)[^\\S\r\n]+|^Main\\s*:\\s*)(\\S+)\r?\n?";
String s = "Preface (optional, up to multiple lines)...\nMain : sequence1\n sequence2\n sequence3\n sequence4\nEpilogue (optional, up to multiple lines)";
Matcher m = Pattern.compile(p).matcher(s);
while(m.find()) {
System.out.println(m.group(1));
}

Get equations from string

I have a string in following pattern
( var1=:key1:'any_value including space and 'quotes'' AND/OR var2=:key2:'any_value...' AND/OR var3=:key3:'any_value...' )
I want to get following result from this.
:key1:'any_value including space and 'quotes''
:key2:'any_value...'
:key3:'any_value...'
Could any one please suggest the pattern/RE for the same ?
Failed attempts :
First I can split it by AND/OR and again split the further strings on : and so on, but looking for single RE/Pattern which can do this.
You can use this regex with negated pattern to match your data:
":[^:]+:'.*?'(?=\\s*(?:AND(?:/OR)?|\\)))"
RegEx Demo
Breakup:
: # match a literal :
[^:]+ # match 1 or more characters that are not :
: # match a literal :
' # match a literal '
.*? # match 0 or more of any characters (non-greedy)
' # match a literal '
(?=\s*(?:AND(?:/OR)?|\))) # lookahead to assert there is AND/OR at the end or closing )
I think this would work for your circumstances.
Unless you know parsing of quotes, there is not much else you could do.
Raw: (?<==)(?:(?!\s*AND/OR).)+
Quoted: "(?<==)(?:(?!\\s*AND/OR).)+"
Expanded:
(?<= = ) # A '=' behind
(?:
(?! \s* AND/OR ) # Not 'AND/OR' in front
.
)+

Regex - trying to match / \ | , or newline, error says invalid escape character

I'm actually trying to split a string on any of the following :
/
\
|
,
\n
Here's the regex I'm using, which gives the 'invalid escape character' error :
String delims = "[\\\\\|\\/\\n,]+";
String[] list1 = str1.split(delims);
I've tried a few more versions of this, trying to get the number of \'s right. What's the right way to do this?
"[/\\|\n,\\\\]+"
Some of these you need to double escape
/ matches /
\\| matches |
\n matches new line
, matches ,
\\\\ matches \
To create \ literal in regex engine you need to write it with four \ in string, so you have one \ extra
"[\\\\\|\\/\\n,]+";
1234^
here
Also you don't need to escape / in Java regex engine, and you don't need to pass \n as \\n (\n literal will be also accepted) you can so try with
String delims = "[\\\\|/\n,]+";

Categories