Java regex to extract text sequences across multiple lines

Java regex to extract text sequences across multiple lines - java

Given an excerpt of text like
Preface (optional, up to multiple lines)
Main : sequence1
sequence2
sequence3
sequence4
Epilogue (optional, up to multiple lines)
which Java regular expression could be used to extract all the sequences (i.e. sequence1, sequence2, sequence3, sequence4 above)? For example, a Matcher.find() loop?
Each "sequence" is preceded by and may also contain 0 or more white spaces (including tabs).
The following regex
(?m).*Main(?:[ |t]+:(?:[ |t]+(\S+)[\r\n])+
only yields the first sequence (sequence1).

You may use the following regex:
(?m)(?:\G(?!\A)[^\S\r\n]+|^Main\s*:\s*)(\S+)\r?\n?
Details:
(?m) - multiline mode on
(?:\G(?!\A)[^\S\r\n]+|^Main\s*:\s*) - either of the two:
\G(?!\A)[^\S\r\n]+ - end of the previous successful match (\G(?!\A)) and then 1+ horizontal whitespaces ([^\S\r\n]+, can be replaced with [\p{Zs}\t]+ or [\s&&[^\r\n]]+)
| - or
^Main\s*:\s* - start of a line, Main, 0+ whitespaces, :, 0+ whitespaces
(\S+) - Group 1 capturing 1+ non-whitespace symbols
\r?\n? - an optional CR and an optional LF.
See the Java code below:
String p = "(?m)(?:\\G(?!\\A)[^\\S\r\n]+|^Main\\s*:\\s*)(\\S+)\r?\n?";
String s = "Preface (optional, up to multiple lines)...\nMain : sequence1\n sequence2\n sequence3\n sequence4\nEpilogue (optional, up to multiple lines)";
Matcher m = Pattern.compile(p).matcher(s);
while(m.find()) {
System.out.println(m.group(1));
}

Related

Regex pattern matching is getting timed out

I want to split an input string based on the regex pattern using Pattern.split(String) api. The regex uses both positive and negative lookaheads. The regex is supposed to split on a delimiter (,) and needs to ignore the delimiter if it is enclosed in double inverted quotes("x,y").
The regex is - (?<!(?<!\Q\\E)\Q\\E)\Q,\E(?=(?:[^\Q"\E]*(?<=\Q,\E)\Q"\E[[^\Q,\E|\Q"\E] | [\Q"\E]]+[^\Q"\E]*[^\Q\\E]*[\Q"\E]*)*[^\Q"\E]*$)
The input string for which this split call is getting timed out is -
"","1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]","QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"
I read that the lookup technics are heavy and can cause the timeouts if the string is too long. And if I remove the backward slashes enclosing [\"BOLT,HI-JOK\"] at the end of the string, then the regex is able to detect and split.
The pattern also does not detect the first delimiter at place [STIFFENER]","QH20426AD3 with the above string. But if I remove the backward slashes enclosing [\"BOLT,HI-JOK\"] at the end of the string, then the regex is able to detect it.
I am not very experienced with the lookup in regex, can some one please give hints about how can I optimize this regex and avoid time outs?
Any pointers, article links are appreciated!

If you want to split on a comma, and the strings that follow are from an opening till closing double quote after it:
,(?="[^"\\]*(?:\\.[^"\\]*)*")
The pattern matches:
, Match a comma
(?= Positive lookahad
"[^"\\]* Match " and 0+ times any char except " or \
(?:\\.[^"\\]*)*" Optionally repeat matching \ to escape any char using the . and again match any chars other than " and /
) Close lookahead
Regex demo | Java demo
String string = "\"\",\"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]\",\"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\\\"BOLT,HI-JOK\\\"]\"\n";
String[] parts = string.split(",(?=\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")");
for (String part : parts)
System.out.println(part);
Output
""
"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]"
"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"

Regex to extract hashtags with two dot-separated parts

I'm trying to create a regular expression in order to extract some text from strings. I want to extract text from urls or normal text messages e.g.:
endpoint/?userId=#someuser.id
OR
Hi #someuser.name, how are you?
And from both I want to extract exactly #someuser.name from message and #someuser.id from url. There might be be many of those string to extract from the url and messages.
My regular expression currently looks like this:
(#[^\.]+?\.)([^\W]\w+\b)
It works fine, except one for one case and I don't know how to do it - e.g.:
Those strings SHOULD NOT be matched: # .id, #.id. There must be at least one character between # and .. One or more spaces between those characters should not be matched.
How can I do it using my current regex?

You may use
String regex = "#[^.#]*[^.#\\s][^#.]*\\.\\w+";
See the regex demo and its graph:
Details
# - a # symbol
[^.#]* - zero or more chars other than . and #
[^.#\\s] - any char but ., # and whitespace
[^#.]* - - zero or more chars other than . and #
\. - a dot
\w+ - 1+ word chars (letters, digits or _).
Java demo:
String s = "# #.id\nendpoint/?userId=#someuser.id\nHi #someuser.name, how are you?";
String regex = "#[^.#]*[^.#\\s][^#.]*\\.\\w+";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(0));
}
Output:
#someuser.id
#someuser.name

You can try the following regex:
#(\w+)\.(\w+)
demo
Notes:
remove the parenthesis if you do not want to capture any group.
in your java regex string you need to escape every \
this gives #(\\w+)\\.(\\w+)
if the id is only made of numbers you can change the second \w by [0-9]
if the username include other characters than alphabet, numbers and underscore you have to change \w into a character class with all the authorised characters defined explicitly.
Code sample:
String input = "endpoint/?userId=#someuser.id Hi #someuser.name, how are you? # .id, #.id.";
Matcher m = Pattern.compile("#(\\w+)\\.(\\w+)").matcher(input);
while (m.find()) {
System.out.println(m.group());
}
output:
#someuser.id
#someuser.name

The redefined requirements are:
We search for pattern #A.B
A can be anything, except for only whitespaces, nor may it contain # or .
B can only be regular ASCII letters or digits
Converting those requirements into a (possible) regex:
#[^.#]+((?<!#\\s+)\\.)[A-Za-z0-9]+
Explanation:
#[^.#]+((?<!#\\s+)\\.)[A-Za-z0-9]+ # The entire capture for the Java-Matcher:
# # A literal '#' character
[^.#]+ # Followed by 1 or more characters which are NOT '.' nor '#'
( \\.) # Followed by a '.' character
(?<! ) # Which is NOT preceded by (negative lookbehind):
# # A literal '#'
\\s+ # With 1 or more whitespaces
[A-Za-z0-9]+ # Followed by 1 or more alphanumeric characters
# (PS: \\w+ could be used here if '_' is allowed as well)
Test code:
String input = "endpoint/?userId=#someuser.id Hi #someuser.name, how are you? # .id #.id %^*##*(.H(#EH Ok, # some spaces here .but none here #$p€©ï#l.$p€©ï#l that should do it..";
System.out.println("Input: \""+ input + '"');
System.out.println("Outputs: ");
java.util.regex.Matcher matcher = java.util.regex.Pattern.compile("#[^.#]+((?<!#\\s+)\\.)[A-Za-z0-9]+")
.matcher(input);
while(matcher.find())
System.out.println('"'+matcher.group()+'"');
Try it online.
Which outputs:
Input: "endpoint/?userId=#someuser.id Hi #someuser.name, how are you? # .id #.id %^*##*(.H(#EH Ok, # some spaces here .but none here #$p€©ï#l.$p€©ï#l that should do it.."
Outputs:
"#someuser.id"
"#someuser.name"
"##*(.H"
"# some spaces here .but"

#(\w+)[.](\w+)
results two groups, e.g
endpoint/?userId=#someuser.id -> group[0]=someuser and group[1]=id

How to capture multiple groups in regex?

I am trying to capture following word, number:
stxt:usa,city:14
I can capture usa and 14 using:
stxt:(.*?),city:(\d.*)$
However, when text is;
stxt:usa
The regex did not work. I tried to apply or condition using | but it did not work.
stxt:(.*?),|city:(\d.*)$

You may use
(stxt|city):([^,]+)
See the regex demo (note the \n added only for the sake of the demo, you do not need it in real life).
Pattern details:
(stxt|city) - either a stxt or city substrings (you may add \b before the ( to only match a whole word) (Group 1)
: - a colon
([^,]+) - 1 or more characters other than a comma (Group 2).
Java demo:
String s = "stxt:usa,city:14";
Pattern pattern = Pattern.compile("(stxt|city):([^,]+)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}

Looking at your string, you could also find the word/digits after the colon.
:(\w+)

Extraction of subsequences that end with point and space by regular expression

hy
I want to extract sub sentences of this sentence by regular expression:
it learn od fg network layout. kdsjhuu ddkm networ.12kfdf. learndfefe layout. learn sdffsfsfs. sddsd learn fefe.
I couldn't write a correct regular expression for Pattern.compile.
This is my expression:([^(\\.\\s)]*)([^.]*\\.)
Actually, i need a way for writing "read everthing except \\.\\s
sub sentences:
it learn od fg network layout.
kdsjhuu ddkm networ.12kfdf.
learndfefe layout.
learn sdffsfsfs.
sddsd learn fefe.

Just split your string with regex "\\. "
String[] arr= str.split("\\. ");

You can use this pattern with the find method:
Pattern p = Pattern.compile("[^\\s.][^.]*(?:\\.(?!\\s|\\z)[^.]*)*\\.?");
Matcher m = p.matcher(yourText);
while(m.find()) {
System.out.println(m.group(0));
}
Pattern details:
[^\\s.] # all that is not a whitespace (to trim) or a dot
[^.]* # all that is not a dot (zero or more times)
(?: # open a non-capturing group
\\. (?!\\s|\\z) # dot not followed by a whitespace or the end of the string
[^.]* #
)* # close and repeat the group as needed
\\.? # an optional dot (allow to match a sentence at the end
# of the string even if there is no dot)

Regular expression for no whitespaces on the first position

Example accepted:
This is a try!
And this is the second line!
Example not accepted:
this is a try with initial spaces
and this the second line
So, I need:
no string made only by whitespaces " "
no string where first char is whitespace
new lines are ok; only the first character cannot be a new line
I was using
^(?=\s*\S).*$
but that pattern can't allow new lines.

You can try this regex
^(?!\s*$|\s).*$
---- -- --
| | |->matches everything!
| |->no string where first char is whitespace
|->no string made only by whitespaces
you need to use singleline mode ..
you can try it here..you need to use matches method

"no string made only by whitespaces" is the same to "no string where first char is whitespace" as it also begins with white space.
You have to set Pattern.MULTILINE which changes the meaning of ^ and $ also to begin and end of line, not only entire string
"^\\S.+$"

I'm not a Java guy, but a solution in Python could look like this here:
In [1]: import re
In [2]: example_accepted = 'This is a try!\nAnd this is the second line!'
In [3]: example_not_accepted = ' This is a try with initial spaces\nand this the second line'
In [4]: pattern = re.compile(r"""
....: ^ # matches at the beginning of a string
....: \S # matches any non-whitespace character
....: .+ # matches one or more arbitrary characters
....: $ # matches at the end of a string
....: """,
....: flags=re.MULTILINE|re.VERBOSE)
In [5]: pattern.findall(example_accepted)
Out[5]: ['This is a try!', 'And this is the second line!']
In [6]: pattern.findall(example_not_accepted)
Out[6]: ['and this the second line']
The key part here is the flag re.MULTILINE. With this flag enabled, ^ and $ do not only match at the beginning and end of a string, but also at the beginning and end of lines which are separated by newlines. I'm sure there is something equivalent for Java as well.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java regex to extract text sequences across multiple lines - java

Related

Regex pattern matching is getting timed out

Regex to extract hashtags with two dot-separated parts

How to capture multiple groups in regex?

Extraction of subsequences that end with point and space by regular expression

Regular expression for no whitespaces on the first position

Categories

Resources