Regular expression for no whitespaces on the first position

Regular expression for no whitespaces on the first position - java

Example accepted:
This is a try!
And this is the second line!
Example not accepted:
this is a try with initial spaces
and this the second line
So, I need:
no string made only by whitespaces " "
no string where first char is whitespace
new lines are ok; only the first character cannot be a new line
I was using
^(?=\s*\S).*$
but that pattern can't allow new lines.

You can try this regex
^(?!\s*$|\s).*$
---- -- --
| | |->matches everything!
| |->no string where first char is whitespace
|->no string made only by whitespaces
you need to use singleline mode ..
you can try it here..you need to use matches method

"no string made only by whitespaces" is the same to "no string where first char is whitespace" as it also begins with white space.
You have to set Pattern.MULTILINE which changes the meaning of ^ and $ also to begin and end of line, not only entire string
"^\\S.+$"

I'm not a Java guy, but a solution in Python could look like this here:
In [1]: import re
In [2]: example_accepted = 'This is a try!\nAnd this is the second line!'
In [3]: example_not_accepted = ' This is a try with initial spaces\nand this the second line'
In [4]: pattern = re.compile(r"""
....: ^ # matches at the beginning of a string
....: \S # matches any non-whitespace character
....: .+ # matches one or more arbitrary characters
....: $ # matches at the end of a string
....: """,
....: flags=re.MULTILINE|re.VERBOSE)
In [5]: pattern.findall(example_accepted)
Out[5]: ['This is a try!', 'And this is the second line!']
In [6]: pattern.findall(example_not_accepted)
Out[6]: ['and this the second line']
The key part here is the flag re.MULTILINE. With this flag enabled, ^ and $ do not only match at the beginning and end of a string, but also at the beginning and end of lines which are separated by newlines. I'm sure there is something equivalent for Java as well.

Related

Regex pattern matching is getting timed out

I want to split an input string based on the regex pattern using Pattern.split(String) api. The regex uses both positive and negative lookaheads. The regex is supposed to split on a delimiter (,) and needs to ignore the delimiter if it is enclosed in double inverted quotes("x,y").
The regex is - (?<!(?<!\Q\\E)\Q\\E)\Q,\E(?=(?:[^\Q"\E]*(?<=\Q,\E)\Q"\E[[^\Q,\E|\Q"\E] | [\Q"\E]]+[^\Q"\E]*[^\Q\\E]*[\Q"\E]*)*[^\Q"\E]*$)
The input string for which this split call is getting timed out is -
"","1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]","QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"
I read that the lookup technics are heavy and can cause the timeouts if the string is too long. And if I remove the backward slashes enclosing [\"BOLT,HI-JOK\"] at the end of the string, then the regex is able to detect and split.
The pattern also does not detect the first delimiter at place [STIFFENER]","QH20426AD3 with the above string. But if I remove the backward slashes enclosing [\"BOLT,HI-JOK\"] at the end of the string, then the regex is able to detect it.
I am not very experienced with the lookup in regex, can some one please give hints about how can I optimize this regex and avoid time outs?
Any pointers, article links are appreciated!

If you want to split on a comma, and the strings that follow are from an opening till closing double quote after it:
,(?="[^"\\]*(?:\\.[^"\\]*)*")
The pattern matches:
, Match a comma
(?= Positive lookahad
"[^"\\]* Match " and 0+ times any char except " or \
(?:\\.[^"\\]*)*" Optionally repeat matching \ to escape any char using the . and again match any chars other than " and /
) Close lookahead
Regex demo | Java demo
String string = "\"\",\"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]\",\"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\\\"BOLT,HI-JOK\\\"]\"\n";
String[] parts = string.split(",(?=\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\")");
for (String part : parts)
System.out.println(part);
Output
""
"1114356033020-0011,- [BRACKET],1114356033020-0017,- [FRAME],1114356033020-0019,- [CLIP],1114356033020-0001,- [FRAME ASSY],1114356033020-0013,- [GUSSET],1114356033020-0015,- [STIFFENER]"
"QH20426AD3 [RIVET,SOL FL HD],UY510AE3L [NUT,HEX],PO41071B0 [SEALING CMPD],LL510A3-10 [\"BOLT,HI-JOK\"]"

Regex to extract hashtags with two dot-separated parts

I'm trying to create a regular expression in order to extract some text from strings. I want to extract text from urls or normal text messages e.g.:
endpoint/?userId=#someuser.id
OR
Hi #someuser.name, how are you?
And from both I want to extract exactly #someuser.name from message and #someuser.id from url. There might be be many of those string to extract from the url and messages.
My regular expression currently looks like this:
(#[^\.]+?\.)([^\W]\w+\b)
It works fine, except one for one case and I don't know how to do it - e.g.:
Those strings SHOULD NOT be matched: # .id, #.id. There must be at least one character between # and .. One or more spaces between those characters should not be matched.
How can I do it using my current regex?

You may use
String regex = "#[^.#]*[^.#\\s][^#.]*\\.\\w+";
See the regex demo and its graph:
Details
# - a # symbol
[^.#]* - zero or more chars other than . and #
[^.#\\s] - any char but ., # and whitespace
[^#.]* - - zero or more chars other than . and #
\. - a dot
\w+ - 1+ word chars (letters, digits or _).
Java demo:
String s = "# #.id\nendpoint/?userId=#someuser.id\nHi #someuser.name, how are you?";
String regex = "#[^.#]*[^.#\\s][^#.]*\\.\\w+";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(s);
while (matcher.find()){
System.out.println(matcher.group(0));
}
Output:
#someuser.id
#someuser.name

You can try the following regex:
#(\w+)\.(\w+)
demo
Notes:
remove the parenthesis if you do not want to capture any group.
in your java regex string you need to escape every \
this gives #(\\w+)\\.(\\w+)
if the id is only made of numbers you can change the second \w by [0-9]
if the username include other characters than alphabet, numbers and underscore you have to change \w into a character class with all the authorised characters defined explicitly.
Code sample:
String input = "endpoint/?userId=#someuser.id Hi #someuser.name, how are you? # .id, #.id.";
Matcher m = Pattern.compile("#(\\w+)\\.(\\w+)").matcher(input);
while (m.find()) {
System.out.println(m.group());
}
output:
#someuser.id
#someuser.name

The redefined requirements are:
We search for pattern #A.B
A can be anything, except for only whitespaces, nor may it contain # or .
B can only be regular ASCII letters or digits
Converting those requirements into a (possible) regex:
#[^.#]+((?<!#\\s+)\\.)[A-Za-z0-9]+
Explanation:
#[^.#]+((?<!#\\s+)\\.)[A-Za-z0-9]+ # The entire capture for the Java-Matcher:
# # A literal '#' character
[^.#]+ # Followed by 1 or more characters which are NOT '.' nor '#'
( \\.) # Followed by a '.' character
(?<! ) # Which is NOT preceded by (negative lookbehind):
# # A literal '#'
\\s+ # With 1 or more whitespaces
[A-Za-z0-9]+ # Followed by 1 or more alphanumeric characters
# (PS: \\w+ could be used here if '_' is allowed as well)
Test code:
String input = "endpoint/?userId=#someuser.id Hi #someuser.name, how are you? # .id #.id %^*##*(.H(#EH Ok, # some spaces here .but none here #$p€©ï#l.$p€©ï#l that should do it..";
System.out.println("Input: \""+ input + '"');
System.out.println("Outputs: ");
java.util.regex.Matcher matcher = java.util.regex.Pattern.compile("#[^.#]+((?<!#\\s+)\\.)[A-Za-z0-9]+")
.matcher(input);
while(matcher.find())
System.out.println('"'+matcher.group()+'"');
Try it online.
Which outputs:
Input: "endpoint/?userId=#someuser.id Hi #someuser.name, how are you? # .id #.id %^*##*(.H(#EH Ok, # some spaces here .but none here #$p€©ï#l.$p€©ï#l that should do it.."
Outputs:
"#someuser.id"
"#someuser.name"
"##*(.H"
"# some spaces here .but"

#(\w+)[.](\w+)
results two groups, e.g
endpoint/?userId=#someuser.id -> group[0]=someuser and group[1]=id

high-level regular expression with not

Hi regular expression experts,
I have the following text
<[~UNKNOWN:a-z\.]> <[~UNKNOWN:A-Z\-0-9]> <[~UNKNOWN:A-Z\]a-z]
And the following reg expr
\[\~[^\[\~\]]*\]
It works fine for the 1st and 2nd group in the text but not for the 3rd one.
The 1st group is
[~UNKNOWN:a-z\.]
The 2nd is
[~UNKNOWN:A-Z\-0-9]
and the 3rd one is
[~UNKNOWN:A-Z\]a-z]
However the reg exp finds the following text
[~UNKNOWN:A-Z\]
I understand why and I know that I have to add the following rule to the reg exp:
starting with '[' and '~' characters and ending with ']' UNLESS there is a '\' in front of ']'. So I should add a NOT expression but not sure how.
Could anybody please help?
Thanks,
V.

Why not simply:
<([^>]+)>?
Regex Demo

This should work (first line pattern, second line your pattern (ignore whitespace), third line my changes):
\[\~(?:[^\[\~\]]|(?<=\\)\])*(?<!\\)\]
\[\~ [^\[\~\]] * \]
(?: |(?<=\\)\]) (?<!\\)
Your regex:
\[\~ # Literal characters [~
[^ # Character group, NONE of the following:
\[\~\] # [ or ~ or ]
]* # 0 or more of this character group
\] # Followed by ]
Your pattern in words: [~, everything in between, up to the next ], as long as there is no [ or ~ or ] in there.
My pattern , only relevant changes explained:
\[\~
(?: # Non capturing group
[^\[\~\]]
| # OR
(?<=\\)\] # ], preceded by \
)*
(?<!\\)\] # ], not preceded by \
In words: Same as yours, plus ] may be contained if it is preceded by \, and the closing ] may not be preceded by \

Java regex to extract text sequences across multiple lines

Given an excerpt of text like
Preface (optional, up to multiple lines)
Main : sequence1
sequence2
sequence3
sequence4
Epilogue (optional, up to multiple lines)
which Java regular expression could be used to extract all the sequences (i.e. sequence1, sequence2, sequence3, sequence4 above)? For example, a Matcher.find() loop?
Each "sequence" is preceded by and may also contain 0 or more white spaces (including tabs).
The following regex
(?m).*Main(?:[ |t]+:(?:[ |t]+(\S+)[\r\n])+
only yields the first sequence (sequence1).

You may use the following regex:
(?m)(?:\G(?!\A)[^\S\r\n]+|^Main\s*:\s*)(\S+)\r?\n?
Details:
(?m) - multiline mode on
(?:\G(?!\A)[^\S\r\n]+|^Main\s*:\s*) - either of the two:
\G(?!\A)[^\S\r\n]+ - end of the previous successful match (\G(?!\A)) and then 1+ horizontal whitespaces ([^\S\r\n]+, can be replaced with [\p{Zs}\t]+ or [\s&&[^\r\n]]+)
| - or
^Main\s*:\s* - start of a line, Main, 0+ whitespaces, :, 0+ whitespaces
(\S+) - Group 1 capturing 1+ non-whitespace symbols
\r?\n? - an optional CR and an optional LF.
See the Java code below:
String p = "(?m)(?:\\G(?!\\A)[^\\S\r\n]+|^Main\\s*:\\s*)(\\S+)\r?\n?";
String s = "Preface (optional, up to multiple lines)...\nMain : sequence1\n sequence2\n sequence3\n sequence4\nEpilogue (optional, up to multiple lines)";
Matcher m = Pattern.compile(p).matcher(s);
while(m.find()) {
System.out.println(m.group(1));
}

Extraction of subsequences that end with point and space by regular expression

hy
I want to extract sub sentences of this sentence by regular expression:
it learn od fg network layout. kdsjhuu ddkm networ.12kfdf. learndfefe layout. learn sdffsfsfs. sddsd learn fefe.
I couldn't write a correct regular expression for Pattern.compile.
This is my expression:([^(\\.\\s)]*)([^.]*\\.)
Actually, i need a way for writing "read everthing except \\.\\s
sub sentences:
it learn od fg network layout.
kdsjhuu ddkm networ.12kfdf.
learndfefe layout.
learn sdffsfsfs.
sddsd learn fefe.

Just split your string with regex "\\. "
String[] arr= str.split("\\. ");

You can use this pattern with the find method:
Pattern p = Pattern.compile("[^\\s.][^.]*(?:\\.(?!\\s|\\z)[^.]*)*\\.?");
Matcher m = p.matcher(yourText);
while(m.find()) {
System.out.println(m.group(0));
}
Pattern details:
[^\\s.] # all that is not a whitespace (to trim) or a dot
[^.]* # all that is not a dot (zero or more times)
(?: # open a non-capturing group
\\. (?!\\s|\\z) # dot not followed by a whitespace or the end of the string
[^.]* #
)* # close and repeat the group as needed
\\.? # an optional dot (allow to match a sentence at the end
# of the string even if there is no dot)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regular expression for no whitespaces on the first position - java

You can try this regex ^(?!\s$|\s).$ ---- -- -- | | |->matches everything! | |->no string where first char is whitespace |->no string made only by whitespaces you need to use singleline mode .. you can try it here..you need to use matches method

"no string made only by whitespaces" is the same to "no string where first char is whitespace" as it also begins with white space. You have to set Pattern.MULTILINE which changes the meaning of ^ and $ also to begin and end of line, not only entire string "^\\S.+$"

Related

Regex pattern matching is getting timed out

Regex to extract hashtags with two dot-separated parts

high-level regular expression with not

Java regex to extract text sequences across multiple lines

Extraction of subsequences that end with point and space by regular expression

Categories

Resources

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regular expression for no whitespaces on the first position - java

You can try this regex ^(?!\s*$|\s).*$ ---- -- -- | | |->matches everything! | |->no string where first char is whitespace |->no string made only by whitespaces you need to use singleline mode .. you can try it here..you need to use matches method

"no string made only by whitespaces" is the same to "no string where first char is whitespace" as it also begins with white space. You have to set Pattern.MULTILINE which changes the meaning of ^ and $ also to begin and end of line, not only entire string "^\\S.+$"

Related

Regex pattern matching is getting timed out

Regex to extract hashtags with two dot-separated parts

high-level regular expression with not

Java regex to extract text sequences across multiple lines

Extraction of subsequences that end with point and space by regular expression

Categories

Resources

You can try this regex ^(?!\s$|\s).$ ---- -- -- | | |->matches everything! | |->no string where first char is whitespace |->no string made only by whitespaces you need to use singleline mode .. you can try it here..you need to use matches method