Regex match a word that starts with a string - java

How to generate a regex to match only one word which starts with big
I have tried to form a regex with start and end string. Starting string as big and ending string as \s space.
Consider this line You are my big-big-big friend and also a brother
When i use the below regex, it gives me result as big-big-bigfriendandalsoabrother
(.big.*\s)
But i am expecting result as big-big-big. The word can be at starting of line or at the end. I want to generate a regex to match the full word which starts with big
Help would be appreciated.

The following regex may be used:
(?<!\S)big\S*
Details:
(?<!\S) - a negative lookbehind that makes sure there is start of string or a whitespace immediately to the left of the current location
big - a literal substring
\S* - any 0 or more chars other than whitespace chars

You can use the Regex
(?!\s)big\S*
It'll match exactly what you asked for.
Explanation:
(?!\s)
It may or may not have a whitespace behind it, but it shouldn't be counted as part of the capture (negative lookahead)
big
Will find the word big
\S*
Will find any character that's NOT a whitespace, 0 or more times
So:
(?!\s)big\S*
Finds the word big, followed by anything that's not a whitespace, until it hits a whitespace. It may or may not have a whitespace behind.

Related

RegEx matching $1 tokens [duplicate]

Imagine you are trying to pattern match "stackoverflow".
You want the following:
this is stackoverflow and it rocks [MATCH]
stackoverflow is the best [MATCH]
i love stackoverflow [MATCH]
typostackoverflow rules [NO MATCH]
i love stackoverflowtypo [NO MATCH]
I know how to parse out stackoverflow if it has spaces on both sites using:
/\s(stackoverflow)\s/
Same with if its at the start or end of a string:
/^(stackoverflow)\s/
/\s(stackoverflow)$/
But how do you specify "space or end of string" and "space or start of string" using a regular expression?
You can use any of the following:
\b #A word break and will work for both spaces and end of lines.
(^|\s) #the | means or. () is a capturing group.
/\b(stackoverflow)\b/
Also, if you don't want to include the space in your match, you can use lookbehind/aheads.
(?<=\s|^) #to look behind the match
(stackoverflow) #the string you want. () optional
(?=\s|$) #to look ahead.
(^|\s) would match space or start of string and ($|\s) for space or end of string. Together it's:
(^|\s)stackoverflow($|\s)
Here's what I would use:
(?<!\S)stackoverflow(?!\S)
In other words, match "stackoverflow" if it's not preceded by a non-whitespace character and not followed by a non-whitespace character.
This is neater (IMO) than the "space-or-anchor" approach, and it doesn't assume the string starts and ends with word characters like the \b approach does.
\b matches at word boundaries (without actually matching any characters), so the following should do what you want:
\bstackoverflow\b

Regex to split String on pattern but with a minimum number of characters

I want to split a long text stored in a String variable following those rules:
Split on a dot (.)
The Substrings should have a minimum length of 30 (for example).
Take this example:
"The boy ate the apple. The sun is shining high in the sky. The answer to life the universe and everything is forty two, said the big computer."
let's say the minimum length I want is 30.
The result splits obtained would be:
"The boy ate the apple. The sun is shining high in the sky."
"The answer to life the universe and everything is forty two, said the big computer."
I don't want to take "The boy ate the apple." as a split because it's less than 30 characters.
2 ways I thought of:
Loop through all the characters and add them to a String builder. And whenever I reach a dot (.) I check if my String builder is more than the minimum I split it, otherwise I continue.
Split on all dots (.), and then loop through the splits. if one of the Splitted strings is smaller than the minimum, I concatenate it with the one after.
But I am looking if this can be done directly by using a Regex to split and test the minimum number of characters before a match.
Thanks
Instead of using split, you could also match your values using a capturing group.
To make the dot also match a newline you could use Pattern.DOTALL
\s*(.{30}[^.]*\.|.+$)
In Java:
String regex = "\\s*(.{30}[^.]*\\.|.+$)";
Explanation
\s* Match 0_ times a whitespace character
( Capturing group
.{30} Match any character 30 times
[^.]* Match 0+ times not a dot using a negated character class
\. Match literally
| Or
.+$ Match 1+ times any character until the end of the string.
) Close capturing group
Regex demo | Java demo
Instead of using the split method, try matching with the following regexp: \S.{29,}?[.]
Demo
This should do the job:
"\W*+(.{30,}?)\W*\."
Test: https://regex101.com/r/aavcme/3
\W*+ takes as much as non-word character to trim spaces between sentences
. matches any character (I guess you want to match any kind of character in your sentences)
{30,} asserts the minimum length of the match (30)
? means "as few as possible"
\. matches the dot separating the sentences (assuming that you always have a dot at the end of a sentence, even the last one)

Java Pattern regex search between strings

Given the following strings (stringToTest):
G2:7JAPjGdnGy8jxR8[RQ:1,2]-G3:jRo6pN8ZW9aglYz[RQ:3,4]
G2:7JAPjGdnGy8jxR8[RQ:3,4]-G3:jRo6pN8ZW9aglYz[RQ:3,4]
And the Pattern:
Pattern p = Pattern.compile("G2:\\S+RQ:3,4");
if (p.matcher(stringToTest).find())
{
// Match
}
For string 1 I DON'T want to match, because RQ:3,4 is associated with the G3 section, not G2, and I want string 2 to match, as RQ:3,4 is associated with G2 section.
The problem with the current regex is that it's searching too far and reaching the RQ:3,4 eventually in case 1 even though I don't want to consider past the G2 section.
It's also possible that the stringToTest might be (just one section):
G2:7JAPjGdnGy8jxR8[RQ:3,4]
The strings 7JAPjGdnGy8jxR8 and jRo6pN8ZW9aglYz are variable length hashes.
Can anyone help me with the correct regex to use, to start looking at G2 for RQ:3,4 but stopping if it reaches the end of the string or -G (the start of the next section).
You may use this regex with a negative lookahead in between:
G2:(?:(?!G\d+:)\S)*RQ:3,4
RegEx Demo
RegEx Details:
G2:: Match literal text G2:
(?: Start a non-capture group
(?!G\d+:): Assert that we don't have a G<digit>: ahead of us
\S: Match a non-whitespace character
)*: End non-capture group. Match 0 or more of this
RQ:3,4: Match literal text RQ:3,4
In Java use this regex:
String re = "G2:(?:(?!G\\d+:)\\S)*RQ:3,4";
The problem is that \S matches any whitespace char and the regex engine parses the text from left to right. Once it finds G2: it grabs all non-whitespaces to the right (since \S* is a ghreedy subpattern) and then backtracks to find the rightmost occurrence of RQ:3,4.
In a general case, you may use
String regex = "G2:(?:(?!-G)\\S)*RQ:3,4";
See the regex demo. (?:(?!-G)\S)* is a tempered greedy token that will match 0+ occurrences of a non-whitespace char that does not start a -G substring.
If the hyphen is only possible in front of the next section, you may subtract - from \S:
String regex = "G2:[^\\s-]*RQ:3,4"; // using a negated character class
String regex = "G2:[\\S&&[^-]]*RQ:3,4"; // using character class subtraction
See this regex demo. [^\\s-]* will match 0 or more chars other than whitespace and -.
Try to use [^[] instead of \S in this regex: G2:[^[]*\[RQ:3,4
[^[] means any character but [
Demo
(considering that strings like this: G2:7JAP[jGd]nGy8[]R8[RQ:3,4] are not possible)

regex capture includes too much

I have a string from which I would like to caputre all after and including colon until (excluding) white space or paranthesis.
Why does the following regex include the paranthesis in the string match?
:(.*?)[\(\)\s] or also :(.+?)[\)\s] (non-greedy) does not work.
Example input: WHERE t.operator_id = :operatorID AND (t.merchant_id = :merchantID) AND t.readerApplication_id = :readerApplicationID AND t.accountType in :accountTypes
Should exctract :operatorID, :merchantID, :readerApplicationID, :accountTypes.
But my regexes extract for the second match :marchantID)
What is wrong and why?
Even if I use an exacter mapping condition in the capture, it does not work: :([a-zA-z0-9_]+?)[\)\(\s]
Put your conditional "followed by space or paren" as a lookahead, so that it sees but doesn't match. Right now you are explicitly matching parentheses with [\(\)\s]:
:(.+?)(?=[\s\(\)])
https://regex101.com/r/im8KWF/1/
Or, use the built-in \b "word boundary", which is also a "zero-width" assertion meaning the same thing*:
:(.+?)\b
https://regex101.com/r/FnnzGM/3/
*Definition of word boundary from regular-expressions.info:
There are three different positions that qualify as word boundaries:
Before the first character in the string, if the first character is a
word character. After the last character in the string, if the last
character is a word character. Between two characters in the string,
where one is a word character and the other is not a word character.

Regex to accept only alphabets and spaces and disallowing spaces at the beginning and the end of the string

I have the following requirements for validating an input field:
It should only contain alphabets and spaces between the alphabets.
It cannot contain spaces at the beginning or end of the string.
It cannot contain any other special character.
I am using following regex for this:
^(?!\s*$)[-a-zA-Z ]*$
But this is allowing spaces at the beginning. Any help is appreciated.
For me the only logical way to do this is:
^\p{L}+(?: \p{L}+)*$
At the start of the string there must be at least one letter. (I replaced your [a-zA-Z] by the Unicode code property for letters \p{L}). Then there can be a space followed by at least one letter, this part can be repeated.
\p{L}: any kind of letter from any language. See regular-expressions.info
The problem in your expression ^(?!\s*$) is, that lookahead will fail, if there is only whitespace till the end of the string. If you want to disallow leading whitespace, just remove the end of string anchor inside the lookahead ==> ^(?!\s)[-a-zA-Z ]*$. But this still allows the string to end with whitespace. To avoid this look back at the end of the string ^(?!\s)[-a-zA-Z ]*(?<!\s)$. But I think for this task a look around is not needed.
This should work if you use it with String.matches method. I assume you want English alphabet.
"[a-zA-Z]+(\\s+[a-zA-Z]+)*"
Note that \s will allow all kinds of whitespace characters. In Java, it would be equivalent to
[ \t\n\x0B\f\r]
Which includes horizontal tab (09), line feed (10), carriage return (13), form feed (12), backspace (08), space (32).
If you want to specifically allow only space (32):
"[a-zA-Z]+( +[a-zA-Z]+)*"
You can further optimize the regex above by making the capturing group ( +[a-zA-Z]+) non-capturing (with String.matches you are not going to be able to get the words individually anyway). It is also possible to change the quantifiers to make them possessive, since there is no point in backtracking here.
"[a-zA-Z]++(?: ++[a-zA-Z]++)*+"
Try this:
^(((?<!^)\s(?!$)|[-a-zA-Z])*)$
This expression uses negative lookahead and negative lookbehind to disallow spaces at the beginning or at the end of the string, and requiring the match of the entire string.
I think the problem is there's a ? before the negation of white spaces, which means it is optional
This should work:
[a-zA-Z]{1}([a-zA-Z\s]*[a-zA-Z]{1})?
at least one sequence of letters, then optional string with spaces but always ends with letters
I don't know if words in your accepted string can be seperated by more then one space. If they can:
^[a-zA-Z]+(( )+[a-zA-z]+)*$
If can't:
^[a-zA-Z]+( [a-zA-z]+)*$
String must start with letter (or few letters), not space.
String can contain few words, but every word beside first must have space before it.
Hope I helped.

Categories