I'm trying to write a Java regexp that matches a string in this format:
AXXXXYYYYB
Where XXXX is a string that terminates at the 20th character or the second space, whichever comes first, and YYYY is a string that terminates at the 20th character or the first space, whichever comes first.
And I need XXXX and YYYY to be the first and second capture groups.
I can get it to work terminating at the first space in XXXX with this:
^A([^ ]{1,20}) ?([^ ]{1,20})B$
But I can't figure out the rule that would terminate at the 20th character or the second space.
Also, I don't care if either capture group ends up with an extra leading or trailing space.
Sample input -> output:
MR SMITH BROOKLYN -> "MR SMITH" and "BROOKLYN" (separated at second space)
MR SMYTHE-JONES BRONX -> "MR SMYTHE-JONES" and "BRONX" (separated at second space)
12345678901234567890QUEENS -> "12345678901234567890" and "QUEENS" (separated at 20th character)
1234567890 1234567890QUEENS -> "1234567890 123456789" and "0QUEENS" (separated at 20th character)
1234567890 1234567890STATEN ISLAND -> "1234567890 123456789" and "0STATEN" (separated at 20th character, then separated at space)
^([^ ]+[ ][^ ]+)[ ](.*)$|(.{20})(.*)$
You can try this.Grab the captures.
1)([^ ]+[ ][^ ]+)[ ](.*) will break on second space
2)(.{20})(.*) will break on 20 characters.
See demo.
http://regex101.com/r/gT6kI4/4
This is my solution, which makes use of lookbehind:
"([^ ]*(?:[ ][^ ]*)?)(?<!.{21})[ ]?([^ ]{0,20})"
([^ ]*(?:[ ][^ ]*)?)(?<!.{21}) matches and captures the first part, which must be strictly less than 21 characters and contains maximum one space. Due to the greedy quantifiers, it will always try for the longest possible string first (always match past the first space first) and reduces its length when limited by the look-behind. The lookbehind only allows the matcher to proceed when you can't find 21 characters to match, which means the part in front is less than 20 characters.
Since the first part can end with space, I need to match it with [ ]?.
Then, since the second part can't contain any space (since it breaks at the first space), it can simply be matched and captured by ([^ ]{0,20}).
Note that this solution assumes there is no line separator character in the input string.
There is a caveat: the first part may contain trailing space, if it is the first space and it is the 20th character. You can prevent that by making a small change:
"([^ ]*(?:[ ][^ ]+)?)(?<!.{21})[ ]?([^ ]{0,20})"
^
Demo on ideone
I don't think this could be done using one regex pattern.
I suggest running this pattern first:
^(.{20})(.*)$
if sub-pattern no. 1 contains more than one space then fail it and run this pattern instead
^(\S+\s\S+)\s(.*)$
Related
This is my first question. Nice to e-meet everyone.
I have created the following regex pattern in Java 8 (this is just a simplified example of what I actually have in my code - for the sake of clarity):
(?<!a)([0-9])\,([0-9])(?!a)|(?<!b)([0-9]) ([0-9])(?!b)|(?<!c)([0-9])([0-9])(?!c)
so in general it consists of three alternatives:
1st one matches two single digits separated with a comma, for example:
1,1
2,0
4,5
2nd one matches two single digits separated with a space, for example:
1 1
2 0
4 5
3rd one matches two single digits in a row, for example:
11
20
45
Each alternative uses lookarounds and their content has to be slightly different for each one of them - that's why I couldn't just put everything together like that:
([0-9])[, ]?([0-9])
Each of the matched digits is enclosed in a capturing group and now I have a second line to 'call out' these captured numbers like this:
(?<!n)($1 $2|$3 $4|$5 $6)(?!n)
So at the end I need to match a text that would have the same digits separated with single space and not surrounded by 'n'. So if any of the examples shown above would be matched by the pattern from the 1st line, the 2nd line pattern should match these:
1 1
2 0
4 5
11 11
22 00
44 55
And not any of these:
n1 1
2,0
45
asd asd asd
The problem is the following: it returns a match even if I do not have these captured digits in the tested text, but I do have space in it... So here I do not get match and that is correct:
aaaaaaaaa
bbbbbbbbb
aasdfasdf
but here I get a match on the following things (most apparently because there is a space/spaces):
abc abc
q w r t y
as df
Does anyone know if this is normal that despite the fact that the characters in capturing groups are not captured by the 1st line, the 'non capturing group' part (so a single space) will be matched and therefore the whole pattern returns match, as if a capturing group could be a zero-length match in the second line if nothing is captured by the first line? Thanks in advance for any comment on this.
Your regex matches whitespace because the resulting pattern for the 1,1 string is (?<!n)(1 1| | )(?!n), and it can match a space that is neither preceded nor followed with a space.
When a replacement backreference does not match any string in a .replaceAll/.replaceFirst it is assigned an empty string (it is assigned null when using .find() / .matches()), and thus you still get the blank alternatives in the resulting pattern.
You may leverage this functionality AND the fact that each alternative has exactly two capturing groups by concatenating replacement backreferences in the string replacement pattern, getting rid of the alternations altogether:
SEARCH: (?<!a)([0-9]),([0-9])(?!a)|(?<!b)([0-9]) ([0-9])(?!b)|(?<!c)([0-9])([0-9])(?!c)
REPLACE: (?<!n)($1 $2|$3 $4|$5 $6)(?!n)
Note how the backreferences are concatenated: all backreferences to odd groups come first, then all backreferences to even groups are placed in a no-alternative pattern.
See the regex demo.
Note that even if the number of groups is different across the alternatives you may just add "fake" empty groups to each of them, and this approach will still work.
I want to split a long text stored in a String variable following those rules:
Split on a dot (.)
The Substrings should have a minimum length of 30 (for example).
Take this example:
"The boy ate the apple. The sun is shining high in the sky. The answer to life the universe and everything is forty two, said the big computer."
let's say the minimum length I want is 30.
The result splits obtained would be:
"The boy ate the apple. The sun is shining high in the sky."
"The answer to life the universe and everything is forty two, said the big computer."
I don't want to take "The boy ate the apple." as a split because it's less than 30 characters.
2 ways I thought of:
Loop through all the characters and add them to a String builder. And whenever I reach a dot (.) I check if my String builder is more than the minimum I split it, otherwise I continue.
Split on all dots (.), and then loop through the splits. if one of the Splitted strings is smaller than the minimum, I concatenate it with the one after.
But I am looking if this can be done directly by using a Regex to split and test the minimum number of characters before a match.
Thanks
Instead of using split, you could also match your values using a capturing group.
To make the dot also match a newline you could use Pattern.DOTALL
\s*(.{30}[^.]*\.|.+$)
In Java:
String regex = "\\s*(.{30}[^.]*\\.|.+$)";
Explanation
\s* Match 0_ times a whitespace character
( Capturing group
.{30} Match any character 30 times
[^.]* Match 0+ times not a dot using a negated character class
\. Match literally
| Or
.+$ Match 1+ times any character until the end of the string.
) Close capturing group
Regex demo | Java demo
Instead of using the split method, try matching with the following regexp: \S.{29,}?[.]
Demo
This should do the job:
"\W*+(.{30,}?)\W*\."
Test: https://regex101.com/r/aavcme/3
\W*+ takes as much as non-word character to trim spaces between sentences
. matches any character (I guess you want to match any kind of character in your sentences)
{30,} asserts the minimum length of the match (30)
? means "as few as possible"
\. matches the dot separating the sentences (assuming that you always have a dot at the end of a sentence, even the last one)
I have the following regex in Java:
String regex = "[^\\s\\p{L}\\p{N}]";
Pattern p = Pattern.compile(regex);
String phrase = "Time flies: "when you're having fun!" Can't wait, 'until' next summer :)";
String delimited = p.matcher(phrase).replaceAll("");
Right now this regex removes all non-spaces and nonAlphanumerics.
Input: Time flies: "when you're having fun!" Can't wait, 'until' next summer :)
Output: Time flies when youre having fun Cant wait until next summer
Problem is, I want to maintain the single quotes on words, such as you're, can't, etc. But want to remove single quotes that are at the end of a sentence, or surround a word, such as 'hello'.
This is what I want:
Input: Time flies: "when you're having fun!" Can't wait, 'until' next summer :)
Output: Time flies when you're having fun Can't wait until next summer
How can I update my current regex to be able to do this? I need to keep the \p{L} and \p{N} as it has to work for more than one language.
Thanks!
This should do what you want, or come close:
String regex = "[^\\s\\p{L}\\p{N}']|(?<=(^|\\s))'|'(?=($|\\s))";
The regex has three alternatives separated by |. It will match:
Any character that is not a space, letter, number, or quote mark.
A quote mark, if it is preceded by the beginning of the line or a space (therefore, a quote mark at the beginning of a word). This uses positive lookbehind.
A quote mark, if it is followed by the end of the line or a space (therefore, a quote mark at the end of the word). This uses positive lookahead.
It works on the example you give. Where it might not work the way you want is if you have a word with a quote mark on one side, but not the other: "'Tis a shame that we couldn't visit James' house". Since the lookahead/behind only look at the character right before and after the quote, and doesn't look ahead to see if (say) the quote mark at the beginning of the word is followed by a quote mark at the end of the word, it will delete the quote marks on 'Tis and James'.
The code:
String s = "a12ij";
System.out.println(Arrays.toString(s.split("\\d?")));
The output is [a, , , i, j], which confuses me. If the expression is greedy, shouldn't it try and match as much as possible, thereby splitting on each digit? I would assume that the output should be [a, , i, j] instead. Where is that extra empty character coming from?
The pattern you're using only matches one digit a time:
\d match a digit [0-9]
? matches between zero and one time (greedy)
Since you have more than one digit it's going to split on both of them individually. You can easily match more than one digit at a time more than a few different ways, here are a couple:
\d match a digit [0-9]
+? matches between one and unlimited times (lazy)
Or you could just do:
\d match a digit [0-9]
+ matches between one and unlimited times (greedy)
Which would likely be the closest to what I would think you would want, although it's unclear.
Explanation:
Since the token \d is using the ? quantifier the regex engine is telling your split function to match a digit between zero and one time. So that must include all of your characters (zero), as well as each digit matched (once).
You can picture it something like this:
a,1,2,i,j // each character represents (zero) and is split
| |
a, , ,i,j // digit 1 and 2 are each matched (once)
Digit 1 and 2 were matched but not captured — so they are tossed out, however, the comma still remains from the split, and is not removed basically producing two empty strings.
If you're specifically looking to have your result as a, ,i,j then I'll give you a hint. You'll want to (capture the \digits as a group between one and unlimited times+) followed up by the greedy qualifier ?. I recommend visiting one of the popular regex sites that allows you to experiment with patterns and quantifiers; it's also a great way to learn and can teach you a lot!
↳ The solution can be found here
The javadoc for split() is not clear on what happens when a pattern can match the empty string. My best guess here is the delimiters found by split() are what would be found by successive find() calls of a Matcher. The javadoc for find() says:
This method starts at the beginning of this matcher's region, or, if a
previous invocation of the method was successful and the matcher has
not since been reset, at the first character not matched by the
previous match.
So if the string is "a12ij" and the pattern matches either a single digit or an empty string, then find() should find the following:
Empty string starting at position 0 (before a)
The string "1"
The string "2"
Empty string starting at position 3 (before i). This is because "the first character not matched by the previous match" is the i.
Empty string starting at position 4 (before j).
Empty string starting at position 5 (at the end of the string).
So if the matches found are the substrings denoted by the x, where an x under a blank means the match is an empty string:
a 1 2 i j
x x x x x x
Now if we look at the substrings between the x's, they are "a", "", "", "i", "j" as you are seeing. (The substring before the first empty string is not returned, because the split() javadoc says "A zero-width match at the beginning however never produces such empty leading substring." [Note that this may be new behavior with Java 8.] Also, split() doesn't return empty trailing substrings.)
I'd have to look at the code for split() to confirm this behavior. But it makes sense looking at the Matcher javadoc and it is consistent with the behavior you're seeing.
MORE: I've confirmed from the source that split() does rely on Matcher and find(), except for an optimization for the common case of splitting on a one-known-character delimiter. So that explains the behavior.
My problem here is that i want a Character remove in some parts of a String but I do not know how to restrict the removing.
Example:
A computer is a general purpose device that can be\n
programmed to carry out a finite set of\n
millions to billions of times more capable.\n
\n
In this era mechanical analog computers were used\n
for military applications.\n
1.1 Limited-function early computers\n
1.2 First general-purpose computers\n
1.3 Stored-program architecture\n
1.4 Semiconductors and\n
this here example is the content of my string, what i want to happen is to remove the \n of lines 1 and 2 above but not to remove the \n in line 5 onwards. How do i remove the \n without removing the other \n?. My Goal here is to make the string a paragraph without \n after line. like the example the first 3 lines can be a paragraph and the next lines are in bullet form(example). what i am saying is that I do not want to remove \n in bulleted characters.
The real contents of the string is dynamic.
I have tried using String.replaceAll("\n", " ") well clearly that would not work it will remove all the \n i have thought of using Regex to determine what is Alphanumberic but it would remove some letters after \n
Try using this regex: -
str = str.replaceAll("(.+)(?<!\\.)\n(?!\\d)", "$1 ");
System.out.println(str);
This will replace your \n if it is not preceded by a dot - termination of a paragraph, and it is not followed by a digit, for when it is followed by a bulleted point. (like, your \n in first bullet point is followed by a 1.2. So, it will not be replaced.).
(.+) at the start, ensures that you are not replacing a blank line.
This will work for the string you have shown.
Explanation: -
(.+) -> A capture group, capturing anything, occurring at least once.
(?<!\\.) -> This is called negative-look-behind. It matches the string following it, only if that string is not preceded by a dot(.) given in the negative-look-behind pattern.
For e.g.: - You don't need to replace \n after the line: - millions to billions of times more capable.\n.
(?!\\d) -> This is called negative -look-ahead. It matches string behind it, only if that string is not followed by a digit (\\d) given in the negative-look-ahead pattern.
For e.g.: - In your bulleted points, computers\n is followed by 1.2. where 1 is a digit. So, you don't want to replace that \n.
Now, $1 and $2 represent the groups captured in the pattern match. Since you just want to replace "\n". So, we took the remaining pattern match as it is, while replacing "\n" with a space.
So, $1 is representation for 1st group - (.+)
Note, look-ahead and look-behind regexes are non-capturing groups.
For More Details, follow these links: -
http://docs.oracle.com/javase/tutorial/essential/regex/
http://docs.oracle.com/javase/tutorial/essential/regex/quant.html
I suspect your requirement is to remove the \n of lines 1 and 2 .
What you can do is as below:
split your string into segments,
String[] array = yourString.split("\n");
concat every segments by adding \n tag, except line 1,2
array[1] + array[2] + array[3] + '\n' + array[4] + '\n' ...// and so
forth