Regexpression - mutliline in Java - java

I have an arbitray string, e.g.
String multiline=`
This is my "test" case
with lines
\section{new section}
Another incorrect test"
\section{next section}
With some more "text"
\subsection{next section}
With some more "text1"
`
I use LaTeX and I want to replace the quotes with those which are used in books - similar to ,, and ´´ For this I need to replace the beginning quotes with a \glqq and the ending with a \qrqq - for each group which starts with \.?section.
If I try the following
String pattern1 = "(^\\\\.?section\\{.+\\})[\\s\\S]*(\\\"(.+)\\\")";
Pattern p = Pattern.compile(pattern1, Pattern.MULTILINE);
Matcher m = p.matcher(testString);
System.out.println(p.matcher(testString).find()); //true
while (m.find()) {
for (int i = 0; i < 4; i++) {
System.out.println("Index: " + i);
System.out.println(m.group(i).replaceAll("\"([\\w]+)\"", "\u00AB$1\u00BB"));
}
}
I get as a result on the console
true
Index: 0
\section{new section}
Another incorrect test"
\section{next section}
With some more «text1»
Index: 1
\section{new section}
Index: 2
«text1»
Index: 3
text1
My some problems with the current approach:
The first valid match ("text") isn't found. I guess it has to do with the mulitline and incorrect grouping of \section{. The grouping for the quotes should be restricted to a group which starts with \section and ends with \?.section - how to make this correct?
Even when the text is found properly - how to get a complete string with the replacements?

You may match all texts between section and the next section or end of string, and replace all "..." strings inside it with «....
Here is the Java snippet (see demo):
String s = "This is my \"test\" case\nwith lines\n\\section{new section}\nAnother incorrect test\"\n\\section{next section}\nWith some more \"text\"\n\\subsection{next section}\nWith some more \"text1\"";
StringBuffer result = new StringBuffer();
Matcher m = Pattern.compile("(?s)section.*?(?=section|$)").matcher(s);
while (m.find()) {
String out = m.group(0).replaceAll("\"([^\"]*)\"", "«$1»");
m.appendReplacement(result, Matcher.quoteReplacement(out));
}
m.appendTail(result);
System.out.println(result.toString());
Output:
This is my "test" case
with lines
\section{new section}
Another incorrect test"
\section{next section}
With some more «text»
\subsection{next section}
With some more «text1»
The pattern means:
(?s) - Pattern.DOTALL embedded flag option
section - a section substring
.*? - any 0+ chars, as few as possible
(?=section|$) - a positive lookahead that requires a section substring or end of string to appear immediately to the right of the current location.

Related

Java Regex. group excluding delimiters

I'm trying to split my string using regex. It should include even zero-length matches before and after every delimiter. For example, if delimiter is ^ and my string is ^^^ I expect to get to get 4 zero-length groups.
I can not use just regex = "([^\\^]*)" because it will include extra zero-length matches after every true match between delimiters.
So I have decided to use not-delimiter symbols following after beginning of line or after delimiter. It works perfect on https://regex101.com/ (I'm sorry, i couldn't find a share option on this web-site to share my example) but in Intellij IDEa it skips one match.
So, now my code is:
final String regex = "(^|\\^)([^\\^]*)";
final String string = "^^^^";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find())
System.out.println("[" + matcher.start(2) + "-" + matcher.end(2) + "]: \"" + matcher.group(2) + "\"");
and I expect 5 empty-string matches. But I have only 4:
[0-0]: ""
[2-2]: ""
[3-3]: ""
[4-4]: ""
The question is why does it skip [1-1] match and how can I fix it?
Your regex matches either the start of string or a ^ (capturing that into Group 1) and then any 0+ chars other than ^ into Group 2. When the first match is found (the start of the string), the first group keeps an empty string (as it is the start of string) and Group 2 also holds an empty string (as the first char is ^ and [^^]* can match an empty string before a non-matching char. The whole match is zero-length, and the regex engine moves the regex index to the next position. So, after the first match, the regex index is moved from the start of the string to the position after the first ^. Then, the second match is found, the second ^ and the empty string after it. Hence, the the first ^ is not matched, it is skipped.
The solution is a simple split one:
String[] result = string.split("\\^", -1);
The second argument makes the method output all empty matches at the end of the resulting array.
See a Java demo:
String str = "^^^^";
String[] result = str.split("\\^", -1);
System.out.println("Number of items: " + result.length);
for (String s: result) {
System.out.println("\"" + s+ "\"");
}
Output:
Number of items: 5
""
""
""
""
""

Regex java a regular expression to extract string except the last number

How to extract all characters from a string without the last number (if exist ) in Java, I found how to extract the last number in a string using this regex [0-9.]+$ , however I want the opposite.
Examples :
abd_12df1231 => abd_12df
abcd => abcd
abcd12a => abcd12a
abcd12a1 => abcd12a
What you might do is match from the start of the string ^ one or more word characters \w+ followed by not a digit using \D
^\w+\D
As suggested in the comments, you could expand the characters you want to match using a character class ^[\w-]+\D or if you want to match any character you could use a dot ^.+\D
If you want to remove one or more digits at the end of the string, you may use
s = s.replaceFirst("[0-9]+$", "");
See the regex demo
To also remove floats, use
s = s.replaceFirst("[0-9]*\\.?[0-9]+$", "");
See another regex demo
Details
(?s) - a Pattern.DOTALL inline modifier
^ - start of string
(.*?) - Capturing group #1: any 0+ chars other than line break chars as few as possible
\\d*\\.?\\d+ - an integer or float value
$ - end of string.
Java demo:
List<String> strs = Arrays.asList("abd_12df1231", "abcd", "abcd12a", "abcd12a1", "abcd12a1.34567");
for (String str : strs)
System.out.println(str + " => \"" + str.replaceFirst("[0-9]*\\.?[0-9]+$", "") + "\"");
Output:
abd_12df1231 => "abd_12df"
abcd => "abcd"
abcd12a => "abcd12a"
abcd12a1 => "abcd12a"
abcd12a1.34567 => "abcd12a"
To actually match a substring from start till the last number, you may use
(?s)^(.*?)\d*\.?\d+$
See the regex demo
Java code:
String s = "abc234 def1.566";
Pattern pattern = Pattern.compile("(?s)^(.*?)\\d*\\.?\\d+$");
Matcher matcher = pattern.matcher(s);
if (matcher.find()){
System.out.println(matcher.group(1));
}
With this Regex you could capture the last digit(s)
\d+$
You could save that digit and do a string.replace(lastDigit,"");

Java regex to match after start of previous match [duplicate]

How can I extract overlapping matches from an input using String.split()?
For example, if trying to find matches to "aba":
String input = "abababa";
String[] parts = input.split(???);
Expected output:
[aba, aba, aba]
String#split will not give you overlapping matches. Because a particular part of the string, will only be included in a unique index, of the array obtained, and not in two indices.
You should use Pattern and Matcher classes here.
You can use this regex: -
Pattern pattern = Pattern.compile("(?=(aba))");
And use Matcher#find method to get all the overlapping matches, and print group(1) for it.
The above regex matches every empty string, that is followed by aba, then just print the 1st captured group. Now since look-ahead is zero-width assertion, so it will not consume the string that is matched. And hence you will get all the overlapping matches.
String input = "abababa";
String patternToFind = "aba";
Pattern pattern = Pattern.compile("(?=" + patternToFind + ")");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(patternToFind + " found at index: " + matcher.start());
}
Output: -
aba found at index: 0
aba found at index: 2
aba found at index: 4
I would use indexOf.
for(int i = text.indexOf(find); i >= 0; i = text.indexOf(find, i + 1))
System.out.println(find + " found at " + i);
This is not a correct use of split(). From the javadocs:
Splits this string around matches of the given regular expression.
Seems to me that you are not trying to split the string but to find all matches of your regular expression in the string. For this you would have to use a Matcher, and some extra code that loops on the Matcher to find all matches and then creates the array.

Subtle Java Regular Expressions

String str = "1234545";
String regex = "\\d*";
Pattern p1 = Pattern.compile(regex);
Matcher m1 = p1.matcher(str);
while (m1.find()) {
System.out.print(m1.group() + " found at index : ");
System.out.print(m1.start());
}
The output of this program is 1234545 found at index:0 found at index:7.
My question is:
why is there a space printed when actually there is no space in the str.
The space printed between "index:0" and "at index:7" is coming from the string literal that you print. It was supposed to come after the matched string; however, in this case the match is empty.
Here is what's going on: the first match consumes all digits in the string, leaving zero characters for the following match. However, the following match succeeds, because the asterisk * in your expression allows matching empty strings.
To avoid this confusion in the future, add delimiter characters around the actual match, like this:
System.out.print("'" + m1.group() + "' at index : ");
Now you would see an empty pair of single quotes, showing that the match was empty.

Pattern: how subtract matched character in character class?

Is it possible to subtract a matched character in a character class?
Java docs are having examples about character classes with subtraction:
[a-z&&[^bc]] - a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]] - a through z, and not m through p: [a-lq-z](subtraction)
I want to write pattern, which matches two pairs of word characters, when pairs are not the same:
1) "aaaa123" - should NOT match
2) "aabb123" - should match "aabb" part
3) "aa--123" - should NOT match
I am close to success with following pattern:
([\w])\1([\w])\2
but of course it does not work in case 1, so I need to subtract the match of first group. But when I try to do this:
Pattern p = Pattern.compile("([\\w])\\1([\\w&&[^\\1]])\\2");
I am getting an exception:
Exception in thread "main" java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence near index 17
([\w])\1([\w&&[^\1]])\2
^
at java.util.regex.Pattern.error(Pattern.java:1713)
So seems it does not work with groups, but just with listing specific characters. Following pattern compiles with no problems:
Pattern p = Pattern.compile("([\\w])\\1([\\w&&[^a]])\\2");
Is there any other way to write such pattern?
Use
Pattern p = Pattern.compile("((\\w)\\2(?!\\2))((\\w)\\4)");
Your characters will be in groups 1 and 3.
This works by using a negative lookahead, to make sure the character following the second character in the first character group is a different character.
You are using the wrong tool for the job. By all means use a regex to detect pairs of character pairs, but you can just use != to test whether the characters within the pairs are the same. Seriously, there is no reason to do everything in a regular expression - it makes for unreadable, non-portable code and brings you no benefit other than "looking cool".
Try this
String regex = "(\\w)\\1(?!\\1)(\\w)\\2";
Pattern pattern = Pattern.compile(regex);
(?!\\1) is a negative lookahead, it ensures that the content of \\1 is not following
My test code
String s1 = "aaaa123";
String s2 = "aabb123";
String s3 = "aa--123";
String s4 = "123ccdd";
String[] s = { s1, s2, s3, s4 };
String regex = "(\\w)\\1(?!\\1)(\\w)\\2";
for(String a : s) {
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(a);
if (matcher.find())
System.out.println(a + " ==> Success");
else
System.out.println(a + " ==> Failure");
}
The output
aaaa123 ==> Failure
aabb123 ==> Success
aa--123 ==> Failure
123ccdd ==> Success

Categories