Java Regex Matcher not giving expected result - java

I have the following code.
String _partsPattern = "(.*)((\n\n)|(\n)|(.))";
static final Pattern partsPattern = Pattern.compile(_partsPattern);
String text= "PART1: 01/02/03\r\nFindings:no smoking";
Matcher match = partsPattern.matcher(text);
while (match.find()) {
System.out.println( match.group(1));
return; //I just care on the first match for this purpose
}
Output: PART1: 01/02/0
I was expecting PART1: 01/02/03 why is the 3 at the end of my text not matching in my result.

Problem with your regex is that . will not match line separators like \r or \n so your regex will stop before \r and since last part of your regex
(.*)((\n\n)|(\n)|(.))
^^^^^^^^^^^^^^^
is required and it can't match \r last character will be stored in (.).
If you don't want to include these line separators in your match just use "(.*)$"; pattern with Pattern.MULTILINE flag to make $ match end of each line (it will represent standard line separators like \r or \r\n or \n but will not include them in match).
So try with
String _partsPattern = "(.*)$"; //parenthesis are not required now
final Pattern partsPattern = Pattern.compile(_partsPattern,Pattern.MULTILINE);
Other approach would be changing your regex to something like (.*)((\r\n)|(\n)|(.)) or (.*)((\r?\n)|(.)) but I am not sure what would be the purpose of last (.) (I would probably remove it). It is just variation of your original regex.

Works, giving "PART1: 01/02/03 ". So my guess is that in the real code you read the text maybe with a Reader.readLine and erroneously strip a carriage return + linefeed. Far fetched but I cannot imagine otherwise. (readLine strips the newline itself.)

Related

Java Regex - Remove Non-Alphanumeric characters except line breaks

I'm trying to remove all the non-alphanumeric characters from a String in Java but keep the carriage returns. I have the following regular expression, but it keeps joining words before and after a line break.
[^\\p{Alnum}\\s]
How would I be able to preserve the line breaks or convert them into spaces so that I don't have words joining?
An example of this issue is shown below:
Original Text
and refreshingly direct
when compared with the hand-waving of Swinburne.
After Replacement:
and refreshingly directwhen compared with the hand-waving of Swinburne.
You may add these chars to the regex, not \s, as \s matches any whitespace:
String reg = "[^\\p{Alnum}\n\r]";
Or, you may use character class subtraction:
String reg = "[\\P{Alnum}&&[^\n\r]]";
Here, \P{Alnum} matches any non-alphanumeric and &&[^\n\r] prevents a LF and CR from matching.
A Java test:
String s = "&&& Text\r\nNew line".replaceAll("[^\\p{Alnum}\n\r]+", "");
System.out.println(s);
// => Text
Newline
Note that there are more line break chars than LF and CR. In Java 8, \R construct matches any style linebreak and it matches \u000D\u000A|\[\u000A\u000B\u000C\u000D\u0085\u2028\u2029\].
So, to exclude matching any line breaks, you may use
String reg = "[^\\p{Alnum}\\u000A\\u000B\\u000C\\u000D\\u0085\\u2028\\u2029]+";
You can use this regex [^A-Za-z0-9\\n\\r] for example :
String result = str.replaceAll("[^a-zA-Z0-9\\n\\r]", "");
Example
Input
aaze03.aze1654aze987 */-a*azeaze\n hello *-*/zeaze+64\nqsdoi
Output
aaze03aze1654aze987aazeaze
hellozeaze64
qsdoi
I made a mistake with my code. I was reading in a file line by line and building the String, but didn't add a space at the end of each line. Therefore there were no actual line breaks to replace.
That's a perfect case for Guava's CharMatcher:
String input = "and refreshingly direct\n\rwhen compared with the hand-waving of Swinburne.";
String output = CharMatcher.javaLetterOrDigit().or(CharMatcher.whitespace()).retainFrom(input);
Output will be:
and refreshingly direct
when compared with the handwaving of Swinburne

Java replaceAll remove spaces from empty lines

I'm trying to remove all spaces from lines in a block of text which contain nothing but spaces, leaving the line breaks in place.
I tried the following:
str = " text\n \n \n text";
str = str
.replaceAll("\\A +\\n", "\n")
.replaceAll("(\\n +\\n)", "\n\n")
.replaceAll("\\n +\\Z", "\n");
I was expecting the output to be
" text\n\n\n text"
but instead it was
" text\n\n \n text"
The space in the third line of the block had not been removed. What am I doing wrong here?
Use the MULTILINE flag, so that ^ and $ will match the beginning and end of each line. The problem with your regex is that it is capturing the newline character, so the next match will advance past it, and cannot match.
str.replaceAll("(?m)^ +$", "")
You need to match lines with horizontal spaces only and the Pattern.MULTILINE modifier is required for the ^ and $ anchors to match start and end of lines respectively (its embedded option is (?m)). Use
String str = " text\n \n \n text";
str = str.replaceAll("(?m)^[\\p{Zs}\t]+$", "");
See the Java demo.
Details:
(?m) - Multiline mode
^ - start of line
[\\p{Zs}\t]+ - 1 or more horizontal whitespaces
$ - end of line.
An alternative to [\p{Zs}\t] is a pattern to match any whitespace excluding vertical whitespace symbols. In Java, character class subtraction can be handy: [\s&&[^\r\n]] where [\s] matches any whitespace and &&[^\r\n] excludes a carriage return and newline characters from it. A full pattern would look like .replaceAll("(?Um)^[\\s&&[^\r\n]]+$", "").
Use anchors:
str = str.replaceAll("(?m)^[^\\S\\n]+$", "");
Where ^ and $ match respectively the start and the end of a line when the multiline flag (?m) is switched on.
The problem with your pattern is that you use \\n around the horizontal whitespaces replaceAll("(\\n +\\n)", "\n\n") (simple spaces in your pattern). If you do that you can't obtain contiguous results since you can't match the same character twice.
Note: add eventually \\r in the character class (to exclude it as \\n) if you want to take in account Windows or old Mac end of lines.

JAVA REGEX fine the correct pattern

I tried to use regex. I have this pattern
STACK
blabla
OVER
blabla
STACK
vlvlv
OVER
and maybe can another line in the end.
I write this patter that seems to work in sites that check regex but dont work in java.
"^(STACK(\n[^\n]+\n)OVER(\n[^\n]+(\n)?)?)+$"
what is the right pattern?
THANKS
Assuming that you want to check if your entire input can be matched with regex you can use something like
String data =
"STACK\r\n" +
"blabla\r\n" +
"OVER\r\n" +
"blabla\r\n" +
"STACK\r\n" +
"vlvlv\r\n" +
"OVER";
String regex ="(^STACK$((\r?\n|\r).+(\r?\n|\r))^OVER$((\r?\n|\r).+(\r?\n|\r)?)?+)+";
Pattern p = Pattern.compile(regex,Pattern.MULTILINE);
Matcher m = p.matcher(data);
System.out.println(m.matches());
I added Pattern.MULTILINE flag to let ^ and $ be start and end of lines, not like it is by default start and end of entire input.
Also to say that START and OVER has to be the only word in line I surrounded it with ^ and $.
Another thing you didn't include in your regex is possibility that line separator can also be \r\n or \r so I changed it to reflect it.
Last thing I did was changing [^\n] to . since they represents almost the same (dot doesn't include \r while [^\n] does.

Escape sequence in regex parsed by Pattern.LITERAL

Given the following snippet:
Pattern pt = Pattern.compile("\ndog", Pattern.LITERAL);
Matcher mc = pt.matcher("\ndogDoG");
while(mc.find())
{
System.out.printf("I have found %s starting at the " +
"index %s and ending at the index %s%n",mc.group(),mc.start(),mc.end());
}
The output will be:
I have found
dog starting at the index 0 and ending at the index 4.
It means that even though I have specified Pattern.LITERAL which this link says that:
Pattern.LITERAL Enables literal parsing of the pattern. When this flag
is specified then the input string that specifies the pattern is
treated as a sequence of literal characters. Metacharacters or escape
sequences in the input sequence will be given no special meaning.
However the output given from the above snippet does interpret the \n escape sequence, it does not treat it like a literal.
Why does it happen in that way since they specify in this tutorial that it should not?
I now \n is a line terminator, however it's still an escape sequence character.
however it's still an escape sequence character.
No it's not. It's a newline character. You can do:
char c = '\n';
Your output is therefore expected.
Note that if you compile a pattern with:
Pattern.compile("\n")
then \n is the literal character \n.
BUT if you compile with:
Pattern.compile("\\n")
then it is an escape sequence. And they happen to match the same thing.
Pattern.LITERAL cares about regex literals, not string literals.
Therefore, it treats \\n as backslash plus n (instead of the regex token for newline), but it treats \n as the line feed character that it stands for (and thus ignores it).

groovy or java: how to retrieve a block of comments using regex from /** ***/?

This might be a piece of cake for java experts. Please help me out:
I have a block of comments in my program like this:
/*********
block of comments - line 1
line 2
.....
***/
How could I retrieve "block of comments" using regex?
thanks.
Something like this should do:
String str =
"some text\n"+
"/*********\n" +
"block of comments - line 1\n" +
"line 2\n"+
"....\n" +
"***/\n" +
"some more text";
Pattern p = Pattern.compile("/\\*+(.*?)\\*+/", Pattern.DOTALL);
Matcher m = p.matcher(str);
if (m.find())
System.out.println(m.group(1));
(DOTALL says that the . in the pattern should also match new-line characters)
Prints:
block of comments - line 1
line 2
....
Pattern regex = Pattern.compile("/\\*[^\\r\\n]*[\\r\\n]+(.*?)[\\r\\n]+[^\\r\\n]*\\*+/", Pattern.DOTALL);
This works because comments can't be nested in Java.
It is important to use a reluctant quantifier (.*?) or we will match everything from the first comment to the last comment in a file, regardless of whether there is actual code in-between.
/\* matches /*
[^\r\n]* matches whatever else is on the rest of this line.
[\r\n]+ matches one or more linefeeds.
.*? matches as few characters as possible.
[\r\n]+ matches one or more linefeeds.
[^\r\n]* matches any characters on the line of the closing */.
\*/ matches */.
Not sure about the multi-line issues, but it were all on one line, you could do this:
^\/\*.*\*\/$
That breaks down to:
^ start of a line
\/\*+ start of a comment, one or more *'s (both characters escaped)
.* any number of characters
\*+\/ end of a comment, one or more *'s (both characters escaped)
$ end of a line
By the way, it's "regex" not "regrex" :)

Categories