I'm trying to remove all spaces from lines in a block of text which contain nothing but spaces, leaving the line breaks in place.
I tried the following:
str = " text\n \n \n text";
str = str
.replaceAll("\\A +\\n", "\n")
.replaceAll("(\\n +\\n)", "\n\n")
.replaceAll("\\n +\\Z", "\n");
I was expecting the output to be
" text\n\n\n text"
but instead it was
" text\n\n \n text"
The space in the third line of the block had not been removed. What am I doing wrong here?
Use the MULTILINE flag, so that ^ and $ will match the beginning and end of each line. The problem with your regex is that it is capturing the newline character, so the next match will advance past it, and cannot match.
str.replaceAll("(?m)^ +$", "")
You need to match lines with horizontal spaces only and the Pattern.MULTILINE modifier is required for the ^ and $ anchors to match start and end of lines respectively (its embedded option is (?m)). Use
String str = " text\n \n \n text";
str = str.replaceAll("(?m)^[\\p{Zs}\t]+$", "");
See the Java demo.
Details:
(?m) - Multiline mode
^ - start of line
[\\p{Zs}\t]+ - 1 or more horizontal whitespaces
$ - end of line.
An alternative to [\p{Zs}\t] is a pattern to match any whitespace excluding vertical whitespace symbols. In Java, character class subtraction can be handy: [\s&&[^\r\n]] where [\s] matches any whitespace and &&[^\r\n] excludes a carriage return and newline characters from it. A full pattern would look like .replaceAll("(?Um)^[\\s&&[^\r\n]]+$", "").
Use anchors:
str = str.replaceAll("(?m)^[^\\S\\n]+$", "");
Where ^ and $ match respectively the start and the end of a line when the multiline flag (?m) is switched on.
The problem with your pattern is that you use \\n around the horizontal whitespaces replaceAll("(\\n +\\n)", "\n\n") (simple spaces in your pattern). If you do that you can't obtain contiguous results since you can't match the same character twice.
Note: add eventually \\r in the character class (to exclude it as \\n) if you want to take in account Windows or old Mac end of lines.
Related
Want to use Java String.replaceAll(regex, "") to remove all leading spaces of each line in a multi-line text string. if possible remove all carriage-returns as well. What the "regex" should be?
Converting my comment to answer so that solution is easy to find for future visitors.
You may use regex replacement in Java:
str = str.replaceAll("(?m)^\\s+|\\s+$", "");
RegEx Details:
(?m): Enable MULTILINE mode so that ^ and $ are matched in every line.
^\\s+: Match 1+ whitespaces at line start
|: OR
\\s+$: Match 1+ whitespaces before line end
I have a String like
"this is line 1\n\n\nthis is line 2\n\n\nthis is line 3\t\t\tthis is line 3 also"
What I want to do is remove repeated specific characters like "\n", "\t" from this text.
"this is line 1\nthis is line 2\nthis is line 3\tthis is line 3 also"
I tried some regular expressions but didn't work for me.
text = text.replace("/[^\\w\\s]|(.)\\1/gi", "");
Is there any regex for this?
If you need to only remove sepcific whitespace chars, \s won't help as it will overmatch, i.e. it will also match spaces, hard spaces, etc.
You may use a character class with the chars, wrap them with a capturing group and use a backreference to the value captured. And replace with the backreference to the Group 1 value:
.replaceAll("([\n\t])\\1+", "$1")
See the regex demo.
Details
([\n\t]) - Group 1 (referred to with \1 from the pattern and $1 from the replacement pattern): a character class matching either a newline or tab symbols
\1+ - one or more repetitions of the value in Group 1.
I would use Guava's CharMatcher:
CharMatcher.javaIsoControl().removeFrom(myString)
I'm trying to remove all the non-alphanumeric characters from a String in Java but keep the carriage returns. I have the following regular expression, but it keeps joining words before and after a line break.
[^\\p{Alnum}\\s]
How would I be able to preserve the line breaks or convert them into spaces so that I don't have words joining?
An example of this issue is shown below:
Original Text
and refreshingly direct
when compared with the hand-waving of Swinburne.
After Replacement:
and refreshingly directwhen compared with the hand-waving of Swinburne.
You may add these chars to the regex, not \s, as \s matches any whitespace:
String reg = "[^\\p{Alnum}\n\r]";
Or, you may use character class subtraction:
String reg = "[\\P{Alnum}&&[^\n\r]]";
Here, \P{Alnum} matches any non-alphanumeric and &&[^\n\r] prevents a LF and CR from matching.
A Java test:
String s = "&&& Text\r\nNew line".replaceAll("[^\\p{Alnum}\n\r]+", "");
System.out.println(s);
// => Text
Newline
Note that there are more line break chars than LF and CR. In Java 8, \R construct matches any style linebreak and it matches \u000D\u000A|\[\u000A\u000B\u000C\u000D\u0085\u2028\u2029\].
So, to exclude matching any line breaks, you may use
String reg = "[^\\p{Alnum}\\u000A\\u000B\\u000C\\u000D\\u0085\\u2028\\u2029]+";
You can use this regex [^A-Za-z0-9\\n\\r] for example :
String result = str.replaceAll("[^a-zA-Z0-9\\n\\r]", "");
Example
Input
aaze03.aze1654aze987 */-a*azeaze\n hello *-*/zeaze+64\nqsdoi
Output
aaze03aze1654aze987aazeaze
hellozeaze64
qsdoi
I made a mistake with my code. I was reading in a file line by line and building the String, but didn't add a space at the end of each line. Therefore there were no actual line breaks to replace.
That's a perfect case for Guava's CharMatcher:
String input = "and refreshingly direct\n\rwhen compared with the hand-waving of Swinburne.";
String output = CharMatcher.javaLetterOrDigit().or(CharMatcher.whitespace()).retainFrom(input);
Output will be:
and refreshingly direct
when compared with the handwaving of Swinburne
My code is as follows:
public class Test {
static String REGEX = ".*([ |\t|\r\n|\r|\n]).*";
static String st = "abcd\r\nefgh";
public static void main(String args[]){
System.out.println(st.matches(REGEX));
}
}
The code outputs false. In any other cases it matches as expected, but I can't figure out what the problem here is.
You need to remove the character class.
static String REGEX = ".*( |\t|\r\n|\r|\n).*";
You can't put \r\n inside a character class. If you do that, it would be treated as \r, \n as two separate items which in-turn matches either \r or \n. You already know that .* won't match any line breaks so, .* matches the first part and the next char class would match a single character ie, \r. Now the following character is \n which won't be matched by .*, so your regex got failed.
UPDATE:
Based on your comments, you need something like this:
.*(?:[ \r\n\t].*)+
EXPLANATION:
In plain words, it is a regex that matches a line, then 1 or more lines. Or, just a multiline text.
.* - 0 or more characters other than a newline
(?:[ \r\n\t].*)+ - a non-capturing group that matches 1 or more times a sequence of
[ \r\n\t] - either a space, or a \r or \n or \t
.* - 0 or more characters other than a newline
See demo
Original answer
You can fix your pattern 2 ways:
String REGEX = ".*(?:\r\n|[ \t\r\n]).*";
This way we match either \r\n sequence, or any character in the character class.
Or (since the character class only matches 1 character, we can add + after it to capture 1 or more:
String REGEX = ".*[ \t\r\n]+.*";
See IDEONE demo
Note that it is not a good idea to use single characters in alternations, it decreases performance.
Also note that capturing groups should not be overused. If you do not plan to use the contents of the group, use non-capturing groups ((?:...)), or remove them.
I have the following code.
String _partsPattern = "(.*)((\n\n)|(\n)|(.))";
static final Pattern partsPattern = Pattern.compile(_partsPattern);
String text= "PART1: 01/02/03\r\nFindings:no smoking";
Matcher match = partsPattern.matcher(text);
while (match.find()) {
System.out.println( match.group(1));
return; //I just care on the first match for this purpose
}
Output: PART1: 01/02/0
I was expecting PART1: 01/02/03 why is the 3 at the end of my text not matching in my result.
Problem with your regex is that . will not match line separators like \r or \n so your regex will stop before \r and since last part of your regex
(.*)((\n\n)|(\n)|(.))
^^^^^^^^^^^^^^^
is required and it can't match \r last character will be stored in (.).
If you don't want to include these line separators in your match just use "(.*)$"; pattern with Pattern.MULTILINE flag to make $ match end of each line (it will represent standard line separators like \r or \r\n or \n but will not include them in match).
So try with
String _partsPattern = "(.*)$"; //parenthesis are not required now
final Pattern partsPattern = Pattern.compile(_partsPattern,Pattern.MULTILINE);
Other approach would be changing your regex to something like (.*)((\r\n)|(\n)|(.)) or (.*)((\r?\n)|(.)) but I am not sure what would be the purpose of last (.) (I would probably remove it). It is just variation of your original regex.
Works, giving "PART1: 01/02/03 ". So my guess is that in the real code you read the text maybe with a Reader.readLine and erroneously strip a carriage return + linefeed. Far fetched but I cannot imagine otherwise. (readLine strips the newline itself.)