Java Regex - Remove Non-Alphanumeric characters except line breaks - java

I'm trying to remove all the non-alphanumeric characters from a String in Java but keep the carriage returns. I have the following regular expression, but it keeps joining words before and after a line break.
[^\\p{Alnum}\\s]
How would I be able to preserve the line breaks or convert them into spaces so that I don't have words joining?
An example of this issue is shown below:
Original Text
and refreshingly direct
when compared with the hand-waving of Swinburne.
After Replacement:
and refreshingly directwhen compared with the hand-waving of Swinburne.

You may add these chars to the regex, not \s, as \s matches any whitespace:
String reg = "[^\\p{Alnum}\n\r]";
Or, you may use character class subtraction:
String reg = "[\\P{Alnum}&&[^\n\r]]";
Here, \P{Alnum} matches any non-alphanumeric and &&[^\n\r] prevents a LF and CR from matching.
A Java test:
String s = "&&& Text\r\nNew line".replaceAll("[^\\p{Alnum}\n\r]+", "");
System.out.println(s);
// => Text
Newline
Note that there are more line break chars than LF and CR. In Java 8, \R construct matches any style linebreak and it matches \u000D\u000A|\[\u000A\u000B\u000C\u000D\u0085\u2028\u2029\].
So, to exclude matching any line breaks, you may use
String reg = "[^\\p{Alnum}\\u000A\\u000B\\u000C\\u000D\\u0085\\u2028\\u2029]+";

You can use this regex [^A-Za-z0-9\\n\\r] for example :
String result = str.replaceAll("[^a-zA-Z0-9\\n\\r]", "");
Example
Input
aaze03.aze1654aze987 */-a*azeaze\n hello *-*/zeaze+64\nqsdoi
Output
aaze03aze1654aze987aazeaze
hellozeaze64
qsdoi

I made a mistake with my code. I was reading in a file line by line and building the String, but didn't add a space at the end of each line. Therefore there were no actual line breaks to replace.

That's a perfect case for Guava's CharMatcher:
String input = "and refreshingly direct\n\rwhen compared with the hand-waving of Swinburne.";
String output = CharMatcher.javaLetterOrDigit().or(CharMatcher.whitespace()).retainFrom(input);
Output will be:
and refreshingly direct
when compared with the handwaving of Swinburne

Related

Java Replace Unicode Characters in a String

I have a string which contains multiple unicode characters. I want to identify all these unicode characters, ex: \ uF06C, and replace it with a back slash and four hexa digits without "u" in it.
Example:
Source String: "add \uF06Cd1 Clause"
Result String: "add \F06Cd1 Clause"
How can achieve this in Java?
Edit:
Question in link Java Regex - How to replace a pattern or how to is different from this as my question deals with unicode character. Though it has multiple literals, it is considered as one single character by jvm and hence regex won't work.
The correct way to do this is using a regex to match the entire unicode definition and use group-replacement.
The regex to match the unicode-string:
A unicode-character looks like \uABCD, so \u, followed by a 4-character hexnumber string. Matching these can be done using
\\u[A-Fa-f\d]{4}
But there's a problem with this:
In a String like "just some \\uabcd arbitrary text" the \u would still get matched. So we need to make sure the \u is preceeded by an even number of \s:
(?<!\\)(\\\\)*\\u[A-Fa-f\d]{4}
Now as an output, we want a backslash followed by the hexnum-part. This can be done by group-replacement, so let's get start by grouping characters:
(?<!\\)(\\\\)*(\\u)([A-Fa-f\d]{4})
As a replacement we want all backlashes from the group that matches two backslashes, followed by a backslash and the hexnum-part of the unicode-literal:
$1\\$3
Now for the actual code:
String pattern = "(?<!\\\\)(\\\\\\\\)*(\\\\u)([A-Fa-f\\d]{4})";
String replace = "$1\\\\$3";
Matcher match = Pattern.compile(pattern).matcher(test);
String result = match.replaceAll(replace);
That's a lot of backslashes! Well, there's an issue with java, regex and backslash: backslashes need to be escaped in java and regex. So "\\\\" as a pattern-string in java matches one \ as regex-matched character.
EDIT:
On actual strings, the characters need to be filtered out and be replaced by their integer-representation:
StringBuilder sb = new StringBuilder();
for(char c : in.toCharArray())
if(c > 127)
sb.append("\\").append(String.format("%04x", (int) c));
else
sb.append(c);
This assumes by "unicode-character" you mean non-ASCII-characters. This code will print any ASCII-character as is and output all other characters as backslash followed by their unicode-code. The definition "unicode-character" is rather vague though, as char in java always represents unicode-characters. This approach preserves any control-chars like "\n", "\r", etc., which is why I chose it over other definitions.
Try using String.replaceAll() method
s = s.replaceAll("\u", "\");

How to remove the space at the begin of read Stream via BufferReader

I have try follow snippet to read file
mFileReader = new FileReader(mAutoUpgradeFile);
mBufferedReader = new BufferedReader(mFileReader);
mStringBuilder = new StringBuilder();
String line=null;
while( null != (line=mBufferedReader.readLine()) ){
mStringBuilder.append(line);
Log.d(TAG,"read line ="+line+"\n");
}
The content of mAutoUpgradeFile
aaaaa
bbbbbb
ccccccccc
dddddddddd
Output via readLine
read line =space aaaaa
read line = bbbbbb
update 2015-12-18 first time
print the ASCII of the first byte of line
It's same line below:
What's 65279 mean in this situation?
update 2015-12-18 Second time
65279 may be Introduced by UTF-8 + BOM format
change encoding format to UTF-8 + ! BOM format
This issue resolved
Thanks everybody.
String#trim() removes any leading and trailing whitespace from a string.
Returns a copy of the string, with leading and trailing whitespace omitted.
Besides you can use String#replaceFirst(String regex, String replacement)
Replaces the first substring of this string that matches the given regular expression with the given replacement.
And use a regex like :
line= line.replaceFirst("^ ", "");
Using String#trim() method or line.replaceAll("^\\s+|\\s+$", "") will trim the whitespace from both the end.
For left side trim:
line.replaceAll("^\\s+", "");
For right right trim:
line.replaceAll("\\s+$", "");
If the "xxx" trace prints out the first character read from the line, then you have a line which begins by the unicode char 65279 (which I don't know what represents), so it is not actually a blank nor whitespace character.
If you know by certain which character ranges you want to allow when reading a line, I recommend you to trim the leading ocurrences that slip out of that ranges.
For example: To trim all the leading characters out of A-Z, a-z, and 0-9:
line=line.replaceFirst("[^a-zA-Z0-9]*", "");

Java Regex Matcher not giving expected result

I have the following code.
String _partsPattern = "(.*)((\n\n)|(\n)|(.))";
static final Pattern partsPattern = Pattern.compile(_partsPattern);
String text= "PART1: 01/02/03\r\nFindings:no smoking";
Matcher match = partsPattern.matcher(text);
while (match.find()) {
System.out.println( match.group(1));
return; //I just care on the first match for this purpose
}
Output: PART1: 01/02/0
I was expecting PART1: 01/02/03 why is the 3 at the end of my text not matching in my result.
Problem with your regex is that . will not match line separators like \r or \n so your regex will stop before \r and since last part of your regex
(.*)((\n\n)|(\n)|(.))
^^^^^^^^^^^^^^^
is required and it can't match \r last character will be stored in (.).
If you don't want to include these line separators in your match just use "(.*)$"; pattern with Pattern.MULTILINE flag to make $ match end of each line (it will represent standard line separators like \r or \r\n or \n but will not include them in match).
So try with
String _partsPattern = "(.*)$"; //parenthesis are not required now
final Pattern partsPattern = Pattern.compile(_partsPattern,Pattern.MULTILINE);
Other approach would be changing your regex to something like (.*)((\r\n)|(\n)|(.)) or (.*)((\r?\n)|(.)) but I am not sure what would be the purpose of last (.) (I would probably remove it). It is just variation of your original regex.
Works, giving "PART1: 01/02/03 ". So my guess is that in the real code you read the text maybe with a Reader.readLine and erroneously strip a carriage return + linefeed. Far fetched but I cannot imagine otherwise. (readLine strips the newline itself.)

Regex required to update a character

I have a String : testing<b>s<b>tringwit<b>h</b>nomean<b>s</b>ing
I want to replace the character s with some other character sequence suppose : <b>X</b> but i want the character sequence s to remain intact i.e. regex should not update the character s with a previous character as "<".
I used the JAVA code :
String str = testing<b>s<b>tringwit<b>h</b>nomean<b>s</b>ing;
str = str.replace("s[^<]", "<b>X</b>");
The problem is that the regex would match 2 characters, s and following character if it is not ">" and Sting.replace would replace both the characters. I want only s to be replaced and not the following character.
Any help would be appreciated. Since i have lots of such replacements i don't want to use a loop matching each character and updating it sequentially.
There are other ways, but you could, for example, capture the second character and put it back:
str = str.replaceAll("s([^<])", "<b>X\\1</b>");
Looks like you want a negative lookahead:
s(?!<)
String str = "testing<b>s<b>tringwit<b>h</b>nomean<b>s</b>ing;";
System.out.println(str.replaceAll("s(?!<)", "<b>X</b>"));
output:
te<b>X</b>ting<b>s<b>tringwit<b>h</b>nomean<b>s</b>ing;
Use look arounds to assert, but not capture, surrounding text:
str = str.replaceAll("s(?![^<]))", "whatever");
Or, capture and put back using a back reference $1:
str = str.replaceAll("s([^<])", "whatever$1");
Note that you need to use replaceAll() (which use regex), rather than replace() (which uses plain text).

How to replace a special character with single slash

I have a question about strings in Java. Let's say, I have a string like so:
String str = "The . startup trace ?state is info?";
As the string contains the special character like "?" I need the string to be replaced with "\?" as per my requirement. How do I replace special characters with "\"? I tried the following way.
str.replace("?","\?");
But it gives a compilation error. Then I tried the following:
str.replace("?","\\?");
When I do this it replaces the special characters with "\\". But when I print the string, it prints with single slash. I thought it is taking single slash only but when I debugged I found that the variable is taking "\\".
Can anyone suggest how to replace the special characters with single slash ("\")?
On escape sequences
A declaration like:
String s = "\\";
defines a string containing a single backslash. That is, s.length() == 1.
This is because \ is a Java escape character for String and char literals. Here are some other examples:
"\n" is a String of length 1 containing the newline character
"\t" is a String of length 1 containing the tab character
"\"" is a String of length 1 containing the double quote character
"\/" contains an invalid escape sequence, and therefore is not a valid String literal
it causes compilation error
Naturally you can combine escape sequences with normal unescaped characters in a String literal:
System.out.println("\"Hey\\\nHow\tare you?");
The above prints (tab spacing may vary):
"Hey\
How are you?
References
JLS 3.10.6 Escape Sequences for Character and String Literals
See also
Is the char literal '\"' the same as '"' ?(backslash-doublequote vs only-doublequote)
Back to the problem
Your problem definition is very vague, but the following snippet works as it should:
System.out.println("How are you? Really??? Awesome!".replace("?", "\\?"));
The above snippet replaces ? with \?, and thus prints:
How are you\? Really\?\?\? Awesome!
If instead you want to replace a char with another char, then there's also an overload for that:
System.out.println("How are you? Really??? Awesome!".replace('?', '\\'));
The above snippet replaces ? with \, and thus prints:
How are you\ Really\\\ Awesome!
String API links
replace(CharSequence target, CharSequence replacement)
Replaces each substring of this string that matches the literal target sequence with the specified literal replacement sequence.
replace(char oldChar, char newChar)
Returns a new string resulting from replacing all occurrences of oldChar in this string with newChar.
On how regex complicates things
If you're using replaceAll or any other regex-based methods, then things becomes somewhat more complicated. It can be greatly simplified if you understand some basic rules.
Regex patterns in Java is given as String values
Metacharacters (such as ? and .) have special meanings, and may need to be escaped by preceding with a backslash to be matched literally
The backslash is also a special character in replacement String values
The above factors can lead to the need for numerous backslashes in patterns and replacement strings in a Java source code.
It doesn't look like you need regex for this problem, but here's a simple example to show what it can do:
System.out.println(
"Who you gonna call? GHOSTBUSTERS!!!"
.replaceAll("[?!]+", "<$0>")
);
The above prints:
Who you gonna call<?> GHOSTBUSTERS<!!!>
The pattern [?!]+ matches one-or-more (+) of any characters in the character class [...] definition (which contains a ? and ! in this case). The replacement string <$0> essentially puts the entire match $0 within angled brackets.
Related questions
Having trouble with Splitting text. - discusses common mistakes like split(".") and split("|")
Regular expressions references
regular-expressions.info
Character class and Repetition with Star and Plus
java.util.regex.Pattern and Matcher
In case you want to replace ? with \?, there are 2 possibilities: replace and replaceAll (for regular expressions):
str.replace("?", "\\?")
str.replaceAll("\\?","\\\\?");
The result is "The . startup trace \?state is info\?"
If you want to replace ? with \, just remove the ? character from the second argument.
But when I print the string, it prints
with single slash.
Good. That's exactly what you want, isn't it?
There are two simple rules:
A backslash inside a String literal has to be specified as two to satisfy the compiler, i.e. "\". Otherwise it is taken as a special-character escape.
A backslash in a regular expresion has to be specified as two to satisfy regex, otherwise it is taken as a regex escape. Because of (1) this means you have to write 2x2=4 of them:"\\\\" (and because of the forum software I actually had to write 8!).
String str="\\";
str=str.replace(str,"\\\\");
System.out.println("New String="+str);
Out put:- New String=\
In java "\\" treat as "\". So, the above code replace a "\" single slash into "\\".

Categories