I need to strip out a few invalid characters from a string and wrote the following code part of a StringUtil library:
public static String removeBlockedCharacters(String data) {
if (data==null) {
return data;
}
return data.replaceAll("(?i)[<|>|\u003C|\u003E]", "");
}
I have a test file illegalCharacter.txt with one line in it:
hello \u003c here < and > there
I run the following unit test:
#Test
public void testBlockedCharactersRemoval() throws IOException{
checkEquals(StringUtil.removeBlockedCharacters("a < b > c\u003e\u003E\u003c\u003C"), "a b c");
log.info("Procesing from string directly: " + StringUtil.removeBlockedCharacters("hello \u003c here < and > there"));
log.info("Procesing from file to string: " + StringUtil.removeBlockedCharacters(FileUtils.readFileToString(new File("src/test/resources/illegalCharacters.txt"))));
}
I get:
INFO - 2010-09-14 13:37:36,111 - TestStringUtil.testBlockedCharactersRemoval(36) | Procesing from string directly: hello here and there
INFO - 2010-09-14 13:37:36,126 - TestStringUtil.testBlockedCharactersRemoval(37) | Procesing from file to string: hello \u003c here and there
I am VERY confused: as you can see, the code properly strips out the '<', '>', and '\u003c' if I pass a string containing these values but it fails to strip out '\u003c' if I read from a file containing the same string.
My questions, so that I stop loosing hair over it, are:
Why do I get this behavior?
How can I change my code to properly strip \u003c in all occasions?
Thanks
hello \u003c here < and > there
the \u003c in an ASCII file won't do it, you need to put the actual Unicode character in a Unicode encoded text file.
When you compile your source file, the very first thing that happens--before any lexing or parsing--is that the Unicode escapes, \u003C and \u003E, get converted to the actual characters, < and >. So your code is really:
return data.replaceAll("(?i)[<|>|<|>]", "");
When you compile the code for the test against the string literal, the same thing happens; the test string that you wrote as:
"a < b > c\u003e\u003E\u003c\u003C"
...is really:
"a < b > c>><<"
But when you read the test string from a file, no such conversion occurs; you end up trying to match the six-character sequence \u003c with the single character, <. If you really want to match \u003C and \u003E, your code should look like this:
return data.replaceAll("(?i)(?:<|>|\\\\u003C|\\\\u003E)", "");
If you use one backslash, the Java compiler interprets it as a Unicode escape and converts it to < or >.
If you use two backslashes, the regex compiler interprets it as a Unicode escape and thinks you want to match a < or >.
If you use three backslashes, the Java compiler turns it into \< or \>, the regex compiler ignores the backslash, and it tries to match < or >.
So, to match a raw Unicode escape sequence, you have to use four backslashes to match the one backslash in the escape sequence.
Notice that I changed your brackets, too. [<|>] is a character class that matches <, | or >; what you want is an alternation.
Looks to me that the problem isn't with your escaping, but with the fact that you have unicode data you're trying to parse.
Have you tried using the two argument version of readFileToString, replacing your readFileToString(File) call with readFileToString(File, Encoding)?
Resources
FileUtils
Related
I have my data which needs to be cleaned up before further processing in various other applications. In this process one of the downstream applications only allows a certain range of Unicode characters. The following is the regex I'm using to strip out the invalid Unicode characters.
/[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/
However, I'm still having issues getting the regex to work in Java. Is there a special way to treat the above regex, since it contains a range of Unicode characters?
UPDATE:
This is how I tested the same and didn't seem to get it to work with the way suggested by #Andreas :
public void testStripUnicode() {
String doc = "{\"fields\":{\"field1\":\"unicode char '\\u000b'\",\"field2\":[\"unicode char '\\u0003'\"]}}";
String stripped = DocumentCleaner.clean(doc);
System.out.println(doc);
System.out.println(stripped);
}
doc
{"fields":{"field1":"unicode char '\u000b'","field2":["unicode char '\u0003'"]}}
stripped-doc
{"fields":{"field1":"unicode char '\u000b'","field2":["unicode char '\u0003'"]}}
Should be fine, just drop the slashes / and double the backslashes \:
String regex = "[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]";
String stripped = value.replaceAll(regex, "");
Or if you do it repeatedly, you can parse the regular expression once, up front:
// Prepare regular expression
Pattern p = Pattern.compile("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]");
// Use regular expression
String stripped = p.matcher(value).replaceAll("");
Does Java have a built-in way to escape arbitrary text so that it can be included in a regular expression? For example, if my users enter "$5", I'd like to match that exactly rather than a "5" after the end of input.
Since Java 1.5, yes:
Pattern.quote("$5");
Difference between Pattern.quote and Matcher.quoteReplacement was not clear to me before I saw following example
s.replaceFirst(Pattern.quote("text to replace"),
Matcher.quoteReplacement("replacement text"));
It may be too late to respond, but you can also use Pattern.LITERAL, which would ignore all special characters while formatting:
Pattern.compile(textToFormat, Pattern.LITERAL);
I think what you're after is \Q$5\E. Also see Pattern.quote(s) introduced in Java5.
See Pattern javadoc for details.
First off, if
you use replaceAll()
you DON'T use Matcher.quoteReplacement()
the text to be substituted in includes a $1
it won't put a 1 at the end. It will look at the search regex for the first matching group and sub THAT in. That's what $1, $2 or $3 means in the replacement text: matching groups from the search pattern.
I frequently plug long strings of text into .properties files, then generate email subjects and bodies from those. Indeed, this appears to be the default way to do i18n in Spring Framework. I put XML tags, as placeholders, into the strings and I use replaceAll() to replace the XML tags with the values at runtime.
I ran into an issue where a user input a dollars-and-cents figure, with a dollar sign. replaceAll() choked on it, with the following showing up in a stracktrace:
java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.start(Matcher.java:374)
at java.util.regex.Matcher.appendReplacement(Matcher.java:748)
at java.util.regex.Matcher.replaceAll(Matcher.java:823)
at java.lang.String.replaceAll(String.java:2201)
In this case, the user had entered "$3" somewhere in their input and replaceAll() went looking in the search regex for the third matching group, didn't find one, and puked.
Given:
// "msg" is a string from a .properties file, containing "<userInput />" among other tags
// "userInput" is a String containing the user's input
replacing
msg = msg.replaceAll("<userInput \\/>", userInput);
with
msg = msg.replaceAll("<userInput \\/>", Matcher.quoteReplacement(userInput));
solved the problem. The user could put in any kind of characters, including dollar signs, without issue. It behaved exactly the way you would expect.
To have protected pattern you may replace all symbols with "\\\\", except digits and letters. And after that you can put in that protected pattern your special symbols to make this pattern working not like stupid quoted text, but really like a patten, but your own. Without user special symbols.
public class Test {
public static void main(String[] args) {
String str = "y z (111)";
String p1 = "x x (111)";
String p2 = ".* .* \\(111\\)";
p1 = escapeRE(p1);
p1 = p1.replace("x", ".*");
System.out.println( p1 + "-->" + str.matches(p1) );
//.*\ .*\ \(111\)-->true
System.out.println( p2 + "-->" + str.matches(p2) );
//.* .* \(111\)-->true
}
public static String escapeRE(String str) {
//Pattern escaper = Pattern.compile("([^a-zA-z0-9])");
//return escaper.matcher(str).replaceAll("\\\\$1");
return str.replaceAll("([^a-zA-Z0-9])", "\\\\$1");
}
}
Pattern.quote("blabla") works nicely.
The Pattern.quote() works nicely. It encloses the sentence with the characters "\Q" and "\E", and if it does escape "\Q" and "\E".
However, if you need to do a real regular expression escaping(or custom escaping), you can use this code:
String someText = "Some/s/wText*/,**";
System.out.println(someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
This method returns: Some/\s/wText*/\,**
Code for example and tests:
String someText = "Some\\E/s/wText*/,**";
System.out.println("Pattern.quote: "+ Pattern.quote(someText));
System.out.println("Full escape: "+someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
^(Negation) symbol is used to match something that is not in the character group.
This is the link to Regular Expressions
Here is the image info about negation:
I need to check if a JAVA string which we send to commzgate(3rd party) as SMS but our SMS fails because our string is containing some invalid/non-readable characters which I need to check first. Basically i need to put a regular expression check in java to validate if my string contains following characters or not :-
€ [ \ ] ^ { | } ~
Any suggestions! Moreover when I try to put these characters in my java file, it does not save and alerts for non-utf8 character message in eclipse, so everytime i have to remove € and save. Is that so my validation is not complete.
Thanks
You appear to have included an answer in your question - you can use a regular expression to check for these characters and replace/remove them as required. As you are hitting a problem with Eclipse on a non-UTF-8 character (the Euro symbol) you could instead use the unicode character U+20AC (which should be it).
regular expression java
The regexp is /[€\[\]\{\}\~]/g
You can do simpler but this one works
For example in js
var a = "€[]{}^|~"
var reg = /[€\[\]\{\}\~|^]/g
a.replace(reg, "") //output ""
Use simply:
^[^€\\[\\]\\^{|}~\\\]*$
// € [ ] ^{|}~ \ < literals
Which matches the start and end of string, any characters (or none) provided they aren't in the (escaped) character class. The ^ inside the character class here indicates that it should not match.
Note the double-escaped characters because Java requires literal characters to be escaped as well.
I have created a text file on windows system where I think default encoding style is ANSI and contents of the file looks like this :
This is\u2019 a sample text file \u2014and it can ....
I saved this file using the default encoding style of windows though there were encoding styles were also available like UTF-8,UTF-16 etc.
Now I want to write a simple java function where I will pass some input string and replace all of the unicodes with the corresponding ascii value.
e.g :- \u2019 should be replaced with "'"
\u2014 should be replaced with "-" and so on.
Observation :
When i created a string literal like this
String s = "This is\u2019 a sample text file \u2014and it can ....";
My code is working fine , but when I am reading it from the file it is not working. I am aware that in Java String uses UTF-16 encoding .
Below is the code that I am using to read the input file.
FileReader fileReader = new FileReader(new File("C:\\input.txt"));
BufferedReader bufferedReader = new BufferedReader(fileReader)
String record = bufferedReader.readLine();
I also tried using the InputStream and setting the Charset to UTF-8 , but still the same result.
Replacement code :
public static String removeUTFCharacters(String data){
for(Entry<String,String> entry : utfChars.entrySet()){
data=data.replaceAll(entry.getKey(), entry.getValue());
}
return data;
}
Map :
utfChars.put("\u2019","'");
utfChars.put("\u2018","'");
utfChars.put("\u201c","\"");
utfChars.put("\u201d","\"");
utfChars.put("\u2013","-");
utfChars.put("\u2014","-");
utfChars.put("\u2212","-");
utfChars.put("\u2022","*");
Can anybody help me in understanding the concept and solution to this problem.
Match the escape sequence \uXXXX with a regular expression. Then use a replacement loop to replace each occurrence of that escape sequence with the decoded value of the character.
Because Java string literals use \ to introduce escapes, the sequence \\ is used to represent \. Also, the Java regex syntax treats the sequence \u specially (to represent a Unicode escape). So the \ has to be escaped again, with an additonal \\. So, in the pattern, "\\\\u" really means, "match \u in the input."
To match the numeric portion, four hexadecimal characters, use the pattern \p{XDigit}, escaping the \ with an extra \. We want to easily extract the hex number as a group, so it is enclosed in parentheses to create a capturing group. Thus, "(\\p{XDigit}{4})" in the pattern means, "match 4 hexadecimal characters in the input, and capture them."
In a loop, we search for occurrences of the pattern, replacing each occurrence with the decoded character value. The character value is decoded by parsing the hexadecimal number. Integer.parseInt(m.group(1), 16) means, "parse the group captured in the previous match as a base-16 number." Then a replacement string is created with that character. The replacement string must be escaped, or quoted, in case it is $, which has special meaning in replacement text.
String data = "This is\\u2019 a sample text file \\u2014and it can ...";
Pattern p = Pattern.compile("\\\\u(\\p{XDigit}{4})");
Matcher m = p.matcher(data);
StringBuffer buf = new StringBuffer(data.length());
while (m.find()) {
String ch = String.valueOf((char) Integer.parseInt(m.group(1), 16));
m.appendReplacement(buf, Matcher.quoteReplacement(ch));
}
m.appendTail(buf);
System.out.println(buf);
If you can use another library, you can use apache commons
https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html
String dirtyString = "Colocaci\u00F3n";
String cleanString = StringEscapeUtils.unescapeJava(dirtyString);
//cleanString = "Colocación"
As a bit of background to the problem I load in a text file and then assign a phrase from that text file to become a randomPirateWord, I then change the letters into that text file to become **'s and that works correctly. However when I am asking the user to guess a letter, it doesn't work correctly, if they guess incorrectly then the code works fine but if they guess a letter correctly the code doesn't work properly. I have put the error message below the code:
if (!escape.equalsIgnoreCase("m")){
System.out.print(" Type the letter you want to guess: ");
char letter = input.nextLine().charAt(0);
if(m.getRandomPirateWord().contains(letter+"")){
System.out.println(m.getRandomPirateWord().replaceAll("*",letter+""));
}
Error message:
Exception in thread "main" java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0
*
^
at java.util.regex.Pattern.error(Pattern.java:1924)
at java.util.regex.Pattern.sequence(Pattern.java:2090)
at java.util.regex.Pattern.expr(Pattern.java:1964)
at java.util.regex.Pattern.compile(Pattern.java:1665)
at java.util.regex.Pattern.<init>(Pattern.java:1337)
at java.util.regex.Pattern.compile(Pattern.java:1022)
at java.lang.String.replaceAll(String.java:2162)
at uk.ac.aber.dcs.pirate_hangman.TextBasedGame.runTextBasedGame(TextBasedGame.java:45)
at uk.ac.aber.dcs.pirate_hangman.Application.runApplication(Application.java:19)
at uk.ac.aber.dcs.pirate_hangman.Main.main(Main.java:6)
Use String#replace() instead of String#replaceAll(). The later one uses regex pattern for replacement, where * is a meta-character, and needs to be escaped.
Use the following, You have to escape the * character, since replaceAll() method accepts regular expression as one argument
replaceAll("\\*",letter+"")
Check out below Example.
import java.io.*;
public class Test{
public static void main(String args[]){
String Str = new String("Welcome to Tutorialspoint.com");
System.out.print("Return Value :" );
System.out.println(Str.replaceAll("(.*)Tutorials(.*)",
"AMROOD" ));
}
}
Syntax of method :
public String replaceAll(String regex, String replacement)
regex -- the regular expression to which this string is to be matched.
replacement -- the string which would replace found expression.
OUTPUT: AMROOD
Reference : Regex replacement
The problem is that * is a reserved character in regexes, so you need to escape it.
replaceAll("\\*",letter+"")
‘*’ symbol is used to identify a group from the regular expression which is the first parameter of ‘replaceAll’ or ‘replaceFirst’ method
1. Using ‘replace’ method: This would be the good choice if you want to replace a string literal and not a pattern.
2. Escaping ‘*’ symbol: If you need to a use regular expression, and your pattern has no groups identified, then you can escape any group identification symbols from your replace string as shown below:
String replaceValue = java.util.regex.Matcher.quoteReplacement("*100");