Unicode Replacement with ASCII

Unicode Replacement with ASCII - java

I have created a text file on windows system where I think default encoding style is ANSI and contents of the file looks like this :
This is\u2019 a sample text file \u2014and it can ....
I saved this file using the default encoding style of windows though there were encoding styles were also available like UTF-8,UTF-16 etc.
Now I want to write a simple java function where I will pass some input string and replace all of the unicodes with the corresponding ascii value.
e.g :- \u2019 should be replaced with "'"
\u2014 should be replaced with "-" and so on.
Observation :
When i created a string literal like this
String s = "This is\u2019 a sample text file \u2014and it can ....";
My code is working fine , but when I am reading it from the file it is not working. I am aware that in Java String uses UTF-16 encoding .
Below is the code that I am using to read the input file.
FileReader fileReader = new FileReader(new File("C:\\input.txt"));
BufferedReader bufferedReader = new BufferedReader(fileReader)
String record = bufferedReader.readLine();
I also tried using the InputStream and setting the Charset to UTF-8 , but still the same result.
Replacement code :
public static String removeUTFCharacters(String data){
for(Entry<String,String> entry : utfChars.entrySet()){
data=data.replaceAll(entry.getKey(), entry.getValue());
}
return data;
}
Map :
utfChars.put("\u2019","'");
utfChars.put("\u2018","'");
utfChars.put("\u201c","\"");
utfChars.put("\u201d","\"");
utfChars.put("\u2013","-");
utfChars.put("\u2014","-");
utfChars.put("\u2212","-");
utfChars.put("\u2022","*");
Can anybody help me in understanding the concept and solution to this problem.

Match the escape sequence \uXXXX with a regular expression. Then use a replacement loop to replace each occurrence of that escape sequence with the decoded value of the character.
Because Java string literals use \ to introduce escapes, the sequence \\ is used to represent \. Also, the Java regex syntax treats the sequence \u specially (to represent a Unicode escape). So the \ has to be escaped again, with an additonal \\. So, in the pattern, "\\\\u" really means, "match \u in the input."
To match the numeric portion, four hexadecimal characters, use the pattern \p{XDigit}, escaping the \ with an extra \. We want to easily extract the hex number as a group, so it is enclosed in parentheses to create a capturing group. Thus, "(\\p{XDigit}{4})" in the pattern means, "match 4 hexadecimal characters in the input, and capture them."
In a loop, we search for occurrences of the pattern, replacing each occurrence with the decoded character value. The character value is decoded by parsing the hexadecimal number. Integer.parseInt(m.group(1), 16) means, "parse the group captured in the previous match as a base-16 number." Then a replacement string is created with that character. The replacement string must be escaped, or quoted, in case it is $, which has special meaning in replacement text.
String data = "This is\\u2019 a sample text file \\u2014and it can ...";
Pattern p = Pattern.compile("\\\\u(\\p{XDigit}{4})");
Matcher m = p.matcher(data);
StringBuffer buf = new StringBuffer(data.length());
while (m.find()) {
String ch = String.valueOf((char) Integer.parseInt(m.group(1), 16));
m.appendReplacement(buf, Matcher.quoteReplacement(ch));
}
m.appendTail(buf);
System.out.println(buf);

If you can use another library, you can use apache commons
https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html
String dirtyString = "Colocaci\u00F3n";
String cleanString = StringEscapeUtils.unescapeJava(dirtyString);
//cleanString = "Colocación"

Related

how to get rid of #011 characters in java

I am getting this string while I got the content over JMSQ. While printing I see the following line. I see those are vertical tab characters in XML. But how should I get rid of them.
#011#011#011<xeh:eid>dljfl</xeh:eid>
I have tried
replaceAll("[\\x0B]", "");
but it's not working.

Just do this:
String a = "#011#011#011<xeh:eid>dljfl</xeh:eid>";
String a_wo_vt_chars = a.replaceAll("#011", "");

"#011#011#011<xeh:eid>dljfl</xeh:eid>".replaceAll("#011", "") works fine, results in <xeh:eid>dljfl</xeh:eid>
According to the Pattern javadoc, \xhh stands for "the character with hexadecimal value 0xhh". But I guess in your string literal, #011 is just literal characters.
If I try to replicate the vertical tab in a string literal, it works with \\x0B:
"\u000b\u000b\u000b<xeh:eid>dljfl</xeh:eid>".replaceAll("\\x0B", "")
But maybe we are reading it wrong. While #0B is 11, #11 might be 17...

When #011 represents the hexvalue for the char you can use
a.replaceAll("\\u0011", "");
// or
a.replaceAll("\\x11", "");
But if #011 represents the octal value the use
a.replaceAll("\\011", "")
Also see Unicode Regular Expressions

Unable to strip invalid unicode characters java

I have my data which needs to be cleaned up before further processing in various other applications. In this process one of the downstream applications only allows a certain range of Unicode characters. The following is the regex I'm using to strip out the invalid Unicode characters.
/[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/
However, I'm still having issues getting the regex to work in Java. Is there a special way to treat the above regex, since it contains a range of Unicode characters?
UPDATE:
This is how I tested the same and didn't seem to get it to work with the way suggested by #Andreas :
public void testStripUnicode() {
String doc = "{\"fields\":{\"field1\":\"unicode char '\\u000b'\",\"field2\":[\"unicode char '\\u0003'\"]}}";
String stripped = DocumentCleaner.clean(doc);
System.out.println(doc);
System.out.println(stripped);
}
doc
{"fields":{"field1":"unicode char '\u000b'","field2":["unicode char '\u0003'"]}}
stripped-doc
{"fields":{"field1":"unicode char '\u000b'","field2":["unicode char '\u0003'"]}}

Should be fine, just drop the slashes / and double the backslashes \:
String regex = "[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]";
String stripped = value.replaceAll(regex, "");
Or if you do it repeatedly, you can parse the regular expression once, up front:
// Prepare regular expression
Pattern p = Pattern.compile("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]");
// Use regular expression
String stripped = p.matcher(value).replaceAll("");

Escape the regex character from a Searching String [duplicate]

Does Java have a built-in way to escape arbitrary text so that it can be included in a regular expression? For example, if my users enter "$5", I'd like to match that exactly rather than a "5" after the end of input.

Since Java 1.5, yes:
Pattern.quote("$5");

Difference between Pattern.quote and Matcher.quoteReplacement was not clear to me before I saw following example
s.replaceFirst(Pattern.quote("text to replace"),
Matcher.quoteReplacement("replacement text"));

It may be too late to respond, but you can also use Pattern.LITERAL, which would ignore all special characters while formatting:
Pattern.compile(textToFormat, Pattern.LITERAL);

I think what you're after is \Q$5\E. Also see Pattern.quote(s) introduced in Java5.
See Pattern javadoc for details.

First off, if
you use replaceAll()
you DON'T use Matcher.quoteReplacement()
the text to be substituted in includes a $1
it won't put a 1 at the end. It will look at the search regex for the first matching group and sub THAT in. That's what $1, $2 or $3 means in the replacement text: matching groups from the search pattern.
I frequently plug long strings of text into .properties files, then generate email subjects and bodies from those. Indeed, this appears to be the default way to do i18n in Spring Framework. I put XML tags, as placeholders, into the strings and I use replaceAll() to replace the XML tags with the values at runtime.
I ran into an issue where a user input a dollars-and-cents figure, with a dollar sign. replaceAll() choked on it, with the following showing up in a stracktrace:
java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.start(Matcher.java:374)
at java.util.regex.Matcher.appendReplacement(Matcher.java:748)
at java.util.regex.Matcher.replaceAll(Matcher.java:823)
at java.lang.String.replaceAll(String.java:2201)
In this case, the user had entered "$3" somewhere in their input and replaceAll() went looking in the search regex for the third matching group, didn't find one, and puked.
Given:
// "msg" is a string from a .properties file, containing "<userInput />" among other tags
// "userInput" is a String containing the user's input
replacing
msg = msg.replaceAll("<userInput \\/>", userInput);
with
msg = msg.replaceAll("<userInput \\/>", Matcher.quoteReplacement(userInput));
solved the problem. The user could put in any kind of characters, including dollar signs, without issue. It behaved exactly the way you would expect.

To have protected pattern you may replace all symbols with "\\\\", except digits and letters. And after that you can put in that protected pattern your special symbols to make this pattern working not like stupid quoted text, but really like a patten, but your own. Without user special symbols.
public class Test {
public static void main(String[] args) {
String str = "y z (111)";
String p1 = "x x (111)";
String p2 = ".* .* \\(111\\)";
p1 = escapeRE(p1);
p1 = p1.replace("x", ".*");
System.out.println( p1 + "-->" + str.matches(p1) );
//.*\ .*\ \(111\)-->true
System.out.println( p2 + "-->" + str.matches(p2) );
//.* .* \(111\)-->true
}
public static String escapeRE(String str) {
//Pattern escaper = Pattern.compile("([^a-zA-z0-9])");
//return escaper.matcher(str).replaceAll("\\\\$1");
return str.replaceAll("([^a-zA-Z0-9])", "\\\\$1");
}
}

Pattern.quote("blabla") works nicely.
The Pattern.quote() works nicely. It encloses the sentence with the characters "\Q" and "\E", and if it does escape "\Q" and "\E".
However, if you need to do a real regular expression escaping(or custom escaping), you can use this code:
String someText = "Some/s/wText*/,**";
System.out.println(someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
This method returns: Some/\s/wText*/\,**
Code for example and tests:
String someText = "Some\\E/s/wText*/,**";
System.out.println("Pattern.quote: "+ Pattern.quote(someText));
System.out.println("Full escape: "+someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));

^(Negation) symbol is used to match something that is not in the character group.
This is the link to Regular Expressions
Here is the image info about negation:

Hard time with escape character

I need to strip out a few invalid characters from a string and wrote the following code part of a StringUtil library:
public static String removeBlockedCharacters(String data) {
if (data==null) {
return data;
}
return data.replaceAll("(?i)[<|>|\u003C|\u003E]", "");
}
I have a test file illegalCharacter.txt with one line in it:
hello \u003c here < and > there
I run the following unit test:
#Test
public void testBlockedCharactersRemoval() throws IOException{
checkEquals(StringUtil.removeBlockedCharacters("a < b > c\u003e\u003E\u003c\u003C"), "a b c");
log.info("Procesing from string directly: " + StringUtil.removeBlockedCharacters("hello \u003c here < and > there"));
log.info("Procesing from file to string: " + StringUtil.removeBlockedCharacters(FileUtils.readFileToString(new File("src/test/resources/illegalCharacters.txt"))));
}
I get:
INFO - 2010-09-14 13:37:36,111 - TestStringUtil.testBlockedCharactersRemoval(36) | Procesing from string directly: hello here and there
INFO - 2010-09-14 13:37:36,126 - TestStringUtil.testBlockedCharactersRemoval(37) | Procesing from file to string: hello \u003c here and there
I am VERY confused: as you can see, the code properly strips out the '<', '>', and '\u003c' if I pass a string containing these values but it fails to strip out '\u003c' if I read from a file containing the same string.
My questions, so that I stop loosing hair over it, are:
Why do I get this behavior?
How can I change my code to properly strip \u003c in all occasions?
Thanks

hello \u003c here < and > there
the \u003c in an ASCII file won't do it, you need to put the actual Unicode character in a Unicode encoded text file.

When you compile your source file, the very first thing that happens--before any lexing or parsing--is that the Unicode escapes, \u003C and \u003E, get converted to the actual characters, < and >. So your code is really:
return data.replaceAll("(?i)[<|>|<|>]", "");
When you compile the code for the test against the string literal, the same thing happens; the test string that you wrote as:
"a < b > c\u003e\u003E\u003c\u003C"
...is really:
"a < b > c>><<"
But when you read the test string from a file, no such conversion occurs; you end up trying to match the six-character sequence \u003c with the single character, <. If you really want to match \u003C and \u003E, your code should look like this:
return data.replaceAll("(?i)(?:<|>|\\\\u003C|\\\\u003E)", "");
If you use one backslash, the Java compiler interprets it as a Unicode escape and converts it to < or >.
If you use two backslashes, the regex compiler interprets it as a Unicode escape and thinks you want to match a < or >.
If you use three backslashes, the Java compiler turns it into \< or \>, the regex compiler ignores the backslash, and it tries to match < or >.
So, to match a raw Unicode escape sequence, you have to use four backslashes to match the one backslash in the escape sequence.
Notice that I changed your brackets, too. [<|>] is a character class that matches <, | or >; what you want is an alternation.

Looks to me that the problem isn't with your escaping, but with the fact that you have unicode data you're trying to parse.
Have you tried using the two argument version of readFileToString, replacing your readFileToString(File) call with readFileToString(File, Encoding)?
Resources
FileUtils

How to escape text for regular expression in Java?

Does Java have a built-in way to escape arbitrary text so that it can be included in a regular expression? For example, if my users enter "$5", I'd like to match that exactly rather than a "5" after the end of input.

Since Java 1.5, yes:
Pattern.quote("$5");

Difference between Pattern.quote and Matcher.quoteReplacement was not clear to me before I saw following example
s.replaceFirst(Pattern.quote("text to replace"),
Matcher.quoteReplacement("replacement text"));

It may be too late to respond, but you can also use Pattern.LITERAL, which would ignore all special characters while formatting:
Pattern.compile(textToFormat, Pattern.LITERAL);

I think what you're after is \Q$5\E. Also see Pattern.quote(s) introduced in Java5.
See Pattern javadoc for details.

First off, if
you use replaceAll()
you DON'T use Matcher.quoteReplacement()
the text to be substituted in includes a $1
it won't put a 1 at the end. It will look at the search regex for the first matching group and sub THAT in. That's what $1, $2 or $3 means in the replacement text: matching groups from the search pattern.
I frequently plug long strings of text into .properties files, then generate email subjects and bodies from those. Indeed, this appears to be the default way to do i18n in Spring Framework. I put XML tags, as placeholders, into the strings and I use replaceAll() to replace the XML tags with the values at runtime.
I ran into an issue where a user input a dollars-and-cents figure, with a dollar sign. replaceAll() choked on it, with the following showing up in a stracktrace:
java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.start(Matcher.java:374)
at java.util.regex.Matcher.appendReplacement(Matcher.java:748)
at java.util.regex.Matcher.replaceAll(Matcher.java:823)
at java.lang.String.replaceAll(String.java:2201)
In this case, the user had entered "$3" somewhere in their input and replaceAll() went looking in the search regex for the third matching group, didn't find one, and puked.
Given:
// "msg" is a string from a .properties file, containing "<userInput />" among other tags
// "userInput" is a String containing the user's input
replacing
msg = msg.replaceAll("<userInput \\/>", userInput);
with
msg = msg.replaceAll("<userInput \\/>", Matcher.quoteReplacement(userInput));
solved the problem. The user could put in any kind of characters, including dollar signs, without issue. It behaved exactly the way you would expect.

To have protected pattern you may replace all symbols with "\\\\", except digits and letters. And after that you can put in that protected pattern your special symbols to make this pattern working not like stupid quoted text, but really like a patten, but your own. Without user special symbols.
public class Test {
public static void main(String[] args) {
String str = "y z (111)";
String p1 = "x x (111)";
String p2 = ".* .* \\(111\\)";
p1 = escapeRE(p1);
p1 = p1.replace("x", ".*");
System.out.println( p1 + "-->" + str.matches(p1) );
//.*\ .*\ \(111\)-->true
System.out.println( p2 + "-->" + str.matches(p2) );
//.* .* \(111\)-->true
}
public static String escapeRE(String str) {
//Pattern escaper = Pattern.compile("([^a-zA-z0-9])");
//return escaper.matcher(str).replaceAll("\\\\$1");
return str.replaceAll("([^a-zA-Z0-9])", "\\\\$1");
}
}

Pattern.quote("blabla") works nicely.
The Pattern.quote() works nicely. It encloses the sentence with the characters "\Q" and "\E", and if it does escape "\Q" and "\E".
However, if you need to do a real regular expression escaping(or custom escaping), you can use this code:
String someText = "Some/s/wText*/,**";
System.out.println(someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
This method returns: Some/\s/wText*/\,**
Code for example and tests:
String someText = "Some\\E/s/wText*/,**";
System.out.println("Pattern.quote: "+ Pattern.quote(someText));
System.out.println("Full escape: "+someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));

^(Negation) symbol is used to match something that is not in the character group.
This is the link to Regular Expressions
Here is the image info about negation:

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Unicode Replacement with ASCII - java

Related

how to get rid of #011 characters in java

Unable to strip invalid unicode characters java

Escape the regex character from a Searching String [duplicate]

Hard time with escape character

How to escape text for regular expression in Java?

Categories

Resources