Unable to strip invalid unicode characters java

Unable to strip invalid unicode characters java - java

I have my data which needs to be cleaned up before further processing in various other applications. In this process one of the downstream applications only allows a certain range of Unicode characters. The following is the regex I'm using to strip out the invalid Unicode characters.
/[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/
However, I'm still having issues getting the regex to work in Java. Is there a special way to treat the above regex, since it contains a range of Unicode characters?
UPDATE:
This is how I tested the same and didn't seem to get it to work with the way suggested by #Andreas :
public void testStripUnicode() {
String doc = "{\"fields\":{\"field1\":\"unicode char '\\u000b'\",\"field2\":[\"unicode char '\\u0003'\"]}}";
String stripped = DocumentCleaner.clean(doc);
System.out.println(doc);
System.out.println(stripped);
}
doc
{"fields":{"field1":"unicode char '\u000b'","field2":["unicode char '\u0003'"]}}
stripped-doc
{"fields":{"field1":"unicode char '\u000b'","field2":["unicode char '\u0003'"]}}

Should be fine, just drop the slashes / and double the backslashes \:
String regex = "[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]";
String stripped = value.replaceAll(regex, "");
Or if you do it repeatedly, you can parse the regular expression once, up front:
// Prepare regular expression
Pattern p = Pattern.compile("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]");
// Use regular expression
String stripped = p.matcher(value).replaceAll("");

Related

Remove backslash before forward slash

Context: GoogleBooks API returing unexpected thumbnail url
Ok so i found the reason for the problem i had in that question
what i found was the returned url from the googlebooks api was something like this:
http:/\/books.google.com\/books\/content?id=0DwKEBD5ZBUC&printsec=frontcover&img=1&zoom=5&source=gbs_api
Going to that url would return a error, but if i replaced the "\ /"s with "/" it would return the proper url
is there something like a java/kotlin regex that would change this http:/\/books.google.com\/ to this http://books.google.com/
(i know a bit of regex in python but I'm clueless in java/kotlin)
thank you

You can use triple-quoted string literals (that act as raw string literals where backslashes are treated as literal chars and not part of string escape sequences) + kotlin.text.replace:
val text = """http:/\/books.google.com\/books\/content?id=0DwKEBD5ZBUC&printsec=frontcover&img=1&zoom=5&source=gbs_api"""
print(text.replace("""\/""", "/"))
Output:
http://books.google.com/books/content?id=0DwKEBD5ZBUC&printsec=frontcover&img=1&zoom=5&source=gbs_api
See the Kotlin demo.
NOTE: you will need to double the backslashes in the regular string literal:
print(text.replace("\\/", "/"))
If you need to use this "backslash + slash" pattern in a regex you will need 2 backslashes in the triple-quoted string literal and 4 backslashes in a regular string literal:
print(text.replace("""\\/""".toRegex(), "/"))
print(text.replace("\\\\/".toRegex(), "/"))
NOTE: There is no need to escape / forward slash in a Kotlin regex declaration as it is not a special regex metacharacter and Kotlin regexps are defined with string literals, not regex literals, and thus do not need regex delimiters (/ is often used as a regex delimiter char in environments that support this notation).

You could match the protocol, and then replace the backslash followed by a forward slash by a forward slash only
https?:\\?/\\?/\S+
Pattern in Java
String regex = "https?:\\\\?/\\\\?/\\S+";
Java demo | regex demo
For example in Java:
String regex = "https?:\\\\?/\\\\?/\\S+";
String string = "http:/\\/books.google.com\\/books\\/content?id=0DwKEBD5ZBUC&printsec=frontcover&img=1&zoom=5&source=gbs_api";
if(string.matches(regex)) {
System.out.println(string.replace("\\/", "/"));
}
}
Output
http://books.google.com/books/content?id=0DwKEBD5ZBUC&printsec=frontcover&img=1&zoom=5&source=gbs_api

I had same problem and my url was:
String url="https:\\/\\/www.dailymotion.com\\/cdn\\/H264-320x240\\/video\\/x83iqpl.mp4?sec=zaJEh8Q2ahOorzbKJTOI7b5FX3QT8OXSbnjpCAnNyUWNHl1kqXq0D9F8iLMFJ0ocg120B-dMbEE5kDQJN4hYIA";
I solved it with this code:
replace("\\/", "/");

how to get rid of #011 characters in java

I am getting this string while I got the content over JMSQ. While printing I see the following line. I see those are vertical tab characters in XML. But how should I get rid of them.
#011#011#011<xeh:eid>dljfl</xeh:eid>
I have tried
replaceAll("[\\x0B]", "");
but it's not working.

Just do this:
String a = "#011#011#011<xeh:eid>dljfl</xeh:eid>";
String a_wo_vt_chars = a.replaceAll("#011", "");

"#011#011#011<xeh:eid>dljfl</xeh:eid>".replaceAll("#011", "") works fine, results in <xeh:eid>dljfl</xeh:eid>
According to the Pattern javadoc, \xhh stands for "the character with hexadecimal value 0xhh". But I guess in your string literal, #011 is just literal characters.
If I try to replicate the vertical tab in a string literal, it works with \\x0B:
"\u000b\u000b\u000b<xeh:eid>dljfl</xeh:eid>".replaceAll("\\x0B", "")
But maybe we are reading it wrong. While #0B is 11, #11 might be 17...

When #011 represents the hexvalue for the char you can use
a.replaceAll("\\u0011", "");
// or
a.replaceAll("\\x11", "");
But if #011 represents the octal value the use
a.replaceAll("\\011", "")
Also see Unicode Regular Expressions

Escape the regex character from a Searching String [duplicate]

Does Java have a built-in way to escape arbitrary text so that it can be included in a regular expression? For example, if my users enter "$5", I'd like to match that exactly rather than a "5" after the end of input.

Since Java 1.5, yes:
Pattern.quote("$5");

Difference between Pattern.quote and Matcher.quoteReplacement was not clear to me before I saw following example
s.replaceFirst(Pattern.quote("text to replace"),
Matcher.quoteReplacement("replacement text"));

It may be too late to respond, but you can also use Pattern.LITERAL, which would ignore all special characters while formatting:
Pattern.compile(textToFormat, Pattern.LITERAL);

I think what you're after is \Q$5\E. Also see Pattern.quote(s) introduced in Java5.
See Pattern javadoc for details.

First off, if
you use replaceAll()
you DON'T use Matcher.quoteReplacement()
the text to be substituted in includes a $1
it won't put a 1 at the end. It will look at the search regex for the first matching group and sub THAT in. That's what $1, $2 or $3 means in the replacement text: matching groups from the search pattern.
I frequently plug long strings of text into .properties files, then generate email subjects and bodies from those. Indeed, this appears to be the default way to do i18n in Spring Framework. I put XML tags, as placeholders, into the strings and I use replaceAll() to replace the XML tags with the values at runtime.
I ran into an issue where a user input a dollars-and-cents figure, with a dollar sign. replaceAll() choked on it, with the following showing up in a stracktrace:
java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.start(Matcher.java:374)
at java.util.regex.Matcher.appendReplacement(Matcher.java:748)
at java.util.regex.Matcher.replaceAll(Matcher.java:823)
at java.lang.String.replaceAll(String.java:2201)
In this case, the user had entered "$3" somewhere in their input and replaceAll() went looking in the search regex for the third matching group, didn't find one, and puked.
Given:
// "msg" is a string from a .properties file, containing "<userInput />" among other tags
// "userInput" is a String containing the user's input
replacing
msg = msg.replaceAll("<userInput \\/>", userInput);
with
msg = msg.replaceAll("<userInput \\/>", Matcher.quoteReplacement(userInput));
solved the problem. The user could put in any kind of characters, including dollar signs, without issue. It behaved exactly the way you would expect.

To have protected pattern you may replace all symbols with "\\\\", except digits and letters. And after that you can put in that protected pattern your special symbols to make this pattern working not like stupid quoted text, but really like a patten, but your own. Without user special symbols.
public class Test {
public static void main(String[] args) {
String str = "y z (111)";
String p1 = "x x (111)";
String p2 = ".* .* \\(111\\)";
p1 = escapeRE(p1);
p1 = p1.replace("x", ".*");
System.out.println( p1 + "-->" + str.matches(p1) );
//.*\ .*\ \(111\)-->true
System.out.println( p2 + "-->" + str.matches(p2) );
//.* .* \(111\)-->true
}
public static String escapeRE(String str) {
//Pattern escaper = Pattern.compile("([^a-zA-z0-9])");
//return escaper.matcher(str).replaceAll("\\\\$1");
return str.replaceAll("([^a-zA-Z0-9])", "\\\\$1");
}
}

Pattern.quote("blabla") works nicely.
The Pattern.quote() works nicely. It encloses the sentence with the characters "\Q" and "\E", and if it does escape "\Q" and "\E".
However, if you need to do a real regular expression escaping(or custom escaping), you can use this code:
String someText = "Some/s/wText*/,**";
System.out.println(someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
This method returns: Some/\s/wText*/\,**
Code for example and tests:
String someText = "Some\\E/s/wText*/,**";
System.out.println("Pattern.quote: "+ Pattern.quote(someText));
System.out.println("Full escape: "+someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));

^(Negation) symbol is used to match something that is not in the character group.
This is the link to Regular Expressions
Here is the image info about negation:

Unicode Replacement with ASCII

I have created a text file on windows system where I think default encoding style is ANSI and contents of the file looks like this :
This is\u2019 a sample text file \u2014and it can ....
I saved this file using the default encoding style of windows though there were encoding styles were also available like UTF-8,UTF-16 etc.
Now I want to write a simple java function where I will pass some input string and replace all of the unicodes with the corresponding ascii value.
e.g :- \u2019 should be replaced with "'"
\u2014 should be replaced with "-" and so on.
Observation :
When i created a string literal like this
String s = "This is\u2019 a sample text file \u2014and it can ....";
My code is working fine , but when I am reading it from the file it is not working. I am aware that in Java String uses UTF-16 encoding .
Below is the code that I am using to read the input file.
FileReader fileReader = new FileReader(new File("C:\\input.txt"));
BufferedReader bufferedReader = new BufferedReader(fileReader)
String record = bufferedReader.readLine();
I also tried using the InputStream and setting the Charset to UTF-8 , but still the same result.
Replacement code :
public static String removeUTFCharacters(String data){
for(Entry<String,String> entry : utfChars.entrySet()){
data=data.replaceAll(entry.getKey(), entry.getValue());
}
return data;
}
Map :
utfChars.put("\u2019","'");
utfChars.put("\u2018","'");
utfChars.put("\u201c","\"");
utfChars.put("\u201d","\"");
utfChars.put("\u2013","-");
utfChars.put("\u2014","-");
utfChars.put("\u2212","-");
utfChars.put("\u2022","*");
Can anybody help me in understanding the concept and solution to this problem.

Match the escape sequence \uXXXX with a regular expression. Then use a replacement loop to replace each occurrence of that escape sequence with the decoded value of the character.
Because Java string literals use \ to introduce escapes, the sequence \\ is used to represent \. Also, the Java regex syntax treats the sequence \u specially (to represent a Unicode escape). So the \ has to be escaped again, with an additonal \\. So, in the pattern, "\\\\u" really means, "match \u in the input."
To match the numeric portion, four hexadecimal characters, use the pattern \p{XDigit}, escaping the \ with an extra \. We want to easily extract the hex number as a group, so it is enclosed in parentheses to create a capturing group. Thus, "(\\p{XDigit}{4})" in the pattern means, "match 4 hexadecimal characters in the input, and capture them."
In a loop, we search for occurrences of the pattern, replacing each occurrence with the decoded character value. The character value is decoded by parsing the hexadecimal number. Integer.parseInt(m.group(1), 16) means, "parse the group captured in the previous match as a base-16 number." Then a replacement string is created with that character. The replacement string must be escaped, or quoted, in case it is $, which has special meaning in replacement text.
String data = "This is\\u2019 a sample text file \\u2014and it can ...";
Pattern p = Pattern.compile("\\\\u(\\p{XDigit}{4})");
Matcher m = p.matcher(data);
StringBuffer buf = new StringBuffer(data.length());
while (m.find()) {
String ch = String.valueOf((char) Integer.parseInt(m.group(1), 16));
m.appendReplacement(buf, Matcher.quoteReplacement(ch));
}
m.appendTail(buf);
System.out.println(buf);

If you can use another library, you can use apache commons
https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html
String dirtyString = "Colocaci\u00F3n";
String cleanString = StringEscapeUtils.unescapeJava(dirtyString);
//cleanString = "Colocación"

.replaceAll() method not working correctly

As a bit of background to the problem I load in a text file and then assign a phrase from that text file to become a randomPirateWord, I then change the letters into that text file to become **'s and that works correctly. However when I am asking the user to guess a letter, it doesn't work correctly, if they guess incorrectly then the code works fine but if they guess a letter correctly the code doesn't work properly. I have put the error message below the code:
if (!escape.equalsIgnoreCase("m")){
System.out.print(" Type the letter you want to guess: ");
char letter = input.nextLine().charAt(0);
if(m.getRandomPirateWord().contains(letter+"")){
System.out.println(m.getRandomPirateWord().replaceAll("*",letter+""));
}
Error message:
Exception in thread "main" java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0
*
^
at java.util.regex.Pattern.error(Pattern.java:1924)
at java.util.regex.Pattern.sequence(Pattern.java:2090)
at java.util.regex.Pattern.expr(Pattern.java:1964)
at java.util.regex.Pattern.compile(Pattern.java:1665)
at java.util.regex.Pattern.<init>(Pattern.java:1337)
at java.util.regex.Pattern.compile(Pattern.java:1022)
at java.lang.String.replaceAll(String.java:2162)
at uk.ac.aber.dcs.pirate_hangman.TextBasedGame.runTextBasedGame(TextBasedGame.java:45)
at uk.ac.aber.dcs.pirate_hangman.Application.runApplication(Application.java:19)
at uk.ac.aber.dcs.pirate_hangman.Main.main(Main.java:6)

Use String#replace() instead of String#replaceAll(). The later one uses regex pattern for replacement, where * is a meta-character, and needs to be escaped.

Use the following, You have to escape the * character, since replaceAll() method accepts regular expression as one argument
replaceAll("\\*",letter+"")

Check out below Example.
import java.io.*;
public class Test{
public static void main(String args[]){
String Str = new String("Welcome to Tutorialspoint.com");
System.out.print("Return Value :" );
System.out.println(Str.replaceAll("(.*)Tutorials(.*)",
"AMROOD" ));
}
}
Syntax of method :
public String replaceAll(String regex, String replacement)
regex -- the regular expression to which this string is to be matched.
replacement -- the string which would replace found expression.
OUTPUT: AMROOD
Reference : Regex replacement

The problem is that * is a reserved character in regexes, so you need to escape it.
replaceAll("\\*",letter+"")

‘*’ symbol is used to identify a group from the regular expression which is the first parameter of ‘replaceAll’ or ‘replaceFirst’ method
1. Using ‘replace’ method: This would be the good choice if you want to replace a string literal and not a pattern.
2. Escaping ‘*’ symbol: If you need to a use regular expression, and your pattern has no groups identified, then you can escape any group identification symbols from your replace string as shown below:
String replaceValue = java.util.regex.Matcher.quoteReplacement("*100");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Unable to strip invalid unicode characters java - java

Related

Remove backslash before forward slash

how to get rid of #011 characters in java

Escape the regex character from a Searching String [duplicate]

Unicode Replacement with ASCII

.replaceAll() method not working correctly

Categories

Resources