how to get rid of #011 characters in java - java

I am getting this string while I got the content over JMSQ. While printing I see the following line. I see those are vertical tab characters in XML. But how should I get rid of them.
#011#011#011<xeh:eid>dljfl</xeh:eid>
I have tried
replaceAll("[\\x0B]", "");
but it's not working.

Just do this:
String a = "#011#011#011<xeh:eid>dljfl</xeh:eid>";
String a_wo_vt_chars = a.replaceAll("#011", "");

"#011#011#011<xeh:eid>dljfl</xeh:eid>".replaceAll("#011", "") works fine, results in <xeh:eid>dljfl</xeh:eid>
According to the Pattern javadoc, \xhh stands for "the character with hexadecimal value 0xhh". But I guess in your string literal, #011 is just literal characters.
If I try to replicate the vertical tab in a string literal, it works with \\x0B:
"\u000b\u000b\u000b<xeh:eid>dljfl</xeh:eid>".replaceAll("\\x0B", "")
But maybe we are reading it wrong. While #0B is 11, #11 might be 17...

When #011 represents the hexvalue for the char you can use
a.replaceAll("\\u0011", "");
// or
a.replaceAll("\\x11", "");
But if #011 represents the octal value the use
a.replaceAll("\\011", "")
Also see Unicode Regular Expressions

Related

Unable to strip invalid unicode characters java

I have my data which needs to be cleaned up before further processing in various other applications. In this process one of the downstream applications only allows a certain range of Unicode characters. The following is the regex I'm using to strip out the invalid Unicode characters.
/[^\u0009\u000a\u000d\u0020-\uD7FF\uE000-\uFFFD]/
However, I'm still having issues getting the regex to work in Java. Is there a special way to treat the above regex, since it contains a range of Unicode characters?
UPDATE:
This is how I tested the same and didn't seem to get it to work with the way suggested by #Andreas :
public void testStripUnicode() {
String doc = "{\"fields\":{\"field1\":\"unicode char '\\u000b'\",\"field2\":[\"unicode char '\\u0003'\"]}}";
String stripped = DocumentCleaner.clean(doc);
System.out.println(doc);
System.out.println(stripped);
}
doc
{"fields":{"field1":"unicode char '\u000b'","field2":["unicode char '\u0003'"]}}
stripped-doc
{"fields":{"field1":"unicode char '\u000b'","field2":["unicode char '\u0003'"]}}
Should be fine, just drop the slashes / and double the backslashes \:
String regex = "[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]";
String stripped = value.replaceAll(regex, "");
Or if you do it repeatedly, you can parse the regular expression once, up front:
// Prepare regular expression
Pattern p = Pattern.compile("[^\\u0009\\u000a\\u000d\\u0020-\\uD7FF\\uE000-\\uFFFD]");
// Use regular expression
String stripped = p.matcher(value).replaceAll("");

Escape the regex character from a Searching String [duplicate]

Does Java have a built-in way to escape arbitrary text so that it can be included in a regular expression? For example, if my users enter "$5", I'd like to match that exactly rather than a "5" after the end of input.
Since Java 1.5, yes:
Pattern.quote("$5");
Difference between Pattern.quote and Matcher.quoteReplacement was not clear to me before I saw following example
s.replaceFirst(Pattern.quote("text to replace"),
Matcher.quoteReplacement("replacement text"));
It may be too late to respond, but you can also use Pattern.LITERAL, which would ignore all special characters while formatting:
Pattern.compile(textToFormat, Pattern.LITERAL);
I think what you're after is \Q$5\E. Also see Pattern.quote(s) introduced in Java5.
See Pattern javadoc for details.
First off, if
you use replaceAll()
you DON'T use Matcher.quoteReplacement()
the text to be substituted in includes a $1
it won't put a 1 at the end. It will look at the search regex for the first matching group and sub THAT in. That's what $1, $2 or $3 means in the replacement text: matching groups from the search pattern.
I frequently plug long strings of text into .properties files, then generate email subjects and bodies from those. Indeed, this appears to be the default way to do i18n in Spring Framework. I put XML tags, as placeholders, into the strings and I use replaceAll() to replace the XML tags with the values at runtime.
I ran into an issue where a user input a dollars-and-cents figure, with a dollar sign. replaceAll() choked on it, with the following showing up in a stracktrace:
java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.start(Matcher.java:374)
at java.util.regex.Matcher.appendReplacement(Matcher.java:748)
at java.util.regex.Matcher.replaceAll(Matcher.java:823)
at java.lang.String.replaceAll(String.java:2201)
In this case, the user had entered "$3" somewhere in their input and replaceAll() went looking in the search regex for the third matching group, didn't find one, and puked.
Given:
// "msg" is a string from a .properties file, containing "<userInput />" among other tags
// "userInput" is a String containing the user's input
replacing
msg = msg.replaceAll("<userInput \\/>", userInput);
with
msg = msg.replaceAll("<userInput \\/>", Matcher.quoteReplacement(userInput));
solved the problem. The user could put in any kind of characters, including dollar signs, without issue. It behaved exactly the way you would expect.
To have protected pattern you may replace all symbols with "\\\\", except digits and letters. And after that you can put in that protected pattern your special symbols to make this pattern working not like stupid quoted text, but really like a patten, but your own. Without user special symbols.
public class Test {
public static void main(String[] args) {
String str = "y z (111)";
String p1 = "x x (111)";
String p2 = ".* .* \\(111\\)";
p1 = escapeRE(p1);
p1 = p1.replace("x", ".*");
System.out.println( p1 + "-->" + str.matches(p1) );
//.*\ .*\ \(111\)-->true
System.out.println( p2 + "-->" + str.matches(p2) );
//.* .* \(111\)-->true
}
public static String escapeRE(String str) {
//Pattern escaper = Pattern.compile("([^a-zA-z0-9])");
//return escaper.matcher(str).replaceAll("\\\\$1");
return str.replaceAll("([^a-zA-Z0-9])", "\\\\$1");
}
}
Pattern.quote("blabla") works nicely.
The Pattern.quote() works nicely. It encloses the sentence with the characters "\Q" and "\E", and if it does escape "\Q" and "\E".
However, if you need to do a real regular expression escaping(or custom escaping), you can use this code:
String someText = "Some/s/wText*/,**";
System.out.println(someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
This method returns: Some/\s/wText*/\,**
Code for example and tests:
String someText = "Some\\E/s/wText*/,**";
System.out.println("Pattern.quote: "+ Pattern.quote(someText));
System.out.println("Full escape: "+someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
^(Negation) symbol is used to match something that is not in the character group.
This is the link to Regular Expressions
Here is the image info about negation:

Unicode Replacement with ASCII

I have created a text file on windows system where I think default encoding style is ANSI and contents of the file looks like this :
This is\u2019 a sample text file \u2014and it can ....
I saved this file using the default encoding style of windows though there were encoding styles were also available like UTF-8,UTF-16 etc.
Now I want to write a simple java function where I will pass some input string and replace all of the unicodes with the corresponding ascii value.
e.g :- \u2019 should be replaced with "'"
\u2014 should be replaced with "-" and so on.
Observation :
When i created a string literal like this
String s = "This is\u2019 a sample text file \u2014and it can ....";
My code is working fine , but when I am reading it from the file it is not working. I am aware that in Java String uses UTF-16 encoding .
Below is the code that I am using to read the input file.
FileReader fileReader = new FileReader(new File("C:\\input.txt"));
BufferedReader bufferedReader = new BufferedReader(fileReader)
String record = bufferedReader.readLine();
I also tried using the InputStream and setting the Charset to UTF-8 , but still the same result.
Replacement code :
public static String removeUTFCharacters(String data){
for(Entry<String,String> entry : utfChars.entrySet()){
data=data.replaceAll(entry.getKey(), entry.getValue());
}
return data;
}
Map :
utfChars.put("\u2019","'");
utfChars.put("\u2018","'");
utfChars.put("\u201c","\"");
utfChars.put("\u201d","\"");
utfChars.put("\u2013","-");
utfChars.put("\u2014","-");
utfChars.put("\u2212","-");
utfChars.put("\u2022","*");
Can anybody help me in understanding the concept and solution to this problem.
Match the escape sequence \uXXXX with a regular expression. Then use a replacement loop to replace each occurrence of that escape sequence with the decoded value of the character.
Because Java string literals use \ to introduce escapes, the sequence \\ is used to represent \. Also, the Java regex syntax treats the sequence \u specially (to represent a Unicode escape). So the \ has to be escaped again, with an additonal \\. So, in the pattern, "\\\\u" really means, "match \u in the input."
To match the numeric portion, four hexadecimal characters, use the pattern \p{XDigit}, escaping the \ with an extra \. We want to easily extract the hex number as a group, so it is enclosed in parentheses to create a capturing group. Thus, "(\\p{XDigit}{4})" in the pattern means, "match 4 hexadecimal characters in the input, and capture them."
In a loop, we search for occurrences of the pattern, replacing each occurrence with the decoded character value. The character value is decoded by parsing the hexadecimal number. Integer.parseInt(m.group(1), 16) means, "parse the group captured in the previous match as a base-16 number." Then a replacement string is created with that character. The replacement string must be escaped, or quoted, in case it is $, which has special meaning in replacement text.
String data = "This is\\u2019 a sample text file \\u2014and it can ...";
Pattern p = Pattern.compile("\\\\u(\\p{XDigit}{4})");
Matcher m = p.matcher(data);
StringBuffer buf = new StringBuffer(data.length());
while (m.find()) {
String ch = String.valueOf((char) Integer.parseInt(m.group(1), 16));
m.appendReplacement(buf, Matcher.quoteReplacement(ch));
}
m.appendTail(buf);
System.out.println(buf);
If you can use another library, you can use apache commons
https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html
String dirtyString = "Colocaci\u00F3n";
String cleanString = StringEscapeUtils.unescapeJava(dirtyString);
//cleanString = "Colocación"

give an example of using cyirillic in regex java

How to make regex of a cyrillic string, i want to use it in this a way somehow:
String.replaceAll("Кириллица","")
Of course it doesn't work. What am I to do, to make it work?
Ok,I see that the method works, but it doesn't work for me. How can I check, why does method not execute?
...
Hm, I tried to use s1 = s1.replaceAll("[\\p{InCyrillic}]", ""); for the string I get through the sockets. it works great, all cyrillic chars disapperar, including the word "Экзамен", but if I try s1=s1.replaceAll("Экзамен","") nothing happens.
But method s1=s1.replaceAll("Экзамен","") worked in the same program for a static string defined in this program. I guess that problem may be because of wrong charset, but I still can't understand what am I doing wrong. The charset of the string is windows-1251. I tried to experiment with charset in program (it is jsp now), using methods
System.setProperty("file.encoding", "windows-1251");
response.setCharacterEncoding("windows-1251");
tried converting the string from one charset to another. And nothing changes
It might be more clear if you show your result in case #Henry's answer.
I suppose that the issue in characters or encoding.
To identify is the String in cyrillic you can with this code:
String s1 = "Экзaмен";
s1 = s1.replaceAll("[\\p{InCyrillic}]", "");
System.out.println(s1);
The code will remove all cyrillic characters and you can identify invalid encoded characters.
If your result will be like "a" or "e", or "ae", It means that in your string exist latin characters which simular to cyrillic, so you should replace using this regex
s1 = s1.replaceAll("Экз[aa]м[ee]н", "");
where [a-is cyrillic character and a-is latin character] and so on.
If your result will be as "Экзaмен", the issue in encoding and I hope this link will help you
How to determine if a String contains invalid encoded characters
Just tried this:
String s1 = "Введение в специальность (Б.3.2.1-ПиКО)60,3Экзамен";
String s2 = s1.replaceAll("Экзамен", "");
System.out.println(s2);
The output is:
Введение в специальность (Б.3.2.1-ПиКО)60,3

Unable to split a string

I have a string
Mr praneel PIDIKITI
When I use this regular expression
String[] nameParts = name.split("\\s+");
instead of getting three parts I am only getting two, Mr and Praneel PIDIKITI.
I am unable to split the second string. Does anyone know what could be the problem?
I even used split(" ");.
The problem is I used replaceAll("\\<.*?>", " ").trim(); to convert html into this string and then I am using name.split("\\s+"); to get the name value.
I think it must be something other than space (some special character).
Your code should work. I suspect your input. There could be a non printable junk character between Praneel and PIDIKITI. For example,
String name = "Mr praneel" + (char)1 +"PIDIKITI";
String[] nameParts = name.split("\\s+");
for(String s : nameParts)
System.out.println(s);
Are you sure that there is no junk character between Praneel and PIDIKITI?
Remove non printable characters like this:
// remove non printable characters excluding white space characters
name = name.replaceAll("[^\\p{Print}\\s]","");
If you're parsing HTML, may I recommend JSoup? Its a good HTML parser for java

Categories