How to make regex of a cyrillic string, i want to use it in this a way somehow:
String.replaceAll("Кириллица","")
Of course it doesn't work. What am I to do, to make it work?
Ok,I see that the method works, but it doesn't work for me. How can I check, why does method not execute?
...
Hm, I tried to use s1 = s1.replaceAll("[\\p{InCyrillic}]", ""); for the string I get through the sockets. it works great, all cyrillic chars disapperar, including the word "Экзамен", but if I try s1=s1.replaceAll("Экзамен","") nothing happens.
But method s1=s1.replaceAll("Экзамен","") worked in the same program for a static string defined in this program. I guess that problem may be because of wrong charset, but I still can't understand what am I doing wrong. The charset of the string is windows-1251. I tried to experiment with charset in program (it is jsp now), using methods
System.setProperty("file.encoding", "windows-1251");
response.setCharacterEncoding("windows-1251");
tried converting the string from one charset to another. And nothing changes
It might be more clear if you show your result in case #Henry's answer.
I suppose that the issue in characters or encoding.
To identify is the String in cyrillic you can with this code:
String s1 = "Экзaмен";
s1 = s1.replaceAll("[\\p{InCyrillic}]", "");
System.out.println(s1);
The code will remove all cyrillic characters and you can identify invalid encoded characters.
If your result will be like "a" or "e", or "ae", It means that in your string exist latin characters which simular to cyrillic, so you should replace using this regex
s1 = s1.replaceAll("Экз[aa]м[ee]н", "");
where [a-is cyrillic character and a-is latin character] and so on.
If your result will be as "Экзaмен", the issue in encoding and I hope this link will help you
How to determine if a String contains invalid encoded characters
Just tried this:
String s1 = "Введение в специальность (Б.3.2.1-ПиКО)60,3Экзамен";
String s2 = s1.replaceAll("Экзамен", "");
System.out.println(s2);
The output is:
Введение в специальность (Б.3.2.1-ПиКО)60,3
Related
I am getting this string while I got the content over JMSQ. While printing I see the following line. I see those are vertical tab characters in XML. But how should I get rid of them.
#011#011#011<xeh:eid>dljfl</xeh:eid>
I have tried
replaceAll("[\\x0B]", "");
but it's not working.
Just do this:
String a = "#011#011#011<xeh:eid>dljfl</xeh:eid>";
String a_wo_vt_chars = a.replaceAll("#011", "");
"#011#011#011<xeh:eid>dljfl</xeh:eid>".replaceAll("#011", "") works fine, results in <xeh:eid>dljfl</xeh:eid>
According to the Pattern javadoc, \xhh stands for "the character with hexadecimal value 0xhh". But I guess in your string literal, #011 is just literal characters.
If I try to replicate the vertical tab in a string literal, it works with \\x0B:
"\u000b\u000b\u000b<xeh:eid>dljfl</xeh:eid>".replaceAll("\\x0B", "")
But maybe we are reading it wrong. While #0B is 11, #11 might be 17...
When #011 represents the hexvalue for the char you can use
a.replaceAll("\\u0011", "");
// or
a.replaceAll("\\x11", "");
But if #011 represents the octal value the use
a.replaceAll("\\011", "")
Also see Unicode Regular Expressions
I refer to the following web site:
http://coderstoolbox.net/string/#!encoding=xml&action=encode&charset=us_ascii
Choosing "URL", "Encode", and "US-ASCII", the input is converted to the desired output.
How do I produce the same output with Java codes?
Thanks in advance.
I used this and it seems to work fine.
public static String encode(String input) {
Pattern doNotReplace = Pattern.compile("[a-zA-Z0-9]");
return input.chars().mapToObj(c->{
if(!doNotReplace.matcher(String.valueOf((char)c)).matches()){
return "%" + (c<256?Integer.toHexString(c):"u"+Integer.toHexString(c));
}
return String.valueOf((char)c);
}).collect(Collectors.joining("")).toUpperCase();
}
PS: I'm using 256 to limit the placement of the prefix U to non-ASCII characters. No need of prefix U for standard ASCII characters which are within 256.
Alternate option:
There is a built-in Java class (java.net.URLEncoder) that does URL Encoding. But it works a little differently (For example, it does not replace the Space character with %20, but replaces with a + instead. Something similar happens with other characters too). See if it helps:
String encoded = URLEncoder.encode(input, "US-ASCII");
Hope this helps!
You can use ESAPi.encoder().encodeForUrl(linkString)
Check more details on encodeForUrl https://en.wikipedia.org/wiki/Percent-encoding
please comment if that does not satisfy your requirement or face any other issue.
Thanks
I have the following string in which the  special character is coming in hidden. I want to remove only the  from this string ~IQBAL~KARACHI¦~~~~~~~~~~~.
Here is a before and after image to show what I mean:
I've tried this code:
responseMessageUTF.replaceAll("\\P{InBasic_Latin}", "");
but this is also replacing the ¦ character. Is there any way to remove only the  character and not the ¦ character?
I have a simple one liner code, it removes for most of the non-UTF-8 characters. I tested for your character as well i.e. Â.
String myString = "~KARACHI¦~~~~~~";
String result = myString.replaceAll("[^\\x00-\\x7F]","");
System.out.println(result);
You can find complete code here.
You may test that as well here.
You have to use the right UTF:
Code example:
String blub = " ~KARACHI¦~~~~~~";
System.out.println(blub);
System.out.println(blub.replaceAll(new String("Â".getBytes("UTF-8"), "UTF-8"), ""));
Output:
~KARACHI¦~~~~~~
~KARACHI¦~~~~~~
See a description similiar to this problem here: Link
I'm having trouble in concatenating pieces of text mixing Western and Arabic chars.
I've a list of tokens like this:
-LRB-
دریای
مازندران
-RRB-
,
I use the following procedure to concatenate these list of tokens:
String str = "";
for (String tok : tokens) {
str += tok + " ";
}
This is the output of my procedure:
-LRB- دریای مازندران -RRB- ,
As can be seen, the position of the Arabic words is inverted.
How can I solve this (maybe suggesting to Java to ignore the information about text direction)?
EDIT
Actually, it seems that my problem was a false problem.
Now I've a new one. I need to wrap each word inside a string like this (word *) so that my output will be like this:
(word1 *)(word2 *)(word3 *)...
The procedure that I use is the following:
String str = "";
for (String tok : tokens) {
str += "(" + tok + "*)";
}
However, the result that I got is this:
(-LRB- *)(دریای *)(مازندران *)(-RRB- *)(, *)
instead of:
(-LRB- *)(دریای)(* مازندران *)(-RRB- *)(, *)
** EDIT2 **
Actually, I've discovered that my problem is not a problem. I wrote my string on a file and I opened it with nano (in the console). And it was correctly concatenated.
So the problem was due to the Eclipse console (and also gedit) which --let's say-- incorrectly rendered the string.
Anyway, thanks for your help!
The output is correct, and if you are presenting this text to an Arabic-speaking user you should not override the directionality of the text. Arabic is written from right to left. When you concatenate two Arabic strings together, the first will appear to the right of the second. This is controlled by the BiDi algorithm, the details of which are covered in http://www.unicode.org/reports/tr9/.
First, I would suggest using StringBuilder instead of raw String concatination. You will make your Garbage Collector a lot happier. Second, not seeing the input or how your StringTokenizer is setup, I would venture a guess that it seems like you are having problems tokenizing the string properly.
I have a string
Mr praneel PIDIKITI
When I use this regular expression
String[] nameParts = name.split("\\s+");
instead of getting three parts I am only getting two, Mr and Praneel PIDIKITI.
I am unable to split the second string. Does anyone know what could be the problem?
I even used split(" ");.
The problem is I used replaceAll("\\<.*?>", " ").trim(); to convert html into this string and then I am using name.split("\\s+"); to get the name value.
I think it must be something other than space (some special character).
Your code should work. I suspect your input. There could be a non printable junk character between Praneel and PIDIKITI. For example,
String name = "Mr praneel" + (char)1 +"PIDIKITI";
String[] nameParts = name.split("\\s+");
for(String s : nameParts)
System.out.println(s);
Are you sure that there is no junk character between Praneel and PIDIKITI?
Remove non printable characters like this:
// remove non printable characters excluding white space characters
name = name.replaceAll("[^\\p{Print}\\s]","");
If you're parsing HTML, may I recommend JSoup? Its a good HTML parser for java