How to remove the Â character from a string in java?

How to remove the Â character from a string in java? - java

I have the following string in which the Â special character is coming in hidden. I want to remove only the Â from this string ~IQBAL~KARACHIÂ¦~~~~~~~~~~~.
Here is a before and after image to show what I mean:
I've tried this code:
responseMessageUTF.replaceAll("\\P{InBasic_Latin}", "");
but this is also replacing the ¦ character. Is there any way to remove only the Â character and not the ¦ character?

I have a simple one liner code, it removes for most of the non-UTF-8 characters. I tested for your character as well i.e. Â.
String myString = "~KARACHIÂ¦~~~~~~";
String result = myString.replaceAll("[^\\x00-\\x7F]","");
System.out.println(result);
You can find complete code here.
You may test that as well here.

You have to use the right UTF:
Code example:
String blub = " ~KARACHIÂ¦~~~~~~";
System.out.println(blub);
System.out.println(blub.replaceAll(new String("Â".getBytes("UTF-8"), "UTF-8"), ""));
Output:
~KARACHIÂ¦~~~~~~
~KARACHI¦~~~~~~
See a description similiar to this problem here: Link

Related

how to get rid of #011 characters in java

I am getting this string while I got the content over JMSQ. While printing I see the following line. I see those are vertical tab characters in XML. But how should I get rid of them.
#011#011#011<xeh:eid>dljfl</xeh:eid>
I have tried
replaceAll("[\\x0B]", "");
but it's not working.

Just do this:
String a = "#011#011#011<xeh:eid>dljfl</xeh:eid>";
String a_wo_vt_chars = a.replaceAll("#011", "");

"#011#011#011<xeh:eid>dljfl</xeh:eid>".replaceAll("#011", "") works fine, results in <xeh:eid>dljfl</xeh:eid>
According to the Pattern javadoc, \xhh stands for "the character with hexadecimal value 0xhh". But I guess in your string literal, #011 is just literal characters.
If I try to replicate the vertical tab in a string literal, it works with \\x0B:
"\u000b\u000b\u000b<xeh:eid>dljfl</xeh:eid>".replaceAll("\\x0B", "")
But maybe we are reading it wrong. While #0B is 11, #11 might be 17...

When #011 represents the hexvalue for the char you can use
a.replaceAll("\\u0011", "");
// or
a.replaceAll("\\x11", "");
But if #011 represents the octal value the use
a.replaceAll("\\011", "")
Also see Unicode Regular Expressions

Regex for replacing all newlines that are not after ';'

I'd like to ask if someone could help me out with regex which will match all \n except \n that is after ;
Example:
test
test1
test2;
test
test1;
test
test1;
will be changed to
testtest1test2;
testtest1;
testtest1;

This regex can be used to find those lines: (?<!;)\n What is means is basically not a ; followed by a new line. You can also add a \r? before \n to accept carriage returns if they can be present in your document, though this will depend on your platform.
Simply replace the matches with "" (an empty string) to remove the newlines.

You can use:
// read complete file in a string
String data = new Scanner(new File("file.txt")).useDelimiter("\\Z").next();
// remove all newlines that aren't preceded by semi-colon
String repl = data.replaceAll("(?<!;)(\\r?\\n)+", "");

Using Look behinds,this should work-
Search for - (?<!;)\n
Replace with - (Empty string - '')
Demo here

give an example of using cyirillic in regex java

How to make regex of a cyrillic string, i want to use it in this a way somehow:
String.replaceAll("Кириллица","")
Of course it doesn't work. What am I to do, to make it work?
Ok,I see that the method works, but it doesn't work for me. How can I check, why does method not execute?
...
Hm, I tried to use s1 = s1.replaceAll("[\\p{InCyrillic}]", ""); for the string I get through the sockets. it works great, all cyrillic chars disapperar, including the word "Экзамен", but if I try s1=s1.replaceAll("Экзамен","") nothing happens.
But method s1=s1.replaceAll("Экзамен","") worked in the same program for a static string defined in this program. I guess that problem may be because of wrong charset, but I still can't understand what am I doing wrong. The charset of the string is windows-1251. I tried to experiment with charset in program (it is jsp now), using methods
System.setProperty("file.encoding", "windows-1251");
response.setCharacterEncoding("windows-1251");
tried converting the string from one charset to another. And nothing changes

It might be more clear if you show your result in case #Henry's answer.
I suppose that the issue in characters or encoding.
To identify is the String in cyrillic you can with this code:
String s1 = "Экзaмен";
s1 = s1.replaceAll("[\\p{InCyrillic}]", "");
System.out.println(s1);
The code will remove all cyrillic characters and you can identify invalid encoded characters.
If your result will be like "a" or "e", or "ae", It means that in your string exist latin characters which simular to cyrillic, so you should replace using this regex
s1 = s1.replaceAll("Экз[aa]м[ee]н", "");
where [a-is cyrillic character and a-is latin character] and so on.
If your result will be as "Экзaмен", the issue in encoding and I hope this link will help you
How to determine if a String contains invalid encoded characters

Just tried this:
String s1 = "Введение в специальность (Б.3.2.1-ПиКО)60,3Экзамен";
String s2 = s1.replaceAll("Экзамен", "");
System.out.println(s2);
The output is:
Введение в специальность (Б.3.2.1-ПиКО)60,3

Unable to split a string

I have a string
Mr praneel PIDIKITI
When I use this regular expression
String[] nameParts = name.split("\\s+");
instead of getting three parts I am only getting two, Mr and Praneel PIDIKITI.
I am unable to split the second string. Does anyone know what could be the problem?
I even used split(" ");.
The problem is I used replaceAll("\\<.*?>", " ").trim(); to convert html into this string and then I am using name.split("\\s+"); to get the name value.
I think it must be something other than space (some special character).

Your code should work. I suspect your input. There could be a non printable junk character between Praneel and PIDIKITI. For example,
String name = "Mr praneel" + (char)1 +"PIDIKITI";
String[] nameParts = name.split("\\s+");
for(String s : nameParts)
System.out.println(s);
Are you sure that there is no junk character between Praneel and PIDIKITI?
Remove non printable characters like this:
// remove non printable characters excluding white space characters
name = name.replaceAll("[^\\p{Print}\\s]","");

If you're parsing HTML, may I recommend JSoup? Its a good HTML parser for java

replace \n and \r\n with <br /> in java

This has been asked several times for several languages but I can't get it to work.
I have a string like this
String str = "This is a string.\nThis is a long string.";
And I'm trying to replace the \n with <br /> using
str = str.replaceAll("(\r\n|\n)", "<br />");
but the \n is not getting replaced.
I tried to use this RegEx Tool to verify and I see the same result. The input string does not have a match for "(\r\n|\n)". What am i doing wrong ?

It works for me.
public class Program
{
public static void main(String[] args) {
String str = "This is a string.\nThis is a long string.";
str = str.replaceAll("(\r\n|\n)", "<br />");
System.out.println(str);
}
}
Result:
This is a string.<br />This is a long string.
Your problem is somewhere else.

For me, this worked:
rawText.replaceAll("(\\\\r\\\\n|\\\\n)", "\\\n");
Tip: use regex tester for quick testing without compiling in your environment

A little more robust version of what you're attempting:
str = str.replaceAll("(\r\n|\n\r|\r|\n)", "<br />");

Since my account is new I can't up-vote Nino van Hooff's answer. If your strings are coming from a Windows based source such as an aspx based server, this solution does work:
rawText.replaceAll("(\\\\r\\\\n|\\\\n)", "<br />");
Seems to be a weird character set issue as the double back-slashes are being interpreted as single slash escape characters. Hence the need for the quadruple slashes above.
Again, under most circumstances "(\\r\\n|\\n)" should work, but if your strings are coming from a Windows based source try the above.
Just an FYI tried everything to correct the issue I was having replacing those line endings. Thought at first was failed conversion from Windows-1252 to UTF-8. But that didn't working either. This solution is what finally did the trick. :)

It works for me. The Java code works exactly as you wrote it. In the tester, the input string should be:
This is a string.
This is a long string.
...with a real linefeed. You can't use:
This is a string.\nThis is a long string.
...because it treats \n as the literal sequence backslash 'n'.

That should work, but don't kill yourself trying to figure it out. Just use 2 passes.
str = str.replaceAll("(\r\n)", "<br />");
str = str.replaceAll("(\n)", "<br />");
Disclaimer: this is not very efficient.

This should work. You need to put in two slashes
str = str.replaceAll("(\\r\\n|\\n)", "<br />");
In this Reference, there is an example which shows
private final String REGEX = "\\d"; // a single digit
I have used two slashes in many of my projects and it seems to work fine!

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to remove the Â character from a string in java? - java

You have to use the right UTF: Code example: String blub = " ~KARACHIÂ¦~~~~~~"; System.out.println(blub); System.out.println(blub.replaceAll(new String("Â".getBytes("UTF-8"), "UTF-8"), "")); Output: ~KARACHIÂ¦~~~~~~ ~KARACHI¦~~~~~~ See a description similiar to this problem here: Link

Related

how to get rid of #011 characters in java

Regex for replacing all newlines that are not after ';'

give an example of using cyirillic in regex java

Unable to split a string

replace \n and \r\n with <br /> in java

Categories

Resources