Remove Arabic non-alpha-numeric characters from a string in Java

Remove Arabic non-alpha-numeric characters from a string in Java - java

How can I remove all non-alpha-numeric Arabic characters from a string in Java?

use regex [^A-Za-z0-9 ] the regex will only allow alphabets from A to Z and a to z also numericals from 0 to 9. nothing else

Here is the complete answer:
String patternString = "";
Pattern pattern = null;
Matcher matcher = null;
String normalizedString = "";
patternString = "[^A-Za-zأ-ْ-9 ]";
pattern = Pattern.compile(patternString);
matcher = pattern.matcher(string);
normalizedString = matcher.replaceAll("");

I tried multiple solutions and nothing works prominently. I tried all the solution from the current thread as well as from here - how could i remove arabic punctuation form a String in java.
As no other solution works completely, I have created method which will retain only arabic characters and rest all chars will be removed as below -
public static String findArabicString(String s) {
StringBuilder finalValue = new StringBuilder();
if (null != s) {
for (int i = 0; i < s.length();) {
int c = s.codePointAt(i);
if ((c >= 0x0600 && c <= 0x06E0))
finalValue.append((char) c);
i += Character.charCount(c);
}
}
System.out.println(finalValue.toString());
return finalValue.toString();
}
The method can be customized as required, for example I want to retain space and arabic characters, then there is a slight chnage required in the testing condition as below -
public static String findArabicString(String s) {
StringBuilder finalValue = new StringBuilder();
if (null != s) {
for (int i = 0; i < s.length();) {
int c = s.codePointAt(i);
// 32 is unicode for white space
if ((c >= 0x0600 && c <= 0x06E0) || c == 32)
finalValue.append((char) c);
i += Character.charCount(c);
}
}
System.out.println(finalValue.toString());
return finalValue.toString();
}
I hope this will help to anyone facing similar issue as I do.

To remove arabic alpha from a string you can use the method below :
public void removeArabicChars() {
String input = "This string contains Arabic characters هذا النص يحتوي على حروف عربية";
String output = input.replaceAll("\\p{InArabic}", "");
System.out.println(output);
}

Related

Removing supplementary characters from a Java string [duplicate]

This question already has answers here:
What is the regex to extract all the emojis from a string?
(18 answers)
Closed 5 years ago.
I have a Java string that contains supplementary characters (characters in the Unicode standard whose code points are above U+FFFF). These characters could for example be emojis. I want to remove those characters from the string, i.e. replace them with the empty string "".
How do I remove supplementary characters from a string?
How do I remove characters from an arbitrary code point range? (For example all characters in the range 1F000–1FFFF)?

There are a couple of approaches. As regex replace is expensive, maybe do:
String basic(String s) {
StringBuilder sb = new StringBuilder();
for (char ch : s.toCharArray()) {
if (!Character.isLowSurrogate(ch) && !Character.isHighSurrogate(ch)) {
sb.append(ch);
}
}
return sb.length() == s.length() ? s : sb.toString();
}

You can get a character's unicode value by simply converting it to an int.
Therefore, you'll want to do the following:
Convert your String to a char[], or do something like have the loop condition iterate through each character in the String using String.charAt()
Check if the unicode value is one you want to remove.
If so, replace the character with "".
This is just to start you off, however if you're still struggling I can try type out a whole example.
Good luck!

Here is a code snippet that collects characters between code point 60 and 100:
public class Test {
public static void main(String[] args) {
new Test().go();
}
private void go() {
String s = "ABC12三￮";
String ret = "";
for (int i = 0; i < s.length(); i++) {
System.out.println(s.codePointAt(i));
if ((s.codePointAt(i) > 60) & (s.codePointAt(i) < 100)) {
ret += s.substring(i, i+1);
}
}
System.out.println(ret);
}
}
the result:
code point: 65
code point: 66
code point: 67
code point: 49
code point: 50
code point: 19977
code point: 65518
result: ABC
Hope this helps.

Java strings are UTF-16 encoded. The String type has a codePointAt() method for retrieving a decoded codepoint at a given char (codeunit) index.
So, you can do something like this, for instance:
String removeSupplementaryChars(String s)
{
int len = s.length();
if (len == 0)
return "";
StringBuilder sb = new StringBuilder(len);
int i = 0;
do
{
if (s.codePointAt(i) <= 0xFFFF)
sb.append(s.charAt[i]);
i = s.offsetByCodePoints(i, 1);
}
while (i < len);
return sb.toString();
}
Or this:
String removeCodepointsinRange(String s, int lower, int upper)
{
int len = s.length();
if (len == 0)
return "";
StringBuilder sb = new StringBuilder(len);
int i = 0;
do
{
int cp = s.codePointAt(i);
if ((cp < lower) || (cp > upper))
sb.appendCodePoint(cp);
i = s.offsetByCodePoints(i, 1);
}
while (i < len);
return sb.toString();
}

Java: Display unicode chars as chars when printing string [duplicate]

I have a string with escaped Unicode characters, \uXXXX, and I want to convert it to regular Unicode letters. For example:
"\u0048\u0065\u006C\u006C\u006F World"
should become
"Hello World"
I know that when I print the first string it already shows Hello world. My problem is I read file names from a file, and then I search for them. The files names in the file are escaped with Unicode encoding, and when I search for the files, I can't find them, since it searches for a file with \uXXXX in its name.

The Apache Commons Lang StringEscapeUtils.unescapeJava() can decode it properly.
import org.apache.commons.lang.StringEscapeUtils;
#Test
public void testUnescapeJava() {
String sJava="\\u0048\\u0065\\u006C\\u006C\\u006F";
System.out.println("StringEscapeUtils.unescapeJava(sJava):\n" + StringEscapeUtils.unescapeJava(sJava));
}
output:
StringEscapeUtils.unescapeJava(sJava):
Hello

Technically doing:
String myString = "\u0048\u0065\u006C\u006C\u006F World";
automatically converts it to "Hello World", so I assume you are reading in the string from some file. In order to convert it to "Hello" you'll have to parse the text into the separate unicode digits, (take the \uXXXX and just get XXXX) then do Integer.ParseInt(XXXX, 16) to get a hex value and then case that to char to get the actual character.
Edit: Some code to accomplish this:
String str = myString.split(" ")[0];
str = str.replace("\\","");
String[] arr = str.split("u");
String text = "";
for(int i = 1; i < arr.length; i++){
int hexVal = Integer.parseInt(arr[i], 16);
text += (char)hexVal;
}
// Text will now have Hello

You can use StringEscapeUtils from Apache Commons Lang, i.e.:
String Title = StringEscapeUtils.unescapeJava("\\u0048\\u0065\\u006C\\u006C\\u006F");

This simple method will work for most cases, but would trip up over something like "u005Cu005C" which should decode to the string "\u0048" but would actually decode "H" as the first pass produces "\u0048" as the working string which then gets processed again by the while loop.
static final String decode(final String in)
{
String working = in;
int index;
index = working.indexOf("\\u");
while(index > -1)
{
int length = working.length();
if(index > (length-6))break;
int numStart = index + 2;
int numFinish = numStart + 4;
String substring = working.substring(numStart, numFinish);
int number = Integer.parseInt(substring,16);
String stringStart = working.substring(0, index);
String stringEnd = working.substring(numFinish);
working = stringStart + ((char)number) + stringEnd;
index = working.indexOf("\\u");
}
return working;
}

Shorter version:
public static String unescapeJava(String escaped) {
if(escaped.indexOf("\\u")==-1)
return escaped;
String processed="";
int position=escaped.indexOf("\\u");
while(position!=-1) {
if(position!=0)
processed+=escaped.substring(0,position);
String token=escaped.substring(position+2,position+6);
escaped=escaped.substring(position+6);
processed+=(char)Integer.parseInt(token,16);
position=escaped.indexOf("\\u");
}
processed+=escaped;
return processed;
}

StringEscapeUtils from org.apache.commons.lang3 library is deprecated as of 3.6.
So you can use their new commons-text library instead:
compile 'org.apache.commons:commons-text:1.9'
OR
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-text</artifactId>
<version>1.9</version>
</dependency>
Example code:
org.apache.commons.text.StringEscapeUtils.unescapeJava(escapedString);

With Kotlin you can write your own extension function for String
fun String.unescapeUnicode() = replace("\\\\u([0-9A-Fa-f]{4})".toRegex()) {
String(Character.toChars(it.groupValues[1].toInt(radix = 16)))
}
and then
fun main() {
val originalString = "\\u0048\\u0065\\u006C\\u006C\\u006F World"
println(originalString.unescapeUnicode())
}

It's not totally clear from your question, but I'm assuming you saying that you have a file where each line of that file is a filename. And each filename is something like this:
\u0048\u0065\u006C\u006C\u006F
In other words, the characters in the file of filenames are \, u, 0, 0, 4, 8 and so on.
If so, what you're seeing is expected. Java only translates \uXXXX sequences in string literals in source code (and when reading in stored Properties objects). When you read the contents you file you will have a string consisting of the characters \, u, 0, 0, 4, 8 and so on and not the string Hello.
So you will need to parse that string to extract the 0048, 0065, etc. pieces and then convert them to chars and make a string from those chars and then pass that string to the routine that opens the file.

For Java 9+, you can use the new replaceAll method of Matcher class.
private static final Pattern UNICODE_PATTERN = Pattern.compile("\\\\u([0-9A-Fa-f]{4})");
public static String unescapeUnicode(String unescaped) {
return UNICODE_PATTERN.matcher(unescaped).replaceAll(r -> String.valueOf((char) Integer.parseInt(r.group(1), 16)));
}
public static void main(String[] args) {
String originalMessage = "\\u0048\\u0065\\u006C\\u006C\\u006F World";
String unescapedMessage = unescapeUnicode(originalMessage);
System.out.println(unescapedMessage);
}
I believe the main advantage of this approach over unescapeJava by StringEscapeUtils (besides not using an extra library) is that you can convert only the unicode characters (if you wish), since the latter converts all escaped Java characters (like \n or \t). If you prefer to convert all escaped characters the library is really the best option.

Updates regarding answers suggesting using The Apache Commons Lang's:
StringEscapeUtils.unescapeJava() - it was deprecated,
Deprecated.
as of 3.6, use commons-text StringEscapeUtils instead
The replacement is Apache Commons Text's StringEscapeUtils.unescapeJava()

Just wanted to contribute my version, using regex:
private static final String UNICODE_REGEX = "\\\\u([0-9a-f]{4})";
private static final Pattern UNICODE_PATTERN = Pattern.compile(UNICODE_REGEX);
...
String message = "\u0048\u0065\u006C\u006C\u006F World";
Matcher matcher = UNICODE_PATTERN.matcher(message);
StringBuffer decodedMessage = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(
decodedMessage, String.valueOf((char) Integer.parseInt(matcher.group(1), 16)));
}
matcher.appendTail(decodedMessage);
System.out.println(decodedMessage.toString());

I wrote a performanced and error-proof solution:
public static final String decode(final String in) {
int p1 = in.indexOf("\\u");
if (p1 < 0)
return in;
StringBuilder sb = new StringBuilder();
while (true) {
int p2 = p1 + 6;
if (p2 > in.length()) {
sb.append(in.subSequence(p1, in.length()));
break;
}
try {
int c = Integer.parseInt(in.substring(p1 + 2, p1 + 6), 16);
sb.append((char) c);
p1 += 6;
} catch (Exception e) {
sb.append(in.subSequence(p1, p1 + 2));
p1 += 2;
}
int p0 = in.indexOf("\\u", p1);
if (p0 < 0) {
sb.append(in.subSequence(p1, in.length()));
break;
} else {
sb.append(in.subSequence(p1, p0));
p1 = p0;
}
}
return sb.toString();
}

one easy way i know using JsonObject:
try {
JSONObject json = new JSONObject();
json.put("string", myString);
String converted = json.getString("string");
} catch (JSONException e) {
e.printStackTrace();
}

Fast
fun unicodeDecode(unicode: String): String {
val stringBuffer = StringBuilder()
var i = 0
while (i < unicode.length) {
if (i + 1 < unicode.length)
if (unicode[i].toString() + unicode[i + 1].toString() == "\\u") {
val symbol = unicode.substring(i + 2, i + 6)
val c = Integer.parseInt(symbol, 16)
stringBuffer.append(c.toChar())
i += 5
} else stringBuffer.append(unicode[i])
i++
}
return stringBuffer.toString()
}

UnicodeUnescaper from Apache Commons Text does exactly what you want, and ignores any other escape sequences.
String input = "\\u0048\\u0065\\u006C\\u006C\\u006F World";
String output = new UnicodeUnescaper().translate(input);
assert("Hello World".equals(output));
assert("\u0048\u0065\u006C\u006C\u006F World".equals(output));
Where input would be the string you are reading from a file.

try
private static final Charset UTF_8 = Charset.forName("UTF-8");
private String forceUtf8Coding(String input) {return new String(input.getBytes(UTF_8), UTF_8))}

Actually, I wrote an Open Source library that contains some utilities. One of them is converting a Unicode sequence to String and vise-versa. I found it very useful. Here is the quote from the article about this library about Unicode converter:
Class StringUnicodeEncoderDecoder has methods that can convert a
String (in any language) into a sequence of Unicode characters and
vise-versa. For example a String "Hello World" will be converted into
"\u0048\u0065\u006c\u006c\u006f\u0020 \u0057\u006f\u0072\u006c\u0064"
and may be restored back.
Here is the link to entire article that explains what Utilities the library has and how to get the library to use it. It is available as Maven artifact or as source from Github. It is very easy to use. Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison

Here is my solution...
String decodedName = JwtJson.substring(startOfName, endOfName);
StringBuilder builtName = new StringBuilder();
int i = 0;
while ( i < decodedName.length() )
{
if ( decodedName.substring(i).startsWith("\\u"))
{
i=i+2;
builtName.append(Character.toChars(Integer.parseInt(decodedName.substring(i,i+4), 16)));
i=i+4;
}
else
{
builtName.append(decodedName.charAt(i));
i = i+1;
}
};

I found that many of the answers did not address the issue of "Supplementary Characters". Here is the correct way to support it. No third-party libraries, pure Java implementation.
http://www.oracle.com/us/technologies/java/supplementary-142654.html
public static String fromUnicode(String unicode) {
String str = unicode.replace("\\", "");
String[] arr = str.split("u");
StringBuffer text = new StringBuffer();
for (int i = 1; i < arr.length; i++) {
int hexVal = Integer.parseInt(arr[i], 16);
text.append(Character.toChars(hexVal));
}
return text.toString();
}
public static String toUnicode(String text) {
StringBuffer sb = new StringBuffer();
for (int i = 0; i < text.length(); i++) {
int codePoint = text.codePointAt(i);
// Skip over the second char in a surrogate pair
if (codePoint > 0xffff) {
i++;
}
String hex = Integer.toHexString(codePoint);
sb.append("\\u");
for (int j = 0; j < 4 - hex.length(); j++) {
sb.append("0");
}
sb.append(hex);
}
return sb.toString();
}
#Test
public void toUnicode() {
System.out.println(toUnicode("😊"));
System.out.println(toUnicode("🥰"));
System.out.println(toUnicode("Hello World"));
}
// output:
// \u1f60a
// \u1f970
// \u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
#Test
public void fromUnicode() {
System.out.println(fromUnicode("\\u1f60a"));
System.out.println(fromUnicode("\\u1f970"));
System.out.println(fromUnicode("\\u0048\\u0065\\u006c\\u006c\\u006f\\u0020\\u0057\\u006f\\u0072\\u006c\\u0064"));
}
// output:
// 😊
// 🥰
// Hello World

#NominSim
There may be other character, so I should detect it by length.
private String forceUtf8Coding(String str) {
str = str.replace("\\","");
String[] arr = str.split("u");
StringBuilder text = new StringBuilder();
for(int i = 1; i < arr.length; i++){
String a = arr[i];
String b = "";
if (arr[i].length() > 4){
a = arr[i].substring(0, 4);
b = arr[i].substring(4);
}
int hexVal = Integer.parseInt(a, 16);
text.append((char) hexVal).append(b);
}
return text.toString();
}

An alternate way of accomplishing this could be to make use of chars() introduced with Java 9, this can be used to iterate over the characters making sure any char which maps to a surrogate code point is passed through uninterpreted. This can be used as:-
String myString = "\u0048\u0065\u006C\u006C\u006F World";
myString.chars().forEach(a -> System.out.print((char)a));
// would print "Hello World"

Solution for Kotlin:
val sourceContent = File("test.txt").readText(Charset.forName("windows-1251"))
val result = String(sourceContent.toByteArray())
Kotlin uses UTF-8 everywhere as default encoding.
Method toByteArray() has default argument - Charsets.UTF_8.

Check and extract a number from a String in Java

I'm writing a program where the user enters a String in the following format:
"What is the square of 10?"
I need to check that there is a number in the String
and then extract just the number.
If i use .contains("\\d+") or .contains("[0-9]+"), the program can't find a number in the String, no matter what the input is, but .matches("\\d+")will only work when there is only numbers.
What can I use as a solution for finding and extracting?

try this
str.matches(".*\\d.*");

If you want to extract the first number out of the input string, you can do-
public static String extractNumber(final String str) {
if(str == null || str.isEmpty()) return "";
StringBuilder sb = new StringBuilder();
boolean found = false;
for(char c : str.toCharArray()){
if(Character.isDigit(c)){
sb.append(c);
found = true;
} else if(found){
// If we already found a digit before and this char is not a digit, stop looping
break;
}
}
return sb.toString();
}
Examples:
For input "123abc", the method above will return 123.
For "abc1000def", 1000.
For "555abc45", 555.
For "abc", will return an empty string.

I think it is faster than regex .
public final boolean containsDigit(String s) {
boolean containsDigit = false;
if (s != null && !s.isEmpty()) {
for (char c : s.toCharArray()) {
if (containsDigit = Character.isDigit(c)) {
break;
}
}
}
return containsDigit;
}

s=s.replaceAll("[*a-zA-Z]", "") replaces all alphabets
s=s.replaceAll("[*0-9]", "") replaces all numerics
if you do above two replaces you will get all special charactered string
If you want to extract only integers from a String s=s.replaceAll("[^0-9]", "")
If you want to extract only Alphabets from a String s=s.replaceAll("[^a-zA-Z]", "")
Happy coding :)

The code below is enough for "Check if a String contains numbers in Java"
Pattern p = Pattern.compile("([0-9])");
Matcher m = p.matcher("Here is ur string");
if(m.find()){
System.out.println("Hello "+m.find());
}

I could not find a single pattern correct.
Please follow below guide for a small and sweet solution.
String regex = "(.)*(\\d)(.)*";
Pattern pattern = Pattern.compile(regex);
String msg = "What is the square of 10?";
boolean containsNumber = pattern.matcher(msg).matches();

Pattern p = Pattern.compile("(([A-Z].*[0-9])");
Matcher m = p.matcher("TEST 123");
boolean b = m.find();
System.out.println(b);

The solution I went with looks like this:
Pattern numberPat = Pattern.compile("\\d+");
Matcher matcher1 = numberPat.matcher(line);
Pattern stringPat = Pattern.compile("What is the square of", Pattern.CASE_INSENSITIVE);
Matcher matcher2 = stringPat.matcher(line);
if (matcher1.find() && matcher2.find())
{
int number = Integer.parseInt(matcher1.group());
pw.println(number + " squared = " + (number * number));
}
I'm sure it's not a perfect solution, but it suited my needs. Thank you all for the help. :)

Try the following pattern:
.matches("[a-zA-Z ]*\\d+.*")

Below code snippet will tell whether the String contains digit or not
str.matches(".*\\d.*")
or
str.matches(.*[0-9].*)
For example
String str = "abhinav123";
str.matches(".*\\d.*") or str.matches(.*[0-9].*) will return true
str = "abhinav";
str.matches(".*\\d.*") or str.matches(.*[0-9].*) will return false

As I was redirected here searching for a method to find digits in string in Kotlin language, I'll leave my findings here for other folks wanting a solution specific to Kotlin.
Finding out if a string contains digit:
val hasDigits = sampleString.any { it.isDigit() }
Finding out if a string contains only digits:
val hasOnlyDigits = sampleString.all { it.isDigit() }
Extract digits from string:
val onlyNumberString = sampleString.filter { it.isDigit() }

public String hasNums(String str) {
char[] nums = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9' };
char[] toChar = new char[str.length()];
for (int i = 0; i < str.length(); i++) {
toChar[i] = str.charAt(i);
for (int j = 0; j < nums.length; j++) {
if (toChar[i] == nums[j]) { return str; }
}
}
return "None";
}

You can try this
String text = "ddd123.0114cc";
String numOnly = text.replaceAll("\\p{Alpha}","");
try {
double numVal = Double.valueOf(numOnly);
System.out.println(text +" contains numbers");
} catch (NumberFormatException e){
System.out.println(text+" not contains numbers");
}

As you don't only want to look for a number but also extract it, you should write a small function doing that for you. Go letter by letter till you spot a digit. Ah, just found the necessary code for you on stackoverflow: find integer in string. Look at the accepted answer.

.matches(".*\\d+.*") only works for numbers but not other symbols like // or * etc.

ASCII is at the start of UNICODE, so you can do something like this:
(x >= 97 && x <= 122) || (x >= 65 && x <= 90) // 97 == 'a' and 65 = 'A'
I'm sure you can figure out the other values...

How to convert a string with Unicode encoding to a string of letters

I have a string with escaped Unicode characters, \uXXXX, and I want to convert it to regular Unicode letters. For example:
"\u0048\u0065\u006C\u006C\u006F World"
should become
"Hello World"
I know that when I print the first string it already shows Hello world. My problem is I read file names from a file, and then I search for them. The files names in the file are escaped with Unicode encoding, and when I search for the files, I can't find them, since it searches for a file with \uXXXX in its name.

The Apache Commons Lang StringEscapeUtils.unescapeJava() can decode it properly.
import org.apache.commons.lang.StringEscapeUtils;
#Test
public void testUnescapeJava() {
String sJava="\\u0048\\u0065\\u006C\\u006C\\u006F";
System.out.println("StringEscapeUtils.unescapeJava(sJava):\n" + StringEscapeUtils.unescapeJava(sJava));
}
output:
StringEscapeUtils.unescapeJava(sJava):
Hello

Technically doing:
String myString = "\u0048\u0065\u006C\u006C\u006F World";
automatically converts it to "Hello World", so I assume you are reading in the string from some file. In order to convert it to "Hello" you'll have to parse the text into the separate unicode digits, (take the \uXXXX and just get XXXX) then do Integer.ParseInt(XXXX, 16) to get a hex value and then case that to char to get the actual character.
Edit: Some code to accomplish this:
String str = myString.split(" ")[0];
str = str.replace("\\","");
String[] arr = str.split("u");
String text = "";
for(int i = 1; i < arr.length; i++){
int hexVal = Integer.parseInt(arr[i], 16);
text += (char)hexVal;
}
// Text will now have Hello

You can use StringEscapeUtils from Apache Commons Lang, i.e.:
String Title = StringEscapeUtils.unescapeJava("\\u0048\\u0065\\u006C\\u006C\\u006F");

This simple method will work for most cases, but would trip up over something like "u005Cu005C" which should decode to the string "\u0048" but would actually decode "H" as the first pass produces "\u0048" as the working string which then gets processed again by the while loop.
static final String decode(final String in)
{
String working = in;
int index;
index = working.indexOf("\\u");
while(index > -1)
{
int length = working.length();
if(index > (length-6))break;
int numStart = index + 2;
int numFinish = numStart + 4;
String substring = working.substring(numStart, numFinish);
int number = Integer.parseInt(substring,16);
String stringStart = working.substring(0, index);
String stringEnd = working.substring(numFinish);
working = stringStart + ((char)number) + stringEnd;
index = working.indexOf("\\u");
}
return working;
}

Shorter version:
public static String unescapeJava(String escaped) {
if(escaped.indexOf("\\u")==-1)
return escaped;
String processed="";
int position=escaped.indexOf("\\u");
while(position!=-1) {
if(position!=0)
processed+=escaped.substring(0,position);
String token=escaped.substring(position+2,position+6);
escaped=escaped.substring(position+6);
processed+=(char)Integer.parseInt(token,16);
position=escaped.indexOf("\\u");
}
processed+=escaped;
return processed;
}

StringEscapeUtils from org.apache.commons.lang3 library is deprecated as of 3.6.
So you can use their new commons-text library instead:
compile 'org.apache.commons:commons-text:1.9'
OR
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-text</artifactId>
<version>1.9</version>
</dependency>
Example code:
org.apache.commons.text.StringEscapeUtils.unescapeJava(escapedString);

With Kotlin you can write your own extension function for String
fun String.unescapeUnicode() = replace("\\\\u([0-9A-Fa-f]{4})".toRegex()) {
String(Character.toChars(it.groupValues[1].toInt(radix = 16)))
}
and then
fun main() {
val originalString = "\\u0048\\u0065\\u006C\\u006C\\u006F World"
println(originalString.unescapeUnicode())
}

It's not totally clear from your question, but I'm assuming you saying that you have a file where each line of that file is a filename. And each filename is something like this:
\u0048\u0065\u006C\u006C\u006F
In other words, the characters in the file of filenames are \, u, 0, 0, 4, 8 and so on.
If so, what you're seeing is expected. Java only translates \uXXXX sequences in string literals in source code (and when reading in stored Properties objects). When you read the contents you file you will have a string consisting of the characters \, u, 0, 0, 4, 8 and so on and not the string Hello.
So you will need to parse that string to extract the 0048, 0065, etc. pieces and then convert them to chars and make a string from those chars and then pass that string to the routine that opens the file.

For Java 9+, you can use the new replaceAll method of Matcher class.
private static final Pattern UNICODE_PATTERN = Pattern.compile("\\\\u([0-9A-Fa-f]{4})");
public static String unescapeUnicode(String unescaped) {
return UNICODE_PATTERN.matcher(unescaped).replaceAll(r -> String.valueOf((char) Integer.parseInt(r.group(1), 16)));
}
public static void main(String[] args) {
String originalMessage = "\\u0048\\u0065\\u006C\\u006C\\u006F World";
String unescapedMessage = unescapeUnicode(originalMessage);
System.out.println(unescapedMessage);
}
I believe the main advantage of this approach over unescapeJava by StringEscapeUtils (besides not using an extra library) is that you can convert only the unicode characters (if you wish), since the latter converts all escaped Java characters (like \n or \t). If you prefer to convert all escaped characters the library is really the best option.

Updates regarding answers suggesting using The Apache Commons Lang's:
StringEscapeUtils.unescapeJava() - it was deprecated,
Deprecated.
as of 3.6, use commons-text StringEscapeUtils instead
The replacement is Apache Commons Text's StringEscapeUtils.unescapeJava()

Just wanted to contribute my version, using regex:
private static final String UNICODE_REGEX = "\\\\u([0-9a-f]{4})";
private static final Pattern UNICODE_PATTERN = Pattern.compile(UNICODE_REGEX);
...
String message = "\u0048\u0065\u006C\u006C\u006F World";
Matcher matcher = UNICODE_PATTERN.matcher(message);
StringBuffer decodedMessage = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(
decodedMessage, String.valueOf((char) Integer.parseInt(matcher.group(1), 16)));
}
matcher.appendTail(decodedMessage);
System.out.println(decodedMessage.toString());

I wrote a performanced and error-proof solution:
public static final String decode(final String in) {
int p1 = in.indexOf("\\u");
if (p1 < 0)
return in;
StringBuilder sb = new StringBuilder();
while (true) {
int p2 = p1 + 6;
if (p2 > in.length()) {
sb.append(in.subSequence(p1, in.length()));
break;
}
try {
int c = Integer.parseInt(in.substring(p1 + 2, p1 + 6), 16);
sb.append((char) c);
p1 += 6;
} catch (Exception e) {
sb.append(in.subSequence(p1, p1 + 2));
p1 += 2;
}
int p0 = in.indexOf("\\u", p1);
if (p0 < 0) {
sb.append(in.subSequence(p1, in.length()));
break;
} else {
sb.append(in.subSequence(p1, p0));
p1 = p0;
}
}
return sb.toString();
}

one easy way i know using JsonObject:
try {
JSONObject json = new JSONObject();
json.put("string", myString);
String converted = json.getString("string");
} catch (JSONException e) {
e.printStackTrace();
}

Fast
fun unicodeDecode(unicode: String): String {
val stringBuffer = StringBuilder()
var i = 0
while (i < unicode.length) {
if (i + 1 < unicode.length)
if (unicode[i].toString() + unicode[i + 1].toString() == "\\u") {
val symbol = unicode.substring(i + 2, i + 6)
val c = Integer.parseInt(symbol, 16)
stringBuffer.append(c.toChar())
i += 5
} else stringBuffer.append(unicode[i])
i++
}
return stringBuffer.toString()
}

UnicodeUnescaper from Apache Commons Text does exactly what you want, and ignores any other escape sequences.
String input = "\\u0048\\u0065\\u006C\\u006C\\u006F World";
String output = new UnicodeUnescaper().translate(input);
assert("Hello World".equals(output));
assert("\u0048\u0065\u006C\u006C\u006F World".equals(output));
Where input would be the string you are reading from a file.

try
private static final Charset UTF_8 = Charset.forName("UTF-8");
private String forceUtf8Coding(String input) {return new String(input.getBytes(UTF_8), UTF_8))}

Actually, I wrote an Open Source library that contains some utilities. One of them is converting a Unicode sequence to String and vise-versa. I found it very useful. Here is the quote from the article about this library about Unicode converter:
Class StringUnicodeEncoderDecoder has methods that can convert a
String (in any language) into a sequence of Unicode characters and
vise-versa. For example a String "Hello World" will be converted into
"\u0048\u0065\u006c\u006c\u006f\u0020 \u0057\u006f\u0072\u006c\u0064"
and may be restored back.
Here is the link to entire article that explains what Utilities the library has and how to get the library to use it. It is available as Maven artifact or as source from Github. It is very easy to use. Open Source Java library with stack trace filtering, Silent String parsing Unicode converter and Version comparison

Here is my solution...
String decodedName = JwtJson.substring(startOfName, endOfName);
StringBuilder builtName = new StringBuilder();
int i = 0;
while ( i < decodedName.length() )
{
if ( decodedName.substring(i).startsWith("\\u"))
{
i=i+2;
builtName.append(Character.toChars(Integer.parseInt(decodedName.substring(i,i+4), 16)));
i=i+4;
}
else
{
builtName.append(decodedName.charAt(i));
i = i+1;
}
};

I found that many of the answers did not address the issue of "Supplementary Characters". Here is the correct way to support it. No third-party libraries, pure Java implementation.
http://www.oracle.com/us/technologies/java/supplementary-142654.html
public static String fromUnicode(String unicode) {
String str = unicode.replace("\\", "");
String[] arr = str.split("u");
StringBuffer text = new StringBuffer();
for (int i = 1; i < arr.length; i++) {
int hexVal = Integer.parseInt(arr[i], 16);
text.append(Character.toChars(hexVal));
}
return text.toString();
}
public static String toUnicode(String text) {
StringBuffer sb = new StringBuffer();
for (int i = 0; i < text.length(); i++) {
int codePoint = text.codePointAt(i);
// Skip over the second char in a surrogate pair
if (codePoint > 0xffff) {
i++;
}
String hex = Integer.toHexString(codePoint);
sb.append("\\u");
for (int j = 0; j < 4 - hex.length(); j++) {
sb.append("0");
}
sb.append(hex);
}
return sb.toString();
}
#Test
public void toUnicode() {
System.out.println(toUnicode("😊"));
System.out.println(toUnicode("🥰"));
System.out.println(toUnicode("Hello World"));
}
// output:
// \u1f60a
// \u1f970
// \u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
#Test
public void fromUnicode() {
System.out.println(fromUnicode("\\u1f60a"));
System.out.println(fromUnicode("\\u1f970"));
System.out.println(fromUnicode("\\u0048\\u0065\\u006c\\u006c\\u006f\\u0020\\u0057\\u006f\\u0072\\u006c\\u0064"));
}
// output:
// 😊
// 🥰
// Hello World

#NominSim
There may be other character, so I should detect it by length.
private String forceUtf8Coding(String str) {
str = str.replace("\\","");
String[] arr = str.split("u");
StringBuilder text = new StringBuilder();
for(int i = 1; i < arr.length; i++){
String a = arr[i];
String b = "";
if (arr[i].length() > 4){
a = arr[i].substring(0, 4);
b = arr[i].substring(4);
}
int hexVal = Integer.parseInt(a, 16);
text.append((char) hexVal).append(b);
}
return text.toString();
}

An alternate way of accomplishing this could be to make use of chars() introduced with Java 9, this can be used to iterate over the characters making sure any char which maps to a surrogate code point is passed through uninterpreted. This can be used as:-
String myString = "\u0048\u0065\u006C\u006C\u006F World";
myString.chars().forEach(a -> System.out.print((char)a));
// would print "Hello World"

Solution for Kotlin:
val sourceContent = File("test.txt").readText(Charset.forName("windows-1251"))
val result = String(sourceContent.toByteArray())
Kotlin uses UTF-8 everywhere as default encoding.
Method toByteArray() has default argument - Charsets.UTF_8.

Extract digits from a string in Java

I have a Java String object. I need to extract only digits from it. I'll give an example:
"123-456-789" I want "123456789"
Is there a library function that extracts only digits?
Thanks for the answers. Before I try these I need to know if I have to install any additional llibraries?

You can use regex and delete non-digits.
str = str.replaceAll("\\D+","");

Here's a more verbose solution. Less elegant, but probably faster:
public static String stripNonDigits(
final CharSequence input /* inspired by seh's comment */){
final StringBuilder sb = new StringBuilder(
input.length() /* also inspired by seh's comment */);
for(int i = 0; i < input.length(); i++){
final char c = input.charAt(i);
if(c > 47 && c < 58){
sb.append(c);
}
}
return sb.toString();
}
Test Code:
public static void main(final String[] args){
final String input = "0-123-abc-456-xyz-789";
final String result = stripNonDigits(input);
System.out.println(result);
}
Output:
0123456789
BTW: I did not use Character.isDigit(ch) because it accepts many other chars except 0 - 9.

public String extractDigits(String src) {
StringBuilder builder = new StringBuilder();
for (int i = 0; i < src.length(); i++) {
char c = src.charAt(i);
if (Character.isDigit(c)) {
builder.append(c);
}
}
return builder.toString();
}

Using Google Guava:
CharMatcher.inRange('0','9').retainFrom("123-456-789")
UPDATE:
Using Precomputed CharMatcher can further improve performance
CharMatcher ASCII_DIGITS=CharMatcher.inRange('0','9').precomputed();
ASCII_DIGITS.retainFrom("123-456-789");

input.replaceAll("[^0-9?!\\.]","")
This will ignore the decimal points.
eg: if you have an input as 445.3kg the output will be 445.3.

Using Google Guava:
CharMatcher.DIGIT.retainFrom("123-456-789");
CharMatcher is plug-able and quite interesting to use, for instance you can do the following:
String input = "My phone number is 123-456-789!";
String output = CharMatcher.is('-').or(CharMatcher.DIGIT).retainFrom(input);
output == 123-456-789

public class FindDigitFromString
{
public static void main(String[] args)
{
String s=" Hi How Are You 11 ";
String s1=s.replaceAll("[^0-9]+", "");
//*replacing all the value of string except digit by using "[^0-9]+" regex.*
System.out.println(s1);
}
}
Output: 11

Use regular expression to match your requirement.
String num,num1,num2;
String str = "123-456-789";
String regex ="(\\d+)";
Matcher matcher = Pattern.compile( regex ).matcher( str);
while (matcher.find( ))
{
num = matcher.group();
System.out.print(num);
}

I inspired by code Sean Patrick Floyd and little rewrite it for maximum performance i get.
public static String stripNonDigitsV2( CharSequence input ) {
if (input == null)
return null;
if ( input.length() == 0 )
return "";
char[] result = new char[input.length()];
int cursor = 0;
CharBuffer buffer = CharBuffer.wrap( input );
while ( buffer.hasRemaining() ) {
char chr = buffer.get();
if ( chr > 47 && chr < 58 )
result[cursor++] = chr;
}
return new String( result, 0, cursor );
}
i do Performance test to very long String with minimal numbers and result is:
Original code is 25,5% slower
Guava approach is 2.5-3 times slower
Regular expression with D+ is 3-3.5 times slower
Regular expression with only D is 25+ times slower
Btw it depends on how long that string is. With string that contains only 6 number is guava 50% slower and regexp 1 times slower

Using Kotlin and Lambda expressions you can do it like this:
val digitStr = str.filter { it.isDigit() }

You can use str.replaceAll("[^0-9]", "");

I have finalized the code for phone numbers +9 (987) 124124.
Unicode characters occupy 4 bytes.
public static String stripNonDigitsV2( CharSequence input ) {
if (input == null)
return null;
if ( input.length() == 0 )
return "";
char[] result = new char[input.length()];
int cursor = 0;
CharBuffer buffer = CharBuffer.wrap( input );
int i=0;
while ( i< buffer.length() ) { //buffer.hasRemaining()
char chr = buffer.get(i);
if (chr=='u'){
i=i+5;
chr=buffer.get(i);
}
if ( chr > 39 && chr < 58 )
result[cursor++] = chr;
i=i+1;
}
return new String( result, 0, cursor );
}

Code:
public class saasa {
public static void main(String[] args) {
// TODO Auto-generated method stub
String t="123-456-789";
t=t.replaceAll("-", "");
System.out.println(t);
}

import java.util.*;
public class FindDigits{
public static void main(String []args){
FindDigits h=new FindDigits();
h.checkStringIsNumerical();
}
void checkStringIsNumerical(){
String h="hello 123 for the rest of the 98475wt355";
for(int i=0;i<h.length();i++) {
if(h.charAt(i)!=' '){
System.out.println("Is this '"+h.charAt(i)+"' is a digit?:"+Character.isDigit(h.charAt(i)));
}
}
}
void checkStringIsNumerical2(){
String h="hello 123 for 2the rest of the 98475wt355";
for(int i=0;i<h.length();i++) {
char chr=h.charAt(i);
if(chr!=' '){
if(Character.isDigit(chr)){
System.out.print(chr) ;
}
}
}
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Remove Arabic non-alpha-numeric characters from a string in Java - java

How can I remove all non-alpha-numeric Arabic characters from a string in Java?

use regex [^A-Za-z0-9 ] the regex will only allow alphabets from A to Z and a to z also numericals from 0 to 9. nothing else

Here is the complete answer: String patternString = ""; Pattern pattern = null; Matcher matcher = null; String normalizedString = ""; patternString = "[^A-Za-zأ-ْ-9 ]"; pattern = Pattern.compile(patternString); matcher = pattern.matcher(string); normalizedString = matcher.replaceAll("");

To remove arabic alpha from a string you can use the method below : public void removeArabicChars() { String input = "This string contains Arabic characters هذا النص يحتوي على حروف عربية"; String output = input.replaceAll("\\p{InArabic}", ""); System.out.println(output); }

Related

Removing supplementary characters from a Java string [duplicate]

Java: Display unicode chars as chars when printing string [duplicate]

Check and extract a number from a String in Java

How to convert a string with Unicode encoding to a string of letters

Extract digits from a string in Java

Categories

Resources