Unicode to string conversion in Java - java

I am building a language, a toy language. The syntax \#0061 is supposed to convert the given Unicode to an character:
String temp = yytext().subtring(2);
Then after that try to append '\u' to the string, I noticed that generated an error.
I also tried to "\\" + "u" + temp; this way does not do any conversion.
I am basically trying to convert Unicode to a character by supplying only '0061' to a method, help.

Strip the '#' and use Integer.parseInt("0061", 16) to convert the hex digits to an int. Then cast to a char.
(If you had implemented the lexer by hand, an alternatively would be to do the conversion on the fly as your lexer matches the unicode literal. But on rereading the question, I see that you are using a lexer generator ... good move!)

i am basically trying to convert
unicode to a character by supplying
only '0061' to a method, help.
char fromUnicode(String codePoint) {
return (char) Integer.parseInt(codePoint, 16);
}
You need to handle bad inputs and such, but that will work otherwise.

You need to convert the particular codepoint to a char. You can do that with a little help of regex:
String string = "blah #0061 blah";
Matcher matcher = Pattern.compile("\\#((?i)[0-9a-f]{4})").matcher(string);
while (matcher.find()) {
int codepoint = Integer.valueOf(matcher.group(1), 16);
string = string.replaceAll(matcher.group(0), String.valueOf((char) codepoint));
}
System.out.println(string); // blah a blah
Edit as per the comments, if it is a single token, then just do:
String string = "0061";
char c = (char) Integer.parseInt(string, 16);
System.out.println(c); // a

\uXXXX is an escape sequence. Before execution it has already been converted into the actual character value, its not "evaluated" in anyway at runtime.
What you probably want to do is define a mapping from your #XXXX syntax to Unicode code points and cast them to char.

Related

Java - Remove only the first backslash

Small Java question regarding how to remove only the first backslash please.
I have a string which looks like this:
String s = "\\u6df1\\u5733";
Please note, there are two backslashes, and multiple occurrences.
Hence, when this is displayed, the visual result is:
\深\圳
I would like to just remove any extra backslashes, having a result like this:
深圳
So far, I have tried this:
String s = "\\u6df1\\u5733";
String ss = s.replaceAll("\\", "");
But it is still not working.
What is the correct solution please in order to get 深圳 from "\\u6df1\\u5733" please?
Thank you
Try this.
String s = "\\u6df1\\u5733";
Pattern UNICODE_ESCAPE = Pattern.compile("\\\\u[0-9a-f]+", Pattern.CASE_INSENSITIVE);
String ss = UNICODE_ESCAPE.matcher(s).results()
.map(x -> new String(Character.toChars(Integer.parseInt(x.group().substring(2), 16))))
.collect(Collectors.joining());
System.out.println(ss);
UNICODE_ESCAPE.matcher(s).results() returns the stream of MatcherResult.
x.group().substring(2) extracts hexadecimal part "xxxx" from "\\uxxxx".
Integer.parseInt(..., 16) converts it to an integer value that is a code point.
Caracter.toChars() converts it to an array of char.
new String(...) converts it to an String. And .collect(Collectors.joining()) concatenates the all of them.
output:
深圳
Going by this output:
\深\圳
you actually have two unicode characters each preceded by one backslash.
In a Java string literal, that would look like this:
String s = "\\\u6df1\\\u5733";
If you want to remove the backslashes (\\) and leave the unicode character codes (e.g. \u6df1), then you just need replace.
String ss = s.replace("\\", "");
replaceAll won't work for this, because it requires a regular expression as its first argument.

Replace special characters in a string with their UTF-8 encoded character java?

I want to convert only the special characters to their UTF-8 equivalent character.
For example given a String: Abcds23#$_ss, it should get converted to Abcds23353695ss.
The following is how i did the above conversion:
The utf-8 in hexadecimal for # is 23 and in decimal is 35. The utf-8 in hexadecimal for $ is 24 and in decimal is 36. The utf-8 in hexadecimal for _ is 5f and in decimal is 95.
I know we have the String.replaceAll(String regex, String replacement) method. But I want to replace specific character with their specific UTF-8 equivalent.
How do I do the same in java?
I don't know how do you define "special characters", but this function should give you an idea:
public static String convert(String str)
{
StringBuilder buf = new StringBuilder();
for (int index = 0; index < str.length(); index++)
{
char ch = str.charAt(index);
if (Character.isLetterOrDigit(ch))
buf.append(ch);
else
buf.append(str.codePointAt(index));
}
return buf.toString();
}
#Test
public void test()
{
Assert.assertEquals("Abcds23353695ss", convert("Abcds23#$_ss"));
}
The following uses java 8 or above and checks whether a Unicode code point (symbol) is a letter or digit, pure ASCII (< 128) and otherwise output the Unicode code point as string of the numerical value.
static String convert(String str) {
int[] cps = str.codePoints()
.flatMap((cp) ->
Character.isLetterOrDigit(cp) && cp < 128
? IntStream.of(cp)
: String.valueOf(cp).codePoints())
.toArray();
return new String(cps, 0, cps.length);
}
String.codePoints() yields an IntStream, flatMap adds IntStreams in a single flattened stream, and toArray collects it in an array. So we can construct a new String from those code points. Entirely Unicode safe.
Conversion is not undoable without delimiters.
On Unicode:
Unicode numbers symbols, called code points, from 0 upwards, into the 3 byte range.
To be coded (formated) in bytes there exist UTF-8 (multi-byte), UTF-16LE and UTF-16BE (2byte-sequences) and UTF-32 (code points as-is more or less).
Java string constants in a .class file are in UTF-8. A String is composed of UTF-16BE chars. And String can give code points as above. So java by design uses Unicode for text.

Java Char to its unicode hexadecimal string representation and vice-versa

I need to generate the hexadecimal code of Java characters into strings, and parse those strings again later. I found here that parsing can be performed as following:
char c = "\u041f".toCharArray()[0];
I was hoping for something more elegant like Integer.valueOf() for parsing.
How about generating the hexadecimal unicode properly?
This will generate a hex string representation of the char:
char ch = 'ö';
String hex = String.format("%04x", (int) ch);
And this will convert the hex string back into a char:
int hexToInt = Integer.parseInt(hex, 16);
char intToChar = (char)hexToInt;
After doing some deeper reading, the javadoc says the Character methods based on char parameters do not support all unicode values, but those taking code points (i.e., int) do.
Hence, I have been performing the following test:
int codePointCopyright = Integer.parseInt("00A9", 16);
System.out.println(Integer.toHexString(codePointCopyright));
System.out.println(Character.isValidCodePoint(codePointCopyright));
char[] toChars = Character.toChars(codePointCopyright);
System.out.println(toChars);
System.out.println();
int codePointAsian = Integer.parseInt("20011", 16);
System.out.println(Integer.toHexString(codePointAsian));
System.out.println(Character.isValidCodePoint(codePointAsian));
char[] toCharsAsian = Character.toChars(codePointAsian);
System.out.println(toCharsAsian);
and I am getting:
Therefore, I should not talk about char in my question, but rather about array of chars, since Unicode characters can be represented with more than one char. On the other side, an int covers it all.
On String level:
The following uses not char but int, say for Chinese, but is also adequate for chars.
int cp = "\u041f".codePointAt(0);
String s = new String(Character.toChars(cp));
On native2ascii level:
If you want to convert back and forth between \uXXXX and Unicode character, use from apache, commons-lang the StringEscapeUtils:
String t = StringEscapeUtils.escapeJava(s + "ö");
System.out.println(t);
On the command-line native2ascii can convert back and forth files between u-escaped and say UTF-8.

Removing backslashes in the numbers sequence

What regular expression can get a number sequence from the input string, contains backslashes and not a numbers, for example -
"12\34a56ss7890"
I need to -
1234567890
If we assume you have this in a String. You could do something like:
string = string.replaceAll("\\D", "");
This will replace all non digit Characters from your String.
str.replaceAll("[^\d]", "");
bootnote: im not a java developer, but the regex itself should be correct
Sorry for adding another Answer but this is needed because this won't fit to an Comment.
I think this is because of the \34. If I do call System.out.print("12\34a56ss7890"); I will get the following output 12a56ss7890. This is because the \34 will be escaped. This is an Issue in Java. You can fix this by first calling this Method on your InputStream:
private InputStreamReader replaceBackSlashes() throws Exception {
FileInputStream fis = new FileInputStream(new File("PATH TO A FILE");
Scanner in = new Scanner(fis, "UTF-8");
ByteArrayOutputStream out = new ByteArrayOutputStream();
while (in.hasNext()) {
String nextLine = in.nextLine().replace("\", "");
out.write(nextLine.getBytes());
out.write("\n".getBytes());
}
return new InputStreamReader(new ByteArrayInputStream(out.toByteArray()));
}
BTW: Sorry for my Edit, but there was a little Mistake in the Code.
After calling this Method you will convert your InputStream to a String and the call this on the String:
string = string.replaceAll("\\D", "");
This should hopefully work now :)
String num;
String str =" 12\34a56ss7890";
str= str.replace("\34", "34");
String regex = "[\\d]+";
Matcher matcher = Pattern.compile( regex ).matcher( str);
while (matcher.find( ))
{
num = matcher.group();
System.out.print(num);
}
replace \34 by 34 and match the rest using regular expression.
User a regular exxpression.
String numvber;
String str =" 12\34a56ss7890";
str= str.replace("\34", "34");
String regex = "[\\d]+";//match only digits.
Matcher matcher = Pattern.compile( regex ).matcher( str);
while (matcher.find( ))
{
num = matcher.group();
System.out.print(num);
}
The following example:
String a ="1\2sas";
String b ="1\\2sas";
System.out.println(a.replaceAll("[a-zA-Z\\\\]",""));
System.out.println(b.replaceAll("[a-zA-Z\\\\]",""));
gives output:
1X
12
where X is not a X but a little rectangle - a symbol which is shown when the text showing control does not know how to draw it, a so called non printable character.
It is because in String a the "\2" part obviously tries to be interpreted as a single escaped sign "\u0002"- similar to "\n" "\t" - you can see this in debugger (i tried it using NetBeans)
Since the first argument of a replaceAll method is passed to [Pattern.compile](http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#replaceAll(java.lang.String, java.lang.String)) it needs to be escaped twice as opposed to String literal (like b).
So if the String "12\34a56ss7890" looks like this on screen you have printed it out like this:
System.out.println("12\\34a56ss7890");
which is solved in the second example.
However if the literal is given as "12\34a56ss7890" then I think you can't handle it with a single regexp, because if the backslash is followed by a number it gets interpreted as as \u0000 -\u0009 so the best I can think of is a very ugly solution:
str.replaceAll("\u0000","0").replaceAll("\u0001","1") ... .replaceAll("\u0009","9").replaceAll("[^\\d]")
the first then replacements (\u0000-\u0009) might be rewritten as a for loop to make it look elegant.
+1 for an EXCELLENT question :)
EDIT:
actually if a backslash is followed by more than one number they all get interpreted as a single sign - up to three numbers after a backslash, the fourth number will be treated as a single number.
Therefore, my solution is not generally correct, but could be extended to be. I would recommend Robin's solution below as it is far more efficient.
The character \34 is an octal number in the string 12\34a56ss7890, so you could use:
str.replaceAll("\034", "34").replaceAll("\\D", "")

Java Regexp to Match ASCII Characters

What regex would match any ASCII character in java?
I've already tried:
^[\\p{ASCII}]*$
but found that it didn't match lots of things that I wanted (like spaces, parentheses, etc...). I'm hoping to avoid explicitly listing all 127 ASCII characters in a format like:
^[a-zA-Z0-9!##$%^*(),.<>~`[]{}\\/+=-\\s]*$
The first try was almost correct
"^\\p{ASCII}*$"
I have never used \\p{ASCII} but I have used ^[\\u0000-\\u007F]*$
If you only want the printable ASCII characters you can use ^[ -~]*$ - i.e. all characters between space and tilde.
https://en.wikipedia.org/wiki/ASCII#ASCII_printable_code_chart
For JavaScript it'll be /^[\x00-\x7F]*$/.test('blah')
I think question about getting ASCII characters from a raw string which has both ASCII and special characters...
public String getOnlyASCII(String raw) {
Pattern asciiPattern = Pattern.compile("\\p{ASCII}*$");
Matcher matcher = asciiPattern.matcher(raw);
String asciiString = null;
if (matcher.find()) {
asciiString = matcher.group();
}
return asciiString;
}
The above program will remove the non ascii string and return the string. Thanks to #Oleg Pavliv for pattern.
For ex:
raw = ��+919986774157
asciiString = +919986774157

Categories