Java Char to its unicode hexadecimal string representation and vice-versa - java

I need to generate the hexadecimal code of Java characters into strings, and parse those strings again later. I found here that parsing can be performed as following:
char c = "\u041f".toCharArray()[0];
I was hoping for something more elegant like Integer.valueOf() for parsing.
How about generating the hexadecimal unicode properly?

This will generate a hex string representation of the char:
char ch = 'ö';
String hex = String.format("%04x", (int) ch);
And this will convert the hex string back into a char:
int hexToInt = Integer.parseInt(hex, 16);
char intToChar = (char)hexToInt;

After doing some deeper reading, the javadoc says the Character methods based on char parameters do not support all unicode values, but those taking code points (i.e., int) do.
Hence, I have been performing the following test:
int codePointCopyright = Integer.parseInt("00A9", 16);
System.out.println(Integer.toHexString(codePointCopyright));
System.out.println(Character.isValidCodePoint(codePointCopyright));
char[] toChars = Character.toChars(codePointCopyright);
System.out.println(toChars);
System.out.println();
int codePointAsian = Integer.parseInt("20011", 16);
System.out.println(Integer.toHexString(codePointAsian));
System.out.println(Character.isValidCodePoint(codePointAsian));
char[] toCharsAsian = Character.toChars(codePointAsian);
System.out.println(toCharsAsian);
and I am getting:
Therefore, I should not talk about char in my question, but rather about array of chars, since Unicode characters can be represented with more than one char. On the other side, an int covers it all.

On String level:
The following uses not char but int, say for Chinese, but is also adequate for chars.
int cp = "\u041f".codePointAt(0);
String s = new String(Character.toChars(cp));
On native2ascii level:
If you want to convert back and forth between \uXXXX and Unicode character, use from apache, commons-lang the StringEscapeUtils:
String t = StringEscapeUtils.escapeJava(s + "ö");
System.out.println(t);
On the command-line native2ascii can convert back and forth files between u-escaped and say UTF-8.

Related

How to convert string representation of an ASCII value to character

I have a String containing ASCII representation of a character i.e.
String test = "0x07";
Is there a way I can somehow parse it to its character value.
I want something like
char c = 0x07;
But what the character exactly is, will be known only by reading the value in the string.
You have to add one step:
String test = "0x07";
int decimal = Integer.decode(test);
char c = (char) decimal;

Replace special characters in a string with their UTF-8 encoded character java?

I want to convert only the special characters to their UTF-8 equivalent character.
For example given a String: Abcds23#$_ss, it should get converted to Abcds23353695ss.
The following is how i did the above conversion:
The utf-8 in hexadecimal for # is 23 and in decimal is 35. The utf-8 in hexadecimal for $ is 24 and in decimal is 36. The utf-8 in hexadecimal for _ is 5f and in decimal is 95.
I know we have the String.replaceAll(String regex, String replacement) method. But I want to replace specific character with their specific UTF-8 equivalent.
How do I do the same in java?
I don't know how do you define "special characters", but this function should give you an idea:
public static String convert(String str)
{
StringBuilder buf = new StringBuilder();
for (int index = 0; index < str.length(); index++)
{
char ch = str.charAt(index);
if (Character.isLetterOrDigit(ch))
buf.append(ch);
else
buf.append(str.codePointAt(index));
}
return buf.toString();
}
#Test
public void test()
{
Assert.assertEquals("Abcds23353695ss", convert("Abcds23#$_ss"));
}
The following uses java 8 or above and checks whether a Unicode code point (symbol) is a letter or digit, pure ASCII (< 128) and otherwise output the Unicode code point as string of the numerical value.
static String convert(String str) {
int[] cps = str.codePoints()
.flatMap((cp) ->
Character.isLetterOrDigit(cp) && cp < 128
? IntStream.of(cp)
: String.valueOf(cp).codePoints())
.toArray();
return new String(cps, 0, cps.length);
}
String.codePoints() yields an IntStream, flatMap adds IntStreams in a single flattened stream, and toArray collects it in an array. So we can construct a new String from those code points. Entirely Unicode safe.
Conversion is not undoable without delimiters.
On Unicode:
Unicode numbers symbols, called code points, from 0 upwards, into the 3 byte range.
To be coded (formated) in bytes there exist UTF-8 (multi-byte), UTF-16LE and UTF-16BE (2byte-sequences) and UTF-32 (code points as-is more or less).
Java string constants in a .class file are in UTF-8. A String is composed of UTF-16BE chars. And String can give code points as above. So java by design uses Unicode for text.

How to format numbers to a hex strings?

I want to format int numbers as hex strings. System.out.println(Integer.toHexString(1)); prints 1 but I want it as 0x00000001. How do I do that?
Try this
System.out.println(String.format("0x%08X", 1));
You can use the String.format to format an integer as a hex string.
System.out.println(String.format("0x%08X", 1));
That is, pad with zeros, and make the total width 8. The 1 is converted to hex for you. The above line gives: 0x00000001 and
System.out.println(String.format("0x%08X", 234));
gives: 0x000000EA
From formatting syntax documented on Java's Formatter class:
Integer intObject = Integer.valueOf(1);
String s = String.format("0x%08x", intObject);
System.out.println(s);
Less verbose:
System.out.printf("0x%08x", 1); //"Use %0X for upper case letters
I don't know Java too intimately, but there must be a way you can pad the output from the toHexString function with a '0' to a length of 8. If "0x" will always be at the beginning, just tack on that string to the beginning.
Java 17+
There is a new immutable class dedicated to conversion into and formatting hexadecimal numbers. The easiest way to go is using HexFormat::toHexDigits which includes leading zeroes:
String hex = "0x" + HexFormat.of().toHexDigits(1);
// 0x00000001
Beware, one has to concatenate with the "0x" prefix as such method ignores defined prefixes and suffixes, so the following snippet doesn't work as expected (only HexFormat::formatHex methods work with them):
String hex = HexFormat.of().withPrefix("0x").toHexDigits(1);
// 00000001
Returns the eight hexadecimal characters for the int value. Each nibble (4 bits) from most significant to least significant of the value is formatted as if by toLowHexDigit(nibble). The delimiter, prefix and suffix are not used.
Alternatively use the advantage of HexFormat::formatHex formatting to two hexadecimal characters, and a StringBuilder as an Appendable prefix containing "0x":
Each byte value is formatted as the prefix, two hexadecimal characters selected from uppercase or lowercase digits, and the suffix.
StringBuilder hex = HexFormat.of()
.formatHex(new StringBuilder("0x"), new byte[] {0, 0, 0, 1});
// 0x00000001
StringBuilder hex = HexFormat.of()
.formatHex(new StringBuilder("0x"), ByteBuffer.allocate(4).putInt(1).array());
// 0x00000001
You can use a java.util.Formatter or the printf method on a PrintStream.
This is String extension for Kotlin
//if lengthOfResultTextNeeded = 3 and input String is "AC", the result is = "0AC"
//if lengthOfResultTextNeeded = 4 and input String is "AC", the result is = "00AC"
fun String.unSignedHex(lengthOfResultTextNeeded: Int): String {
val count =
lengthOfResultTextNeeded - this.length
val buildHex4DigitString = StringBuilder()
var i = 1
while (i <= count) {
buildHex4DigitString.append("0")
++i
}
return buildHex4DigitString.toString() + this
}

How to convert a char from an alphabetical character to a hexadecimal number in Java?

How do I convert a char from an alphabetical character to hexadecimal number in Java? If any one knows any built-in methods in Java that does the job or if you have your own method, could you please help?
Also, how would I convert from hex to binary?
You can convert from char to hex string.
char ch =
String hex = String.format("%04x", (int) ch);
To read hex and convert to binary you can do
int num = Integer.parseInt(text, 16);
String bin = Integer.toString(num, 2);
You could use:
Integer.toHexString((int) 'a');
Integer.toBinaryString((int) 'b');
Update: hex -> binary conversion:
Integer.toBinaryString(Integer.parseInt("fa", 16))
Use the apache commons codec library
Specifically:
http://commons.apache.org/codec/apidocs/org/apache/commons/codec/binary/Hex.html

Unicode to string conversion in Java

I am building a language, a toy language. The syntax \#0061 is supposed to convert the given Unicode to an character:
String temp = yytext().subtring(2);
Then after that try to append '\u' to the string, I noticed that generated an error.
I also tried to "\\" + "u" + temp; this way does not do any conversion.
I am basically trying to convert Unicode to a character by supplying only '0061' to a method, help.
Strip the '#' and use Integer.parseInt("0061", 16) to convert the hex digits to an int. Then cast to a char.
(If you had implemented the lexer by hand, an alternatively would be to do the conversion on the fly as your lexer matches the unicode literal. But on rereading the question, I see that you are using a lexer generator ... good move!)
i am basically trying to convert
unicode to a character by supplying
only '0061' to a method, help.
char fromUnicode(String codePoint) {
return (char) Integer.parseInt(codePoint, 16);
}
You need to handle bad inputs and such, but that will work otherwise.
You need to convert the particular codepoint to a char. You can do that with a little help of regex:
String string = "blah #0061 blah";
Matcher matcher = Pattern.compile("\\#((?i)[0-9a-f]{4})").matcher(string);
while (matcher.find()) {
int codepoint = Integer.valueOf(matcher.group(1), 16);
string = string.replaceAll(matcher.group(0), String.valueOf((char) codepoint));
}
System.out.println(string); // blah a blah
Edit as per the comments, if it is a single token, then just do:
String string = "0061";
char c = (char) Integer.parseInt(string, 16);
System.out.println(c); // a
\uXXXX is an escape sequence. Before execution it has already been converted into the actual character value, its not "evaluated" in anyway at runtime.
What you probably want to do is define a mapping from your #XXXX syntax to Unicode code points and cast them to char.

Categories