Convert special characters into decimal equivalents in java - java

Is there a java library to convert special characters into decimal equivalent?
example:
input: "©™®"
output: "& #169; & #8482; & #174;"(space after & is only for question purpose, if typed without a space decimal equivalent is converted to special character)
Thank you !

This can be simply achieved with String.format(). The representations are simply the character value as decimal, padded to 4 characters and wrapped in &#;
The only tricky part is deciding which characters are "special". Here I've assumed not digit, not whitespace and not alpha...
StringBuilder output = new StringBuilder();
String input = "Foo bar ©™® baz";
for (char each : input.toCharArray()) {
if (Character.isAlphabetic(each) || Character.isDigit(each) || Character.isWhitespace(each)) {
output.append(each);
} else {
output.append(String.format("&#%04d;", (int) each));
}
}
System.out.println(output.toString());

You just need to fetch the integer value of the character as mentioned in How do I get the decimal value of a unicode character in Java?.
As per Oracle Java doc
char: The char data type is a single 16-bit Unicode character. It has
a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or
65,535 inclusive).
Assuming your characters fall within the character range, you can just get the decimal equivalent of each character from your string.
String text = "©™®";
char[] cArr = text.toCharArray();
for (char c : cArr)
{
int value = c; // get the decimal equivalent of the character
String result = "& #" + value; // append to some format string
System.out.println(result);
}
Output:
& #169
& #8482
& #174

Related

un escapeing special characters using java

I have give following value (escaping using Windows-1252)
ABC &#145 ; &#146 ; &#147 ; &#148 ; &#226 ;, &#234 ;, &#238 ;, &#244 ;, &#251 ;
(I need to add space to display exact value actual there is no space between number and ;)
but the actual value is and I want the same value as below
ABC ‘ ’ “ ” â, ê, î, ô, û
I have tried HtmlUtils.htmlUnescape(decodedString); but did not work
I am getting output like
ABC â, ê, î, ô, û
‘ ’ “ ” is removed.
Can you please provide how to do this in java?
The quote characters are probably still in the string, they are just invisible when displayed. That's because in Unicode or ISO 8859-1, the code point 145 is not assigned to a visible character.
The best solution (if possible) is to pass the encoding to the unescapeHtml method.
An alternative is to call htmlUnescape first and then map the cp1252 codepoints to the corresponding Unicode code points, using the following code:
String unescapeHtmlCp1252(String input) {
String nohtml = HtmlUtils.htmlUnescape(input);
byte[] bytes = nohtml.getBytes(StandardCharsets.ISO_8859_1);
String result = new String(bytes, Charset.forName("cp1252"));
return result;
}
When you step through this code with a debugger and inspect the nohtml string, you will probably see characters with the value 145, 146, and so on. This means that the characters are still there at this point.
Later, when the characters are converted into pixels by using a font, these characters do not have a definition and are therefore just ignored. But until this step, they are still there.
You can use a regular expression for that.
Pattern p = Pattern.compile("&#(\\d+);");
StringBuffer out = new StringBuffer();
String s = "ABC‘’âD";
Matcher m = p.matcher(s);
int startIdx = 0;
byte[] bytes = new byte[]{0};
while(startIdx < s.length() && m.find(startIdx)) {
if (m.start() > startIdx) {
out.append(s.substring(startIdx, m.start()));
}
// fetch the numeric value from the encoding and put it into a byte array
bytes[0] = (byte)Short.parseShort(m.group(1));
// convert the windows 1252 encoded byte array into a java string
out.append(new String(bytes,"Windows-1252"));
startIdx = m.end();
}
if (startIdx < s.length()) {
out.append(s.substring(startIdx));
}
The output / result will be something like
ABC‘’âD

How to convert string representation of an ASCII value to character

I have a String containing ASCII representation of a character i.e.
String test = "0x07";
Is there a way I can somehow parse it to its character value.
I want something like
char c = 0x07;
But what the character exactly is, will be known only by reading the value in the string.
You have to add one step:
String test = "0x07";
int decimal = Integer.decode(test);
char c = (char) decimal;

Replace special characters in a string with their UTF-8 encoded character java?

I want to convert only the special characters to their UTF-8 equivalent character.
For example given a String: Abcds23#$_ss, it should get converted to Abcds23353695ss.
The following is how i did the above conversion:
The utf-8 in hexadecimal for # is 23 and in decimal is 35. The utf-8 in hexadecimal for $ is 24 and in decimal is 36. The utf-8 in hexadecimal for _ is 5f and in decimal is 95.
I know we have the String.replaceAll(String regex, String replacement) method. But I want to replace specific character with their specific UTF-8 equivalent.
How do I do the same in java?
I don't know how do you define "special characters", but this function should give you an idea:
public static String convert(String str)
{
StringBuilder buf = new StringBuilder();
for (int index = 0; index < str.length(); index++)
{
char ch = str.charAt(index);
if (Character.isLetterOrDigit(ch))
buf.append(ch);
else
buf.append(str.codePointAt(index));
}
return buf.toString();
}
#Test
public void test()
{
Assert.assertEquals("Abcds23353695ss", convert("Abcds23#$_ss"));
}
The following uses java 8 or above and checks whether a Unicode code point (symbol) is a letter or digit, pure ASCII (< 128) and otherwise output the Unicode code point as string of the numerical value.
static String convert(String str) {
int[] cps = str.codePoints()
.flatMap((cp) ->
Character.isLetterOrDigit(cp) && cp < 128
? IntStream.of(cp)
: String.valueOf(cp).codePoints())
.toArray();
return new String(cps, 0, cps.length);
}
String.codePoints() yields an IntStream, flatMap adds IntStreams in a single flattened stream, and toArray collects it in an array. So we can construct a new String from those code points. Entirely Unicode safe.
Conversion is not undoable without delimiters.
On Unicode:
Unicode numbers symbols, called code points, from 0 upwards, into the 3 byte range.
To be coded (formated) in bytes there exist UTF-8 (multi-byte), UTF-16LE and UTF-16BE (2byte-sequences) and UTF-32 (code points as-is more or less).
Java string constants in a .class file are in UTF-8. A String is composed of UTF-16BE chars. And String can give code points as above. So java by design uses Unicode for text.

Printing next ASCII character in the sequence

I want to know that how to recognize and print next character in ASCII sequence if input is a non- string value like "space" or "!".
I know that for string value we can convert it into ASCII value by using
char character = 'a';
int ascii = (int) character;
Then adding 1 to it and converting it back to char , we can get next value in the sequence .
You can use:
char character = 'a';
int ascii = (char)((int)character+1);
It should work. But I have haven`t tested it.

How to format numbers to a hex strings?

I want to format int numbers as hex strings. System.out.println(Integer.toHexString(1)); prints 1 but I want it as 0x00000001. How do I do that?
Try this
System.out.println(String.format("0x%08X", 1));
You can use the String.format to format an integer as a hex string.
System.out.println(String.format("0x%08X", 1));
That is, pad with zeros, and make the total width 8. The 1 is converted to hex for you. The above line gives: 0x00000001 and
System.out.println(String.format("0x%08X", 234));
gives: 0x000000EA
From formatting syntax documented on Java's Formatter class:
Integer intObject = Integer.valueOf(1);
String s = String.format("0x%08x", intObject);
System.out.println(s);
Less verbose:
System.out.printf("0x%08x", 1); //"Use %0X for upper case letters
I don't know Java too intimately, but there must be a way you can pad the output from the toHexString function with a '0' to a length of 8. If "0x" will always be at the beginning, just tack on that string to the beginning.
Java 17+
There is a new immutable class dedicated to conversion into and formatting hexadecimal numbers. The easiest way to go is using HexFormat::toHexDigits which includes leading zeroes:
String hex = "0x" + HexFormat.of().toHexDigits(1);
// 0x00000001
Beware, one has to concatenate with the "0x" prefix as such method ignores defined prefixes and suffixes, so the following snippet doesn't work as expected (only HexFormat::formatHex methods work with them):
String hex = HexFormat.of().withPrefix("0x").toHexDigits(1);
// 00000001
Returns the eight hexadecimal characters for the int value. Each nibble (4 bits) from most significant to least significant of the value is formatted as if by toLowHexDigit(nibble). The delimiter, prefix and suffix are not used.
Alternatively use the advantage of HexFormat::formatHex formatting to two hexadecimal characters, and a StringBuilder as an Appendable prefix containing "0x":
Each byte value is formatted as the prefix, two hexadecimal characters selected from uppercase or lowercase digits, and the suffix.
StringBuilder hex = HexFormat.of()
.formatHex(new StringBuilder("0x"), new byte[] {0, 0, 0, 1});
// 0x00000001
StringBuilder hex = HexFormat.of()
.formatHex(new StringBuilder("0x"), ByteBuffer.allocate(4).putInt(1).array());
// 0x00000001
You can use a java.util.Formatter or the printf method on a PrintStream.
This is String extension for Kotlin
//if lengthOfResultTextNeeded = 3 and input String is "AC", the result is = "0AC"
//if lengthOfResultTextNeeded = 4 and input String is "AC", the result is = "00AC"
fun String.unSignedHex(lengthOfResultTextNeeded: Int): String {
val count =
lengthOfResultTextNeeded - this.length
val buildHex4DigitString = StringBuilder()
var i = 1
while (i <= count) {
buildHex4DigitString.append("0")
++i
}
return buildHex4DigitString.toString() + this
}

Categories