A strange character - java

String str = "ิ";
System.out.println(str.length());
byte[] b = str.getBytes();
System.out.println(b[0]);
System.out.println(b[1]);
System.out.println(b[2]);
Above is my code.A speical char in str. It's length is one,but the byte is three.why? And how to make there become one? How to print this char use java code? And in my android phone this char can't delete.

Its because string is "encoded" into bytes, according to documentation
Encodes this String into a sequence of bytes using the platform's default charset, storing the
result into a new byte array.
The behavior of this method when this string cannot be encoded in the default charset is unspecified.
The CharsetEncoder class should be used when more control over the encoding process is required.

It seems like your special character is encoded using UTF-8. UTF-8 characters have different byte sizes, depending on their position in the range.
You can find the algorithm in the wikipedia page here and see how the size is determined.
From the Java String length() documentation:
The length is equal to the number of Unicode code units in the string.
Since the character is encoded using 3 bytes (whereas Unicode is one byte), you get a length of 3, rather than a length of 1 as you expect.

Lenght is NOT bytes
You have only 1 caracter, but this caracter is 3 bytes long. A String is made of several characters , but that doesnt mean that a 1 caracter string would be 1 byte.
About that caracter "ิ.
Java is by default using an UNICODE (encoding. "ิ is actually 0E34, this value beeing the THAI CHARACTER SARA.)
About your encoding issue
You need to change the way your application does its charset encoding and to use a utf-8 encoding instead.

Beside all the other comments. Here a small snippet to visualize it.
String str = "ิ"; // \u0E34
System.out.println("character length: " + str.length());
System.out.print("bytes: ");
for (byte b : str.getBytes("UTF-8")) {
System.out.append(Integer.toHexString(b & 0xFF).toUpperCase() + " ");
}
System.out.println("");
int codePoint = Character.codePointAt(str, 0);
System.out.println("unicode name of the codepoint: " + Character.getName(codePoint));
output
character length: 1
bytes: E0 B8 B4
unicode name of the codepoint: THAI CHARACTER SARA I

Related

Trim a string based on its byte length

I want to trim the string based on byte length (not character length), then how to achieve this?
Example:
String country = "日本日本日";
One Japanese character will be 3 bytes. Above string length is 5, byte length is 15. If I give 3, only 1st character should be printed. If I give 5, only 1st character should come, because 2 characters size is 6 bytes. If I give 6, first 2 characters should be printed.
Edit: Byte size varies depends on the String. It may Japanese (or) Japanese with numerals (or) some other language.
Divide your required byte with 3 and fetch those characters. For ex.
int requiredBytes = 5;
int requiredLength = 5 / 3;
System.out.println(country.subString(0,requiredLength));

un escapeing special characters using java

I have give following value (escaping using Windows-1252)
ABC &#145 ; &#146 ; &#147 ; &#148 ; &#226 ;, &#234 ;, &#238 ;, &#244 ;, &#251 ;
(I need to add space to display exact value actual there is no space between number and ;)
but the actual value is and I want the same value as below
ABC ‘ ’ “ ” â, ê, î, ô, û
I have tried HtmlUtils.htmlUnescape(decodedString); but did not work
I am getting output like
ABC â, ê, î, ô, û
‘ ’ “ ” is removed.
Can you please provide how to do this in java?
The quote characters are probably still in the string, they are just invisible when displayed. That's because in Unicode or ISO 8859-1, the code point 145 is not assigned to a visible character.
The best solution (if possible) is to pass the encoding to the unescapeHtml method.
An alternative is to call htmlUnescape first and then map the cp1252 codepoints to the corresponding Unicode code points, using the following code:
String unescapeHtmlCp1252(String input) {
String nohtml = HtmlUtils.htmlUnescape(input);
byte[] bytes = nohtml.getBytes(StandardCharsets.ISO_8859_1);
String result = new String(bytes, Charset.forName("cp1252"));
return result;
}
When you step through this code with a debugger and inspect the nohtml string, you will probably see characters with the value 145, 146, and so on. This means that the characters are still there at this point.
Later, when the characters are converted into pixels by using a font, these characters do not have a definition and are therefore just ignored. But until this step, they are still there.
You can use a regular expression for that.
Pattern p = Pattern.compile("&#(\\d+);");
StringBuffer out = new StringBuffer();
String s = "ABC‘’âD";
Matcher m = p.matcher(s);
int startIdx = 0;
byte[] bytes = new byte[]{0};
while(startIdx < s.length() && m.find(startIdx)) {
if (m.start() > startIdx) {
out.append(s.substring(startIdx, m.start()));
}
// fetch the numeric value from the encoding and put it into a byte array
bytes[0] = (byte)Short.parseShort(m.group(1));
// convert the windows 1252 encoded byte array into a java string
out.append(new String(bytes,"Windows-1252"));
startIdx = m.end();
}
if (startIdx < s.length()) {
out.append(s.substring(startIdx));
}
The output / result will be something like
ABC‘’âD

Replace special characters in a string with their UTF-8 encoded character java?

I want to convert only the special characters to their UTF-8 equivalent character.
For example given a String: Abcds23#$_ss, it should get converted to Abcds23353695ss.
The following is how i did the above conversion:
The utf-8 in hexadecimal for # is 23 and in decimal is 35. The utf-8 in hexadecimal for $ is 24 and in decimal is 36. The utf-8 in hexadecimal for _ is 5f and in decimal is 95.
I know we have the String.replaceAll(String regex, String replacement) method. But I want to replace specific character with their specific UTF-8 equivalent.
How do I do the same in java?
I don't know how do you define "special characters", but this function should give you an idea:
public static String convert(String str)
{
StringBuilder buf = new StringBuilder();
for (int index = 0; index < str.length(); index++)
{
char ch = str.charAt(index);
if (Character.isLetterOrDigit(ch))
buf.append(ch);
else
buf.append(str.codePointAt(index));
}
return buf.toString();
}
#Test
public void test()
{
Assert.assertEquals("Abcds23353695ss", convert("Abcds23#$_ss"));
}
The following uses java 8 or above and checks whether a Unicode code point (symbol) is a letter or digit, pure ASCII (< 128) and otherwise output the Unicode code point as string of the numerical value.
static String convert(String str) {
int[] cps = str.codePoints()
.flatMap((cp) ->
Character.isLetterOrDigit(cp) && cp < 128
? IntStream.of(cp)
: String.valueOf(cp).codePoints())
.toArray();
return new String(cps, 0, cps.length);
}
String.codePoints() yields an IntStream, flatMap adds IntStreams in a single flattened stream, and toArray collects it in an array. So we can construct a new String from those code points. Entirely Unicode safe.
Conversion is not undoable without delimiters.
On Unicode:
Unicode numbers symbols, called code points, from 0 upwards, into the 3 byte range.
To be coded (formated) in bytes there exist UTF-8 (multi-byte), UTF-16LE and UTF-16BE (2byte-sequences) and UTF-32 (code points as-is more or less).
Java string constants in a .class file are in UTF-8. A String is composed of UTF-16BE chars. And String can give code points as above. So java by design uses Unicode for text.

How to convert hex string to octal string in java

I am converting a string into hexadecimal string and i want to convert that into octal string
I am doing it as follows
String s="I found the reason 2 days ago i was too bussy to update you.The problem was that the UDHL was missing. I should have added 06 at the beginning. it all works fine.For some reason I thought kannel adds this by itself (not very difficult...), but now I know it doesn't...";
String hex = String.format("%040x", new BigInteger(1, s.getBytes("UTF-8")));
Output is
4920666f756e642074686520726561736f6e203220646179732061676f20692077617320746f6f20627573737920746f2075706461746520796f752e5468652070726f626c656d20776173207468617420746865205544484c20776173206d697373696e672e20492073686f756c6420686176652061646465642030362061742074686520626567696e6e696e672e2020697420616c6c20776f726b732066696e652e466f7220736f6d6520726561736f6e20492074686f75676874206b616e6e656c2061646473207468697320627920697473656c6620286e6f74207665727920646966666963756c742e2e2e292c20627574206e6f772049206b6e6f7720697420646f65736e27742e2e2e
I need to convert hex to octal string. I tried like this
String octal = Integer.toOctalString(Integer.parseInt(hex,16));
But as expected it gave me number format exception as hex string have some characters in it.
I want to know how can i conver hex string to octal string.
As per the discussion in the comments, you want to convert each byte individually and convert them to octal:
String s = "My string to convert";
byte[] bytes = s.getBytes("UTF-8");
for (byte b : bytes) {
String octalValue = Integer.toString(b, 8);
// Do whatever
}
The problem is, that your input (String hex) is to big to be stored in a single integer.
3 hex digits correspond to 4 octal digits. So split your input string in chunks of three digits, use the conversion you already figured out, and concatenate the outputs.

Java Char to its unicode hexadecimal string representation and vice-versa

I need to generate the hexadecimal code of Java characters into strings, and parse those strings again later. I found here that parsing can be performed as following:
char c = "\u041f".toCharArray()[0];
I was hoping for something more elegant like Integer.valueOf() for parsing.
How about generating the hexadecimal unicode properly?
This will generate a hex string representation of the char:
char ch = 'ö';
String hex = String.format("%04x", (int) ch);
And this will convert the hex string back into a char:
int hexToInt = Integer.parseInt(hex, 16);
char intToChar = (char)hexToInt;
After doing some deeper reading, the javadoc says the Character methods based on char parameters do not support all unicode values, but those taking code points (i.e., int) do.
Hence, I have been performing the following test:
int codePointCopyright = Integer.parseInt("00A9", 16);
System.out.println(Integer.toHexString(codePointCopyright));
System.out.println(Character.isValidCodePoint(codePointCopyright));
char[] toChars = Character.toChars(codePointCopyright);
System.out.println(toChars);
System.out.println();
int codePointAsian = Integer.parseInt("20011", 16);
System.out.println(Integer.toHexString(codePointAsian));
System.out.println(Character.isValidCodePoint(codePointAsian));
char[] toCharsAsian = Character.toChars(codePointAsian);
System.out.println(toCharsAsian);
and I am getting:
Therefore, I should not talk about char in my question, but rather about array of chars, since Unicode characters can be represented with more than one char. On the other side, an int covers it all.
On String level:
The following uses not char but int, say for Chinese, but is also adequate for chars.
int cp = "\u041f".codePointAt(0);
String s = new String(Character.toChars(cp));
On native2ascii level:
If you want to convert back and forth between \uXXXX and Unicode character, use from apache, commons-lang the StringEscapeUtils:
String t = StringEscapeUtils.escapeJava(s + "ö");
System.out.println(t);
On the command-line native2ascii can convert back and forth files between u-escaped and say UTF-8.

Categories