character taking 6 bytes - java

We are trying to save the below string which is actually a name in db, we make some api call and we get this name:
株式会社エス・ダブリュー・コミュニケーションズ
While saving through our code (as in servlet - hibernate - database), we get an error:
Caused by: java.sql.BatchUpdateException: ORA-12899: value too large for column "NAME_ON_ACCOUNT" (actual: 138, maximum: 100)
this is 23 characters but looks like it's taking 6 bytes per character, that would only make it 138.
Below code gives me 69:
byte[] utf8Bytes = string.getBytes("UTF-8");
System.out.println(utf8Bytes.length);
And this gives me 92:
byte[] utf8Bytes = string.getBytes("UTF-32");
System.out.println(utf8Bytes.length);
I will surely check NLS_CHARACTERSET and see the IO classes but have you ever seen a character taking 6 bytes? Any help will be much appreciated.

It probably holds HTML entities in a string. Like 燃 or possibly the URL style, %8C%9A. Or maybe UTF7, like [Ay76b. (I made up those values, but your actual ones will be similar). It is always a pain to rely on any framework with character encoding because its authors were likely U.S. or European, both sufficing with simple ANSI where one byte equals one character.
If you managed to understand your encoding and converted it to the real UTF8 or even UTF16, it would take up less space in this particular case.

You probably literally have:
\u682a\u5f0f\u4f1a\u793e\u30a8\u30b9\u30fb\u30c0\u30d6\u30ea\u30e5\u30fc\u30fb\u30b3\u30df\u30e5\u30cb\u30b1\u30fc\u30b7\u30e7\u30f3\u30ba
See:
"\u682a\u5f0f\u4f1a\u793e\u30a8\u30b9\u30fb\u30c0\u30d6\u30ea\u30e5\u30fc\u30fb\u30b3\u30df\u30e5\u30cb\u30b1\u30fc\u30b7\u30e7\u30f3\u30ba".length();
//23, or 69 UTF-8 bytes
Vs:
"\\u682a\\u5f0f\\u4f1a\\u793e\\u30a8\\u30b9\\u30fb\\u30c0\\u30d6\\u30ea\\u30e5\\u30fc\\u30fb\\u30b3\\u30df\\u30e5\\u30cb\\u30b1\\u30fc\\u30b7\\u30e7\\u30f3\\u30ba".length();
//138, or 138 UTF-8 bytes

Related

Java: what is the common way to include tags in a TCP/IP message

I designed a protocol for sending TCP/IP messages between peers in a peer-to-peer system.
A message is a byte array in which the first byte indicates the wanted operation. Then follow the arguments.
To reconstruct the arguments I read byte per byte. Because it is possible that there are multiple arguments I have to put tags in between them (I call it an end of argument byte).
What is the common way for including such tags?
Currently I use 1 byte for representing the end of argument tag (number 17). It is important that I use a/multiple byte(s) that will never be contained in an argument (else it will be interpreted as an end of argument byte).
First I thought to use number 17 as end of argument byte as that is the ASCII value for "device controller 1". But now I'm not 100% sure that it will never be contained in an argument. Arguments are files (any possible file, for example : txt, doc but also for example an image or ...).
You cannot insert separators without making any assumptions about data that will be residing between them. If your protocol is to be generic as possible then it should support byte arrays type which can potentially conflict with your separator bits.
I suggest to take the same way as the typical binary serialization formats out there (e.g. AVRO), but in your case as you don't have any kind of schema definition, you will need to adjust it a bit to have a type information inside like Thrift or Protobuf do, but without schema.
Try the following format:
[type1][length1][data][type2][length2][data2]...[lengthN][dataN]
Size of type tag can be 4 bits which gives you 16 types to be assigned, you can say type 1 is String, 2 - Image JPG, 3 -> Number long, it depends on your needs.
Length can be one byte which gives you ability to indicate length from 1 - 256, if you want larger length you can say that if length == 256 then there is a continuation of the sequence and proceed to read the same type until you find length < 256 which will be the last for this type.
Pros of this method is that you always know where is the service bytes and where is the actual data. So rather than indicating the end of the argument you will be indicating the beginning + length.
Later you can include schema tag if you will be able to categorize your messages, this will give you the ability to strip the type information of the messages and leave only the schema id and the length tags which can potentially improve the performance.

Android - String Shortening Approach

I'm currently developing a marketing app on Android, which has a feature to send a URL via SMS. Because I'm using SMS, I want to make the text as short as possible, so it won't be split into parts.
The URL is generated dynamically by app. Different contact will result in different URL, as the app puts some "contact related information" to the URL. And this information is the one needs to be shortened, not the base URL.
I tried using Base64 to shorten the string, but it's not working.
Before
Text: Myself|1234567890
Length: 17
After
Text: TXlzZWxmfDEyMzQ1Njc4OTA=
Length: 25
Then I tried Deflater, and the result is better than Base64, but still it's not shorten the string.
Before
Text: Myself|1234567890
Length: 17
After
Text: x��,N�I�1426153��4����3��
Length: 24
I've also tried GZIP, and the result is much worse than other method.
Before
Text: Myself|1234567890
Length: 17
After
Text: ����������������,N�I�1426153��4�����w��������
Length: 36
After comparing test results, I decided to use Base64 as it sometimes works, but I'm completely not satisfied. Can anyone give me a better approach?
EDIT:
I need this String Shortening to be executed OFFLINE, without internet connection. I'm terribly sorry for this sudden change, as our developer team decided so. Any idea?
Base64 on it's own won't work because it typically increases the length of an encoded string by about 37%.
Deflater and GZIP both contain headers that will increase the length of short strings.
However, you can use Huffman coding or Arithmetic coding to take advantage of the fact that some characters are much more common in URLs than others. Generate a frequency table for your strings by generating a thousand of them or so and summing the occurrence of each character, and then generate a Huffman coding table based on these frequencies. You can then use this hard-coded table to encode and decode your strings: don't transmit the table along with the message.
Here is an interactive webpage that allows you to enter various strings and Huffman encode them: you can try it out with your URLs to get a general idea of what kind of compression rate you can expect, however in practice you will get a slightly lower compression rate if you are using the same table for all your strings. For your sample text "Myself|1234567890" the size of Huffman encoded string is 51% of the original.
After you generate your Huffman encoded string you might need to do another pass over it to escape any illegal characters that can't be transmitted in the SMS (or just Base64 encode your Huffman coded string), which might negate your savings from the Huffman encoding somewhat but hopefully you will still end up with a net saving.
If you get a 50% or so compression rate with Huffman coding and then Base64 encode the result (increasing the size again) you will still end up with a result around 30% smaller than the original.

Reassigning chinese character to a custom character code

I'm currently using UTF-8 as my default charset in Eclipse.
The character code for "隥" is 38565 in Decimal Format. I used http://www.chinese-tools.com/tools/converter-unichar.html to convert that character to get the Unicode form.
I'm sending out data to a LED Panel using Bluetooth Communication. I used OutputStream.write(s) to write out a Signal so that my Microcontroller Side can detect the signal and hence display the corresponding stuffs on the LED Panel, where s is a byte ranging from 0 - 255.
Since the default character code for 隥 is 38565, is there any way possible that I could reassign it to another number? An example would be, I've put 254 as the code to display out this character in my Microcontroller Side.
For the Android side, it is compulsory for me to use 38565 as to allow me to output that character onto an EditText. But when it comes to the bluetooth communication part, how do I reassign that character to another byte?
Bluetooth Data Sending Code
outputStream.write(5);
Thread.sleep(500);
//Row 1 Message / Scroll
outputStream.write(0); // To indicate to my MC Side that this is TEXT_MODE.
if(sPrefs.getBoolean("scrollRow1", false))
{
}else
{
outputStream.write(6); // Scrolling of text
}
msg.getBytes();
if(msg.getBytes().length > 0)
{
a = 1;
outputStream.write(msg.getBytes()); //Write the String which is converted to Bytes through the OutputStream of the Bluetooth.
outputStream.write(32); // Indicate the ending of a String.
}
As you can see in the above code, Bluetooth is using Bytes as their datatype. 38565 has exceed the byte capacity. So my thought was, when I'm typing the character out on Android App, the key that represent the Chinese Character "隥" has a character code of 38565
<Row>
<Key android:codes="38565"
android:keyIcon="#drawable/chinese1"
android:keyWidth="10%p"
android:keyHeight="4%p"
android:horizontalGap="0.5%p"
android:keyEdgeFlags="left"/>
</Row>
But when it comes to sending the data out to Microcontroller Side, I plan to change the keycode to something else that is within 0 - 255. Is that possible or do I have to make my own charset? I've been trying to figure this out and haven't been able to get any reference or help since the past few days. I hope someone can help me out, I'm doing this for my Final Year Project.
Thanks.
I'm sending out data to a LED Panel using Bluetooth Communication.
The documentation for the panel should tell you what encoding to use to convert your characters to bytes. If the answer is UTF-8, then you would send byte 233 then 154 then 165. If the answer is Windows code page 936, you would send byte 235 then 81. Quite possibly the answer is something else, maybe an encoding such as ASCII which does not support Chinese characters.
outputStream.write(msg.getBytes());
getBytes without a parameter uses the computer's default encoding, which likely does not match the encoding the panel uses. In general you should avoid using the default encoding in Java because it varies across machines and is usually wrong. Prefer to explicitly specify an encoding, for example if it is UTF-8:
outputStream.write(msg.getBytes("UTF-8"));

Java string .length() X does not fit into DB2 varchar(X)

I observed one issue that, I am trying to save a record in DB2 database with fields having length check in Java code. I have kept length check exactly equal to database varchar limit. And trying to save but getting SQL Exception DB2 SQL Error: SQLCODE=-302, SQLSTATE=22001, SQLERRMC=null, DRIVER=3.57.82;
Then I reduced the length (truncated) the length smaller than the database size. Truncated to substring (0, 900) apprx for varchar(1000).
Please let me what could be the reason of the same. Is it related to character encoding?
How it needs to be handled?
What is default character encoding applied to String (input from request parameter of a text area field) and corresponding no. of bytes?
DB2 counts string length in bytes, not characters. The max length of a string you can store can therefore be a lot shorter than the size given for varchar.
Unfortunately the only way to truncate a string to a number of bytes is to encode it as bytes, truncating, and reconstructing the string. From what you say it sounds like a variable length encoding such as UTF-8 is being used. The difficult part is not producing an invalid character at the end, and the way to do that is using the NIO charset API:
import java.nio.*;
...
CharBuffer in = CharBuffer.wrap(stringToTruncate);
ByteBuffer out = ByteBuffer.allocate(maxLength);
Charset db2charset = Charset.forName("UTF-8");
CharsetEncoder db2encoder = db2charset.newEncoder();
db2encoder.encode(in, out, true);
out.flip();
return db2charset.decode(out).toString();

Could anyone verify the correctness of getting a md5 hash using this method?

MessageDigest m=MessageDigest.getInstance("MD5");
StringBuffer sb = new StringBuffer();
if(nodeName!=null) sb.append(nodeName);
if(nodeParentName!=null) sb.append(nodeParentName);
if(nodeParentFieldName!=null) sb.append(nodeParentFieldName);
if(nodeRelationName!=null) sb.append(nodeRelationName);
if(nodeViewName!=null) sb.append(nodeViewName);
if(treeName!=null) sb.append(treeName);
if(nodeValue!=null && nodeValue.trim().length()>0) sb.append(nodeValue);
if(considerParentHash) sb.append(parentHash);
m.update(sb.toString().getBytes("UTF-8"),0,sb.toString().length());
BigInteger i = new BigInteger(1,m.digest());
hash = String.format("%1$032X", i);
The idea behind these lines of code is that we append all the values of a class/model into a StringBuilder and then return the padded hash of that (the Java implementation returns md5 hashes that are lenght 30 or 31, so the last line formats the hash to be padded with 0s).
I can verify that this works, but I have a feeling it fails at one point (our application fails and I believe this to be the probable cause).
Can anyone see a reason why this wouldn't work? Are there any workarounds to make this code less prone to errors (e.g. removing the need for the strings to be UTF-8).
There are a few weird things in your code.
UTF-8 encoding of a character may use more than one byte. So you should not use the string length as final parameter to the update() call, but the length of the array of bytes that getBytes() actually returned. As suggested by Paŭlo, use the update() method which takes a single byte[] as parameter.
The output of MD5 is a sequence of 16 bytes with quite arbitrary values. If you interpret it as an integer (that's what you do with your call to BigInteger()), then you will get a numerical value which will be smaller than 2160, possibly much smaller. When converted back to hexadecimal digits, you may get 32, 31, 30... or less than 30 characters. Your usage of the the "%032X" format string left-pads with enough zeros, so your code works, but it is kind of indirect (the output of MD5 has never been an integer to begin with).
You assemble the hash input elements with raw concatenation. This may induce issues. For instance, if modeName is "foo" and modeParentName is "barqux", then the MD5 input will begin with (the UTF-8 encoding of) "foobarqux". If modeName is "foobar" and modeParentName is "qux", then the MD5 input will also begin with "foobarqux". You do not tell why you want to use a hash function, but usually, when one uses a hash function, it is to have a unique trace of some piece of data; two distinct data elements should yield distinct hash inputs.
When handling nodeValue, you call trim(), which means that this string could begin and/or end with whitespace, and you do not want to include that whitespace into the hash input -- but you do include it, since you append nodeValue and not nodeValue.trim().
If what you are trying to do has any relation to security then you should not use MD5, which is cryptographically broken. Use SHA-256 instead.
Hashing an XML element is normally done through canonicalization (which handles whitespace, attribute order, text representation, and so on). See this question on the topic of canonicalizing XML data with Java.
One possible problem is here:
m.update(sb.toString().getBytes("UTF-8"),0,sb.toString().length());
As said by Robing Green, the UTF-8 encoding can produce a byte[] which is longer than your original string (it will do this exactly when the String contains non-ASCII characters). In this case, you are only hashing the start of your String.
Better write it like this:
m.update(sb.toString().getBytes("UTF-8"));
Of course, this would not cause an exception, simply another hash than would be produced otherwise, if you have non-ASCII-characters in your string. You should try to brew your failure down to an SSCCE, like lesmana recommended.

Categories