Java string .length() X does not fit into DB2 varchar(X) - java

I observed one issue that, I am trying to save a record in DB2 database with fields having length check in Java code. I have kept length check exactly equal to database varchar limit. And trying to save but getting SQL Exception DB2 SQL Error: SQLCODE=-302, SQLSTATE=22001, SQLERRMC=null, DRIVER=3.57.82;
Then I reduced the length (truncated) the length smaller than the database size. Truncated to substring (0, 900) apprx for varchar(1000).
Please let me what could be the reason of the same. Is it related to character encoding?
How it needs to be handled?
What is default character encoding applied to String (input from request parameter of a text area field) and corresponding no. of bytes?

DB2 counts string length in bytes, not characters. The max length of a string you can store can therefore be a lot shorter than the size given for varchar.
Unfortunately the only way to truncate a string to a number of bytes is to encode it as bytes, truncating, and reconstructing the string. From what you say it sounds like a variable length encoding such as UTF-8 is being used. The difficult part is not producing an invalid character at the end, and the way to do that is using the NIO charset API:
import java.nio.*;
...
CharBuffer in = CharBuffer.wrap(stringToTruncate);
ByteBuffer out = ByteBuffer.allocate(maxLength);
Charset db2charset = Charset.forName("UTF-8");
CharsetEncoder db2encoder = db2charset.newEncoder();
db2encoder.encode(in, out, true);
out.flip();
return db2charset.decode(out).toString();

Related

handling comp3 and ebcidic conversion in java to ASCII for large files

I am trying to convert comp3 and EBCIDIC characters in my java code but im running into out of memory exception as the amount of data handled is huge about 5 gb. my code is currently as follows:
byte[] data = Files.readAllBytes(path);
this is resulting in an out of memory exception which i can understand, but i cant use a file scanner as well since the data in the file wont be split into lines.
Can anyone point me in the correct direction on how to handle this
Note: the file may contain records of different length hence splitting it based on record length seams not possible.
As Bill said you could (should) ask for the data to be converted to display characters on the mainframe and if English speaking you can do a ascii transfer.
Also how are you deciding where comp-3 fields start ???
You do not have to read the whole file into memory, you can still read the file in blocks, This method will fill an array of bytes:
protected final int readBuffer(InputStream in, final byte[] buf)
throws IOException {
int total = 0;
int num = in.read(buf, total, buf.length);
while (num >= 0 && total + num < buf.length) {
total += num;
num = in.read(buf, total, buf.length - total);
}
return num;
}
if all the records are the same length, create an array of the record length and the above method will read one record at a time.
Finally the JRecord project has classes to read fixed length files etc. It can do comp-3 conversion. Note: I am the author of JRecord.
I'm running into out of memory exception as the amount of data handled is huge about 5 gb.
You only need to read one record at a time.
My code is currently as follows:
byte[] data = Files.readAllBytes(path);
This is resulting in an out of memory exception which i can understand
Me too.
but i cant use a file scanner as well since the data in the file wont be split into lines.
You mean you can't use the Scanner class? That's not the only way to read a record at a time.
In any case not all files have record delimiters. Some have fixed-length records, some have length words at the start of each record, and some have record type attributes at the start of each record, or in both cases at least in the fixed part of the record.
I'll have to split it based on an attribute record_id at a particular position(say at the begining of each record) that will tell me the record length
So read that attribute, decode it if necessary, and read the rest of the record according to the record length you derive from the attribute. One at a time.
I direct your attention to the methods of DataInputStream, especially readFully(). You will also need a Java COMP-3 library. There are several available. Most of the rest can be done by built-in EBCDIC character set decoders.

Gson unicode deserialization

I am reading text from a longblob with table-wide default charset equal to ascii and deserializing it with Gson's fromJson(). The charset as given by another field is UTF-8. I want to serialize back a modified version, but I want to test that the serialized version is equivalent to the original text in the longblob - obviously aside from the modification.
byte[] words;//from the longblob field using getBytes()
String in = new String(words, Charsets.UTF_8);
MyClass myObj = gson.fromJson(in, MyClass.class);
//modify myObj...
String out = gson.toJson(myObj);
The problem seems to be unicode characters. The length of the Strings are not equal due to the affect unicode characters have. For example out as printed will show "we’ll" whereas in will show "we\u2019ll". I know that if I copy and paste these into java code as literals they will be equal and have equal length but in memory they are not equal in the above code.
I prefer a solution that doesn't rely on changing db field type.

Base64 binary data type in java

I need to attach a Base64 binary element to a SOAP message...Im doing a dry run to check if I can convert a value read from a file into Base64 binary..
Here is the below code..In the last line I try to print the type of encoded1(I assume it should be Base64 binary values) but I get the following display..."Attachment[B"...How can I confirm if its Base64 binary string?
Path path = Paths.get("c:/tomcatupload/text.csv");
byte[] attachment1 = Files.readAllBytes(path);
byte[] encoded1 = Base64.encode(attachment1);
System.out.println("Attachment"+ encoded1.getClass().getName());
Base-64 encoding is a way to convert arbitrary bytes to bytes that fit in a range of text characters in ASCII encoding. This is done without any interpretation whatsoever - raw bytes are converted to base-64 on sender's end; the receiver converts them back to a stream of bytes, and that's all there is to it.
When your code prints encoded1.getClass().getName(), all it gets is the static type of the byte array. In order to interpret the data encoded in base-64 as something meaningful to your program, you need to know the format of underlying data transported as base-64. Once the bytes are delivered to you (in your case, that's encoded1 array of bytes) you need to decide what's inside, and act accordingly.
For example, if a serialized Java object is sent to you as base-64, you need to take encoded1, make an in-memory stream from it, and read the object using the regular serialization mechanism:
ByteArrayInputStream memStream = new ByteArrayInputStream(encoded1);
ObjectInputStream objStream = new ObjectInputStream(memStream);
Object attachedObject = objStream.readObject();
The encoding by Base64.encode() was successful if and only if size of encoded1 > size of obtained attachment1.
Please refer, to understand how the encoding works.
http://en.wikipedia.org/wiki/Base64
By the way, your last statement doesn't print the array content. It prints the name of the class to which encoded1 belongs to.

character taking 6 bytes

We are trying to save the below string which is actually a name in db, we make some api call and we get this name:
株式会社エス・ダブリュー・コミュニケーションズ
While saving through our code (as in servlet - hibernate - database), we get an error:
Caused by: java.sql.BatchUpdateException: ORA-12899: value too large for column "NAME_ON_ACCOUNT" (actual: 138, maximum: 100)
this is 23 characters but looks like it's taking 6 bytes per character, that would only make it 138.
Below code gives me 69:
byte[] utf8Bytes = string.getBytes("UTF-8");
System.out.println(utf8Bytes.length);
And this gives me 92:
byte[] utf8Bytes = string.getBytes("UTF-32");
System.out.println(utf8Bytes.length);
I will surely check NLS_CHARACTERSET and see the IO classes but have you ever seen a character taking 6 bytes? Any help will be much appreciated.
It probably holds HTML entities in a string. Like 燃 or possibly the URL style, %8C%9A. Or maybe UTF7, like [Ay76b. (I made up those values, but your actual ones will be similar). It is always a pain to rely on any framework with character encoding because its authors were likely U.S. or European, both sufficing with simple ANSI where one byte equals one character.
If you managed to understand your encoding and converted it to the real UTF8 or even UTF16, it would take up less space in this particular case.
You probably literally have:
\u682a\u5f0f\u4f1a\u793e\u30a8\u30b9\u30fb\u30c0\u30d6\u30ea\u30e5\u30fc\u30fb\u30b3\u30df\u30e5\u30cb\u30b1\u30fc\u30b7\u30e7\u30f3\u30ba
See:
"\u682a\u5f0f\u4f1a\u793e\u30a8\u30b9\u30fb\u30c0\u30d6\u30ea\u30e5\u30fc\u30fb\u30b3\u30df\u30e5\u30cb\u30b1\u30fc\u30b7\u30e7\u30f3\u30ba".length();
//23, or 69 UTF-8 bytes
Vs:
"\\u682a\\u5f0f\\u4f1a\\u793e\\u30a8\\u30b9\\u30fb\\u30c0\\u30d6\\u30ea\\u30e5\\u30fc\\u30fb\\u30b3\\u30df\\u30e5\\u30cb\\u30b1\\u30fc\\u30b7\\u30e7\\u30f3\\u30ba".length();
//138, or 138 UTF-8 bytes

Could anyone verify the correctness of getting a md5 hash using this method?

MessageDigest m=MessageDigest.getInstance("MD5");
StringBuffer sb = new StringBuffer();
if(nodeName!=null) sb.append(nodeName);
if(nodeParentName!=null) sb.append(nodeParentName);
if(nodeParentFieldName!=null) sb.append(nodeParentFieldName);
if(nodeRelationName!=null) sb.append(nodeRelationName);
if(nodeViewName!=null) sb.append(nodeViewName);
if(treeName!=null) sb.append(treeName);
if(nodeValue!=null && nodeValue.trim().length()>0) sb.append(nodeValue);
if(considerParentHash) sb.append(parentHash);
m.update(sb.toString().getBytes("UTF-8"),0,sb.toString().length());
BigInteger i = new BigInteger(1,m.digest());
hash = String.format("%1$032X", i);
The idea behind these lines of code is that we append all the values of a class/model into a StringBuilder and then return the padded hash of that (the Java implementation returns md5 hashes that are lenght 30 or 31, so the last line formats the hash to be padded with 0s).
I can verify that this works, but I have a feeling it fails at one point (our application fails and I believe this to be the probable cause).
Can anyone see a reason why this wouldn't work? Are there any workarounds to make this code less prone to errors (e.g. removing the need for the strings to be UTF-8).
There are a few weird things in your code.
UTF-8 encoding of a character may use more than one byte. So you should not use the string length as final parameter to the update() call, but the length of the array of bytes that getBytes() actually returned. As suggested by Paŭlo, use the update() method which takes a single byte[] as parameter.
The output of MD5 is a sequence of 16 bytes with quite arbitrary values. If you interpret it as an integer (that's what you do with your call to BigInteger()), then you will get a numerical value which will be smaller than 2160, possibly much smaller. When converted back to hexadecimal digits, you may get 32, 31, 30... or less than 30 characters. Your usage of the the "%032X" format string left-pads with enough zeros, so your code works, but it is kind of indirect (the output of MD5 has never been an integer to begin with).
You assemble the hash input elements with raw concatenation. This may induce issues. For instance, if modeName is "foo" and modeParentName is "barqux", then the MD5 input will begin with (the UTF-8 encoding of) "foobarqux". If modeName is "foobar" and modeParentName is "qux", then the MD5 input will also begin with "foobarqux". You do not tell why you want to use a hash function, but usually, when one uses a hash function, it is to have a unique trace of some piece of data; two distinct data elements should yield distinct hash inputs.
When handling nodeValue, you call trim(), which means that this string could begin and/or end with whitespace, and you do not want to include that whitespace into the hash input -- but you do include it, since you append nodeValue and not nodeValue.trim().
If what you are trying to do has any relation to security then you should not use MD5, which is cryptographically broken. Use SHA-256 instead.
Hashing an XML element is normally done through canonicalization (which handles whitespace, attribute order, text representation, and so on). See this question on the topic of canonicalizing XML data with Java.
One possible problem is here:
m.update(sb.toString().getBytes("UTF-8"),0,sb.toString().length());
As said by Robing Green, the UTF-8 encoding can produce a byte[] which is longer than your original string (it will do this exactly when the String contains non-ASCII characters). In this case, you are only hashing the start of your String.
Better write it like this:
m.update(sb.toString().getBytes("UTF-8"));
Of course, this would not cause an exception, simply another hash than would be produced otherwise, if you have non-ASCII-characters in your string. You should try to brew your failure down to an SSCCE, like lesmana recommended.

Categories