How to convert clob to string with encoding in java - java

We are doing massive batch of xml processing and the logic to convert clob to string is shown below.
import java.sql.Clob
import org.apache.commons.io.IOUtils
String extractXml(Clob xmlClob) {
log.info "DefaultCharset: " + groovy.util.CharsetToolkit.getDefaultSystemCharset()
String sourceXml
try {
sourceXml = new String(IOUtils.toByteArray(xmlClob?.getCharacterStream()), encoding) // 1. Encoding not working
sourceXml = new String(IOUtils.toByteArray(xmlClob?.getCharacterStream(), encoding), encoding) // 2. Encoding working
} catch (Exception e) {
...
}
return sourceXml
}
My queries:
a. I am not sure why (1) doesn't work even though I am using getCharacterStream() instead of getAsciiStream().
but (2) seems to work fine may be I am using explicit overriding of system encoding ?
b. The solution (2) looks bit odd as you are specifing 2 times the encoding format (one for bytes array and one for string creation).
I am not sure if there are any performance issues or wondered if there are better ways to write them?
c. I thought of not using the Apache-commons libraries and use a simple java package solution.
But the suprising thing is, I did not give any explicit encoding but it seems to work perfectly.
Is it because It does "streams character -> straight to string buffering" ?
/*
* working perfectly and retuns encoding correctly
*/
String extractXmlWithoutApacheCommons(Clob xmlClob) {
log.info "DefaultCharset: " + groovy.util.CharsetToolkit.getDefaultSystemCharset()
StringBuffer sb = new StringBuffer((int) xmlClob.length())
try {
Reader r = xmlClob.getCharacterStream()
char[] cbuf = new char[2048]
int n = 0
while ((n = r.read(cbuf, 0, cbuf.length)) != -1) {
if (n > 0) {
sb.append(cbuf, 0, n)
}
}
} catch (Exception e) {
...
}
return sb.toString()
}
Can you guys please shed some light to understand them.

The Clob already has an encoding. It's whatever you've specified in the database, and once you read it on Java side it'll be a String (with the implicit UTF-16 encoding, not that it matters at all).
Whatever you think you're doing with all those encoding tricks is wrong and useless. You only need to specify an encoding when turning bytes to chars or the other way around. You're dealing with chars only (except in your first example where you for some unknown reason want to turn them to bytes).
If you want to use IOUtils, then readFully(Reader input, char[] buffer) would be the method to use.
The platform default encoding has no effect in this whole question, since you shouldn't be working with bytes at all.
Edit:
A slightly more modern way with the standard JDK classes would be to use Reader.read(CharBuffer target) like
CharBuffer cb = CharBuffer.allocate((int) xmlClob.length());
while(r.read(cb) != -1)
;
return cb.toString();
but it doesn't really make a huge difference (it's a bit nicer looking).

Related

OutputStream translating bytes into characters with charset (opposite of OutputStreamWriter)

I need to manipulate content as it is being written to an OutputStream. Specifically, I need to replace CR or LF with CRLF to canonicalize text. This is easy for simple character sets where CR=13 and LF=10, but not so simple with multi-byte character sets. The characters should be replaced, not the bytes. It is non-trivial in general to do that in the output stream itself.
The built-in class OutputStreamWriter converts from characters to bytes for a configured encoding. I'm looking for a class that does the opposite, that is an OutputStream configured with a character set that buffers data as needed and translates the written bytes into characters with the character set (or throws on invalid byte sequences), making the characters available in some way, for example by forwarding the call to a Writer.
In other words I want to convert from bytes to characters on-the-fly as content is being written. I could write everything to a buffer and read it back with an InputStreamReader, but that is inefficient for very large payloads that won't fit in memory.
Is there a class like this somewhere (ideally open source, as I don't think it is built in)? If not, are there similar examples for efficient streaming conversion I could use as a starting point? The JDK classes I've seen are optimized for converting many bytes at a time, not for streaming use.
I wrote an implementation based on CharsetDecoder. Create a decoder and allocate a ByteBuffer and CharBuffer in the constructor:
decoder = charset.newDecoder();
byteBuf = ByteBuffer.allocate(bufferSize);
charBuf = CharBuffer.allocate(bufferSize);
Then implement write:
public void write(int b) throws IOException {
if (!byteBuf.hasRemaining()) {
decodeAndWriteByteBuffer(false);
}
byteBuf.put((byte) b);
}
And decodeAndWriteByteBuffer:
private void decodeAndWriteByteBuffer(boolean endOfInput) throws IOException {
byteBuf.flip();
CoderResult cr;
do {
cr = byteBuf.hasRemaining() || endOfInput
? decoder.decode(byteBuf, charBuf, endOfInput)
: CoderResult.UNDERFLOW;
if (cr.isUnderflow()) {
if (endOfInput) {
do {
cr = decoder.flush(charBuf);
writeCharBuffer();
} while (cr.isOverflow());
if (cr.isError()) {
cr.throwException();
}
}
} else if (cr.isOverflow()) {
writeCharBuffer();
} else {
cr.throwException();
}
} while (cr.isOverflow());
byteBuf.compact();
}
The remaining details are left as an exercise to the reader. It seems to work, though it is to early to say anything about performance.

Using something else instead of String

I have a big file and I want to do some „operations” on it.(find some text, check if some text exists, get the offset of some text, maybe changing the file).
My current aproach is this:
public ResultSet getResultSet(String fileName) throws IOException {
InputStream in = new FileInputStream(fileName);
byte[] buffer = new byte[CAPACITY];
byte[] doubleBuffer = new byte[2 * CAPACITY];
long len = in.read(doubleBuffer);
while (true) {
String reconstitutedString = new String(doubleBuffer, 0 ,doubleBuffer.length);
//...do stuff
ByteArrayOutputStream os = new ByteArrayOutputStream();
os.write(doubleBuffer, CAPACITY, CAPACITY);
readUntilNow += len;
len = in.read(buffer);
if (len <= 0) {
break;
}
os.write(buffer, 0, CAPACITY);
doubleBuffer = os.toByteArray();
os.close();
}
in.close();
return makeResult();
}
I would like to change the String reconstitutedString into something else. What would be the best alternative considering I want to be able to get some information about the content of that data, information that I may get calling an IndexOf on a String
You may use StringBuffer or StringBuilder . This two class has almost like String class with the advantage of mutability.
Moreover you can easily convert them to String whenever you required some functionality that only String provides. To convert them you can just use the toString() method.
You may use some other data type as an alternative to String based on your situation. But in general StringBuffer and StringBuilder is the best alternative instead of string. Use StringBuffer for synchronization and StringBuilder in other case.
The best type to do split or indexOf on is String. Just use it.
The most natural choice would be CharBuffer. Like String and StringBuilder it implements the CharSequence interface, therefore it can be used with a lot of text oriented APIs, most notably the regex engine which is the back-end for most search, split, and replacing operations.
What makes CharBuffer the natural choice is that it is also the type that is used by the charset package which provides the necessary operations for converting characters from and to bytes. By dealing with this API you can do the conversion directly from and to CharBuffers without additional data copying steps.
Note that Java’s regex API is prepared for processing buffers containing partially read files and can report whether reading more data might change the result (see hitEnd() and requireEnd()).
These are the necessary tools for building applications which can process large files in smaller chunks and without creating String instance out of it (or only when necessary, e.g. when extracting a matching subsequence).

decode character from unicode using java

I was unable to insert a chinese character to mysql. So I though of doing this. I have a excel sheet where I have chinese characters. Like 秀昭 and so on.
I got them converted to unicode representations like \uxxx using below code which I got from SO, and then I stored in MySQL.
private static String escapeNonAscii(String str) {
List<String> arr = new ArrayList<String>();
StringBuilder retStr = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
int cp = Character.codePointAt(str, i);
System.out.println("cp="+cp);
int charCount = Character.charCount(cp);
if (charCount > 1) {
i += charCount - 1; // 2.
if (i >= str.length()) {
throw new IllegalArgumentException("truncated unexpectedly");
}
}
if (cp < 128) {
retStr.appendCodePoint(cp);
} else {
retStr.append(String.format("\\u%x", cp));
arr.add(String.format("\\\\u%x", cp));
}
}
return retStr.toString();
}
The values have been stored properly. So now I need to display them back. When I tried
System.out.println("\u8BF7\u5728\u6B64\u5904");
It gives me proper output like,
`请在此`
But when I read from DB and did like
System.out.println(rs.getString(1).trim().toString() + " from DB");
It printed
`\u8BF7\u5728\u6B64\u5904`
What might be the problem? Have I missed anything? please help.
Escaped characters will only be processed prior to compiling. To store and retrieve the data from a database, you only have to consider two things: Make sure the data you read had the correct encoding. And when printing the data the correct encoding is set.
If you read data on a windows machine, it is posible you have to use the cp* encodings. Just use a InputStreamReader and set the charset. Now you have the data in the JVM. The internal encoding is some utf-16. Now that you use a type 4 jdbc, you do not have to worry about encoding, except that your database needs a encoding capable to store the data. UTF-8 or Unicode will to the trick. Consult your jdbc documentation for properties to set. Sometimes you have set an encoding explicitly (jdbc:mysql://localhost:3306/?useUnicode=yes&characterEncoding=UTF-8).
When outputting the data, sometimes the output must have a specific encoding. Normally, your JVM runs with the default system char set but you need another one, for example when rendering a HTML file.

Reading characters from a file written with .net

I'm trying to use java to read a string from a file that was written with a .net binaryWriter.
I think the problem is because the .net binary writer uses some 7 bit format for it's strings. By researching online, I came across this code that is supposed to function like the binary reader's readString() method. This is in my CSDataInputStream class that extends DataInputStream.
public String readStringCS() throws IOException {
int stringLength = 0;
boolean stringLengthParsed = false;
int step = 0;
while(!stringLengthParsed) {
byte part = readByte();
stringLengthParsed = (((int)part >> 7) == 0);
int partCutter = part & 127;
part = (byte)partCutter;
int toAdd = (int)part << (step*7);
stringLength += toAdd;
step++;
}
char[] chars = new char[stringLength];
for(int i = 0; i < stringLength; i++) {
chars[i] = readChar();
}
return new String(chars);
}
The first part seems to be working as it is returning the correct amount of characters (7). But when it reads the characters they are all Chinese! I'm pretty sure the problem is with DataInputStream.readChar() but I have no idea why it isn't working... I have even tried using
Character.reverseBytes(readChar());
to read the char to see if that would work, but it would just return different Chinese characters.
Maybe I need to emulate .net's way of reading chars? How would I go about doing that?
Is there something else I'm missing?
Thanks.
Okay, so you've parsed the length correctly by the sounds of it - but you're then treating it as the length in characters. As far as I can tell from the documentation it's the length in bytes.
So you should read the data into a byte[] of the right length, and then use:
return new String(bytes, encoding);
where encoding is the appropriate coding based on whatever was written from .NET... it will default to UTF-8, but it can be specified as something else.
As an aside, I personally wouldn't extend DataInputStream - I would compose it instead, i.e. make your type or method take a DataInputStream (or perhaps just take InputStream and wrap that in a DataInputStream). In general, if you favour composition over inheritance it can make code clearer and easier to maintain, in my experience.

Can a empty java string be created from non-empty UTF-8 byte array?

I'm trying to debug something and I'm wondering if the following code could ever return true
public boolean impossible(byte[] myBytes) {
if (myBytes.length == 0)
return false;
String string = new String(myBytes, "UTF-8");
return string.length() == 0;
}
Is there some value I can pass in that will return true? I've fiddled with passing in just the first byte of a 2 byte sequence, but it still produces a single character string.
To clarify, this happened on a PowerPC chip on Java 1.4 code compiled through GCJ to a native binary executable. This basically means that most bets are off. I'm mostly wondering if Java's 'normal' behaviour, or Java's spec made any promises.
According to the javadoc for java.util.String, the behavior of new String(byte[], "UTF-8") is not specified when the bytearray contains invalid or unexpected data. If you want more predictability in your resultant string use http://java.sun.com/j2se/1.5.0/docs/api/java/nio/charset/CharsetDecoder.html.
Possibly.
From the Java 5 API docs "The behavior of this constructor when the given bytes are not valid in the given charset is unspecified."
I guess that it depends on :
Which version of java you're using
Which vendor wrote your JVM (Sun, HP, IBM, the open source one, etc)
Once the docs say "unspecified" all bets are off
Edit: Beaten to it by Trey
Take his advice about using a CharsetDecoder
If Java handles the BOM mark correctly (which I'm not sure whether they have fixed it yet), then it should be possible to input a byte array with just the BOM (U+FEFF, which is in UTF-8 the byte sequence EF BB BF) and to get an empty string.
Update:
I tested that method with all values of 1-3 bytes. None of them returned an empty string on Java 1.6. Here is the test code that I used with different byte array lenghts:
public static void main(String[] args) throws UnsupportedEncodingException {
byte[] test = new byte[3];
byte[] end = new byte[test.length];
if (impossible(test)) {
System.out.println(Arrays.toString(test));
}
do {
increment(test, 0);
if (impossible(test)) {
System.out.println(Arrays.toString(test));
}
} while (!Arrays.equals(test, end));
}
private static void increment(byte[] arr, int i) {
arr[i]++;
if (arr[i] == 0 && i + 1 < arr.length) {
increment(arr, i + 1);
}
}
public static boolean impossible(byte[] myBytes) throws UnsupportedEncodingException {
if (myBytes.length == 0) {
return false;
}
String string = new String(myBytes, "UTF-8");
return string.length() == 0;
}
UTF-8 is a variable length encoding scheme, with most "normal" characters being single byte. So any given non-empty byte[] will always translate into a String, I'd have thought.
If you want to play it says, write a unit test which iterates over every possible byte value, passing in a single-value array of that value, and assert that the string is non-empty.

Categories