Reading characters from a file written with .net - java

I'm trying to use java to read a string from a file that was written with a .net binaryWriter.
I think the problem is because the .net binary writer uses some 7 bit format for it's strings. By researching online, I came across this code that is supposed to function like the binary reader's readString() method. This is in my CSDataInputStream class that extends DataInputStream.
public String readStringCS() throws IOException {
int stringLength = 0;
boolean stringLengthParsed = false;
int step = 0;
while(!stringLengthParsed) {
byte part = readByte();
stringLengthParsed = (((int)part >> 7) == 0);
int partCutter = part & 127;
part = (byte)partCutter;
int toAdd = (int)part << (step*7);
stringLength += toAdd;
step++;
}
char[] chars = new char[stringLength];
for(int i = 0; i < stringLength; i++) {
chars[i] = readChar();
}
return new String(chars);
}
The first part seems to be working as it is returning the correct amount of characters (7). But when it reads the characters they are all Chinese! I'm pretty sure the problem is with DataInputStream.readChar() but I have no idea why it isn't working... I have even tried using
Character.reverseBytes(readChar());
to read the char to see if that would work, but it would just return different Chinese characters.
Maybe I need to emulate .net's way of reading chars? How would I go about doing that?
Is there something else I'm missing?
Thanks.

Okay, so you've parsed the length correctly by the sounds of it - but you're then treating it as the length in characters. As far as I can tell from the documentation it's the length in bytes.
So you should read the data into a byte[] of the right length, and then use:
return new String(bytes, encoding);
where encoding is the appropriate coding based on whatever was written from .NET... it will default to UTF-8, but it can be specified as something else.
As an aside, I personally wouldn't extend DataInputStream - I would compose it instead, i.e. make your type or method take a DataInputStream (or perhaps just take InputStream and wrap that in a DataInputStream). In general, if you favour composition over inheritance it can make code clearer and easier to maintain, in my experience.

Related

How to convert clob to string with encoding in java

We are doing massive batch of xml processing and the logic to convert clob to string is shown below.
import java.sql.Clob
import org.apache.commons.io.IOUtils
String extractXml(Clob xmlClob) {
log.info "DefaultCharset: " + groovy.util.CharsetToolkit.getDefaultSystemCharset()
String sourceXml
try {
sourceXml = new String(IOUtils.toByteArray(xmlClob?.getCharacterStream()), encoding) // 1. Encoding not working
sourceXml = new String(IOUtils.toByteArray(xmlClob?.getCharacterStream(), encoding), encoding) // 2. Encoding working
} catch (Exception e) {
...
}
return sourceXml
}
My queries:
a. I am not sure why (1) doesn't work even though I am using getCharacterStream() instead of getAsciiStream().
but (2) seems to work fine may be I am using explicit overriding of system encoding ?
b. The solution (2) looks bit odd as you are specifing 2 times the encoding format (one for bytes array and one for string creation).
I am not sure if there are any performance issues or wondered if there are better ways to write them?
c. I thought of not using the Apache-commons libraries and use a simple java package solution.
But the suprising thing is, I did not give any explicit encoding but it seems to work perfectly.
Is it because It does "streams character -> straight to string buffering" ?
/*
* working perfectly and retuns encoding correctly
*/
String extractXmlWithoutApacheCommons(Clob xmlClob) {
log.info "DefaultCharset: " + groovy.util.CharsetToolkit.getDefaultSystemCharset()
StringBuffer sb = new StringBuffer((int) xmlClob.length())
try {
Reader r = xmlClob.getCharacterStream()
char[] cbuf = new char[2048]
int n = 0
while ((n = r.read(cbuf, 0, cbuf.length)) != -1) {
if (n > 0) {
sb.append(cbuf, 0, n)
}
}
} catch (Exception e) {
...
}
return sb.toString()
}
Can you guys please shed some light to understand them.
The Clob already has an encoding. It's whatever you've specified in the database, and once you read it on Java side it'll be a String (with the implicit UTF-16 encoding, not that it matters at all).
Whatever you think you're doing with all those encoding tricks is wrong and useless. You only need to specify an encoding when turning bytes to chars or the other way around. You're dealing with chars only (except in your first example where you for some unknown reason want to turn them to bytes).
If you want to use IOUtils, then readFully(Reader input, char[] buffer) would be the method to use.
The platform default encoding has no effect in this whole question, since you shouldn't be working with bytes at all.
Edit:
A slightly more modern way with the standard JDK classes would be to use Reader.read(CharBuffer target) like
CharBuffer cb = CharBuffer.allocate((int) xmlClob.length());
while(r.read(cb) != -1)
;
return cb.toString();
but it doesn't really make a huge difference (it's a bit nicer looking).

Encoding-aware RandomAccessReader implementation?

The default implementation of RandomAccessFile is 'broken', in the sense that you can't specify which encoding your file is in.
I'm looking for an alternative which matches the following criteria:
Encoding-aware
Random access! (dealing with very big files, need to be able to position the cursor using a byte offset without streaming the whole thing).
I had a poke around in Commons IO, but there's nothing there. I'd rather not have to implement this myself, because there are entirely too many places it could go wrong.
RandomAccessFile is intended for accessing binary data. It is not possible to efficiently create a random access encoded file which is appropriate in all situations.
Even if you find such a solution I would check it carefully to ensure it suits your needs.
If you were to write it, I would suggest considering a random position of row and column rather than character offset from the start of the file.
This has the advantage that you only have to remember where the start of each line is and you can scan the line to get your character. If you index the position of every character, this could use 4 bytes for every character (assuming the file is < 4 GB)
The answer turned out to be less painful than I assumed:
// This gives me access to buffering and charset magic
new BufferedReader(new InputStreamReader(Channels.newInputStream(randomAccessFile.getChannel()), encoding)), encoding
....
I can then implement a readLine() method which reads character by character. Using String.getBytes(encoding) I can keep track of the offset in the file. Calling seek() on the underlying RandomAccessFile allows me to reposition the cursor at will. There are probably some bugs lurking in there, but the basic tests seem to work.
public String readLine() throws IOException {
eol = "";
lastLineByteCount = 0;
StringBuilder builder = new StringBuilder();
char[] characters = new char[1];
int status = reader.read(characters, 0, 1);
if (status == -1) {
return null;
}
char c = characters[0];
while (status != -1) {
if (c == '\n') {
eol += c;
break;
}
if (c == '\r') {
eol += c;
} else {
builder.append(c);
}
status = reader.read(characters, 0, 1);
c = characters[0];
}
String line = builder.toString();
lastLineByteCount = line.getBytes(encoding).length + eol.getBytes(encoding).length;
return line;
}

regular expression replace 2 characters with one

i would like to use a regular expression for the following problem:
SOME_RANDOM_TEXT
should be converted to:
someRandomText
so, the _(any char) should be replaced with just the letter in upper case. i found something like that, using the tool:
_\w and $&
how to get only the second letter from the replacement?? any advice? thanks.
It might be easier simply to String.split("_") and then rejoin, capitalising the first letter of each string in your collection.
Note that Apache Commons has lots of useful string-related stuff, including a join() method.
The problem is that the case conversion from lowercase to uppercase is not supported by Java.util.regex.Pattern
This means you will need to do the conversion programmatically as Brian suggested. See also this thread
You can also write a simple method to do this. It's more complicated but more optimized :
public static String toCamelCase(String value) {
value = value.toLowerCase();
byte[] source = value.getBytes();
int maxLen = source.length;
byte[] target = new byte[maxLen];
int targetIndex = 0;
for (int sourceIndex = 0; sourceIndex < maxLen; sourceIndex++) {
byte c = source[sourceIndex];
if (c == '_') {
if (sourceIndex < maxLen - 1)
source[sourceIndex + 1] = (byte) Character.toUpperCase(source[sourceIndex + 1]);
continue;
}
target[targetIndex++] = source[sourceIndex];
}
return new String(target, 0, targetIndex);
}
I like Apache commons libraries, but sometimes it's good to know how it works and be able to write some specific code for jobs like this.

How to detect end of string in byte array to string conversion?

I receive from socket a string in a byte array which look like :
[128,5,6,3,45,0,0,0,0,0]
The size given by the network protocol is the total lenght of the string (including zeros) so , in my exemple 10.
If i simply do :
String myString = new String(myBuffer);
I have at the end of the string 5 non correct caracter. The conversion don't seems to detect the end of string caracter (0).
To get the correct size and the correct string i do this :
int sizeLabelTmp = 0;
//Iterate over the 10 bit to get the real size of the string
for(int j = 0; j<(sizeLabel); j++) {
byte charac = datasRec[j];
if(charac == 0)
break;
sizeLabelTmp ++;
}
// Create a temp byte array to make a correct conversion
byte[] label = new byte[sizeLabelTmp];
for(int j = 0; j<(sizeLabelTmp); j++) {
label[j] = datasRec[j];
}
String myString = new String(label);
Is there a better way to handle the problem ?
Thanks
May be its too late, But it may help others. The simplest thing you can do is new String(myBuffer).trim() that gives you exactly what you want.
0 isn't an "end of string character". It's just a byte. Whether or not it only comes at the end of the string depends on what encoding you're using (and what the text can be). For example, if you used UTF-16, every other byte would be 0 for ASCII characters.
If you're sure that the first 0 indicates the end of the string, you can use something like the code you've given, but I'd rewrite it as:
int size = 0;
while (size < data.length)
{
if (data[size] == 0)
{
break;
}
size++;
}
// Specify the appropriate encoding as the last argument
String myString = new String(data, 0, size, "UTF-8");
I strongly recommend that you don't just use the platform default encoding - it's not portable, and may well not allow for all Unicode characters. However, you can't just decide arbitrarily - you need to make sure that everything producing and consuming this data agrees on the encoding.
If you're in control of the protocol, it would be much better if you could introduce a length prefix before the string, to indicate how many bytes are in the encoded form. That way you'd be able to read exactly the right amount of data (without "over-reading") and you'd be able to tell if the data was truncated for some reason.
You can always start at the end of the byte array and go backwards until you hit the first non-zero. Then just copy that into a new byte and then String it. Hope this helps:
byte[] foo = {28,6,3,45,0,0,0,0};
int i = foo.length - 1;
while (foo[i] == 0)
{
i--;
}
byte[] bar = Arrays.copyOf(foo, i+1);
String myString = new String(bar, "UTF-8");
System.out.println(myString.length());
Will give you a result of 4.
Strings in Java aren't ended with a 0, like in some other languages. 0 will get turned into the so-called null character, which is allowed to appear in a String. I suggest you use some trimming scheme that either detects the first index of the array that's a 0 and uses a sub-array to construct the String (assuming all the rest will be 0 after that), or just construct the String and call trim(). That'll remove leading and trailing whitespace, which is any character with ASCII code 32 or lower.
The latter won't work if you have leading whitespace you must preserve. Using a StringBuilder and deleting characters at the end as long as they're the null character would work better in that case.
It appears to me that you are ignoring the read-count returned by the read() method. The trailing null bytes probably weren't sent, they are probably still left over from the initial state of the buffer.
int count = in.read(buffer);
if (count < 0)
; // EOS: close the socket etc
else
String s = new String(buffer, 0, count);
Not to dive into the protocol considerations that the original OP mentioned, how about this for trimming the trailing zeroes ?
public static String bytesToString(byte[] data) {
String dataOut = "";
for (int i = 0; i < data.length; i++) {
if (data[i] != 0x00)
dataOut += (char)data[i];
}
return dataOut;
}

Can a empty java string be created from non-empty UTF-8 byte array?

I'm trying to debug something and I'm wondering if the following code could ever return true
public boolean impossible(byte[] myBytes) {
if (myBytes.length == 0)
return false;
String string = new String(myBytes, "UTF-8");
return string.length() == 0;
}
Is there some value I can pass in that will return true? I've fiddled with passing in just the first byte of a 2 byte sequence, but it still produces a single character string.
To clarify, this happened on a PowerPC chip on Java 1.4 code compiled through GCJ to a native binary executable. This basically means that most bets are off. I'm mostly wondering if Java's 'normal' behaviour, or Java's spec made any promises.
According to the javadoc for java.util.String, the behavior of new String(byte[], "UTF-8") is not specified when the bytearray contains invalid or unexpected data. If you want more predictability in your resultant string use http://java.sun.com/j2se/1.5.0/docs/api/java/nio/charset/CharsetDecoder.html.
Possibly.
From the Java 5 API docs "The behavior of this constructor when the given bytes are not valid in the given charset is unspecified."
I guess that it depends on :
Which version of java you're using
Which vendor wrote your JVM (Sun, HP, IBM, the open source one, etc)
Once the docs say "unspecified" all bets are off
Edit: Beaten to it by Trey
Take his advice about using a CharsetDecoder
If Java handles the BOM mark correctly (which I'm not sure whether they have fixed it yet), then it should be possible to input a byte array with just the BOM (U+FEFF, which is in UTF-8 the byte sequence EF BB BF) and to get an empty string.
Update:
I tested that method with all values of 1-3 bytes. None of them returned an empty string on Java 1.6. Here is the test code that I used with different byte array lenghts:
public static void main(String[] args) throws UnsupportedEncodingException {
byte[] test = new byte[3];
byte[] end = new byte[test.length];
if (impossible(test)) {
System.out.println(Arrays.toString(test));
}
do {
increment(test, 0);
if (impossible(test)) {
System.out.println(Arrays.toString(test));
}
} while (!Arrays.equals(test, end));
}
private static void increment(byte[] arr, int i) {
arr[i]++;
if (arr[i] == 0 && i + 1 < arr.length) {
increment(arr, i + 1);
}
}
public static boolean impossible(byte[] myBytes) throws UnsupportedEncodingException {
if (myBytes.length == 0) {
return false;
}
String string = new String(myBytes, "UTF-8");
return string.length() == 0;
}
UTF-8 is a variable length encoding scheme, with most "normal" characters being single byte. So any given non-empty byte[] will always translate into a String, I'd have thought.
If you want to play it says, write a unit test which iterates over every possible byte value, passing in a single-value array of that value, and assert that the string is non-empty.

Categories