How to detect end of string in byte array to string conversion? - java

I receive from socket a string in a byte array which look like :
[128,5,6,3,45,0,0,0,0,0]
The size given by the network protocol is the total lenght of the string (including zeros) so , in my exemple 10.
If i simply do :
String myString = new String(myBuffer);
I have at the end of the string 5 non correct caracter. The conversion don't seems to detect the end of string caracter (0).
To get the correct size and the correct string i do this :
int sizeLabelTmp = 0;
//Iterate over the 10 bit to get the real size of the string
for(int j = 0; j<(sizeLabel); j++) {
byte charac = datasRec[j];
if(charac == 0)
break;
sizeLabelTmp ++;
}
// Create a temp byte array to make a correct conversion
byte[] label = new byte[sizeLabelTmp];
for(int j = 0; j<(sizeLabelTmp); j++) {
label[j] = datasRec[j];
}
String myString = new String(label);
Is there a better way to handle the problem ?
Thanks

May be its too late, But it may help others. The simplest thing you can do is new String(myBuffer).trim() that gives you exactly what you want.

0 isn't an "end of string character". It's just a byte. Whether or not it only comes at the end of the string depends on what encoding you're using (and what the text can be). For example, if you used UTF-16, every other byte would be 0 for ASCII characters.
If you're sure that the first 0 indicates the end of the string, you can use something like the code you've given, but I'd rewrite it as:
int size = 0;
while (size < data.length)
{
if (data[size] == 0)
{
break;
}
size++;
}
// Specify the appropriate encoding as the last argument
String myString = new String(data, 0, size, "UTF-8");
I strongly recommend that you don't just use the platform default encoding - it's not portable, and may well not allow for all Unicode characters. However, you can't just decide arbitrarily - you need to make sure that everything producing and consuming this data agrees on the encoding.
If you're in control of the protocol, it would be much better if you could introduce a length prefix before the string, to indicate how many bytes are in the encoded form. That way you'd be able to read exactly the right amount of data (without "over-reading") and you'd be able to tell if the data was truncated for some reason.

You can always start at the end of the byte array and go backwards until you hit the first non-zero. Then just copy that into a new byte and then String it. Hope this helps:
byte[] foo = {28,6,3,45,0,0,0,0};
int i = foo.length - 1;
while (foo[i] == 0)
{
i--;
}
byte[] bar = Arrays.copyOf(foo, i+1);
String myString = new String(bar, "UTF-8");
System.out.println(myString.length());
Will give you a result of 4.

Strings in Java aren't ended with a 0, like in some other languages. 0 will get turned into the so-called null character, which is allowed to appear in a String. I suggest you use some trimming scheme that either detects the first index of the array that's a 0 and uses a sub-array to construct the String (assuming all the rest will be 0 after that), or just construct the String and call trim(). That'll remove leading and trailing whitespace, which is any character with ASCII code 32 or lower.
The latter won't work if you have leading whitespace you must preserve. Using a StringBuilder and deleting characters at the end as long as they're the null character would work better in that case.

It appears to me that you are ignoring the read-count returned by the read() method. The trailing null bytes probably weren't sent, they are probably still left over from the initial state of the buffer.
int count = in.read(buffer);
if (count < 0)
; // EOS: close the socket etc
else
String s = new String(buffer, 0, count);

Not to dive into the protocol considerations that the original OP mentioned, how about this for trimming the trailing zeroes ?
public static String bytesToString(byte[] data) {
String dataOut = "";
for (int i = 0; i < data.length; i++) {
if (data[i] != 0x00)
dataOut += (char)data[i];
}
return dataOut;
}

Related

Having problems with splitting a String into max 1Mb size subStrings

I have to split a String into 1Mb size strings. With using UTF-8 as character encoding, some letters take up more than 1 byte, so for avoiding to split a character in the middle (for example 'á' is 2 byte, so can't 1 byte go to the end of one String, and 1 to the beggining of the next String)
public static List<String> cutString3(String original, int chunkSize, String encoding) throws UnsupportedEncodingException {
List<String> strings = new ArrayList<>();
final int end = original.length();
int from = 0;
int to = 0;
do {
to = (to + chunkSize > end) ? end : to + chunkSize;
String chunk = original.substring(from, to); // get chunk
while (chunk.getBytes(encoding).length > chunkSize) { // cut the chunk from the end
chunk = original.substring(from, --to);
}
strings.add(chunk); // add chunk to collection
from = to; // next chunk
} while (to < end);
return strings;
}
I'm using the above method to generate an example String:
private static String createDataSize(int msgSize) {
StringBuilder sb = new StringBuilder(msgSize);
for (int i = 0; i < msgSize; i++) {
sb.append("a");
}
return sb.toString();
}
Calling the method as the following:
String exampleString = createDataSize(1024*1024*3);
cutString(exampleString, 1024*1024, "UTF-8");
It has no problems, I get back 3 Strings, as the 3 megabyte String was splitted into 3, 1 megabyte String. But if I change the createDataSize() method's char to append 'á' to the example String, so it only stands from "áááááá..." the inner while loop in the cutString method takes forever, since it's removing every 'á' one by one, until it fits into the given size. How can I improve the inner while, or come up with something similiar solution? The String can be smaller than 1 megabyte, just not bigger!
Using the binary search logic would clearly fit your need.
Simply decrement faster, using only the half of the chunk size, if you still as some room, add an half of it, if not, remove and half. And so on.
A simpler solution would be to remove only the differences between chunk.getBytes(encoding).length and chunkSize. Then see how many byte you can still take if you want to fill it completly.

Run-length decompression

CS student here. I want to write a program that will decompress a string that has been encoded according to a modified form of run-length encoding (which I've already written code for). For instance, if a string contains 'bba10' it would decompress to 'bbaaaaaaaaaa'. How do I get the program to recognize that part of the string ('10') is an integer?
Thanks for reading!
A simple regex will do.
final Matcher m = Pattern.compile("(\\D)(\\d+)").matcher(input);
final StringBuffer b = new StringBuffer();
while (m.find())
m.appendReplacement(b, replicate(m.group(1), Integer.parseInt(m.group(2))));
m.appendTail(b);
where replicate is
String replicate(String s, int count) {
final StringBuilder b = new StringBuilder(count);
for (int i = 0; i < count; i++) b.append(s);
return b.toString();
}
Not sure whether this is one efficient way, but just for reference
for (int i=0;i<your_string.length();i++)
if (your_string.charAt(i)<='9' && your_string.charAt(i)>='0')
integer_begin_location = i;
I think you can divide chars in numeric and not numeric symbols.
When you find a numeric one (>0 and <9) you look to the next and choose to enlarge you number (current *10 + new) or to expand your string
Assuming that the uncompressed data does never contain digits: Iterate over the string, character by character until you get a digit. Then continue until you have a non-digit (or end of string). The digits inbetween can be parsed to an integer as others already stated:
int count = Integer.parseInt(str.substring(start, end));
Here is a working implementation in python. This also works fine for 2 or 3 or multiple digit numbers
inputString="a1b3s22d4a2b22"
inputString=inputString+"\0" //just appending a null char
charcount=""
previouschar=""
outputString=""
for char in inputString:
if char.isnumeric():
charcount=charcount+char
else:
outputString=outputString
if previouschar:
outputString=outputString+(previouschar*int(charcount))
charcount=""
previouschar=char
print(outputString) // outputString= abbbssssssssssssssssssssssddddaabbbbbbbbbbbbbbbbbbbbbb
Presuming that you're not asking about the parsing, you can convert a string like "10" into an integer like this:
int i = Integer.parseInt("10");

Encoding-aware RandomAccessReader implementation?

The default implementation of RandomAccessFile is 'broken', in the sense that you can't specify which encoding your file is in.
I'm looking for an alternative which matches the following criteria:
Encoding-aware
Random access! (dealing with very big files, need to be able to position the cursor using a byte offset without streaming the whole thing).
I had a poke around in Commons IO, but there's nothing there. I'd rather not have to implement this myself, because there are entirely too many places it could go wrong.
RandomAccessFile is intended for accessing binary data. It is not possible to efficiently create a random access encoded file which is appropriate in all situations.
Even if you find such a solution I would check it carefully to ensure it suits your needs.
If you were to write it, I would suggest considering a random position of row and column rather than character offset from the start of the file.
This has the advantage that you only have to remember where the start of each line is and you can scan the line to get your character. If you index the position of every character, this could use 4 bytes for every character (assuming the file is < 4 GB)
The answer turned out to be less painful than I assumed:
// This gives me access to buffering and charset magic
new BufferedReader(new InputStreamReader(Channels.newInputStream(randomAccessFile.getChannel()), encoding)), encoding
....
I can then implement a readLine() method which reads character by character. Using String.getBytes(encoding) I can keep track of the offset in the file. Calling seek() on the underlying RandomAccessFile allows me to reposition the cursor at will. There are probably some bugs lurking in there, but the basic tests seem to work.
public String readLine() throws IOException {
eol = "";
lastLineByteCount = 0;
StringBuilder builder = new StringBuilder();
char[] characters = new char[1];
int status = reader.read(characters, 0, 1);
if (status == -1) {
return null;
}
char c = characters[0];
while (status != -1) {
if (c == '\n') {
eol += c;
break;
}
if (c == '\r') {
eol += c;
} else {
builder.append(c);
}
status = reader.read(characters, 0, 1);
c = characters[0];
}
String line = builder.toString();
lastLineByteCount = line.getBytes(encoding).length + eol.getBytes(encoding).length;
return line;
}

From string to ASCII to binary back to ASCII to string in Java

I have sort of a funky question (that I hope hasn't been asked and answered yet). To start, I'll tell you the order of what I'm trying to do and how I'm doing it and then tell you where I'm having a problem:
Convert a string of characters into ASCII numbers
Convert those ASCII numbers into binary and store them in a string
Convert those binary numbers back into ASCII numbers
Convert the ASCII numbers back into normal characters
Here are the methods I've written so far:
public static String strToBinary(String inputString){
int[] ASCIIHolder = new int[inputString.length()];
//Storing ASCII representation of characters in array of ints
for(int index = 0; index < inputString.length(); index++){
ASCIIHolder[index] = (int)inputString.charAt(index);
}
StringBuffer binaryStringBuffer = new StringBuffer();
/* Now appending values of ASCIIHolder to binaryStringBuffer using
* Integer.toBinaryString in a for loop. Should not get an out of bounds
* exception because more than 1 element will be added to StringBuffer
* each iteration.
*/
for(int index =0;index <inputString.length();index ++){
binaryStringBuffer.append(Integer.toBinaryString
(ASCIIHolder[index]));
}
String binaryToBeReturned = binaryStringBuffer.toString();
binaryToBeReturned.replace(" ", "");
return binaryToBeReturned;
}
public static String binaryToString(String binaryString){
int charCode = Integer.parseInt(binaryString, 2);
String returnString = new Character((char)charCode).toString();
return returnString;
}
I'm getting a NumberFormatException when I run the code and I think it's because the program is trying to convert the binary digits as one entire binary number rather than as separate letters. Based on what you see here, is there a better way to do this overall and/or how can I tell the computer to recognize the ASCII characters when it's iterating through the binary code? Hope that's clear and if not I'll be checking for comments.
So I used OP's code with some modifications and it works really well for me.
I'll post it here for future people. I don't think OP needs it anymore because he probably figured it out in the past 2 years.
public class Convert
{
public String strToBinary(String inputString){
int[] ASCIIHolder = new int[inputString.length()];
//Storing ASCII representation of characters in array of ints
for(int index = 0; index < inputString.length(); index++){
ASCIIHolder[index] = (int)inputString.charAt(index);
}
StringBuffer binaryStringBuffer = new StringBuffer();
/* Now appending values of ASCIIHolder to binaryStringBuffer using
* Integer.toBinaryString in a for loop. Should not get an out of bounds
* exception because more than 1 element will be added to StringBuffer
* each iteration.
*/
for(int index =0;index <inputString.length();index ++){
binaryStringBuffer.append(Integer.toBinaryString
(ASCIIHolder[index]));
}
String binaryToBeReturned = binaryStringBuffer.toString();
binaryToBeReturned.replace(" ", "");
return binaryToBeReturned;
}
public String binaryToString(String binaryString){
String returnString = "";
int charCode;
for(int i = 0; i < binaryString.length(); i+=7)
{
charCode = Integer.parseInt(binaryString.substring(i, i+7), 2);
String returnChar = new Character((char)charCode).toString();
returnString += returnChar;
}
return returnString;
}
}
I'd like to thank OP for writing most of it out for me. Fixing errors is much easier than writing new code.
You've got at least two problems here:
You're just concatenating the binary strings, with no separators. So if you had "1100" and then "0011" you'd get "11000011" which is the same result as if you had "1" followed by "1000011".
You're calling String.replace and ignoring the return result. This sort of doesn't matter as you're replacing spaces, and there won't be any spaces anyway... but there should be!
Of course you don't have to use separators - but if you don't, you need to make sure that you include all 16 bits of each UTF-16 code point. (Or validate that your string only uses a limited range of characters and go down to an appropriate number of bits, e.g. 8 bits for ISO-8859-1 or 7 bits for ASCII.)
(I have to wonder what the point of all of this is. Homework? I can't see this being useful in real life.)

Reading characters from a file written with .net

I'm trying to use java to read a string from a file that was written with a .net binaryWriter.
I think the problem is because the .net binary writer uses some 7 bit format for it's strings. By researching online, I came across this code that is supposed to function like the binary reader's readString() method. This is in my CSDataInputStream class that extends DataInputStream.
public String readStringCS() throws IOException {
int stringLength = 0;
boolean stringLengthParsed = false;
int step = 0;
while(!stringLengthParsed) {
byte part = readByte();
stringLengthParsed = (((int)part >> 7) == 0);
int partCutter = part & 127;
part = (byte)partCutter;
int toAdd = (int)part << (step*7);
stringLength += toAdd;
step++;
}
char[] chars = new char[stringLength];
for(int i = 0; i < stringLength; i++) {
chars[i] = readChar();
}
return new String(chars);
}
The first part seems to be working as it is returning the correct amount of characters (7). But when it reads the characters they are all Chinese! I'm pretty sure the problem is with DataInputStream.readChar() but I have no idea why it isn't working... I have even tried using
Character.reverseBytes(readChar());
to read the char to see if that would work, but it would just return different Chinese characters.
Maybe I need to emulate .net's way of reading chars? How would I go about doing that?
Is there something else I'm missing?
Thanks.
Okay, so you've parsed the length correctly by the sounds of it - but you're then treating it as the length in characters. As far as I can tell from the documentation it's the length in bytes.
So you should read the data into a byte[] of the right length, and then use:
return new String(bytes, encoding);
where encoding is the appropriate coding based on whatever was written from .NET... it will default to UTF-8, but it can be specified as something else.
As an aside, I personally wouldn't extend DataInputStream - I would compose it instead, i.e. make your type or method take a DataInputStream (or perhaps just take InputStream and wrap that in a DataInputStream). In general, if you favour composition over inheritance it can make code clearer and easier to maintain, in my experience.

Categories