Getting Exception in Converting ByteArray to String with Fixed length

Getting Exception in Converting ByteArray to String with Fixed length - java

I want to convert bytes in to String.
I have one android application and I am using flatfile for data storage.
Suppose I have lots of record in my flatfile.
Here in flat file database, my record size is fixed and its 10 characters and here I am storing lots of String records sequence.
But when I read one record from the flat file, then it is fixed number of bytes for each record. Because I wrote 10 bytes for every record.
If my string is S="abc123";
then it is stored in flat file like abc123 ASCII values for each character and rest would be 0.
Means byte array should be [97 ,98 ,99 ,49 ,50 ,51,0,0,0,0].
So when I want to get my actual string from the byte array, at that time I am using below code and it is working fine.
But when I give my inputString = "1234567890" then it creates problem.
public class MainActivity extends Activity {
public static short messageNumb = 0;
public static short appID = 16;
#Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
// record with size 10 and its in bytes.
byte[] recordBytes = new byte[10];
// fill record by 0's
Arrays.fill(recordBytes, (byte) 0);
// input string
String inputString = "abc123";
int length = 0;
int SECTOR_LENGTH = 10;
// convert in bytes
byte[] inputBytes = inputString.getBytes();
// set how many bytes we have to write.
length = SECTOR_LENGTH < inputBytes.length ? SECTOR_LENGTH
: inputBytes.length;
// copy bytes in record size.
System.arraycopy(inputBytes, 0, recordBytes, 0, length);
// Here i write this record in the file.
// Now time to read record from the file.
// Suppose i read one record from the file successfully.
// convert this read bytes to string which we wrote.
Log.d("TAG", "String is = " + getStringFromBytes(recordBytes));
}
public String getStringFromBytes(byte[] inputBytes) {
String s;
s = new String(inputBytes);
return s = s.substring(0, s.indexOf(0));
}
}
But I am getting problem when my string has complete 10 characters. At that time I have two 0's in my byte array so in this line
s = s.substring(0, s.indexOf(0));
I am getting the below exception:
java.lang.StringIndexOutOfBoundsException: length=10; regionStart=0; regionLength=-1
at java.lang.String.startEndAndLength(String.java:593)
at java.lang.String.substring(String.java:1474)
So what can I do when my string length is 10.
I have two solutions- I can check my inputBytes.length == 10 then make it not to do subString condition otherwise check contains 0 in byte array.
But i don't want to use this solution because I used this thing at lots of places in my application. So, is there any other way to achieve this thing?
Please suggest me some good solution which works in every condition. I think at last 2nd solution would be great. (check contains 0's in byte array and then apply sub string function).

public String getStringFromBytes(byte[] inputBytes) {
String s;
s = new String(inputBytes);
int zeroIndex = s.indexOf(0);
return zeroIndex < 0 ? s : s.substring(0, zeroIndex);
}

i think this line cause the error
s = s.substring(0, s.indexOf(0));
s.indexOf(0)
returns -1 , perhaps you should specifiy the ASCII code
for zero which is 48
so this will work s = s.substring(0, s.indexOf(48));
check documentation for indexOf(int)
public int indexOf (int c) Since: API Level 1 Searches in this string
for the first index of the specified character. The search for the
character starts at the beginning and moves towards the end of this
string.
Parameters c the character to find. Returns the index in this string
of the specified character, -1 if the character isn't found.

Related

Splitting a string with byte length limits in java

I want to split a String to a String[] array, whose elements meet following conditions.
s.getBytes(encoding).length should not exceed maxsize(int).
If I join the splitted strings with StringBuilder or + operator, the result should be exactly the original string.
The input string may have unicode characters which can have multiple bytes when encoded in e.g. UTF-8.
The desired prototype is shown below.
public static String[] SplitStringByByteLength(String src,String encoding, int maxsize)
And the testing code:
public boolean isNice(String str, String encoding, int max)
{
//boolean success=true;
StringBuilder b=new StringBuilder();
String[] splitted= SplitStringByByteLength(str,encoding,max);
for(String s: splitted)
{
if(s.getBytes(encoding).length>max)
return false;
b.append(s);
}
if(str.compareTo(b.toString()!=0)
return false;
return true;
}
Though it seems easy when the input string has only ASCII characters, the fact that it could cobtain multibyte characters makes me confused.
Thank you in advance.
Edit: I added my code impementation. (Inefficient)
public static String[] SplitStringByByteLength(String src,String encoding, int maxsize) throws UnsupportedEncodingException
{
ArrayList<String> splitted=new ArrayList<String>();
StringBuilder builder=new StringBuilder();
//int l=0;
int i=0;
while(true)
{
String tmp=builder.toString();
char c=src.charAt(i);
if(c=='\0')
break;
builder.append(c);
if(builder.toString().getBytes(encoding).length>maxsize)
{
splitted.add(new String(tmp));
builder=new StringBuilder();
}
++i;
}
return splitted.toArray(new String[splitted.size()]);
}
Is this the only way to solve this problem?

The class CharsetEncode has provision for your requirement. Extract from the Javadoc of the Encode method:
public final CoderResult encode(CharBuffer in,
ByteBuffer out,
boolean endOfInput)
Encodes as many characters as possible from the given input buffer, writing the results to the given output buffer...
In addition to reading characters from the input buffer and writing bytes to the output buffer, this method returns a CoderResult object to describe its reason for termination:
...
CoderResult.OVERFLOW indicates that there is insufficient space in the output buffer to encode any more characters. This method should be invoked again with an output buffer that has more remaining bytes. This is typically done by draining any encoded bytes from the output buffer.
A possible code could be:
public static String[] SplitStringByByteLength(String src,String encoding, int maxsize) {
Charset cs = Charset.forName(encoding);
CharsetEncoder coder = cs.newEncoder();
ByteBuffer out = ByteBuffer.allocate(maxsize); // output buffer of required size
CharBuffer in = CharBuffer.wrap(src);
List<String> ss = new ArrayList<>(); // a list to store the chunks
int pos = 0;
while(true) {
CoderResult cr = coder.encode(in, out, true); // try to encode as much as possible
int newpos = src.length() - in.length();
String s = src.substring(pos, newpos);
ss.add(s); // add what has been encoded to the list
pos = newpos; // store new input position
out.rewind(); // and rewind output buffer
if (! cr.isOverflow()) {
break; // everything has been encoded
}
}
return ss.toArray(new String[0]);
}
This will split the original string in chunks that when encoded in bytes fit as much as possible in byte arrays of the given size (assuming of course that maxsize is not ridiculously small).

The problem lies in the existence of Unicode "supplementary characters" (see Javadoc of the Character class), that take up two "character places" (a surrogate pair) in a String, and you shouldn't split your String in the middle of such a pair.
An easy approach to splitting would be to stick to the worst-case that a single Unicode code point can take at most four bytes in UTF-8, and split the string after every 99 code points (using string.offsetByCodePoints(pos, 99) ). In most cases, you won't fill the 400 bytes, but you'll be on the safe side.
Some words about code points and characters
When Java started, Unicode had less than 65536 characters, so Java decided that 16 bits were enough for a character. Later the Unicode standard exceeded the 16-bit limit, and Java had a problem: a single Unicode element (now called a "code point") no longer fit into a single Java character.
They decided to go for an encoding into 16-bit entities, being 1:1 for most usual code points, and occupying two "characters" for the exotic code points beyond the 16-bit limit (the pair built from so-called "surrogate characters" from a spare code range below 65535). So now it can happen that e.g. string.charAt(5) and string.charAt(6) must be seen in combination, as a "surrogate pair", together encoding one Unicode code point.
That's the reason why you shouldn't split a string at an arbitrary index.
To help the application programmer, the String class then got a new set of methods, working in code point units, and e.g. string.offsetByCodePoints(pos, 99) means: from the index pos, advance by 99 code points forward, giving an index that will often be pos+99 (in case the string doesn't contain anything exotic), but might be up to pos+198, if all the following string elements happen to be surrogate pairs.
Using the code-point methods, you are safe not to land in the middle of a surrogate pair.

Having problems with splitting a String into max 1Mb size subStrings

I have to split a String into 1Mb size strings. With using UTF-8 as character encoding, some letters take up more than 1 byte, so for avoiding to split a character in the middle (for example 'á' is 2 byte, so can't 1 byte go to the end of one String, and 1 to the beggining of the next String)
public static List<String> cutString3(String original, int chunkSize, String encoding) throws UnsupportedEncodingException {
List<String> strings = new ArrayList<>();
final int end = original.length();
int from = 0;
int to = 0;
do {
to = (to + chunkSize > end) ? end : to + chunkSize;
String chunk = original.substring(from, to); // get chunk
while (chunk.getBytes(encoding).length > chunkSize) { // cut the chunk from the end
chunk = original.substring(from, --to);
}
strings.add(chunk); // add chunk to collection
from = to; // next chunk
} while (to < end);
return strings;
}
I'm using the above method to generate an example String:
private static String createDataSize(int msgSize) {
StringBuilder sb = new StringBuilder(msgSize);
for (int i = 0; i < msgSize; i++) {
sb.append("a");
}
return sb.toString();
}
Calling the method as the following:
String exampleString = createDataSize(1024*1024*3);
cutString(exampleString, 1024*1024, "UTF-8");
It has no problems, I get back 3 Strings, as the 3 megabyte String was splitted into 3, 1 megabyte String. But if I change the createDataSize() method's char to append 'á' to the example String, so it only stands from "áááááá..." the inner while loop in the cutString method takes forever, since it's removing every 'á' one by one, until it fits into the given size. How can I improve the inner while, or come up with something similiar solution? The String can be smaller than 1 megabyte, just not bigger!

Using the binary search logic would clearly fit your need.
Simply decrement faster, using only the half of the chunk size, if you still as some room, add an half of it, if not, remove and half. And so on.
A simpler solution would be to remove only the differences between chunk.getBytes(encoding).length and chunkSize. Then see how many byte you can still take if you want to fill it completly.

Java: Remove first UTF string from byte array

I'm trying to remove a written string from a byte array and maintain the original separate objects:
byte[] data... // this is populated with the following:
// 00094E6966747943686174001C00074D657373616765000B4372616674656446757279000474657374
// to string using converter : " ChannelMessageUsernametest"
// notice that tab/whitespace, ignore quotes
// The byte array was compiled by writing the following (writeUTF from a writer):
// Channel
// Message
// Username
// test
Now I'm trying to strip Channel from the byte array:
ByteArrayDataInput input = ByteStreams.newDataInput(message);
String channel = input.readUTF(); // Channel, don't want this
String message = input.readUTF(); // Message
// works good, but I don't want Channel,
// and I can't remove it from the data before it arrives,
// I have to work with what I have
Here is my problem:
byte[] newData = Arrays.copyOfRange(data, channel.length() + 2, data.length)
// I use Arrays.copyOfRange to strip the whitespace (I assume it's not needed)
// As well, since it's inclusive of length size, I have to add 1 more,
// resulting in channel.length() + 1
// ...
ByteArrayDataInput newInput = ByteStreams.newDataInput(message);
String channel = newInput.readUTF(); // MessageUsernametext
See how I lose the separation of the objects, how can I keep the original "sections" of objects in the original byte[] data inside byte[] newData.
It's safe to assume that String channel (before and after stripping) is a string
It's NOT safe to assume that every object is a string, assume everything is random, because it is

As long as you can guarantee that channel is always in a reasonable character range (for example alphanumeric), changing the channel.length() + 2 to channel.length() + 4 should be sufficient.

Java Strings have 16-bit elements, so it is safe to convert a byte array into a String, although not as memory efficient:
private byte[] removeElements(byte[] data, int fromIndex, int len) {
String str1 = new String(data).substring(0,fromIndex);
String str2 = new String(data).substring(fromIndex+len,data.length);
return (str1+str2).getBytes();
}
In the same manner, you can also search for a String inside the byte array:
private int findStringInByteArray(byte[] mainByte, String str, int fromIndex) {
String main = new String(mainByte);
return main.indexOf(str,fromIndex);
}
Now you can call these methods together:
byte[] newData = removeElements(
data,
findStringInByteArray(data,channel,0),
channel.length() );

From string to ASCII to binary back to ASCII to string in Java

I have sort of a funky question (that I hope hasn't been asked and answered yet). To start, I'll tell you the order of what I'm trying to do and how I'm doing it and then tell you where I'm having a problem:
Convert a string of characters into ASCII numbers
Convert those ASCII numbers into binary and store them in a string
Convert those binary numbers back into ASCII numbers
Convert the ASCII numbers back into normal characters
Here are the methods I've written so far:
public static String strToBinary(String inputString){
int[] ASCIIHolder = new int[inputString.length()];
//Storing ASCII representation of characters in array of ints
for(int index = 0; index < inputString.length(); index++){
ASCIIHolder[index] = (int)inputString.charAt(index);
}
StringBuffer binaryStringBuffer = new StringBuffer();
/* Now appending values of ASCIIHolder to binaryStringBuffer using
* Integer.toBinaryString in a for loop. Should not get an out of bounds
* exception because more than 1 element will be added to StringBuffer
* each iteration.
*/
for(int index =0;index <inputString.length();index ++){
binaryStringBuffer.append(Integer.toBinaryString
(ASCIIHolder[index]));
}
String binaryToBeReturned = binaryStringBuffer.toString();
binaryToBeReturned.replace(" ", "");
return binaryToBeReturned;
}
public static String binaryToString(String binaryString){
int charCode = Integer.parseInt(binaryString, 2);
String returnString = new Character((char)charCode).toString();
return returnString;
}
I'm getting a NumberFormatException when I run the code and I think it's because the program is trying to convert the binary digits as one entire binary number rather than as separate letters. Based on what you see here, is there a better way to do this overall and/or how can I tell the computer to recognize the ASCII characters when it's iterating through the binary code? Hope that's clear and if not I'll be checking for comments.

So I used OP's code with some modifications and it works really well for me.
I'll post it here for future people. I don't think OP needs it anymore because he probably figured it out in the past 2 years.
public class Convert
{
public String strToBinary(String inputString){
int[] ASCIIHolder = new int[inputString.length()];
//Storing ASCII representation of characters in array of ints
for(int index = 0; index < inputString.length(); index++){
ASCIIHolder[index] = (int)inputString.charAt(index);
}
StringBuffer binaryStringBuffer = new StringBuffer();
/* Now appending values of ASCIIHolder to binaryStringBuffer using
* Integer.toBinaryString in a for loop. Should not get an out of bounds
* exception because more than 1 element will be added to StringBuffer
* each iteration.
*/
for(int index =0;index <inputString.length();index ++){
binaryStringBuffer.append(Integer.toBinaryString
(ASCIIHolder[index]));
}
String binaryToBeReturned = binaryStringBuffer.toString();
binaryToBeReturned.replace(" ", "");
return binaryToBeReturned;
}
public String binaryToString(String binaryString){
String returnString = "";
int charCode;
for(int i = 0; i < binaryString.length(); i+=7)
{
charCode = Integer.parseInt(binaryString.substring(i, i+7), 2);
String returnChar = new Character((char)charCode).toString();
returnString += returnChar;
}
return returnString;
}
}
I'd like to thank OP for writing most of it out for me. Fixing errors is much easier than writing new code.

You've got at least two problems here:
You're just concatenating the binary strings, with no separators. So if you had "1100" and then "0011" you'd get "11000011" which is the same result as if you had "1" followed by "1000011".
You're calling String.replace and ignoring the return result. This sort of doesn't matter as you're replacing spaces, and there won't be any spaces anyway... but there should be!
Of course you don't have to use separators - but if you don't, you need to make sure that you include all 16 bits of each UTF-16 code point. (Or validate that your string only uses a limited range of characters and go down to an appropriate number of bits, e.g. 8 bits for ISO-8859-1 or 7 bits for ASCII.)
(I have to wonder what the point of all of this is. Homework? I can't see this being useful in real life.)

How to detect end of string in byte array to string conversion?

I receive from socket a string in a byte array which look like :
[128,5,6,3,45,0,0,0,0,0]
The size given by the network protocol is the total lenght of the string (including zeros) so , in my exemple 10.
If i simply do :
String myString = new String(myBuffer);
I have at the end of the string 5 non correct caracter. The conversion don't seems to detect the end of string caracter (0).
To get the correct size and the correct string i do this :
int sizeLabelTmp = 0;
//Iterate over the 10 bit to get the real size of the string
for(int j = 0; j<(sizeLabel); j++) {
byte charac = datasRec[j];
if(charac == 0)
break;
sizeLabelTmp ++;
}
// Create a temp byte array to make a correct conversion
byte[] label = new byte[sizeLabelTmp];
for(int j = 0; j<(sizeLabelTmp); j++) {
label[j] = datasRec[j];
}
String myString = new String(label);
Is there a better way to handle the problem ?
Thanks

May be its too late, But it may help others. The simplest thing you can do is new String(myBuffer).trim() that gives you exactly what you want.

0 isn't an "end of string character". It's just a byte. Whether or not it only comes at the end of the string depends on what encoding you're using (and what the text can be). For example, if you used UTF-16, every other byte would be 0 for ASCII characters.
If you're sure that the first 0 indicates the end of the string, you can use something like the code you've given, but I'd rewrite it as:
int size = 0;
while (size < data.length)
{
if (data[size] == 0)
{
break;
}
size++;
}
// Specify the appropriate encoding as the last argument
String myString = new String(data, 0, size, "UTF-8");
I strongly recommend that you don't just use the platform default encoding - it's not portable, and may well not allow for all Unicode characters. However, you can't just decide arbitrarily - you need to make sure that everything producing and consuming this data agrees on the encoding.
If you're in control of the protocol, it would be much better if you could introduce a length prefix before the string, to indicate how many bytes are in the encoded form. That way you'd be able to read exactly the right amount of data (without "over-reading") and you'd be able to tell if the data was truncated for some reason.

You can always start at the end of the byte array and go backwards until you hit the first non-zero. Then just copy that into a new byte and then String it. Hope this helps:
byte[] foo = {28,6,3,45,0,0,0,0};
int i = foo.length - 1;
while (foo[i] == 0)
{
i--;
}
byte[] bar = Arrays.copyOf(foo, i+1);
String myString = new String(bar, "UTF-8");
System.out.println(myString.length());
Will give you a result of 4.

Strings in Java aren't ended with a 0, like in some other languages. 0 will get turned into the so-called null character, which is allowed to appear in a String. I suggest you use some trimming scheme that either detects the first index of the array that's a 0 and uses a sub-array to construct the String (assuming all the rest will be 0 after that), or just construct the String and call trim(). That'll remove leading and trailing whitespace, which is any character with ASCII code 32 or lower.
The latter won't work if you have leading whitespace you must preserve. Using a StringBuilder and deleting characters at the end as long as they're the null character would work better in that case.

It appears to me that you are ignoring the read-count returned by the read() method. The trailing null bytes probably weren't sent, they are probably still left over from the initial state of the buffer.
int count = in.read(buffer);
if (count < 0)
; // EOS: close the socket etc
else
String s = new String(buffer, 0, count);

Not to dive into the protocol considerations that the original OP mentioned, how about this for trimming the trailing zeroes ?
public static String bytesToString(byte[] data) {
String dataOut = "";
for (int i = 0; i < data.length; i++) {
if (data[i] != 0x00)
dataOut += (char)data[i];
}
return dataOut;
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.