Java bug? Why extra zero byte in utf8 encoding?

Java bug? Why extra zero byte in utf8 encoding? - java

The following code
public class CharsetProblem {
public static void main(String[] args) {
//String str = "aaaaaaaaa";
String str = "aaaaaaaaaa";
Charset cs1 = Charset.forName("ASCII");
Charset cs2 = Charset.forName("utf8");
System.out.println(toHex(cs1.encode(str).array()));
System.out.println(toHex(cs2.encode(str).array()));
}
public static String toHex(byte[] outputBytes) {
StringBuilder builder = new StringBuilder();
for(int i=0; i<outputBytes.length; ++i) {
builder.append(String.format("%02x", outputBytes[i]));
}
return builder.toString();
}
}
returns
61616161616161616161
6161616161616161616100
i.e. utf8 encoding returns excess byte. If we take less a-s, then we'll have no excess bytes. If we take more a-s we can get more and more excess bytes.
Why?
How one can workaround this?

You can't just get the backing array and use it. ByteBuffers have a capacity, position and a limit.
System.out.println(cs1.encode(str).remaining());
System.out.println(cs2.encode(str).remaining());
produces:
10
10
Try this instead:
public static void main(String[] args) {
//String str = "aaaaaaaaa";
String str = "aaaaaaaaaa";
Charset cs1 = Charset.forName("ASCII");
Charset cs2 = Charset.forName("utf8");
System.out.println(toHex(cs1.encode(str)));
System.out.println(toHex(cs2.encode(str)));
}
public static String toHex(ByteBuffer buff) {
StringBuilder builder = new StringBuilder();
while (buff.remaining() > 0) {
builder.append(String.format("%02x", buff.get()));
}
return builder.toString();
}
It produces the expected:
61616161616161616161
61616161616161616161

You're assuming that the backing array for a ByteBuffer is precisely the correct size to hold the contents, but it's not necessarily. In fact, the contents don't even need to start at the first byte of the array! Study the API for ByteBuffer and you'll understand what's going on: the contents start at the value returned by arrayOffset(), and the end returned by limit().

The answer has already been given, but as I ran into the same problem, I think it might be useful to provide more details:
The byte array returned by invoking cs1.encode(str).array() or cs2.encode(str).array() returns a reference to the whole array allocated to the ByteBuffer at that time. The capacity of the array may be greater than what's actually used. To retrieve only the used portion you should do something like the following:
ByteBuffer bf1 = cs1.encode(str);
ByteBuffer bf2 = cs2.encode(str);
System.out.println(toHex(Arrays.copyOf(bf1.array(), bf1.limit())));
System.out.println(toHex(Arrays.copyOf(bf2.array(), bf2.limit())));
This yields the result you expect.

Related

Is there a way to concatenate Java strings in less than O(n) time?

My homework question involves joining strings in a particular sequence. We are first given the strings, followed by a set of instructions that tell us how to concatenate them; finally we print the output string.
I have used the Kattis FastIO class to handle buffered input and output. Below is my algorithm, which iterates through the instructions to concatenate the strings. I have tried making the array of normal strings, StringBuffers and StringBuilders.
The program seems to work as intended, but it gives a time limit error on my submission platform due to inefficiency. It seems like appending the way I did is O(n); is there any faster way?
public class JoinStrings {
public static void main(String[] args) {
Kattio io = new Kattio(System.in, System.out);
ArrayList<StringBuilder> stringList = new ArrayList<StringBuilder>();
int numStrings = io.getInt();
StringBuilder[] stringArray = new StringBuilder[numStrings];
for (int i = 0; i < numStrings; i++) {
String str = io.getWord();
stringArray[i] = new StringBuilder(str);
}
StringBuilder toPrint = stringArray[0];
while (io.hasMoreTokens()) {
int a = io.getInt();
int b = io.getInt();
stringArray[a-1].append(stringArray[b-1]); // this is the line that is done N times
toPrint = stringArray[a-1];
}
io.println(toPrint.toString());
io.flush();
}
}

The StringBuilder.append() copy char from new string to existing string. It's fast but not free.
Instead of keeping appending the String to the StringBuilder array, keep track of the String indexes need to appended. Then finally append the Strings stored in the print out indexes list.

Array Index out of Bound Exception for returning Char Array

I am new to Java programming and I was writing code to replace spaces in Strings with %20 and return the final String. Here is the code for the problem. Since I am new to programming please tell me what I did wrong. Sorry for my bad English.
package Chapter1;
import java.io.BufferedReader;
import java.io.InputStreamReader;
public class Problem4 {
public char[] replaceSpaces(char[] str_array, int length)
{
int noOfSpaces=0,i,newLength;
for(i=0;i<length;i++)
{
if(str_array[i]==' ')
{
noOfSpaces++;
}
newLength = length + noOfSpaces * 2;
str_array[newLength]='\0';
for(i=0;i<length-1;i++)
{
if(str_array[i]==' ')
{
str_array[newLength-1]='0';
str_array[newLength-2]='2';
str_array[newLength-3]='%';
newLength = newLength-3;
}
str_array[newLength-1]=str_array[i];
newLength = newLength - 1;
}
}
return str_array;
}
public static void main(String args[])throws Exception
{
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
System.out.println("Please enter the string:");
String str = reader.readLine();
char[] str_array = str.toCharArray();
int length = str.length();
Problem4 obj = new Problem4();
char[] result = obj.replaceSpaces(str_array, length);
System.out.println(result);
}
}
But I get the following error:
Please enter the string:
hello world
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 11
at Chapter1.Problem4.replaceSpaces(Problem4.java:19)
at Chapter1.Problem4.main(Problem4.java:46)

How about using String.replaceAll():
String str = reader.readLine();
str = str.replaceAll(" ", "02%");
Sample code here
EDIT:
The problem is at line 19:
str_array[newLength]='\0';//<-- newLength exceeds the char array size
Here array is static i.e. the size is fixed you can use StringBuilder, StringBuffer, etc to build the new String without worrying about the size for such small operations.

Assuming that you want to see what mistakes you made when implementing your approach, instead of looking for a totally different approach:
(1) As has been pointed out, once an array has been allocated, its size cannot be changed. Your method takes str_array as a parameter, but the resulting array will likely be larger than str_array. Therefore, since str_array's length cannot be changed, you'll need to allocate a new array to hold the result, rather than using str_array. You've computed newLength correctly; allocate a new array of that size:
char[] resultArray = new char[newLength];
(2) As Elliott pointed out, Java strings don't need \0 terminators. If, for some reason, you really want to create an array that has a \0 character at the end, then you have to add 1 to your computed newLength to account for the extra character.
(3) You're actually creating the resulting array backward. I don't know if that is intentional.
if(str_array[i]==' ')
{
str_array[newLength-1]='0';
str_array[newLength-2]='2';
str_array[newLength-3]='%';
newLength = newLength-3;
}
str_array[newLength-1]=str_array[i];
newLength = newLength - 1;
i starts with the first character of the string and goes upward; you're filling in characters starting with the last character of the string (newLength) and going backward. If that's what you intended to do, it wasn't clear from your question. Did you want the output to be "dlrow%20olleh"?
(4) If you did intend to go backward, then what the above code does with a space is to put %20 in the string (backwards), but then it also puts the space into the result. If the input character is a space, you want to make sure you don't execute the two lines that copy the input character to the result. So you'll need to add an else. (Note that this problem will lead to an out-of-bounds error, because you're trying to put more characters into the result than you computed.) You'll need to have an else in there even if you really meant to build the string forwards and need to change the logic to make it go forward.

Java arrays are not dynamic (they are Object instances, and they have a field length property that does not change). Because they store the length as a field, it is important to know that they're not '\0' terminated (your attempt to add such a terminator is causing your index out of bounds Exception). Your method doesn't appear to access any instance fields or methods, so I'd make it static. Then you could use a StringBuilder and a for-each loop. Something like
public static char[] replaceSpaces(char[] str_array) {
StringBuilder sb = new StringBuilder();
for (char ch : str_array) {
sb.append((ch != ' ') ? ch : "%20");
}
return sb.toString().toCharArray();
}
Then call it like
char[] result = replaceSpaces(str_array);
Finally, you might use String str = reader.readLine().replace(" ", "+"); or replaceAll(" ", "%20") as suggested by #Arvind here.
P.S. When you finally get your result you'll need to fix your call to print it.
System.out.println(Arrays.toString(result));
or
System.out.println(new String(result));
A char[] is not a String and Java arrays (disappointingly) don't override toString() so you'll get the one from Object.

please tell me what I did wrong
You tried to replace a single character with three characters %20. That's not possible because arrays are fixed length.
Therefore you must allocate a new char[] and copy the characters from str_array into the new array.
for (i = 0; i < length; i++) {
if (str_array[i] == ' ') {
noOfSpaces++;
}
}
newLength = length + noOfSpaces * 2;
char[] newArray = new char[newLength];
// copy characters from str_array into newArray

The exception is raised in this line str_array[newLength]='\0'; because value of newLength is greater than length of str_array.
Array size cannot be increased once it is defined. So try the alternative solution.
char[] str_array1=Arrays.copyOf(str_array, str_array.length+1);
str_array1[newLength]='\0';
don't forget to import the new package import java.util.Arrays;

Convert Java string to byte array

I have a byte array which I'm encrypting then converting to a string so it can be transmitted. When I receive the string I then have to convert the string back into a byte array so it can be decrypted. I have checked that the received string matches the sent string (including length) but when I use something like str.getBytes() to convert it to a byte array, it does not match my original byte array.
example output:
SENT: WzShnf/fOV3NZO2nqnOXZbM1lNwVpcq3qxmXiiv6M5xqC1A3
SENT STR: [B#3e4a9a7d
RECEIVED STR: [B#3e4a9a7d
RECEIVED: W0JAM2U0YTlhN2Q=
any ideas how i can convert the received string to a byte array which matches the sent byte array?
Thanks

You used array.toString(), which is implemented like this:
return "[B#" + Integer.toString(this.hashCode(), 16);
(In fact it inherits the definition from Object, and the part before the # simply is the result of getClass().getName().)
And the hashCode here does not depend on the content.
Instead, use new String(array, encoding).
Of course, this only works for byte-arrays which are really representable as Java strings (which then contain readable characters), not for arbitrary arrays. There better use base64 like Bozho recommended (but make sure to use it on both sides of the channel).

This looks like Base64. Take a look at commons-codec Base64 class.

You can't just use getBytes() on two different machines, since getBytes uses the plattform's default charset.
Decode and encode the array with a specified charset (i.e. UTF-8) to make sure you get the correct results.

First do convertion of your byte array to proper string, by doing
String line= new String(Arrays.toString(your_array))
Then send it and use function below
public static byte[] StringToByteArray(String line)
{
String some=line.substring(1, line.length()-1);
int element_counter=1;
for(int i=0; i<some.length(); i++)
{
if (some.substring(i, i+1).equals(","))
{
element_counter++;
}
}
int [] comas =new int[element_counter-1];
byte [] a=new byte[element_counter];
if (a.length==1)
{
a[0]= Byte.parseByte(some.substring(0));
}
else
{
int j=0;
for (int i = 0; i < some.length(); i++)
{
if (some.substring(i, i+1).equals(","))
{
comas[j]=i;
j++;
}
}
for (int i=0; i<element_counter; i++)
{
if(i==0)
{
a[i]=Byte.parseByte(some.substring(0, comas[i]));
}
else if (i==element_counter-1)
{
a[i]=Byte.parseByte(some.substring(comas[comas.length-1]+2));
}
else
{
a[i]=Byte.parseByte(some.substring(comas[i-1]+2, comas[i]));
}
}
}
return a;
}

Can a empty java string be created from non-empty UTF-8 byte array?

I'm trying to debug something and I'm wondering if the following code could ever return true
public boolean impossible(byte[] myBytes) {
if (myBytes.length == 0)
return false;
String string = new String(myBytes, "UTF-8");
return string.length() == 0;
}
Is there some value I can pass in that will return true? I've fiddled with passing in just the first byte of a 2 byte sequence, but it still produces a single character string.
To clarify, this happened on a PowerPC chip on Java 1.4 code compiled through GCJ to a native binary executable. This basically means that most bets are off. I'm mostly wondering if Java's 'normal' behaviour, or Java's spec made any promises.

According to the javadoc for java.util.String, the behavior of new String(byte[], "UTF-8") is not specified when the bytearray contains invalid or unexpected data. If you want more predictability in your resultant string use http://java.sun.com/j2se/1.5.0/docs/api/java/nio/charset/CharsetDecoder.html.

Possibly.
From the Java 5 API docs "The behavior of this constructor when the given bytes are not valid in the given charset is unspecified."
I guess that it depends on :
Which version of java you're using
Which vendor wrote your JVM (Sun, HP, IBM, the open source one, etc)
Once the docs say "unspecified" all bets are off
Edit: Beaten to it by Trey
Take his advice about using a CharsetDecoder

If Java handles the BOM mark correctly (which I'm not sure whether they have fixed it yet), then it should be possible to input a byte array with just the BOM (U+FEFF, which is in UTF-8 the byte sequence EF BB BF) and to get an empty string.
Update:
I tested that method with all values of 1-3 bytes. None of them returned an empty string on Java 1.6. Here is the test code that I used with different byte array lenghts:
public static void main(String[] args) throws UnsupportedEncodingException {
byte[] test = new byte[3];
byte[] end = new byte[test.length];
if (impossible(test)) {
System.out.println(Arrays.toString(test));
}
do {
increment(test, 0);
if (impossible(test)) {
System.out.println(Arrays.toString(test));
}
} while (!Arrays.equals(test, end));
}
private static void increment(byte[] arr, int i) {
arr[i]++;
if (arr[i] == 0 && i + 1 < arr.length) {
increment(arr, i + 1);
}
}
public static boolean impossible(byte[] myBytes) throws UnsupportedEncodingException {
if (myBytes.length == 0) {
return false;
}
String string = new String(myBytes, "UTF-8");
return string.length() == 0;
}

UTF-8 is a variable length encoding scheme, with most "normal" characters being single byte. So any given non-empty byte[] will always translate into a String, I'd have thought.
If you want to play it says, write a unit test which iterates over every possible byte value, passing in a single-value array of that value, and assert that the string is non-empty.

Java - Create a new String instance with specified length and filled with specific character. Best solution? [duplicate]

This question already has answers here:
Simple way to repeat a string
(32 answers)
Closed 4 years ago.
I did check the other questions; this question has its focus on solving this particular question the most efficient way.
Sometimes you want to create a new string with a specified length, and with a default character filling the entire string.
ie, it would be cool if you could do new String(10, '*') and create a new String from there, with a length of 10 characters all having a *.
Because such a constructor does not exist, and you cannot extend from String, you have either to create a wrapper class or a method to do this for you.
At this moment I am using this:
protected String getStringWithLengthAndFilledWithCharacter(int length, char charToFill) {
char[] array = new char[length];
int pos = 0;
while (pos < length) {
array[pos] = charToFill;
pos++;
}
return new String(array);
}
It still lacks any checking (ie, when length is 0 it will not work). I am constructing the array first because I believe it is faster than using string concatination or using a StringBuffer to do so.
Anyone else has a better sollution?

Apache Commons Lang (probably useful enough to be on the classpath of any non-trivial project) has StringUtils.repeat():
String filled = StringUtils.repeat("*", 10);
Easy!

Simply use the StringUtils class from apache commons lang project. You have a leftPad method:
StringUtils.leftPad("foobar", 10, '*'); // Returns "****foobar"

No need to do the loop, and using just standard Java library classes:
protected String getStringWithLengthAndFilledWithCharacter(int length, char charToFill) {
if (length > 0) {
char[] array = new char[length];
Arrays.fill(array, charToFill);
return new String(array);
}
return "";
}
As you can see, I also added suitable code for the length == 0 case.

Some possible solutions.
This creates a String with length-times '0' filled and replaces then the '0' with the charToFill (old school).
String s = String.format("%0" + length + "d", 0).replace('0', charToFill);
This creates a List containing length-times Strings with charToFill and then joining the List into a String.
String s = String.join("", Collections.nCopies(length, String.valueOf(charToFill)));
This creates a unlimited java8 Stream with Strings with charToFill, limits the output to length and collects the results with a String joiner (new school).
String s = Stream.generate(() -> String.valueOf(charToFill)).limit(length).collect(Collectors.joining());

In Java 11, you have repeat:
String s = " ";
s = s.repeat(1);
(Although at the time of writing still subject to change)

char[] chars = new char[10];
Arrays.fill(chars, '*');
String text = new String(chars);

To improve performance you could have a single predefined sting if you know the max length like:
String template = "####################################";
And then simply perform a substring once you know the length.

Solution using Google Guava
String filled = Strings.repeat("*", 10);

public static String fillString(int count,char c) {
StringBuilder sb = new StringBuilder( count );
for( int i=0; i<count; i++ ) {
sb.append( c );
}
return sb.toString();
}
What is wrong?

using Dollar is simple:
String filled = $("=").repeat(10).toString(); // produces "=========="

Solution using Google Guava, since I prefer it to Apache Commons-Lang:
/**
* Returns a String with exactly the given length composed entirely of
* the given character.
* #param length the length of the returned string
* #param c the character to fill the String with
*/
public static String stringOfLength(final int length, final char c)
{
return Strings.padEnd("", length, c);
}

The above is fine. Do you mind if I ask you a question - Is this causing you a problem? It seams to me you are optimizing before you know if you need to.
Now for my over engineered solution. In many (thou not all) cases you can use CharSequence instead of a String.
public class OneCharSequence implements CharSequence {
private final char value;
private final int length;
public OneCharSequence(final char value, final int length) {
this.value = value;
this.length = length;
}
public char charAt(int index) {
if(index < length) return value;
throw new IndexOutOfBoundsException();
}
public int length() {
return length;
}
public CharSequence subSequence(int start, int end) {
return new OneCharSequence(value, (end-start));
}
public String toString() {
char[] array = new char[length];
Arrays.fill(array, value);
return new String(array);
}
}

One extra note: it seems that all public ways of creating a new String instance involves necessarily the copy of whatever buffer you are working with, be it a char[], a StringBuffer or a StringBuilder. From the String javadoc (and is repeated in the respective toString methods from the other classes):
The contents of the character array are copied; subsequent modification of
the character array does not affect
the newly created string.
So you'll end up having a possibly big memory copy operation after the "fast filling" of the array. The only solution that may avoid this issue is the one from #mlk, if you can manage working directly with the proposed CharSequence implementation (what may be the case).
PS: I would post this as a comment but I don't have enough reputation to do that yet.

Try this Using the substring(int start, int end); method
String myLongString = "abcdefghij";
if (myLongString .length() >= 10)
String shortStr = myLongString.substring(0, 5)+ "...";
this will return abcde.

Mi solution :
pw = "1321";
if (pw.length() < 16){
for(int x = pw.length() ; x < 16 ; x++){
pw += "*";
}
}
The output :
1321************

Try this jobber
String stringy =null;
byte[] buffer = new byte[100000];
for (int i = 0; i < buffer.length; i++) {
buffer[i] =0;
}
stringy =StringUtils.toAsciiString(buffer);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java bug? Why extra zero byte in utf8 encoding? - java

Related

Is there a way to concatenate Java strings in less than O(n) time?

Array Index out of Bound Exception for returning Char Array

Convert Java string to byte array

Can a empty java string be created from non-empty UTF-8 byte array?

Java - Create a new String instance with specified length and filled with specific character. Best solution? [duplicate]

Categories

Resources