Is a character 1 byte or 2 bytes in Java? - java

I thought characters in java are 16 bits as suggested in java doc. Isn't it the case for strings? I have a code that stores an object into a file:
public static void storeNormalObj(File outFile, Object obj) {
FileOutputStream fos = null;
ObjectOutputStream oos = null;
try {
fos = new FileOutputStream(outFile);
oos = new ObjectOutputStream(fos);
oos.writeObject(obj);
oos.flush();
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
oos.close();
try {
fos.close();
} catch (Exception e) {
e.printStackTrace();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
Basically, I tried to store an string "abcd" in to file "output", when I opened up output with an editor and deleted the none string part, what's left is just the string "abcd", which is 4 bytes in total. Anyone knows why? Does java automatically saves space by using ASCII instead of UNICODE for Strings that can be supported by ASCII? Thanks

(I think by "none string part" you are referring to the bytes that ObjectOutputStream emits when you create it. It is possible you don't want to use ObjectOutputStream, but I don't know your requirements.)
Just FYI, Unicode and UTF-8 are not the same thing. Unicode is a standard that specifies, amongst other things, what characters are available. UTF-8 is a character encoding that specifies how these characters shall be physically encoded in 1s and 0s. UTF-8 can use 1 byte for ASCII (<= 127) and up to 4 bytes to represent other Unicode characters.
UTF-8 is a strict superset of ASCII. So even if you specify a UTF-8 encoding for a file and you write "abcd" to it, it will contain just those four bytes: they have the same physical encoding in ASCII as they do in UTF-8.
Your method uses ObjectOutputStream which actually has a significantly different encoding than either ASCII or UTF-8! If you read the Javadoc carefully, if obj is a string and has already occurred in the stream, subsequent calls to writeObject will cause a reference to the previous string to be emitted, potentially causing many fewer bytes to be written in the case of repeated strings.
If you're serious about understanding this, you really should spend a good amount of time reading about Unicode and character encoding systems. Wikipedia has an excellent article on Unicode as a start.

Yea, the char is only Unicode within the context of the Java runtime environment. If you wish to write it using 16-bit encoding, use a FileWriter.
FileWriter outputStream = null;
try {
outputStream = new FileWriter("myfilename.dat");
int c;
while ((c = inputStream.read()) != -1) {
outputStream.write(c);
}
} finally {
if (outputStream != null) {
outputStream.close();
}
}

If you look at the source of String, it will note that it calls DataOutput.writeUTF to write Strings. And if you read that you'll find out they are written as "modified UTF-8". The details are lengthy, but if you don't use non 7 bit ascii, yes, it will take one byte. If you want the gory details look at the EXTREMELY long javadoc in DataOutput.writeUTF()

You may be interested to know there is a -XX:+UseCompressedStrings option in Java Update 21 performance release and later. This will allows String to use a byte[] for strings which do not need a char[]
Despite the Java Hotspot VM Options guide suggesting it may be on by default, this may only be for performance releases. It only appears to work for me if I turn it on explicitly.

So do you expect a 16*4=64 bits = 8 bytes file? More than UTF-8 or ASCII encoding. Once the file is written to a file. The memory (in terms of space) management is up to the operating system. And your code doesn't have a control on it.

Related

does using String in java to hold binary data is wrong?

I need to pass binary data (red from a file) from java to c++ (using jni), so I have a C++ function that expects string (because in c++ string is just char array).
I read my binary file in java using the following code :
byte[] buffer = new byte[512];
FileInputStream in = new FileInputStream("some_file");
int rc = in.read(buffer);
while(rc != -1)
{
// rc should contain the number of bytes read in this operation.
// do stuff...
// next read
rc = in.read(buffer);
String s = new String(buffer);
// here i call my c++ function an pass "s"
}
I'm worried about the line that creates the string, what actually happens when i put the buffer inside a string ? It seems that when the data arrives to my c++ code it is different from what i expect him to be.
does the "string" constructor changes the data somehow ?
Strings are not char arrays at all. They are complex Unicode beasts with semantic interactions between the codepoints, different binary encodings, etc. This is true for all programs. The only thing that's different about C++ is that they haven't finished complaining and started doing things about it yet.
In all languages, for binary data, use an explicit binary data type, like array of bytes.
A C++ char is a Java byte. Both are 8-bit. A Java char is a 16-bit value.
Ignore that C++ calls it char. Give it a Java byte[].

Data loss when writing bytes to a file

I'm working on a string compressor for a school assignment,
There's one bug that I can't seem to work out. The compressed data is being written a file using a FileWriter, represented by a byte array. The compression algorithm returns an input stream so the data flows as such:
piped input stream
-> input stream reader
-> data stored in char buffer
-> data written to file with file writer.
Now, the bug is, that with some very specific strings, the second to last byte in the byte array is written wrong. and it's always the same bit values "11111100".
Every time it's this bit values and always the second to last byte.
Here are some samples from the code:
InputStream compress(InputStream){
//...
//...
PipedInputStream pin = new PipedInputStream();
PipedOutputStream pout = new PipedOutputStream(pin);
ObjectOutputStream oos = new ObjectOutputStream(pout);
oos.writeObject(someobject);
oos.flush();
DataOutputStream dos = new DataOutputStream(pout);
dos.writeFloat(//);
dos.writeShort(//);
dos.write(SomeBytes); // ---Here
dos.flush();
dos.close();
return pin;
}
void write(char[] cbuf, int off, int len){
//....
//....
InputStreamReader s = new InputStreamReader(
c.compress(new ByteArrayInputStream(str.getBytes())));
s.read(charbuffer);
out.write(charbuffer);
}
A string which triggers it is "hello and good evenin" for example.
I have tried to iterate over the byte array and write them one by one, it didn't help.
It's also worth noting that when I tried to write to a file using the output stream in the algorithm itself it worked fine. This design was not my choice btw.
So I'm not really sure what i'm doing wrong here.
Considering that you're saying:
Now, the bug is, that with some very specific strings, the second to
last byte in the byte array is written wrong. and it's always the same
bit values "11111100".
You are taking a
binary stream (the compressed data)
-> reading it as chars
-> then writing it as chars.
And your are converting bytes to chars without clearly defining the encoding.
I'd say that the problem is that your InputStreamReader is translating some byte sequences in a way that you're not expecting.
Remember that in encodings like utf-8 two or three bytes may become one single char.
It can't be coincidence that the very byte pattern you pointed out (11111100) Is one of the utf-8 escape codes (1111110x). Check this wikipedia table at and you'll see that uft-8 is destructive since if a byte starts with: 1111110x the next must start with 10xxxxxx.
Meaning that if using utf-8 to convert
bytes1[] -> chars[] -> bytes2[]
in some cases bytes2 will be different from bytes1.
I recommend changing your code to remove those readers. Or specify ASCII encoding to see if that prevent the translations.
I solved this by encoding and decoding the bytes with Base64.

Writing to file

I am writing to a file for the first time, but the text on the file comes out completely wrong. Instead of numbers (which it is supposed to print), it prints unrecognizable characters. I can't seem to understand why this is happening? (in my code the print statement is inside a for loop, but this is the "shell" around the loop)
Is there a logical explanation for this?
try {
FileWriter outFile = new FileWriter("newFile.txt", true);
outFile.write(number);
} catch (IOException e) {
e.printStackTrace();
}
You're calling Writer.write(int):
Writes a single character. The character to be written is contained in the 16 low-order bits of the given integer value; the 16 high-order bits are ignored.
My guess is that's not what you want to do. If you want to write the text representation of a number, you need to do that explicitly:
outFile.write(String.valueOf(number));
(Personally I'd recommend using OutputStreamWriter wrapped around a FIleOutputStream, as then you can - and should - specify an encoding. FileWriter always uses the platform default encoding.)

Text to binary conversion and writing to file. help please

I'm trying to convert plain text data to binary format so that it becomes non-redable.
The data needs to be written to a file. The conversion works if I print it in console window, and cannot read original text.
However, when it is written to a file, the same original text appears.
How can I write this binary data to a file, without encrypting it, but making it non-readable?
This file, later needs to be processed by another third party tool which accepts binary data. This is why I cannot encrypt it using my own algo.
This is my code:
import java.io.*;
import java.lang.*;
public class convert{
public static void main( String args[] ){
String s= "This is text";
try{
File file= new File("test.php");
file.createNewFile();
FileWriter fw= new FileWriter( file.getName(), true );
BufferedWriter bw= new BufferedWriter( fw );
byte[] b= s.getBytes();
for(int x=0; x<b.length; x++){
byte c=b[x];
bw.write( c );
System.out.println(c);
}
bw.close();
}catch( Exception e ){ e.printStackTrace(); }
}//main
}//class
Plain text is in itself a binary format, where each byte is interpreted as a character (or other variants, depending on assumed or specified encoding, e.g. UTF-8, UTF-16...).
This means that if you write an ASCII character as a byte to a file, it will look identical and will be readable by anyone. In fact, a lot of binary file formats still save strings as normal bytes, which means they can be read when opened in a hex editor, such as here:
What you will need to do to make it unreadable is to serialize it into some common format that is not readable. Normal serialization in Java is unfortunately readable, but you can check here for advice on how to obscure it. You can also use ZIP or similar compression algorithms as well.
Another, more hacky way, is to shift all your character bytes by some known value. This will result in them becoming a different character and it will be unreadable. This can be seen as a very basic Caesar Chipher.
But in the end, all that matters is which formats your target program is able to read.
Well you can use a RandomAccessFile for this. It becomes easy.
RandomAccessFile raf = new RandomAccessFile(path,permissions); //permissions are r,w,rw. The usual ones.
raf.seek(0); //this will set the pointer to first position.
raf.writeUTF(what ever you want to write to file);
//to read the file use readUTF()
raf.seek(0); //you can read from a part of file using this seek method.
System.out.println(raf.readUTF());

Wrap deflated data in gzip format

I think I'm missing something very simple. I have a byte array holding deflated data written into it using a Deflater:
deflate(outData, 0, BLOCK_SIZE, SYNC_FLUSH)
The reason I didn't just use GZIPOutputStream was because there were 4 threads (variable) that each were given a block of data and each thread compressed it's own block before storing that compressed data into a global byte array. If I used GZIPOutputStream it messes up the format because each little block has a header and trailer and is it's own gzip data (I only want to compress it).
So in the end, I've got this byteArray, outData, that's holding all of my compressed data but I'm not really sure how to wrap it. GZIPOutputStream writes from an buffer with uncompressed data, but this array is all set. It's already compressed and I'm just hitting a wall trying to figure out how to get it into a form.
EDIT: Ok, bad wording on my part. I'm writing it to output, not a file, so that it could be redirected if needed. A really simple example is that
cat file.txt | java Jzip | gzip -d | cmp file.txt
should return 0. The problem right now is if I write this byte array as is to output, it's just "raw" compressed data. I think gzip needs all this extra information.
If there's an alternative method, that would be fine to. The whole reason it's like this is because I needed to use multiple threads. Otherwise I would just call GZIPOutputStream.
DOUBLE EDIT: Since the comments provide a lot of good insight, another method is that I just have a bunch of uncompressed blocks of data that were originally one long stream. If gzip can read concatenated streams, if I took those blocks (and kept them in order) and gave each one to a thread that calls GZIPOutputStream on its own block, then took the results and concatenated them. In essence, each block now has header, the compressed info, and trailer. Would gzip recognize that if I concatenated them?
Example:
cat file.txt
Hello world! How are you? I'm ready to set fire to this assignment.
java Testcase < file.txt > file.txt.gz
So I accept it from input. Inside the program, the stream is split up into
"Hello world!" "How are you?" "I'm ready to set fire to this assignment" (they're not strings, it's just an array of bytes! this is just illustration)
So I've got these three blocks of bytes, all uncompressed. I give each of these blocks to a thread, which uses
public static class DGZIPOutputStream extends GZIPOutputStream
{
public DGZIPOutputStream(OutputStream out, boolean flush) throws IOException
{
super(out, flush);
}
public void setDictionary(byte[] b)
{
def.setDictionary(b);
}
public void updateCRC(byte[] input)
{
crc.update(input);
}
}
As you can see, the only thing here is that I've set the flush to SYNC_FLUSH so I can get the alignment right and have the ability to set the dictionary. If each thread were to use DGZIPOutputStream (which I've tested and it works for one long continuous input), and I concatenated those three blocks (now compressed each with a header and trailer), would gzip -d file.txt.gz work?
If that's too weird, ignore the dictionary completely. It doesn't really matter. I just added it in while I was at it.
If you set nowrap true when using the Deflater (sic) constructor, then the result is raw deflate. Otherwise it's zlib, and you would have to strip the zlib header and trailer. For the rest of the answer, I am assuming nowrap is true.
To wrap a complete, terminated deflate stream to be a gzip stream, you need to prepend ten bytes:
"\x1f\x8b\x08\0\0\0\0\0\0\xff"
(sorry -- C format, you'll need to convert to Java octal). You need to also append the four byte CRC in little endian order, followed by the four-byte total uncompressed length modulo 2^32, also in little endian order. Given what is available in the standard Java API, you'll need to compute the CRC in serial. It can't be done in parallel. zlib does have a function to combine separate CRCs that are computed in parallel, but that is not exposed in Java.
Note that I said a complete, terminated deflate stream. It takes some care to make one of those with parallel deflate tasks. You would need to make n-1 unterminated deflate streams and one final terminated deflate stream and concatenate those. The last one is made normally. The other n-1 need to be terminated using sync flush in order to end each on a byte boundary and to not mark it as the end of the stream. To do that, you use deflate with the flush parameter SYNC_FLUSH. Don't use finish() on those.
For better compression, you can use setDictionary on each chunk with the last 32K of the previous chunk.
If you are looking to write the outdata in a file, you may write as:
GZIPOutputStream outStream= new GZIPOutputStream(new FileOutputStream("fileName"));
outStream.write(outData, 0, outData.length);
outStream.close();
Or simply use java.io.FileOutputStream to write:
FileOutputStream outStream= new FileOutputStream("fileName");
outStream.write(outData, 0, outData.length);
outStream.close();
You just want to write a byte array - as is - to a file?
You can use Apache Commons:
FileOutputStream fos = new FileOutputStream("yourFilename");
fos.write(outData);
fos.close():
Or plain old Java:
BufferedOutputStream bs = null;
try {
FileOutputStream fs = new FileOutputStream(new File("yourFilename"));
bs = new BufferedOutputStream(fs);
bs.write(outData);
bs.close();
} catch (Exception e) {
//please handle this
}
if (bs != null) try {
bs.close();
} catch (Exception e) {
//please handle this
}

Categories