I have written an algorithm to implement Huffman Coding for compressing text files. It basically takes in a string as an input and generates a string of bits as output. However, I am having trouble storing this binary data as it is being stored as a string where each bit is a character and consumes 2 bytes of memory for storage. End result, output file is larger than the input, making the whole program worthless. How should I store this binary output such that each bit takes only one bit of memory for storage?
ps. I have tried using a BitSet but that did not change the size of the output at all
Once you have your result in the BitSet, you can call
BitSet.toByteArray() to save your data to a file, i.e.:
FileUtils.writeByteArrayToFile(new File(...), bitSet.toByteArray());
And BitSet.valueOf(byte[]) to read your data from file:
BitSet bitSet = new BitSet(FileUtils.readFileToByteArray(new File(...)));
Related
I am mapping a file to memory and reading it back with java's ByteBuffer. This proves to be a really fast way of reading large files. However, I can only read the values sequentially. Meaning that once I read them buffer.getInt()the buffer pointer moves to the next bytes. So If I want to use a value more than once I have to store it to another variable:
int a = buffer.getInt()
I am noticing that this approach of copying a piece of memory to another is taking a long time (especially with a very large file) compared to just reading bytes. Is there a way I can re-read those bytes instead of copying them?
Just use position(int) to seek in ByteBuffer. Then you can read from anywhere.
ByteBuffer buffer=ByteBuffer.allocate(1000);
byte[] data=new byte[10];
buffer.position(100);
//read 10 from postion 100
buffer.get(data);
Alright, so we need to store a list of words and their respective position in a much bigger text. We've been asked if it's more efficient to save the position represented as text or represented as bits (data streams in Java).
I think that a bitwise representation is best since the text "1024" takes up 4*8=32 bits while only 11 if represented as bits.
The follow up question is should the index be saved in one or two files. Here I thought "perhaps you can't combine text and bitwise-representation in one file?" and that's the reason you'd need two files?
So the question first and foremost is can I store text-information (the word) combined with bitwise-information (it's position) in one file?
Too vague in terms of whats really needed.
If you have up to a few million words + positions, don't even bother thinking about it. Store in whatever format is the simplest to implement; space would only be an issue if you need to sent the data over a low bandwidth network.
Then there is general data compression available, by just wrapping your Input/OutputStreams with deflater or gzip (already built in the JRE) you will get reasonably good compression (50% or more for text). That easily beats what you can quickly write yourself. If you need better compression there is XZ for java (implements LZMA compression), open source.
If you need random access, you're on the wrong track, you will want to carefully design the data layout for the access patterns and storage should be only of tertiary concern.
The number 1024 would at least take 2-4 bytes (so 16-32 bits), as you need to know where the number ends and where it starts, and so it must have a fixed size. If your positions are very big, like 124058936, you would need to use 4 bytes per numbers (which would be better than 9 bytes as a string representation).
Using binary files you'll need of a way to know where the string starts and end, too. You can do this storing a byte before it, with its length, and reading the string like this:
byte[] arr = new byte[in.readByte()]; // in.readByte()*2 if the string is encoded in 16 bits
in.read(arr); // in is a FileInputStream / RandomAccessFile
String yourString = new String(arr, "US-ASCII");
The other possiblity would be terminating your string with a null character (00), but you would need to create your own implementation for that, as no readers support it by default (AFAIK).
Now, is it really worth storing it as binary data? That really depends on how big your positions are (because the strings, if in the text version are separated from their position with a space, would take the same amount of bytes).
My recommendation is that you use the text version, as it will probably be easier to parse and more readable.
About using one or two files, it doesn't really matter. You can combine text and binary in the same file, and it would take the same space (though making it in two separated files will always take a bit more space, and it might make it more messy to edit).
I want to insert and select images from sql server in jdbc. I am confused whether BLOB and byte are the same thing or different. I have used Blob in my code and the application loads slow as it has to select the images stored in Blob and convert it pixel by pixel. I want to use byte array but I don't know whether they are same or different. My main aim is to load the image faster.
Thank you
Before going further, we may need to remember about basic concepts about bit, byte and binary, BLOB.
Bit: Abbreviation of binary digit. It is the smallest storage unit. Bits can take values of 0 or 1.
Byte: Second smallest storage which is commonly (nibble is not mentioned since it is not very common term) used. It includes eight bits.
Binary: Actually, it is a numbering scheme that each digit of a number can take a value of 0 or 1.
BLOB: Set of binary data stored in a database. Also, type of a column which stores binary data inside.
To sum up definitions: Binary format is a scheme that which include bits.
To make it more concrete, we can observe results with the code below.
import java.nio.ByteBuffer;
public class TestByteAndBinary{
public static void main(String []args){
String s = "test"; //a string, series of chars
System.out.println(s);
System.out.println();
byte[] bytes = s.getBytes(); //since each char has a size of 1 byte, we will have an array which has 4 elements
for(byte b : bytes){
System.out.println(b);
}
System.out.println();
for(byte b : bytes){
String c = String.format("%8s", Integer.toBinaryString(b)).replace(' ', '0'); //each element is printed in its binary format
System.out.println(c);
}
}
}
Output:
$javac TestByteAndBinary.java
$java -Xmx128M -Xms16M TestByteAndBinary
test
116
101
115
116
01110100
01100101
01110011
01110100
Let's go back to the question:
If you really want to store an image inside a database, you have to use the BLOB type.
BUT! It is not the best practice.
Because databases are designed to store data and filesystems are
designed to store the files.
Reading image from disk is a simple thing. But reading an image from
the database need more time to accomplished (querying data,
transforming to an array and vice versa).
While an image is being read, it will cause the database to suffer
from lower performance since it is not simple textual or numerical read.
An image file doesn't benefit from characteristical features of a database (like indexing)
At this point, it is best practice to store that image on a server and store its path on the database.
As far as I can see on enterprise level projects, images are very very rarely stored inside the database. And it is the situation that those images were needed to store encrypted since they were including very sensual data. According to my humble opinion, even in that situation, those data had not to be stored in a database.
Blob simply means (Binary Large Object) and its the way database stores byte array.
hope this is simple and it answers your question.
I'm Currently creating a web application that requires passwords to be encrypted and stored in a database. I found the following Guide that encrypts passwords using PBKDF2WithHmacSHA1.
In the example provided the getEncryptedPassword method returns a byte array.
Are there any advantages in doing Base64 encoding the result?
Any Disadvantages?
The byte[] array is the smallest mechanism for storing the value (storage space wise). If you have lots of these values it may make sense to store them as bytes. Depending on where you store the result, the format may make a difference too. Most databases will accomodate byte[] values fairly well, but it can be cumbersome (depending on the database). Stores like text files and XML documents, etc. obviously will struggle with the byte[] array.
In most circumstances I feel there are two formats that make sense, Hexadecimal representation, or byte[]. I seldom think that the advantages of Base64 for short values (less than 32 characters) are worth it (for larger items then sure, use base64, and there's a fantastic library for it too).
This is obviously all subjective.....
Converting values to Hexadecimal are quite easy: see How to convert a byte array to a hex string in Java?
Hex output is convenient, and easier to manage than Base64 which has a more complicated algorithm to build, and is thus slightly slower.....
Assuming a reasonable database there is no advantage, since it's just an encoding scheme. There is a size increase as a consequence of base 64 encoding, which is a disadvantage. If your database reliably stores 8-bit bytes, just store the hash in the database.
While writing a message on wire, I want to write down the number of bytes in the data followed by the data.
Message format:
{num of bytes in data}{data}
I can do this by writing the data to a temporary byteArrayOutput stream and then obtaining the byte array size from it, writing the size followed by the byte array. This approach involves a lot of overhead, viz. unnecessary creation of temporary byte arrays, creation of temporary streams, etc.
Do we have a better (considering both CPU and garbage creation) way of achieving this?
A typical approach would be to introduce a re-useable ByteBuffer. For example:
ByteBuffer out = ...
int oldPos = out.position(); // Remember current position.
out.position(oldPos + 2); // Leave space for message length (unsigned short)
out.putInt(...); // Write out data.
// Finally prepend buffer with number of bytes.
out.putShort(oldPos, (short)(out.position() - (oldPos + 2)));
Once the buffer is populated you could then send the data over the wire using SocketChannel.write(ByteBuffer) (assuming you are using NIO).
Here’s what I would do, in order of preference.
Don’t bother about memory consumption and stuff. Most likely this already is the optimal solution unless it takes a lot of time to create the byte representation of your data so that creating it twice is a noticable impact.
(Actually this would be more like #37 on my list, with #2 to #36 being empty.) Include a method in your all your data objects that can calculate the size of the byte representation and takes less resources than it would to create the byte representation.