I need to store boolean array with 80,000 items in file. I don't care how much time saving takes, I'm interested only in the loading time of array.
I did't try to store it by DataOutputStream because it requires access for each value.
I tried to make this by 3 approaches, such as:
serialize boolean array
use BitSet instead of boolean array an serialize it
transfer boolean array into byte array, where 1 is true and 0 is false appropriately and write it by FileChannel using ByteBuffer
To test reading from files by these approaches, I had run each approach 1,000 times in loop. So I got results which look like this:
deserialization of boolean array takes 574 ms
deserialization of BitSet - 379 ms
getting byte array from FileChannel by MappedByteBuffer - 170 ms
The first and second approaches are too long, the third, perhaps, is not approach at all.
Perhaps there are a best way to accomplish it, so I need your advice
EDIT
Each method ran once
13.8
8.71
6.46
ms appropriatively
What about writing a byte for each boolean and develop a custom parser? This will propably one of the fastest methods.
If you want to save space you could also put 8 booleans into one byte but this would require some bit shifting operations.
Here is a short example code:
public void save() throws IOException
{
boolean[] testData = new boolean[80000];
for(int X=0;X < testData.length; X++)
{
testData[X] = Math.random() > 0.5;
}
FileOutputStream stream = new FileOutputStream(new File("test.bin"));
for (boolean item : testData)
{
stream.write(item ? 1 : 0);
}
stream.close();
}
public boolean[] load() throws IOException
{
long start = System.nanoTime();
File file = new File("test.bin");
FileInputStream inputStream = new FileInputStream(file);
int fileLength = (int) file.length();
byte[] data = new byte[fileLength];
boolean[] output = new boolean[fileLength];
inputStream.read(data);
for (int X = 0; X < data.length; X++)
{
if (data[X] != 0)
{
output[X] = true;
continue;
}
output[X] = false;
}
long end = System.nanoTime() - start;
Console.log("Time: " + end);
return output;
}
It takes about 2ms to load 80.000 booleans.
Tested with JDK 1.8.0_45
So I had a very similar use case where i wanted to serialise/deserialise a very large boolean array.
I implemented something like this,
Firstly i converted boolean array to an integer array simply to club multiple boolean values (This makes storage more efficient and there are no issues with bit padding)
This now means we have to build wrapper methods which will give true/false
private boolean get (int index) {
int holderIndex = (int) Math.floor(index/buckets);
int internalIndex = index % buckets;
return 0 != (container[holderIndex] & (1 << internalIndex));
}
and
private void set (int index) {
int holderIndex = (int) Math.floor(index/buckets);
int internalIndex = index % buckets;
int value = container[holderIndex];
int newValue = value | (1 << internalIndex);
container[holderIndex] = newValue;
}
Now to serialise and deserialise you can directly convert this to bytestream and write to file.
my source code, for reference
Related
I am working on a BitBuffer that will take x bits from a ByteBuffer as an int, long, etc, but I seem to be having a whole lot of problems.
I've tried loading a long at a time and using bit shifting, but the difficulty comes from rolling from one long into the next. I am wondering if there's just a better way. Anyone have any suggestions?
public class BitBuffer
{
final private ByteBuffer bb;
public BitBuffer(byte[] bytes)
{
this.bb = ByteBuffer.wrap(bytes);
}
public int takeInt(int bits)
{
int bytes = toBytes(bits);
if (bytes > 4) throw new RuntimeException("Too many bits requested");
int i=0;
// take bits from bb and fill it into an int
return i;
}
}
More specifically, I am trying to take x bits from the buffer and return them as an int (the minimal case). I can access bytes from the buffer, but let's say I only want to take just the first 4 bits instead.
Example:
If my buffer is filled with "101100001111", if I run these in order:
takeInt(4) // should return 11 (1011)
takeInt(2) // should return 0 (00)
takeInt(2) // should return 0 (00)
takeInt(1) // should return 1 (1)
takeInt(3) // should return 7 (111)
I would like to use something like this for bit packed encoded data where an integer can be stored in just a few bits of a byte.
The BitSet and ByteBuffer ideas were a bit too difficult to control so instead, I went with a binary string approach that basically takes a whole lot of headache out of managing an intermediate buffer of bits.
public class BitBuffer
{
final private String bin;
private int start;
public BitBuffer(byte[] bytes)
{
this.bin = toBinaryString(bytes); // TODO: create this function
this.start = 0;
}
public int takeInt(int nbits)
{
// TODO: handle edge cases
String bits = bin.substring(start, start+=nbits);
return Integer.parseInt(bits, 2);
}
}
Out of everything I've tried this was the cleanest and easiest approach, but I am open to suggestions!
You can convert the ByteBuffer into BitSet and then you'll have continuous access to the bits
public class BitBuffer
{
final private BitSet bs;
public BitBuffer(byte[] bytes)
{
this.bs = BitSet.valueOf(bytes);
}
public int takeInt(int bits)
{
int bytes = toBytes(bits);
if (bytes > 4) throw new RuntimeException("Too many bits requested");
int i=0;
// take bits from bs and fill it into an int
return i;
}
}
I have a file of size around 4-5 Gigs(nearly billion lines). From every line of the file, I have to parse the array of integers and the additional integer info and update my custom data structure. My class to hold such information looks like
class Holder {
private int[][] arr = new int[1000000000][5]; // assuming that max array size is 5
private int[] meta = new int[1000000000];
}
A sample line from the file looks like
(1_23_4_55) 99
Every index in the arr & meta corresponds to the line number in the file. From the above line, I extract the array of integers first and then the meta information. In that case,
--pseudo_code--
arr[line_num] = new int[]{1, 23, 4, 55}
meta[line_num]=99
Right now, I am using BufferedReader object and it's readLine method to read each line & use character level operations to parse the integer array and meta information from each line and populate the Holder instance. But, it takes almost half an hour to complete this entire operation.
I used both java Serialization & Externalizable(write the meta and arr) to serialize and deserialize this HUGE Holder instance. And with both of them, the time to serialize is almost half an hour and to deserialize is also almost half an hour.
I would appreciate your suggestions on dealing with this kind of problem & would definitely love to hear your part of story if any.
P.S. Main Memory is not a problem. I have almost 50 GB of RAM in my machine. I have also increased the BufferedReader size to 40 MB (Of course, I can increase this upto 100 MB considering that disk access takes approx. 100 MB/sec). Even cores and CPU is not a problem.
EDIT I
The code that I am using to do this task is provided below(after anonymizing very few information);
public class BigFileParser {
private int parsePositiveInt(final String s) {
int num = 0;
int sign = -1;
final int len = s.length();
final char ch = s.charAt(0);
if (ch == '-')
sign = 1;
else
num = '0' - ch;
int i = 1;
while (i < len)
num = num * 10 + '0' - s.charAt(i++);
return sign * num;
}
private void loadBigFile() {
long startTime = System.nanoTime();
Holder holder = new Holder();
String line;
try {
Reader fReader = new FileReader("/path/to/BIG/file");
// 40 MB buffer size
BufferedReader bufferedReader = new BufferedReader(fReader, 40960);
String tempTerm;
int i, meta, ascii, len;
boolean consumeNextInteger;
// GNU Trove primitive int array list
TIntArrayList arr;
char c;
while ((line = bufferedReader.readLine()) != null) {
consumeNextInteger = true;
tempTerm = "";
arr = new TIntArrayList(5);
for (i = 0, len = line.length(); i < len; i++) {
c = line.charAt(i);
ascii = c - 0;
// 95 is the ascii value of _ char
if (consumeNextInteger && ascii == 95) {
arr.add(parsePositiveInt(tempTerm));
tempTerm = "";
} else if (ascii >= 48 && ascii <= 57) { // '0' - '9'
tempTerm += c;
} else if (ascii == 9) { // '\t'
arr.add(parsePositiveInt(tempTerm));
consumeNextInteger = false;
tempTerm = "";
}
}
meta = parsePositiveInt(tempTerm);
holder.update(arr, meta);
}
bufferedReader.close();
long endTime = System.nanoTime();
System.out.println("#time -> " + (endTime - startTime) * 1.0
/ 1000000000 + " seconds");
} catch (IOException exp) {
exp.printStackTrace();
}
}
}
public class Holder {
private static final int SIZE = 500000000;
private TIntArrayList[] arrs;
private TIntArrayList metas;
private int idx;
public Holder() {
arrs = new TIntArrayList[SIZE];
metas = new TIntArrayList(SIZE);
idx = 0;
}
public void update(TIntArrayList arr, int meta) {
arrs[idx] = arr;
metas.add(meta);
idx++;
}
}
It sounds like the time taken for file I/O is the main limiting factor, given that serialization (binary format) and your own custom format take about the same time.
Therefore, the best thing you can do is to reduce the size of the file. If your numbers are generally small, then you could get a huge boost from using Google protocol buffers, which will encode small integers generally in one or two bytes.
Or, if you know that all your numbers are in the 0-255 range, you could use a byte[] rather than int[] and cut the size (and hence load time) to a quarter of what it is now. (assuming you go back to serialization or just write to a ByteChannel)
It simply can't take that long. You're working with some 6e9 ints, which means 24 GB. Writing 24 GB to the disk takes some time, but nothing like half an hour.
I'd put all the data in a single one-dimensional array and access it via methods like int getArr(int row, int col) which transform row and col onto a single index. According to how the array gets accessed (usually row-wise or usually column-wise), this index would be computed as N * row + col or N * col + row to maximize locality. I'd also store meta in the same array.
Writing a single huge int[] into memory should be pretty fast, surely no half an hour.
Because of the data amount, the above doesn't work as you can't have a 6e9 entries array. But you can use a couple of big arrays instead and all of the above applies (compute a long index from row and col and split it into two ints for accessing the 2D-array).
Make sure you aren't swapping. Swapping is the most probable reason for the slow speed I can think of.
There are several alternative Java file i/o libraries. This article is a little old, but it gives an overview that's still generally valid. He's reading about 300Mb per second with a 6-year old Mac. So for 4Gb you have under 15 seconds of read time. Of course my experience is that Mac IO channels are very good. YMMV if you have a cheap PC.
Note there is no advantage above a buffer size of 4K or so. In fact you're more likely to cause thrashing with a big buffer, so don't do that.
The implication is that parsing characters into the data you need is the bottleneck.
I have found in other apps that reading into a block of bytes and writing C-like code to extract what I need goes faster than the built-in Java mechanisms like split and regular expressions.
If that still isn't fast enough, you'd have to fall back to a native C extension.
If you randomly pause it you will probably see that the bulk of the time goes into parsing the integers, and/or all the new-ing, as in new int[]{1, 23, 4, 55}. You should be able to just allocate the memory once and stick numbers into it at better than I/O speed if you code it carefully.
But there's another way - why is the file in ASCII?
If it were in binary, you could just slurp it up.
I am currently having heavy performance issues with an application I'm developping in natural language processing. Basically, given texts, it gathers various data and does a bit of number crunching.
And for every sentence, it does EXACTLY the same. The algorithms applied to gather the statistics do not evolve with previously read data and therefore stay the same.
The issue is that the processing time does not evolve linearly at all: 1 min for 10k sentences, 1 hour for 100k and days for 1M...
I tried everything I could, from re-implementing basic data structures to object pooling to recycles instances. The behavior doesn't change. I get non-linear increase in time that seem impossible to justify by a little more hashmap collisions, nor by IO waiting, nor by anything! Java starts to be sluggish when data increases and I feel totally helpless.
If you want an example, just try the following: count the number of occurences of each word in a big file. Some code is shown below. By doing this, it takes me 3 seconds over 100k sentences and 326 seconds over 1.6M ...so a multiplicator of 110 times instead of 16 times. As data grows more, it just get worse...
Here is a code sample:
Note that I compare strings by reference (for efficiency reasons), this can be done thanks to the 'String.intern()' method which returns a unique reference per string. And the map is never re-hashed during the whole process for the numbers given above.
public class DataGathering
{
SimpleRefCounter<String> counts = new SimpleRefCounter<String>(1000000);
private void makeCounts(String path) throws IOException
{
BufferedReader file_src = new BufferedReader(new FileReader(path));
String line_src;
int n = 0;
while (file_src.ready())
{
n++;
if (n % 10000 == 0)
System.out.print(".");
if (n % 100000 == 0)
System.out.println("");
line_src = file_src.readLine();
String[] src_tokens = line_src.split("[ ,.;:?!'\"]");
for (int i = 0; i < src_tokens.length; i++)
{
String src = src_tokens[i].intern();
counts.bump(src);
}
}
file_src.close();
}
public static void main(String[] args) throws IOException
{
String path = "some_big_file.txt";
long timestamp = System.currentTimeMillis();
DataGathering dg = new DataGathering();
dg.makeCounts(path);
long time = (System.currentTimeMillis() - timestamp) / 1000;
System.out.println("\nElapsed time: " + time + "s.");
}
}
public class SimpleRefCounter<K>
{
static final double GROW_FACTOR = 2;
static final double LOAD_FACTOR = 0.5;
private int capacity;
private Object[] keys;
private int[] counts;
public SimpleRefCounter()
{
this(1000);
}
public SimpleRefCounter(int capacity)
{
this.capacity = capacity;
keys = new Object[capacity];
counts = new int[capacity];
}
public synchronized int increase(K key, int n)
{
int id = System.identityHashCode(key) % capacity;
while (keys[id] != null && keys[id] != key) // if it's occupied, let's move to the next one!
id = (id + 1) % capacity;
if (keys[id] == null)
{
key_count++;
keys[id] = key;
if (key_count > LOAD_FACTOR * capacity)
{
resize((int) (GROW_FACTOR * capacity));
}
}
counts[id] += n;
total += n;
return counts[id];
}
public synchronized void resize(int capacity)
{
System.out.println("Resizing counters: " + this);
this.capacity = capacity;
Object[] new_keys = new Object[capacity];
int[] new_counts = new int[capacity];
for (int i = 0; i < keys.length; i++)
{
Object key = keys[i];
int count = counts[i];
int id = System.identityHashCode(key) % capacity;
while (new_keys[id] != null && new_keys[id] != key) // if it's occupied, let's move to the next one!
id = (id + 1) % capacity;
new_keys[id] = key;
new_counts[id] = count;
}
this.keys = new_keys;
this.counts = new_counts;
}
public int bump(K key)
{
return increase(key, 1);
}
public int get(K key)
{
int id = System.identityHashCode(key) % capacity;
while (keys[id] != null && keys[id] != key) // if it's occupied, let's move to the next one!
id = (id + 1) % capacity;
if (keys[id] == null)
return 0;
else
return counts[id];
}
}
Any explanations? Ideas? Suggestions?
...and, as said in the beginning, it is not for this toy example in particular but for the more general case. This same exploding behavior occurs for no reason in the more complex and larger program.
Rather than feeling helpless use a profiler! That would tell you where exactly in your code all this time is spent.
Bursting the processor cache and thrashing the Translation Lookaside Buffer (TLB) may be the problem.
For String.intern you might want to do your own single-threaded implementation.
However, I'm placing my bets on the relatively bad hash values from System.identityHashCode. It clearly isn't using the top bit, as you don't appear to get ArrayIndexOutOfBoundsExceptions. I suggest replacing that with String.hashCode.
String[] src_tokens = line_src.split("[ ,.;:?!'\"]");
Just an idea -- you are creating a new Pattern object for every line here (look at the String.split() implementation). I wonder if this is also contributing to a ton of objects that need to be garbage collected?
I would create the Pattern once, probably as a static field:
final private static Pattern TOKEN_PATTERN = Pattern.compile("[ ,.;:?!'\"]");
And then change the split line do this:
String[] src_tokens = TOKEN_PATTERN.split(line_src);
Or if you don't want to create it as a static field, as least only create it once as a local variable at the beginning of the method, before the while.
In get, when you search for a nonexistent key, search time is proportional to the size of the set of keys.
My advice: if you want a HashMap, just use a HashMap. They got it right for you.
You are filling up the Perm Gen with the string intern. Have you tried viewing the -Xloggc output?
I would guess it's just memory filling up, growing outside the processor cache, memory fragmentation and the garbage collection pauses kicking in. Have you checked memory use at all? Tried to change the heap size the JVM uses?
Try to do it in python, and run the python module from Java.
Enter all the keys in the database, and then execute the following query:
select key, count(*)
from keys
group by key
Have you tried to only iterate through the keys without doing any calculations? is it faster? if yes then go with option (2).
Can't you do this? You can get your answer in no time.
It's me, the original poster, something went wrong during registration, so I post separately. I'll try the various suggestions given.
PS for Tom Hawtin: thanks for the hints, perhaps the 'String.intern()' takes more and more time as vocabulary grows, i'll check that tomorrow, as everything else.
I have a BitSet and want to write it to a file- I came across a solution to use a ObjectOutputStream using the writeObject method.
I looked at the ObjectOutputStream in the java API and saw that you can write other things (byte, int, short etc)
I tried to check out the class so I tried to write a byte to a file using the following code but the result gives me a file with 7 bytes instead of 1 byte
my question is what are the first 6 bytes in the file? why are they there?
my question is relevant to a BitSet because i don't want to start writing lots of data to a file and realize I have random bytes inserted in the file without knowing what they are.
here is the code:
byte[] bt = new byte[]{'A'};
File outFile = new File("testOut.txt");
FileOutputStream fos = new FileOutputStream(outFile);
ObjectOutputStream oos = new ObjectOutputStream(fos);
oos.write(bt);
oos.close();
thanks for any help
Avner
The other bytes will be type information.
Basically ObjectOutputStream is a class used to write Serializable objects to some destination (usually a file). It makes more sense if you think about InputObjectStream. It has a readObject() method on it. How does Java know what Object to instantiate? Easy: there is type information in there.
You could be writing any objects out to an ObjectOutputStream, so the stream holds information about the types written as well as the data needed to reconstitute the object.
If you know that the stream will always contain a BitSet, don't use an ObjectOutputStream - and if space is a premium, then convert the BitSet to a set of bytes where each bit corresponds to a bit in the BitSet, then write that directly to the underlying stream (e.g. a FileOutputStream as in your example).
The serialisation format, like many others, includes a header with magic number and version information. When you use DataOutput/OutputStream methods on ObjectOutputStream are placed in the middle of the serialised data (with no type information). This is typically only done in writeObject implementations after a call to defaultWriteObject or use of putFields.
If you only use the saved BitSet in Java, the serialization works fine. However, it's kind of annoying if you want share the bitset across multi platforms. Besides the overhead of Java serialization, the BitSet is stored in units of 8-bytes. This can generate too much overhead if your bitset is small.
We wrote this small class so we can exract byte arrays from BitSet. Depending on your usecase, it might work better than Java serialization for you.
public class ExportableBitSet extends BitSet {
private static final long serialVersionUID = 1L;
public ExportableBitSet() {
super();
}
public ExportableBitSet(int nbits) {
super(nbits);
}
public ExportableBitSet(byte[] bytes) {
this(bytes == null? 0 : bytes.length*8);
for (int i = 0; i < size(); i++) {
if (isBitOn(i, bytes))
set(i);
}
}
public byte[] toByteArray() {
if (size() == 0)
return new byte[0];
// Find highest bit
int hiBit = -1;
for (int i = 0; i < size(); i++) {
if (get(i))
hiBit = i;
}
int n = (hiBit + 8) / 8;
byte[] bytes = new byte[n];
if (n == 0)
return bytes;
Arrays.fill(bytes, (byte)0);
for (int i=0; i<n*8; i++) {
if (get(i))
setBit(i, bytes);
}
return bytes;
}
protected static int BIT_MASK[] =
{0x80, 0x40, 0x20, 0x10, 0x08, 0x04, 0x02, 0x01};
protected static boolean isBitOn(int bit, byte[] bytes) {
int size = bytes == null ? 0 : bytes.length*8;
if (bit >= size)
return false;
return (bytes[bit/8] & BIT_MASK[bit%8]) != 0;
}
protected static void setBit(int bit, byte[] bytes) {
int size = bytes == null ? 0 : bytes.length*8;
if (bit >= size)
throw new ArrayIndexOutOfBoundsException("Byte array too small");
bytes[bit/8] |= BIT_MASK[bit%8];
}
}
What's the most efficient way to put as many bytes as possible from a ByteBuffer bbuf_src into another ByteBuffer bbuf_dest (as well as know how many bytes were transferred)? I'm trying bbuf_dest.put(bbuf_src) but it seems to want to throw a BufferOverflowException and I can't get the javadocs from Sun right now (network problems) when I need them. >:( argh.
edit: darnit, #Richard's approach (use put() from the backing array of bbuf_src) won't work if bbuf_src is a ReadOnly buffer, as you can't get access to that array. What can I do in that case???
As you've discovered, getting the backing array doesn't always work (it fails for read only buffers, direct buffers, and memory mapped file buffers). The better alternative is to duplicate your source buffer and set a new limit for the amount of data you want to transfer:
int maxTransfer = Math.min(bbuf_dest.remaining(), bbuf_src.remaining());
// use a duplicated buffer so we don't disrupt the limit of the original buffer
ByteBuffer bbuf_tmp = bbuf_src.duplicate ();
bbuf_tmp.limit (bbuf_tmp.position() + maxTransfer);
bbuf_dest.put (bbuf_tmp);
// now discard the data we've copied from the original source (optional)
bbuf_src.position(bbuf_src.position() + maxTransfer);
OK, I've adapted #Richard's answer:
public static int transferAsMuchAsPossible(
ByteBuffer bbuf_dest, ByteBuffer bbuf_src)
{
int nTransfer = Math.min(bbuf_dest.remaining(), bbuf_src.remaining());
if (nTransfer > 0)
{
bbuf_dest.put(bbuf_src.array(),
bbuf_src.arrayOffset()+bbuf_src.position(),
nTransfer);
bbuf_src.position(bbuf_src.position()+nTransfer);
}
return nTransfer;
}
and a test to make sure it works:
public static boolean transferTest()
{
ByteBuffer bb1 = ByteBuffer.allocate(256);
ByteBuffer bb2 = ByteBuffer.allocate(50);
for (int i = 0; i < 100; ++i)
{
bb1.put((byte)i);
}
bb1.flip();
bb1.position(5);
ByteBuffer bb1a = bb1.slice();
bb1a.position(2);
// bb3 includes the 5-100 range
bb2.put((byte)77);
// something to see this works when bb2 isn't empty
int n = transferAsMuchAsPossible(bb2, bb1a);
boolean itWorked = (n == 49);
if (bb1a.position() != 51)
itWorked = false;
if (bb2.position() != 50)
itWorked = false;
bb2.rewind();
if (bb2.get() != 77)
itWorked = false;
for (int i = 0; i < 49; ++i)
{
if (bb2.get() != i+7)
{
itWorked = false;
break;
}
}
return itWorked;
}
You get the BufferOverflowException because your bbuf_dest is not big enough.
You will need to use bbuf_dest.remaining() to find out the maximum number of bytes you can transfer from bbuf_src:
int maxTransfer = Math.min(bbuf_dest.remaining(), bbuf_src.remaining());
bbuf_dest.put(bbuf_src.array(), 0, maxTransfer);