I'm collecting a large number of UDP packets (time dependant) coming from a service on the same network. These packets are being deserialised into structures that contain numbers (float and int) in memory and processed. We could say we are collecting time series data. However, it's not the kind of time series data that you get from supervising a service (mostly the same value for a period of time). These values constantly vary, true, not by very much. But they vary, nevertheless.
Besides this, I would like to send that data to a server in the cloud and on that server store the time series data.
My question is: what possibilities are there to compress the data in order to send smaller packets over the wire to the server (we could send the incoming UDP packets in batches over TCP) and store them? I'm particularly interested in not using the whole storage I have attached to the server. The data for one session is close to 32MB and i would have multiple sessions at the same time. One session's data is not related to another session. They are totally independent.
You can compress time series data using this library: https://github.com/dgryski/go-tsz
It's based on this paper from Facebook:
http://www.vldb.org/pvldb/vol8/p1816-teller.pdf
We have found that about 96% of all time stamps
can be compressed to a single bit.
[...]
Roughly 51% of all values are compressed to a single bit since
the current and previous values are identical. About 30% of
the values are compressed with the control bits ‘10’ (case b),
with an average compressed size of 26.6 bits. The remaining
19% are compressed with control bits ‘11’, with an average
size of 36.9 bits, due to the extra 13 bits of overhead required
to encode the length of leading zero bits and meaningful bits.
You can use a key-value store like bolt (or probably better: rocksdb which supports compression) and store multiple points for each key. For example you could store one key-value pair every 10 minutes, where the value would be all of the points that occurred during that 10 minute window.
This should give you both good performance and high compression.
Related
I have an input file of about 2 GB. It contains numbers (duplicates possible) from 1 to 9999 and are space separated. I want to read the file in small chunks (chunks of say 100000 or 20000). What approach should I take?
I am planning to process these chunks of data on different nodes in distributed fashion. I cannot use HDFS or any other file system that would chunk data automatically.
When you store that 2GB of data in the HDFS, it will be broken down into blocks. The default block size for HDFS is 64MB. You can set it to any size that you wish. For example, if you set the size to be 100MB, your data will be broken down into approximately 20 blocks.
On the other hand, when you are processing the data through MapReduce, you can decide the amount of data that you want to process by defining the number of mappers to use. You do this by setting up the split size.
For example, if you have 20 blocks of size 100MB in your HDFS like mentioned earlier, if you do not set any split size, Hadoop will figure that out for you and assign 20 mappers. But if you specify, for example the split size to be 25MB, then you will have 80 mappers processing your data.
It is important to note, this is just an example. In practical, the higher number of mappers does not mean faster processing time. You'd have to look into optimisation in order to get the best number of splits to use.
Hope this helps.
I understand what it means to access memory such that it is aligned but I don’t understand why this is necessary. For instance, why can I access a single byte from an address 0x…1 but I cannot access a half word (two bytes) from the same address.
Again, I understand that if you have an address A and an object of size s that the access is aligned if A mod s = 0. But I just don’t understand why this is important at the hardware level.
Hardware is complex; this is a simplified explanation.
A typical modern computer might have a 32-bit data bus. This means that any fetch that the CPU needs to do will fetch all 32 bits of a particular memory address. Since the data bus can't fetch anything smaller than 32 bits, the lowest two address bits aren't even used on the address bus, so it's as if RAM is organised into a sequence of 32-bit words instead of 8-bit bytes.
When the CPU does a fetch for a single byte, the read cycle on the bus will fetch 32 bits and then the CPU will discard 24 of those bits, loading the remaining 8 bits into whatever register. If the CPU wants to fetch a 32 bit value that is not aligned on a 32-bit boundary, it has several general choices:
execute two separate read cycles on the bus to load the appropriate parts of the data word and reassemble them
read the 32-bit word at the address determined by throwing away the low two bits of the address
read some unexpected combination of bytes assembled into a 32-bit word, probably not the one you wanted
throw an exception
Various CPUs I have worked with have taken all four of those paths. In general, for maximum compatibility it is safest to align all n-bit reads to an n-bit boundary. However, you can certainly take shortcuts if you are sure that your software will run on some particular CPU family with known unaligned read behaviour. And even if unaligned reads are possible (such as on x86 family CPUs), they will be slower.
The computer always reads in some fixed size chunks which are aligned.
So, if you don't align your data in memory, you will have to probably read more than once.
Example
word size is 8 bytes
your structure is also 8 bytes
if you align it, you'll have to read one chunk
if you don't align it, you'll have to read two chunks
So, it's basically to speed up.
The reason for all alignment rules are the various widths of the Cache Lines (Instruction-Cache do have 16 Byte lines for the Core2 Architecture, and the Data-Cache do have 64-Byte Lines for L1 and 128-Byte Lines for L2).
So if you want to store/load data that crosses a Cahce-Line Boundary you need to load and store both Cache-lines, which hits the performance.
So you just don't do it because of the performance hit, its that simple.
Try reading a serial port. The data is 8 bits wide.
Nice hardware designers ensure it lies on a least significant byte of the word.
If you have a C structure that has elements not word aligned ( from backwards compatibility or conservation of memory say )
then the address of any byte within the structure is not word aligned.
I'm reading a file to parse few of the fields of each record as a reference key and another field as the reference value. These keys and values are referred for another process.
Hence, I chose a HashMap, so that I can get the values for each key, easily.
But, each of the file consists of tens of millions or records. Hence, the HashMap throws OutOfMemoryError. I hope increasing the heap memory will not be a good solution, if the input file in future grows.
For similar questions in SO, most have suggested to use a database. I fear I'll not be given option to use a DB. Is there any other way to handle the problem?
EDIT: I need to do this similar HashMap Loading for 4 such files :( I need all the four. Bcoz, If I dont find a matching entry for my input in the first Map, I need to find in second, then if there not, then third and finally in fourth.
Edit 2: The files I have sums up to, around 1 GB.
EDIT 3:
034560000010000001750
000234500010000100752
012340000010000300374
I have records like these in a file.. I need to have 03456000001000000 as key and 1750 as value.. for all the millions of records. I'll refer these keys and get the value for my another process.
Using a database will not reduce memory cost or runtime per itself.
However, the default hashmaps may not be what you are looking for, depending on your data types. When used with primitive values such as Integers then java hashmaps have a massive memory overhead. In a HashMap<Integer, Integer>, every entry uses like 24+16+16 bytes. Unused entries (and the hashmap keeps up to half of them unused) take 4 bytes extra. So you can roughly estimate >56 bytes per int->int entry in Java HashMap<Integer, Integer>.
If you encode the integers as String, and we're talking maybe 6 digit numbers, that is likely 24 bytes for the underlying char[] array (16 bit characters; 12 bytes overhead for the array, sizes are a multiple of 8!), plus 16 bytes for the String object around (maybe 24, too). For key and value each. So that is then around 24+40+40, i.e. over 104 bytes per entry.
(Update: as your keys are 17 characters in length, make this 24+62+40, i.e. 136 bytes)
If you used a primitive hashmap such as GNU Trove TIntIntHashMap, it would only take 8 bytes + unused, so lets estimate 16 bytes per entry, at least 6 times less memory.
(Update: for TLongIntHashMap, estimate 12 bytes per entry, 24 bytes with overhead of unused buckets.)
Now you could also just store everything in a massive sorted list. This will allow you to perform a fast join operation, and you will lose much of the overhead of unused entries, and can probably process twice as many in much shorter time.
Oh, and if you know the valid value range, you can abuse an array as "hashmap".
I.e. if your valid keys are 0...999999, then just use an int[1000000] as storage, and write each entry into the appropriate row. Don't store the key at all - it's the offset in the array.
Last but not least, Java by default only uses 25% of your memory. You probably want to increase its memory limit.
Short answer: no. It's quite clear that you can't load your entire dataset in memory. You need a way to keep it on disk together with an index, so that you can access the relevant bits of the dataset without rescanning the whole file every time a new key is requested.
Essentially, a DBMS is a mechanism for handling (large) quantities of data: storing, retrieving, combining, filtering etc. They also provide caching for commonly used queries and responses. So anything you are going to do will be a (partial) reimplementation of what a DBMS already does.
I understand your concerns about having an external component to depend on, however note that a DBMS is not necessarily a server daemon. There are tiny DBMS which link with your program and keep all the dataset in a file, like SQLite does.,
Such large data collections should be handled with a database. Java programs are limited in memory, varying from device to device. You provided no info about your program, but please remember that if it is run on different devices, some of them may have very little ram and will crash very quickly. DB (be it SQL or file-based) is a must when it comes to large-data programs.
You have to either
a) have enough memory load to load the data into memory.
b) have to read the data from disk, with an index which is either in memory or not.
Whether you use a database or not the problem is much the same. If you don't have enough memory, you will see a dramatic drop in performance if you start randomly accessing the disk.
There are alternatives like Chronicle Map which use off heap and performs well up to double your main memory size so you won't get an out of memory error, however you still have problem that you can't store more data in memory than you have main memory.
The memory footprint depends on how you approach the file in java. A widely used solution is based on streaming the file using the Apache Commons IO LineIterator. Their recommended usage
LineIterator it = FileUtils.lineIterator(file, "UTF-8");
try {
while (it.hasNext()) {
String line = it.nextLine();
// do something with line
}
} finally {
it.close();
}
Its an optimized approach, but if the file is too big, you can still end up with OutOfMemory
Since you write that you fear that you will not be given the option to use a database some kind of embedded DB might be the answer. If it is impossible to keep everything in memory it must be stored somewhere else.
I believe that some kind of embedded database that uses the disk as storage might work. Examples include BerkeleyDB and Neo4j. Since both databases use a file index for fast lookups the memory load is lesser than if you keep the entire load in memory but they are still fast.
You could try lazy loading it.
We could look at the physical bytes on disk, but that number includes all the replicas. So I am wondering is there a good approach to get the logical data usage (the real meaningful data size) within the cluster without iterating all of them? Thanks.
Unfortunately no. Even the nodetool cfstats which shows live sstable size, includes the replica. You can iterate and add all and divide by the number of replicas you have to get a tough estimate of one.
I've been working on a graphing/data processing application (you can see a screenshot here) using Clojure (though, oftentimes, it feels like I'm using more Java than Clojure), and have started testing my application with bigger datasets. I have no problem with around 100k points, but when I start getting higher than that, I run into heap space problems.
Now, theoretically, about half a GB should be enough to hold around 70 million doubles. Granted, I'm doing many things that require some overhead, and I may in fact be holding 2-3 copies of the data in memory at the same time, but I haven't optimized much yet, and 500k or so is still orders of magnitude less than that I should be able to load.
I understand that Java has artificial restrictions (that can be changed) on the size of the heap, and I understand those can be changed, in part, with options you can specify as the JVM starts. This leads me to my first questions:
Can I change the maximum allowed heap space if I am using Swank-Clojure (via Leiningen) the JVM has on startup?
If I package this application (like I plan to) as an Uberjar, would I be able to ensure my JVM has some kind of minimum heap space?
But I'm not content with just relying on the heap of the JVM to power my application. I don't know the size of the data I may eventually be working with, but it could reach millions of points, and perhaps the heap couldn't accommodate that. Therefore, I'm interesting in finding alternatives to just piling the data on. Here are some ideas I had, and questions about them:
Would it be possible to read in only parts of a large (text) file at a time, so I could import and process the data in "chunks", e.g, n lines at a time? If so, how?
Is there some faster way of accessing the file I'd be reading from (potentially rapidly, depending on the implementation), other than simply reading from it a bit at a time? I guess I'm asking here for any tips/hacks that have worked for you in the past, if you've done a similar thing.
Can I "sample" from the file; e.g. read only every z lines, effectively downsampling my data?
Right now I plan on, if there are answers to the above (I'll keep searching!), or insights offered that lead to equivalent solutions, read in a chunk of data at a time, graph it to the timeline (see the screenshot–the timeline is green), and allowed the user to interact with just that bit until she clicks next chunk (or something), then I'd save changes made to a file and load the next "chunk" of data and display it.
Alternatively, I'd display the whole timeline of all the data (downsampled, so I could load it), but only allow access to one "chunk" of it at a time in the main window (the part that is viewed above the green timeline, as outlined by the viewport rectangle in the timeline).
Most of all, though, is there a better way? Note that I cannot downsample the primary window's data, as I need to be able to process it and let the user interact with it (e.g, click a point or near one to add a "marker" to that point: that marker is drawn as a vertical rule over that point).
I'd appreciate any insight, answers, suggestions or corrections! I'm also willing to expound
on my question in any way you'd like.
This will hopefully, at least in part, be open-sourced; I'd like a simple-to-use yet fast way to make xy-plots of lots of data in the Clojure world.
EDIT Downsampling is possible only when graphing, and not always then, depending on the parts being graphed. I need access to all the data to perform analysis on. (Just clearing that up!) Though I should definitely look into downsampling, I don't think that will solve my memory issues in the least, as all I'm doing to graph is drawing on a BufferedImage.
Can I change the maximum allowed heap
space if I am using Swank-Clojure (via
Leiningen) the JVM has on startup?
You can change the Java heap size by supplying the -Xms (min heap) and -Xmx (max heap) options at startup, see the docs.
So something like java -Xms256m -Xmx1024m ... would give 256MB initial heap with the option to grow to 1GB.
I don't use Leiningen/Swank, but I expect that it's possible to change it. If nothing else, there should be a startup script for Java somewhere where you can change the arguments.
If I package this application (like I
plan to) as an Uberjar, would I be
able to ensure my JVM has some kind of
minimum heap space?
Memory isn't controlled from within a jar file, but from the startup script, normally a .sh or .bat file that calls java and supplies the arguments.
Can I "sample" from the file; e.g.
read only every z lines?
java.io.RandomAccessFile gives random file access by byte index, which you can build on to sample the contents.
Would it be possible to read in only
parts of a large (text) file at a
time, so I could import and process
the data in "chunks", e.g, n lines at
a time? If so, how?
line-seq returns a lazy sequence of each line in a file, so you can process as much at a time as you wish.
Alternatively, use the Java mechanisms in java.io - BufferedReader.readLine() or FileInputStream.read(byte[] buffer)
Is there some faster way of accessing
the file I'd be reading from
(potentially rapidly, depending on the
implementation), other than simply
reading from it a bit at a time?
Within Java/Clojure there is BufferedReader, or you can maintain your own byte buffer and read larger chunks at a time.
To make the most out of the memory you have, keep the data as primitive as possible.
For some actual numbers, let's assume you want to graph the contents of a music CD:
A CD has two channels, each with 44,100 samples per second
60 min. of music is then ~300 million data points
Represented as 16 bits (2 bytes, a short) per datapoint: 600MB
Represented as primitive int array (4 bytes per datapoint): 1.2GB
Represented as Integer array (32 bytes per datapoint): 10GB
Using the numbers from this blog for object size (16 byte overhead per object, 4 bytes for primitive int, objects aligned to 8-byte boundaries, 8-byte pointers in the array = 32 bytes per Integer datapoint).
Even 600MB of data is a stretch to keep in memory all at once on a "normal" computer, since you will probably be using lots of memory elsewhere too. But the switch from primitive to boxed numbers will all by itself reduce the number of datapoints you can hold in memory by an order of magnitude.
If you were to graph the data from a 60 min CD on a 1900 pixel wide "overview" timeline, you would have one pixel to display two seconds of music (~180,000 datapoints). This is clearly way too little to show any level of detail, you would want some form of subsampling or summary data there.
So the solution you describe - process the full dataset one chunk at a time for a summary display in the 'overview' timeline, and keep only the small subset for the main "detail" window in memory - sounds perfectly reasonable.
Update:
On fast file reads: This article times the file reading speed for 13 different ways to read a 100MB file in Java - the results vary from 0.5 seconds to 10 minutes(!). In general, reading is fast with a decent buffer size (4k to 8k bytes) and (very) slow when reading one byte at a time.
The article also has a comparison to C in case anyone is interested. (Spoiler: The fastest Java reads are within a factor 2 of a memory-mapped file in C.)
Tossing out a couple ideas from left field...
You might find something useful in the Colt library... http://acs.lbl.gov/software/colt/
Or perhaps memory-mapped I/O.
A couple of thoughts:
Best way to handle large in-memory data sets in Java/Clojure is to use large primitive arrays. If you do this, you are basically using only a little more memory than the size of the underlying data. You handle these arrays in Clojure just fine with the aget/aset functionality
I'd be tempted to downsample, but maintain a way to lazily access the detailed points "on demand" if you need to, e.g. in the user interaction case. Kind of like the way that Google maps lets you see the whole world, and only loads the detail when you zoom in....
If you only care about the output image from the x-y plot, then you can construct it by loading in a few thousand points at a time (e.g. loading into your primitive arrays), plotting them then discarding then. In this way you won't need to hold the full data set in memory.