Interpolating Large Datasets
I have a large data set of about 0.5million records representing the exchange rate between the USD / GBP over the course of a given day.
I have an application that wants to be able to graph this data or maybe a subset. For obvious reasons I do not want to plot 0.5 million points on my graph.
What I need is a smaller data set (100 points or so) which accurately (as possible) represents the given data. Does anyone know of any interesting and performant ways this data can be achieved?
Cheers, Karl
There are several statistical methods for reducing a large dataset to a smaller, easier to visualize dataset. It's not clear from your question what summary statistic you want. I've just assumed that you want to see how the exchange rate changes as a function of time, but perhaps you are interested in how often the exchange rate goes above a certain value, or some other statistic that I'm not considering.
Summarizing a trend over time
Here is an example using the lowess method in R (from the documentation on scatter plot smoothing):
> library(graphics)
# print out the first 10 rows of the cars dataset
> cars[1:10,]
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17
# plot the original data
> plot(cars, main = "lowess(cars)")
# fit a loess-smoothed line to the points
> lines(lowess(cars), col = 2)
# plot a finger-grained loess-smoothed line to the points
> lines(lowess(cars, f=.2), col = 3)
The parameter f controls how tightly the regression fits to your data. Use some thoughtfulness with this, as you want something that accurately fits your data without overfitting. Rather than speed and distance, you could plot the exchange rate versus time.
It's also straightforward to access the results of the smoothing. Here's how to do that:
> data = lowess( cars$speed, cars$dist )
> data
$x
[1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 16 16 17 17 17 18 18 18 18 19 19
[38] 19 20 20 20 20 20 22 23 24 24 24 24 25
$y
[1] 4.965459 4.965459 13.124495 13.124495 15.858633 18.579691 21.280313 21.280313 21.280313 24.129277 24.129277
[12] 27.119549 27.119549 27.119549 27.119549 30.027276 30.027276 30.027276 30.027276 32.962506 32.962506 32.962506
[23] 32.962506 36.757728 36.757728 36.757728 40.435075 40.435075 43.463492 43.463492 43.463492 46.885479 46.885479
[34] 46.885479 46.885479 50.793152 50.793152 50.793152 56.491224 56.491224 56.491224 56.491224 56.491224 67.585824
[45] 73.079695 78.643164 78.643164 78.643164 78.643164 84.328698
The data object that you get back contains entries named x and y, which correspond to the x and y values passed into the lowess function. In this case, x and y represent speed and dist.
One thought is use the DBMS to compress the data for you using an appropriate query. Something along the lines of having it take a median for a specific range, a pseudo-query:
SELECT truncate_to_hour(rate_ts), median(rate) FROM exchange_rates
WHERE rate_ts >= start_ts AND rate_ts <= end_ts
GROUP BY truncate_to_hour(rate_ts)
ORDER BY truncate_to_hour(rate_ts)
Where truncate_to_hour is something appropriate to your DBMS. Or a similar approach with some kind of function to segment the time into unique blocks (such as round to nearest 5 minute interval), or another math function to aggregate the group thats appropriate in place of median. Given the complexity of the time segmenting procedure and how your DBMS optimizes it may be more efficient to run a query on a temporary table with the segmented time value.
If you wanted to write your own, one obvious solution would be to break your record set into fixed number-of-points chunks, for which the value would be the average (mean, median, ... pick one). This has the probable advantage of being the fastest, and shows overall trends.
But it lacks the drama of price ticks. A better solution would probably involve looking for the inflection points, then selecting among them using sliding windows. This has the advantage of better displaying the actual events of the day, but will be slower.
Something like RRDTool would do what you need automatically - the tutorial should get you started, and drraw will graph the data.
I use this at work for things like error graphs, I don't need 1-minute resolution for a 6-month time period, only for the most recent few hours. After that I have 1-hour resolution for a few days, then 1-day resolution for a few months.
The naive approach is simply calculating an average per timeinterval corresponding to a pixel.
http://commons.wikimedia.org/wiki/File:Euro_exchange_rate_to_AUD.svg
This does not show flunctuations. I would suggest also calculating the standard deviation in each time interval and plot that too (essentially making each pixel higher than one single pixel). I could not locate an example, but I know that Gnuplot can do this (but is not written in Java).
How about to make enumeration/iterator wrapper. I'm not familiar with Java, but it may looks similar to:
class MedianEnumeration implements Enumeration<Double>
{
private Enumeration<Double> frameEnum;
private int frameSize;
MedianEnumeration(Enumeration<Double> e, int len) {
frameEnum = e;
frameSize = len;
}
public boolean hasMoreElements() {
return frameEnum.hasMoreElements();
}
public Double nextElement() {
Double sum = frameEnum.nextElement();
int i;
for(i=1; (i < frameSize) && (frameEnum.hasMoreElements()); ++i) {
sum += (Double)frameEnum.nextElement();
}
return (sum / i);
}
}
Related
I am designing a software in Java, one of its functionalities is calculating the cumulative distribution of certain value in the distribution.
For example: The average marriage age in a country 28 old (which is the mean in the distribution), the distribution that i am using is chi-square (class ChiSquaredDistribution) with degree of freedom(3), since it resembles age at marriage distribution in the real world.
My goal is: if the user type their age, the output would be an approximate percentage of them getting married at that age (one year boundary) based on that distribution. something like: input : 30 years >>> output : 5.1%, input : 28 years>>> output :6%, input : 56 years>>> output :0.8%. The input is int, output is double
the problem is, the distribution starts at (0), and the mean is i believe (3) by default, the following code i wrote displays marriage probability from the age 0 to 70, my question is how to shift it to 18 and over, with the mean of the average age at marriage ?
ChiSquaredDistribution x = new ChiSquaredDistribution(3);
Random r = new Random();
for (int UserAtAge=0; UserAtAge<70; UserAtAge++) {
System.out.println((x.cumulativeProbability(UserAtAge+1)-x.cumulativeProbability(UserAtAge))*100);
}
Two images attached for current results, and the intended results. Any code and help would highly be appreciated.
See the current results and the desired results
Shift your distribution by subtracting 18 from each value, so 18 maps to 0, 28 maps to 10, 70 maps to 52, etc. The mean of an unshifted chi-square is its degrees of freedom. Using a chi-square(3) would yield a mean of 21 for the shifted data, so you'll want to bump that up to a chi-square(10) to yield a mean of 28 with the shift.
With some cleanup (lower-case start for local variables, r was unused), the shifted version is:
ChiSquaredDistribution x = new ChiSquaredDistribution(10);
for (int userAge=18; userAge<71; userAge++) {
System.out.println((x.cumulativeProbability(userAge + 1 - 18) - x.cumulativeProbability(userAge - 18)) * 100);
}
I have a large array (~400.000.000 entries) with integers of {0, 1, ..., 8}.
So I need 4 bits per entry. Around 200 MB.
At the moment I use a byte-array and save 2 numbers in each entry.
I wonder, if there is a good method, to compress this array. I did a quick research and found algorithms like Huffmann or LZW. But these algorithms are all for compressing the data, send the compressed data to someone and decompress them.
I just want to have a table, with less memory space, so I can load it into the RAM. The 200MB table easily fits, but I'm thinking on even bigger tables.
Important is, that I still be able to determine the values on certain positions.
Any tips?
Further information:
I just did a little experimenting, and it turns out, that on average 2.14 consecutive numbers have the same value.
There are 1 zero, 154 ones, 10373 twos, 385990 threes, 8146188 fours, 85008968 fives, 265638366 sixes, 70791576 sevens and 80 eights.
So more than half of the numbers are 6s.
I only need a fast getValue(idx) funktion, setValue(idx, value) is not important.
It depends on how your data look like. Are there repeating entries, or do they change only slowly, or what?
If so, you can try to compress chunks of your data and decompress when needed. The bigger the chunks, the more memory you can save and the worse the speed. IMHO no good deal. You could also save the data compressed and decompress in memory.
Otherwise, i.e., in case of no regularities, you'll need at least log(9) / log(2) = 3.17 bits per entry and there's nothing what could improve it.
You can come pretty close to this value by packing 5 numbers into a short. As 9**5 = 59049 < 65536 = 2**16, it fits nearly perfectly. You'll need 3.2 bits per number, no big win. Packing of five number is given via this formula
a + 9 * (b + 9 * (c + 9 * (d + 9 * e)))
and unpacking is trivial via a precomputed table.
UPDATE after question update
Further information: I just did a little experimenting, and it turns out, that on average 2.14 consecutive numbers have the same value. There are 1 zero, 154 ones, 10373 twos, 385990 threes, 8146188 fours, 85008968 fives, 265638366 sixes, 70791576 sevens and 80 eights. So more than half of the numbers are 6s.
The fact that there are on the average about 2.14 consecutive numbers are the same could lead to some compression, but in this case it says us nothing. There are nearly only fives and sixes, so encountering two equal consecutive numbers seems to be implied.
Given this facts, you can forget my above optimization. There are practically only 8 values there as you can treat the single zero separately. So you need just 3 bits per value and a single index for the zero.
You can even create a HashMap for all values below four or above seven, store there 1+154+10373+385990+80 entries and use only 2 bits per value. But this is still far from ideal.
Assuming no regularities, you'd need 1.44 bit per value as this is the entropy. You could go over all 5-tuples, compute their histogram, and use 1 byte for encoding of the 255 most frequent tuples. All the remaining tuples would map to the 256th value, telling you that you have to look in a HashMap for the rare tuple value.
Some evaluation
I was curious if it can work. The packing of 5 numbers into one byte needs 85996340 bytes. There are nearly 5 million tuples which don't fit, so my idea was to use a hash map for them. Assuming rehashing rather than chaining it makes sense to keep it maybe 50% full, so we need 10 million entries. Assuming TIntShortHashMap (mapping indexes to tuples) each entry takes 6 bytes, leading to 60 MB. Too bad.
Packing only 4 numbers into one byte consumes 107495425 bytes and leaves 159531 tuples which don't fit. This looks better, however, I'm sure the denser packing could be improved a lot.
The results as produced by this little program:
*** Packing 5 numbers in a byte. ***
Normal packed size: 85996340.
Number of tuples in need of special handling: 4813535.
*** Packing 4 numbers in a byte. ***
Normal packed size: 107495425.
Number of tuples in need of special handling: 159531.
There are many options - most depend on how your data looks. You could use any of the following and even combinations of them.
LZW - or variants
In your case a variant that uses a 4-bit initial dictionary would probably be a good start.
You could compress your data in blocks so you could use the index requested to determine which block to decode on the fly.
This would be a good fit if there are repeating patterns in your data.
Difference Coding
Your edit suggests that your data may benefit from a differencing pass. Essentially you replace every value with the difference between it and its predecessor.
Again you would need to treat your data in blocks and difference fixed run lengths.
You may also find that using differencing following by LZW would be a good solution.
Fourier Transform
If some data loss would be acceptable then some of the Fourier Transform compression schemes may be effective.
Lossless JPEG
If your data has a 2-dimensional aspect then some of the JPEG algorithms may lebd themselves well.
The bottom line
You need to bear in mind:
The longer time you spend compressing - up to a limit - the better compression ratio you can achieve
There is a real practical limit to how far you can go with lossless compression.
Once you go lossy you are essentially no longer restricted. You could approximate the whole of your data with new int[]{6} and get quite a few correct results.
As more than 1/2 of the entries are sixes, then just encode those as a single bit. Use 2 bits for the second most common and so on. Then you have something like this:
encoding total
#entrie bit pattern #bits # of bits
zero 1 000000001 9 9
ones 154 0000001 7 1078
twos 10373 000001 6 62238
threes 385990 00001 5 1929950
fours 8146188 0001 4 32584752
fives 85008968 01 2 170017936
sixes 265638366 1 1 265638366
sevens 70791576 001 3 212374728
eights 80 00000001 8 640
--------------------------------------------------------
Total 682609697 bits
With 429981696 entries encoded with 682609697 bits, you would then need 1.59 bit per entry on average.
Edit:
To allow for fast lookup, you can make an index into the compressed data that show where every n entry starts. Finding the exact value would then involve decompressing on average n/2 entries. Depending on how fast it should be you can adjust the number of entries in the index. To reduce the size of the pointer into the compressed data (and those the size of the index), use an estimate and just store the offset from that estimate.
Estimated pos Offset from
# entry no Actual Position (n * 1.59) estimated
0 0 0 0
100 162 159 3 Use this
200 332 318 14 <-- column as
300 471 477 -6 the index
400 642 636 6
500 807 795 12
600 943 954 -11
The overhead for such an index with every 100 entry and 10 bits for the offset, would mean 0.1 bit extra per entry.
There are 1 zero, 154 ones, 10373 twos, 385990 threes, 8146188 fours,
85008968 fives, 265638366 sixes, 70791576 sevens and 80 eights
Total = 429981696 symbols
Assuming the distribution is random, the entropy theorem says you cannot do better than 618297161.7 bits ~ 73.707 MB or on average 1.438 bits / symbol.
Minimum number of bits is SUM(count[i] * LOG(429981696 / count[i], 2)).
You can achieve this size using a range coder.
Given Sqrt(N) = 20736
Again you can achieve O(Sqrt(N)) complexity for accessing a random element by saving an int[k = 0 .. CEIL(SQRT(N)) - 1] state with the arithmetic decoder state after each SQRT(N) decoded symbols. This allows fast decoding of the next 20736 block of symbols.
The complexity of accessing an element drops to O(1) if you access the memory stream in a linear way.
Additional memory used: 20736 * 4 = 81KB.
How about considering some caching solution, like mapdb, or apache jcs. This will enable you to persist the Collection to disk, thus enabling you to work with very large lists.
You should look into a BitSet to store it most efficiently. Contrary to what the name suggests, it is not exactly a set, it has order and you can access it per index.
Internally it uses an array of longs to store the bits and hence can update itself by using bit masks.
I don't believe you can store it any more efficiently natively, if you want even more efficiency, then you should consider packing/compression algorithms.
Here is a very interesting java problem I've found:
Before book printing was found the books were copied by certain people called "writers".
The bookkeeper has a stack of N books that need to be copied.For that purpose he has K writers. Each book can have a different number of pages and every writer can only take books from the top of the stack (if he takes book 1 then he can take book 2 but not book 4 or book 5). The bookkeeper knows the number of pages each book has and he needs to share the books between the writers in order for the maximum number of pages a writer has to copy to be the minimum possible.The pages of course can't be split for example you can't have a 30 page book split into 15 and 15.
For example if we have 7 books with 3 writers and the books pages accordingly: 30 50 5 60 90 5 80 then the optimal solution would be for the first writer to take the first 4 books, the second writer the next book and the 3rd the last two books so we would have:
1st = 145 pages
2nd = 90 pages
3rd = 85 pages
So the program is to write an algorithm which finds the optimal solution for sharing the pages between the writers. So in the end of the program you have to show how many pages each one got.
This was in a programming contest years ago and I wanted to give it a try and what I've found so far is that if you take the total number of pages of all the books and divide them by the number of writers you get in the example 106.66 pages and then you try to give to each writer the continuous books from the stack that are closest to that number, but that doesn't work well at all for large page numbers especially if the number of pages a book has exceeds the total number of pages divided by the number of writers
So share your opinion and give tips and hints if you'd like, mathematical or whatever, this is a very interesting algorithm to be found!
I've come up with a straight forward solution, perhaps not very efficient, but the logic works. Basically you start with the number of writers being the same number as that of the number of books and reduce until you have your number of writers.
To show with an example. Suppose you start with your seven values, 30 50 5 60 90 5 80. For each step you reduce it by one by summing up the "lowest pair". The values in bold are the pair being carried on to the next iteration.
7
30 50 5 60 90 5 80
6
30 55 60 90 5 80
5
30 55 60 90 85
4
85 60 90 85
3
145 90 85
With some pseudo programming, this example shows how it could be implemented
main(books: { 30 50 5 60 90 5 80 }, K: 3)
define main(books, K)
writers = books
while writers.length > K do
reduceMinimalPair(writers)
endwhile
end
define reduceMinimalPair(items)
index = 0
minvalue = items[0] + items[1]
for i in 1..items.length-1 do
if items[i] + items[i + 1] < minvalue then
index = i
minvalue = items[i] + items[i + 1]
endif
endfor
items.removeat(index)
items.removeat(index + 1)
items.insertat(index, minvalue)
end
Let us assume you have books 1...n with pages b1,b2,...,bn. Assume you have K writers.
Initialize a matrix F[1...n,1...K]=infinity.
Let F[i,1]= sum_j=1..i (bj)
Now, for every k=2..K
F[i,k] = min_j=1..i( max(F[j,k-1], sum_r=j+1..i (br) )
I think the way you thought is the right one, but if you are saying it did not work for big numbers then maybe you should check if a bigger number than the average exists and do something else in that case. Maybe remove the number and give it from the start to a writer or something along those lines
Alternate to solving it with Dynamic Programming, you can also binary search a upper page limit that everyone will not copy more than this number of pages. When this number converge, that's the answer.
Im just having a few issues getting my head around this problem. Any help would be greatly appreciated.
The program must read a text file, where it will compute the sum of divisors of each input number. For example, number 20 has the sum 1+2+4+5+10=22. Then these sums are summed up line-by-line. For each of these sums above the divisor is then found, and then finally they are totaled up.
E.g Initial File
1 2 4 6 15 20
25 50 100 125 250 500
16 8 3
Then computes the sum of divisors.
1 1 3 6 9 22
6 43 117 31 218 592
15 7 1
Summed up line by line
42
1007
23
Then above sums are computed.
54
73
1
Then finally totaled up and returned.
128
I need to complete the process with each new line being completed by a threadpool.
My logic is as follows.
5.2. For each input line (Add each line to an ArrayBlockingQueue,
Then add each item in the Queue to an ExecutorService Which will run the follow)
5.2.1. Parse the current input line into integers
5.2.2. For each integer in the current input line
5.2.2.1. Compute the sum-of-divisors of this integer
5.2.2.2. Add this to the cumulated sum-of-divisors
5.2.3. Compute the sum-of-divisors of this cumulated sum
5.2.4. Add this to the grand total
I get stuck after 5.2, Do I either create a new class that implements the runnable interface and then adds the cumulated sum to an atomicArray, or is best to create a class that implements the callable interface and then get it to return the cumulated sum? Or is there a completely different way.
Here is what i have so far which returns the desired result but in a sequential matter.
http://pastebin.com/AyB58fpr
Use
java.util.concurrent.Future
and a
java.util.concurrent.Executors.newFixedThreadPool(int nThreads)
This will be really easy to do.
Follow the Oracle tutorial if you are not familiar with Executors.
I prefer the Callable interface since that doesn't create a dependency of the code which processes the input to how the output is gathered.
The usual approach is to collect the tasks in a list of Futures. See this answer for an example.
I have an in-memory cache which stores a set of information by a certain level of aggregation - in the Students example below let's say I store it by Year, Subject, Teacher:
# Students Year Subject Teacher
1 30 7 Math Mrs Smith
2 28 7 Math Mr Cork
3 20 8 Math Mrs Smith
4 20 8 English Mr White
5 18 8 English Mr Book
6 10 12 Math Mrs Jones
Now unfortunately my cache doesn't have GROUP BY or similar functions - so when I want to look at things at a higher level of aggregation, I will have to 'roll up' the data myself. For example, if I aggregate Students by Year, Subject the aforementioned data would look like so:
# Students Year Subject
1 58 7 Math
2 20 8 Math
3 38 8 English
4 10 12 Math
My question is thus - how would I best do this in Java? Theoretically I could be pulling back tens of thousands of objects from this cache, so being able to 'roll up' these collections quickly may become very important.
My initial (perhaps naive) thought would be to do something along the following lines;
Until I exhaust the list of records:
Each 'unique' record that I come
across is added as a key to a
hashmap.
If I encounter a record that
has the same data for this new level
of aggregation, add its quantity to
the existing one.
Now for all I know this is a fairly common problem and there's much better ways of doing this. So I'd welcome any feedback as to whether I'm pointing myself in the right direction.
"Get a new cache" not an option I'm afraid :)
-Dave.
Your "initial thought" isn't a bad approach. The only way to improve on it would be to have an index for the fields on which you are aggregating (year and subject). (That's basically what a dbms does when you define an index.) Then your algorithm could be recast as iterating through all index values; you wouldn't have to check the results hash for each record.
Of course, you would have to build the index when populating the cache and maintain it as data is modified.