Reducing the granularity of a data set

Reducing the granularity of a data set - java

I have an in-memory cache which stores a set of information by a certain level of aggregation - in the Students example below let's say I store it by Year, Subject, Teacher:
# Students Year Subject Teacher
1 30 7 Math Mrs Smith
2 28 7 Math Mr Cork
3 20 8 Math Mrs Smith
4 20 8 English Mr White
5 18 8 English Mr Book
6 10 12 Math Mrs Jones
Now unfortunately my cache doesn't have GROUP BY or similar functions - so when I want to look at things at a higher level of aggregation, I will have to 'roll up' the data myself. For example, if I aggregate Students by Year, Subject the aforementioned data would look like so:
# Students Year Subject
1 58 7 Math
2 20 8 Math
3 38 8 English
4 10 12 Math
My question is thus - how would I best do this in Java? Theoretically I could be pulling back tens of thousands of objects from this cache, so being able to 'roll up' these collections quickly may become very important.
My initial (perhaps naive) thought would be to do something along the following lines;
Until I exhaust the list of records:
Each 'unique' record that I come
across is added as a key to a
hashmap.
If I encounter a record that
has the same data for this new level
of aggregation, add its quantity to
the existing one.
Now for all I know this is a fairly common problem and there's much better ways of doing this. So I'd welcome any feedback as to whether I'm pointing myself in the right direction.
"Get a new cache" not an option I'm afraid :)
-Dave.

Your "initial thought" isn't a bad approach. The only way to improve on it would be to have an index for the fields on which you are aggregating (year and subject). (That's basically what a dbms does when you define an index.) Then your algorithm could be recast as iterating through all index values; you wouldn't have to check the results hash for each record.
Of course, you would have to build the index when populating the cache and maintain it as data is modified.

Related

Thinking about more optimal solution for below algorithm

There are n vendors on amazon who are selling the product at particular price at particular time, I have to design an algorithm which will select the product with least price at particular time.
For ex: For below set of input
Input Format :
<StartTime, EndTime, Price of product by this vendor in this time frame>
1 5 20
3 8 15
7 10 8
Output should be:
1 2 20
3 6 15
7 10 8
I have done with the solution by storing the prices corresponding to time in hashmap, and updating the price if there exist a price lesser then the old one corresponding to that time, and then made the list in vendor class to store all the times corresponding to the particular price.
But above solution is taking O(n2) time complexity, so searching for some fancy DS or approach to solve this in lesser time complexity.

You can use a sweep line algorithm and a multiset to solve it in O(N log N) time:
Let's create two events for each vendor: the moment she starts selling the item and the moment she ends. We'll also create one "check" event for each time we're interested in.
Now we'll sort the list of events by their times.
For each event, we do the following: if it's a start event, we add the new price to the multiset. Otherwise, we remove it.
At any moment of time, the answer is the smallest element in the multiset, so we can answer each query efficiently.
If the multiset supports "fast" (that is, O(log N) or better) insertions, deletions and finding the smallest element, this solution uses O(N log N) time and O(N) space. There is no mulitset in the Java standard library, but you can use a TreeSet of pairs (price, vendor_id) to work around this limitation.

Confusion in building a Svm Training Set

I am currently testing the training phase of my Binary SVM Java implementation.
I have tested it for small data shown below, but I need to apply my svm to a known dataset like spam/not spam, images, etc.
My SVM is capable of reading numeric values so I need to test it with some real data also.
Later I want to move on to images.
To find a real data set, I searched through different repos, but all I could find was numerical values + characters, text, etc.
And I found a spam Archive.
But how do I proceed with that?
I think I need to convert the text into numerical data using tfidf and then apply my SVM.
But how do I indicate them as 1/-1 class.
Normally the input would be of this format right?
0 0 1
3 4 1
5 9 1
12 1 1
8 7 1
9 8 -1
6 12 -1
10 8 -1
8 5 -1
14 8 -1
How do I bring the spam archive data into the above format?

It's all about the features selections. The input is of course the pairs of documents and labels. But the feature extraction is included in the training process. The most straightforward way is the binary representation, in which we check whether a particular word occurs in some particular documents. It is also referred to term frequency: the ith components in the feature vector is the time word wi occurs in one document. Here the vector is a established dictionary that included all the words in the training documents. You may also consider the inverse document frequency: number of times that wi occurs in all documents divided by the total number of documents.
FYI, one research paper about SVM on spam:
http://classes.soe.ucsc.edu/cmps290c/Spring12/lect/14/00788645-SVMspam.pdf

Effective search from a huge number of points

I have a bunch of gps points collected and now I need to make a match of these points with 18000 points. I have these in two arraylists. Is there a better way to search? I am performing this in Java.
Here is a sample of huge data. They contain one more additional parameter ID1 by which a set of points can be grouped.
ID1 ID2 ID3 longi lati,
2 1 1 -79.911635 39.609849,
2 1 2 -79.91151 39.60956,
2 1 3 -79.9115 39.609489,
2 1 4 -79.911496 39.609433,
3 1 1 -79.908162 39.609841,
3 1 2 -79.908447 39.610019,
4 1 1 -79.911136 39.608433,
4 1 2 -79.910961 39.608446,
4 1 3 -79.910629 39.608451,
4 1 4 -79.910064 39.608493,
4 1 5 -79.909117 39.608586,

If you are looking for exact matches, then you can place the points in a set (both HashSet and TreeSet will work), and find the intersection: set1.intersect(set2). You will have to implement compare() or hashcode() accordingly, and equals() in any case, but that is the easy scenario.
If you are looking for "closer than X", you should use a quadtree. Place all the nodes in the first arraylist in a quadtree, and then perform quick lookup using this datastructure (which can yield the closest point in O(log N) per lookup instead of the O(N) per lookup of the brute-force approach). There is an open-source implementation of a quadtree in, for example, geotools.

You could also use the spatial index known as RTREE. It is usually faster than quadtree.
For example this paper finds it to be 2 -3 times faster in Oracle databases: http://pdf.aminer.org/000/300/406/incorporating_updates_in_domain_indexes_experiences_with_oracle_spatial_r.pdf
Java Topology Suite (JTS) contains a good implementation of the rtree: http://www.vividsolutions.com/jts/javadoc/com/vividsolutions/jts/index/strtree/STRtree.html
Note that GeoTools is based on JTS, so there may well also be an rtree lurking inside the spatial index functionality of it: http://docs.geotools.org/latest/userguide/library/main/collection.html

knapsack-like algorithm

Here is a very interesting java problem I've found:
Before book printing was found the books were copied by certain people called "writers".
The bookkeeper has a stack of N books that need to be copied.For that purpose he has K writers. Each book can have a different number of pages and every writer can only take books from the top of the stack (if he takes book 1 then he can take book 2 but not book 4 or book 5). The bookkeeper knows the number of pages each book has and he needs to share the books between the writers in order for the maximum number of pages a writer has to copy to be the minimum possible.The pages of course can't be split for example you can't have a 30 page book split into 15 and 15.
For example if we have 7 books with 3 writers and the books pages accordingly: 30 50 5 60 90 5 80 then the optimal solution would be for the first writer to take the first 4 books, the second writer the next book and the 3rd the last two books so we would have:
1st = 145 pages
2nd = 90 pages
3rd = 85 pages
So the program is to write an algorithm which finds the optimal solution for sharing the pages between the writers. So in the end of the program you have to show how many pages each one got.
This was in a programming contest years ago and I wanted to give it a try and what I've found so far is that if you take the total number of pages of all the books and divide them by the number of writers you get in the example 106.66 pages and then you try to give to each writer the continuous books from the stack that are closest to that number, but that doesn't work well at all for large page numbers especially if the number of pages a book has exceeds the total number of pages divided by the number of writers
So share your opinion and give tips and hints if you'd like, mathematical or whatever, this is a very interesting algorithm to be found!

I've come up with a straight forward solution, perhaps not very efficient, but the logic works. Basically you start with the number of writers being the same number as that of the number of books and reduce until you have your number of writers.
To show with an example. Suppose you start with your seven values, 30 50 5 60 90 5 80. For each step you reduce it by one by summing up the "lowest pair". The values in bold are the pair being carried on to the next iteration.
7
30 50 5 60 90 5 80
6
30 55 60 90 5 80
5
30 55 60 90 85
4
85 60 90 85
3
145 90 85
With some pseudo programming, this example shows how it could be implemented
main(books: { 30 50 5 60 90 5 80 }, K: 3)
define main(books, K)
writers = books
while writers.length > K do
reduceMinimalPair(writers)
endwhile
end
define reduceMinimalPair(items)
index = 0
minvalue = items[0] + items[1]
for i in 1..items.length-1 do
if items[i] + items[i + 1] < minvalue then
index = i
minvalue = items[i] + items[i + 1]
endif
endfor
items.removeat(index)
items.removeat(index + 1)
items.insertat(index, minvalue)
end

Let us assume you have books 1...n with pages b1,b2,...,bn. Assume you have K writers.
Initialize a matrix F[1...n,1...K]=infinity.
Let F[i,1]= sum_j=1..i (bj)
Now, for every k=2..K
F[i,k] = min_j=1..i( max(F[j,k-1], sum_r=j+1..i (br) )

I think the way you thought is the right one, but if you are saying it did not work for big numbers then maybe you should check if a bigger number than the average exists and do something else in that case. Maybe remove the number and give it from the start to a writer or something along those lines

Alternate to solving it with Dynamic Programming, you can also binary search a upper page limit that everyone will not copy more than this number of pages. When this number converge, that's the answer.

Interpolating Large Datasets On the Fly

Interpolating Large Datasets
I have a large data set of about 0.5million records representing the exchange rate between the USD / GBP over the course of a given day.
I have an application that wants to be able to graph this data or maybe a subset. For obvious reasons I do not want to plot 0.5 million points on my graph.
What I need is a smaller data set (100 points or so) which accurately (as possible) represents the given data. Does anyone know of any interesting and performant ways this data can be achieved?
Cheers, Karl

There are several statistical methods for reducing a large dataset to a smaller, easier to visualize dataset. It's not clear from your question what summary statistic you want. I've just assumed that you want to see how the exchange rate changes as a function of time, but perhaps you are interested in how often the exchange rate goes above a certain value, or some other statistic that I'm not considering.
Summarizing a trend over time
Here is an example using the lowess method in R (from the documentation on scatter plot smoothing):
> library(graphics)
# print out the first 10 rows of the cars dataset
> cars[1:10,]
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17
# plot the original data
> plot(cars, main = "lowess(cars)")
# fit a loess-smoothed line to the points
> lines(lowess(cars), col = 2)
# plot a finger-grained loess-smoothed line to the points
> lines(lowess(cars, f=.2), col = 3)
The parameter f controls how tightly the regression fits to your data. Use some thoughtfulness with this, as you want something that accurately fits your data without overfitting. Rather than speed and distance, you could plot the exchange rate versus time.
It's also straightforward to access the results of the smoothing. Here's how to do that:
> data = lowess( cars$speed, cars$dist )
> data
$x
[1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 16 16 17 17 17 18 18 18 18 19 19
[38] 19 20 20 20 20 20 22 23 24 24 24 24 25
$y
[1] 4.965459 4.965459 13.124495 13.124495 15.858633 18.579691 21.280313 21.280313 21.280313 24.129277 24.129277
[12] 27.119549 27.119549 27.119549 27.119549 30.027276 30.027276 30.027276 30.027276 32.962506 32.962506 32.962506
[23] 32.962506 36.757728 36.757728 36.757728 40.435075 40.435075 43.463492 43.463492 43.463492 46.885479 46.885479
[34] 46.885479 46.885479 50.793152 50.793152 50.793152 56.491224 56.491224 56.491224 56.491224 56.491224 67.585824
[45] 73.079695 78.643164 78.643164 78.643164 78.643164 84.328698
The data object that you get back contains entries named x and y, which correspond to the x and y values passed into the lowess function. In this case, x and y represent speed and dist.

One thought is use the DBMS to compress the data for you using an appropriate query. Something along the lines of having it take a median for a specific range, a pseudo-query:
SELECT truncate_to_hour(rate_ts), median(rate) FROM exchange_rates
WHERE rate_ts >= start_ts AND rate_ts <= end_ts
GROUP BY truncate_to_hour(rate_ts)
ORDER BY truncate_to_hour(rate_ts)
Where truncate_to_hour is something appropriate to your DBMS. Or a similar approach with some kind of function to segment the time into unique blocks (such as round to nearest 5 minute interval), or another math function to aggregate the group thats appropriate in place of median. Given the complexity of the time segmenting procedure and how your DBMS optimizes it may be more efficient to run a query on a temporary table with the segmented time value.

If you wanted to write your own, one obvious solution would be to break your record set into fixed number-of-points chunks, for which the value would be the average (mean, median, ... pick one). This has the probable advantage of being the fastest, and shows overall trends.
But it lacks the drama of price ticks. A better solution would probably involve looking for the inflection points, then selecting among them using sliding windows. This has the advantage of better displaying the actual events of the day, but will be slower.

Something like RRDTool would do what you need automatically - the tutorial should get you started, and drraw will graph the data.
I use this at work for things like error graphs, I don't need 1-minute resolution for a 6-month time period, only for the most recent few hours. After that I have 1-hour resolution for a few days, then 1-day resolution for a few months.

The naive approach is simply calculating an average per timeinterval corresponding to a pixel.
http://commons.wikimedia.org/wiki/File:Euro_exchange_rate_to_AUD.svg
This does not show flunctuations. I would suggest also calculating the standard deviation in each time interval and plot that too (essentially making each pixel higher than one single pixel). I could not locate an example, but I know that Gnuplot can do this (but is not written in Java).

How about to make enumeration/iterator wrapper. I'm not familiar with Java, but it may looks similar to:
class MedianEnumeration implements Enumeration<Double>
{
private Enumeration<Double> frameEnum;
private int frameSize;
MedianEnumeration(Enumeration<Double> e, int len) {
frameEnum = e;
frameSize = len;
}
public boolean hasMoreElements() {
return frameEnum.hasMoreElements();
}
public Double nextElement() {
Double sum = frameEnum.nextElement();
int i;
for(i=1; (i < frameSize) && (frameEnum.hasMoreElements()); ++i) {
sum += (Double)frameEnum.nextElement();
}
return (sum / i);
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reducing the granularity of a data set - java

Related

Thinking about more optimal solution for below algorithm

Confusion in building a Svm Training Set

Effective search from a huge number of points

knapsack-like algorithm

Interpolating Large Datasets On the Fly

Categories

Resources