Sampling numerical arrays in java

Sampling numerical arrays in java - java

I have a data set of time series data I would like to display on a line graph. The data is currently stored in an oracle table and the data is sampled at 1 point / second. The question is how do I plot the data over a 6 month period of time? Is there a way to down sample the data once it has been returned from oracle (this can be done in various charts, but I don't want to move the data over the network)? For example, if a query returns 10K points, how can I down sample this to 1K points and still have the line graph and keep the visual characteristics (peaks/valley)of the 10K points?
I looked at apache commons but without know exactly what the statistical name for this is I'm a bit at a loss.
The data I am sampling is indeed time series data such as page hits.

It sounds like what you want is to segment the 10K data points into 1K buckets -- the value of each one of these buckets may be any statistic computation that makes sense for your data (sorry, without actual context it's hard to say) For example, if you want to spot the trend of the data, you might want to use Median Percentile to summarize the 10 points in each bucket. Apache Commons Math have helper functions for that. Then, with the 1K downsampled datapoints, you can plot the chart.
For example, if I have 10K data points of page load times, I might map that to 1K data points by doing a median on every 10 points -- that will tell me the most common load time within the range -- and point that. Or, maybe I can use Max to find the maximum load time in the period.

There are two options: you can do as #Adrian Pang suggests and use time bins, which means you have bins and hard boundaries between them. This is perfectly fine, and it's called downsampling if you're working with a time series.
You can also use a smooth bin definition by applying a sliding window average/function convolution to points. This will give you a time series at the same sampling rate as your original, but much smoother. Prominent examples are the sliding window average (mean/median of all points in the window, equally weighted average) and Gaussian convolution (weighted average where the weights come from a Gaussian density curve).

My advice is to average the values over shorter time intervals. Make the length of the shorter interval dependent on the overall time range. If the overall time range is short enough, just display the raw data. E.g.:
overall = 1 year: let subinterval = 1 day
overall = 1 month: let subinterval = 1 hour
overall = 1 day: let subinterval = 1 minute
overall = 1 hour: no averaging, just use raw data
You will have to make some choices about where to shift from one subinterval to another, e.g., for overall = 5 months, is subinterval = 1 day or 1 hour?
My advice is to make a simple scheme so that it is easy for others to comprehend. Remember that the purpose of the plot is to help someone else (not you) understand the data. A simple averaging scheme will help get you to that goal.

If all you need is reduce the points of your visuallization without losing any visuall information, I suggest to use the code here. The tricky part of this approach is to find the correct threshold. Where threshold is the amount of data point you target to have after the downsampling. The less the threshold the more visual information you lose. However from 10K to 1K, is feasible, since I have tried it with a similar amount of data.
As a side note you should have in mind
The quality of your visualization depends one the amount of points and the size (in pixels) of your charts. Meaning that for bigger charts you need more data.
Any further analysis many not return the corrected results if it is applied at the downsampled data. Or at least I haven't seen anyone prooving the opposite.

Related

Sampling a smaller set of line graph points without losing trends

Given a set of X/Y co-ordinates ([(x,y)] with increasing X(representing a timestamp) and Y representing a value/measurement at that timestamp.
This set can possibly be huge and i would like to avoid returning every single point in the set for display but rather find a smaller subset that would represent the overall trend of the measurement(some level of accuracy loss in the line graph will be acceptable).
So far, i tried the simple uniform sampling of measurement skipping points at uniform interval, then adding the max/min measurement value to the subset. While this is simple, It doesn't really account well for local peaks or valleys if the measurement fluctuates often.
I'm wondering if there are any standard algorithms that deal with solving this type of problems on server side?
Appreciate if anyone has solved it or know of any util/common libraries solving such problems. I'm on Java, but if there is any reference to standard algorithms i might try to implement one in Java.

It's hard to give a general answer to this question. It all depends on how your datapoints are stored, what properties your chart has, how it is rendered etc.
But as #dmuir suggested, you should check out the Douglas-Peucker algorithm. Another approach I just thought up could be to split the input data into chunks of some size (maybe corresponding to a single horizontal pixel) and then using some statistic (min, max, or average) for rendering chunk. If you use running statistics when adding data points to a chunk, this should be O(n), so it's not more expensive than the reading on of your data points.

HdrHistogram: how to control the number of buckets in outputPercentileDistribution()?

I've been using HdrPercentile library in Java to monitor distribution of certain number in my system.
I decided to take a shortcut and use outputPercentileDistribution to let HdrHistogram show me what it thinks of my data.
The output has been useful, but I have hard time understanding how HdrHistogram controls the number of buckets it prints.
The number is controlled by the function argument
Produce textual representation of the value distribution of histogram
data by percentile. The distribution is output with exponentially
increasing resolution, with each exponentially decreasing
half-distance containing dumpTicksPerHalf percentile reporting tick
points.
percentileTicksPerHalfDistance The number of reporting points per
exponentially decreasing half-distance
I do not understand exactly how it's translated into buckets. I did notice that the larger the number that I pass, the more buckets I get.
Can someone explain exactly how the buckets are set up?

After looking at the source code, I think I see what's going on there.
The function argument is slightly misnamed. It really should be percentileBucketsPerHalfDistance.
The system takes half the distance to 100% (initially 50%) and splits it into the given number of buckets. So, for percentileBucketsPerHalfDistance = 2 we get buckets at 0 and 25%.
Then the system takes a half of what' left to 100% again, let's say 50 to 75%, and splits it into the required number of buckets, let's say 50% and 62.5%.
Then we deal with 75% to 87.5%, split into 2 again, so we get 75% and 81.25%
Eventually we get so close to 100% that we overshoot and get 99.9997138977 and 100.0

What design pattern is appropriate for this situation?

I have 2D hydraulic data, which are multigigabyte text files containing depth and velocity information for each point in a grid, broken up into time steps. Each timestep contains a depth/velocity value for every point in the grid. So you could follow one point through each timestep and see how its depth/velocity changes. I want to read in this data one timestep at a time, calculating various things - the maximum depth a grid cell achieves, max velocity, the number of the first timestep where water is more than 2 feet deep, etc. The results of each of these calculations will be a grid - max depth at each point, etc.
So far, this sounds like the Decorator pattern. However, I'm not sure how to get the results out of the various calculations - each calculation produces a different grid. I would have to keep references to each decorator after I create it in order to extract the results from it, or else add a getResults() method that returns a map of different results, etc, neither of which sound ideal.
Another option is the Strategy pattern. Each calculation is a different algorithm that operates on a time step (current depth/velocity) and the results of previous rounds (max depth so far, max velocity so far, etc). However, these previous results are different for each computation - which means either the algorithm classes become stateful, or it becomes the caller's job to keep track of previous results and feed them in. I also dislike the Strategy pattern because the behavior of looping over the timesteps becomes the caller's responsibility - I'd like to just give the "calculator" an iterator over the timesteps (fetching them from the disk as needed) and have it produce the results it needs.
Additional constraints:
Input is large and being read from disk, so iterating exactly once, by time step, is the only practical method
Grids are large, so calculations should be done in place as much as possible

If i understand your problem right, you have a grid_points which have many timesteps & each timestep has depth & velocity. Now have GBs of data.
I would suggest to do one pass on the data & store the parsed data in a RDBMS. then run queries or stored procedures on this data. This way at least the application will not run out of memory

First, maybe I've not well understood the issue and miss the point in my answer, in which case I apologize for taking your time.
At first sight I would think of an approach that's more akin to the "strategy pattern", in combination with a data-oriented base, something like the following pseudo-code:
foreach timeStamp
readGridData
foreach activeCalculator in activeCalculators
useCalculatorPointerListToAccessSpecificStoredDataNeededForNewCalculation
performCalculationOnFreshGridData
updateUpdatableData
presentUpdatedResultsToUser
storeGridResultsInDataPool(OfResultBaseClassType)
discardNoLongerNeededStoredGridResults
next calculator
next timeStep
Again, sorry if this is off the point.

Best fit curve for trend line

Problem Constraints
Size of the data set, but not the data itself, is known.
Data set grows by one data point at a time.
Trend line is graphed one data point at a time (using a spline/Bezier curve).
Graphs
The collage below shows data sets with reasonably accurate trend lines:
The graphs are:
Upper-left. By hour, with ~24 data points.
Upper-right. By day for one year, with ~365 data points.
Lower-left. By week for one year, with ~52 data points.
Lower-right. By month for one year, with ~12 data points.
User Inputs
The user can select:
the type of time series (hourly, daily, monthly, quarterly, annual); and
the start and end dates for the time series.
For example, the user could select a daily report for 30 days in June.
Trend Weight
To calculate the window size (i.e., the number of data points to average when calculating the trend line), the following expression is used:
data points / trend weight
Where data points is derived from user inputs and trend weight is 6.4. Even though a trend weight of 6.4 produces good fits, it is rather arbitrary, and might not be appropriate for different user inputs.
Question
How should trend weight be calculated given the constraints of this problem?

Based on the looks of the graphs I would say you have too many points for your 12 point graph (it is just a spline of the points given... which is visually pleasing, but actually does more harm than good when trying to understand the trend) and too few points for your 365 point graph. Perhaps try doing something a little exponential like:
(Data points)^1.2/14.1
I do realize this is even more arbitrary than what you already have, but arbitrary isn't the worst thing in the world.
(I got 14.1 by trying to keep the 52 point graph fixed, since that one looks nice, by taking (52^(1.2)/52)*6.4=14.1. You using this technique you could try other powers besides 1.2 to see what you visually get.
Dan

I voted this up for the quality of your results and the clarity of your write-up. I wish I could offer an answer that could improve on your already excellent work.
I fear that it might be a matter of trial and error with the trend weight until you see an improved fit.
It could be that you could make this an input from users as well: allow them to fiddle with the value, given realistic constraints, until they get satisfactory values.
I also wondered if the weight would be different for each graph, since the number of points in each is different. Are you trying to get a single weighting that works for all graphs?
Excellent work; a nice question. Well done. I wish I was more helpful. Perhaps someone else will have more wisdom to impart than I do.

It might look like the trend lines are accurate in those 4 graphs but its really quite off. (This is best seen in the begging of the lower left one and the beginning of the upper right. I would think that you would want to use no less than half of your points when finding the trend line (though really you should use much more than half). I would suggest a Trend Weight of 2 at a maximum. Though really you ought to stick closer to the 1-1.5 range. Since it is arbitrary i would suggest you give your user an "accuracy of trend line" slider that they can use where the most accurate setting uses a trend weight of 1 and the least accurate uses a weight of #of data points +1. This would use 0 points (amusing you always round down) and, i would assume, though your statistics software might be different, will generate a strait horizontal line.

Is there a good algorithm to check for changes in data over a specified period of time?

We have around 7k financial products whose closing prices should theoretically move up and down within a certain percentage range throughout a defined period of time (say a one week or month period).
I have access to an internal system that stores these historical prices (not a relational database!). I would like to produce a report that lists any products whose price has not moved at all or less than say 10% over the time period.
I can't just compare the first value (day 1) to the value at the end (day n) as the price could potentially have moved back to what it was on the last day which would lead to a false positive while the product's price could have spiked somewhere in between of course.
Are there any established algorithms to do this in reasonable compute time?

There isn't any way to do this without looking at every single day.
Suppose the data looks like such:
oooo0oooo
With that one-day spike in the middle. You're not going to catch that unless you check the day that the spike happens - in other words, you need to check every single day.

If this needs to be checked often (for a large number of interval, like daily for the last year, and for the same set of products), you can store the high and low values of each item per week/month. By combining the right weekly and/or monthly bounds with some raw data at the edges of the interval you can get the minimum and maximum value over the interval.

If you can add data to kdb (i.e. you're not limited to read access) you might consider adding the 'number of days since last price change' as a new set of data (i.e. one number per financial instrument). A daily task would then fetch today's mark and yesterday's, and update the numbers stored. Similarly you could maintain recent (last month, last year) highs and lows in kdb. You'd have to run a job over the larger dataset to prime the values initially, but then your daily updates will involve much less data.
Recommend that if you adopt something like this you have some way to rerun for all or part of the dataset (say for adding a new product).
Lastly - is the history normalised against current prices? (i.e. are revaluations for stock splits or similar taken into account). If not, you'd need to detect these discontinuities and divide them out.
EDIT
I'd investigate usng kdb+/Q to implement the signal processing, rather than extracting the raw data to a Java application. As you say, it's highly performant.

You can do this if you can keep track of the min and max value of the price during the time interval - this assumes that the time interval is not being constantly changed. One way of keeping track of the min and max values of a changing set of items is with two heaps placed 'back to back' - you could store this and some pointers necessary to find and remove old items in one or two arrays in your store. The idea of putting two heaps back to back is in Knuth's Art of Computer Programming Vol 3 as Exercise 31 section 5.2.3. Knuth calls this sort of beast a Priority Dequeue, and this seems to be searchable. Min and max are available at constant cost. Cost of modifying it when a new price arrives is log n, where n is the number of items stored.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.