Given a set of X/Y co-ordinates ([(x,y)] with increasing X(representing a timestamp) and Y representing a value/measurement at that timestamp.
This set can possibly be huge and i would like to avoid returning every single point in the set for display but rather find a smaller subset that would represent the overall trend of the measurement(some level of accuracy loss in the line graph will be acceptable).
So far, i tried the simple uniform sampling of measurement skipping points at uniform interval, then adding the max/min measurement value to the subset. While this is simple, It doesn't really account well for local peaks or valleys if the measurement fluctuates often.
I'm wondering if there are any standard algorithms that deal with solving this type of problems on server side?
Appreciate if anyone has solved it or know of any util/common libraries solving such problems. I'm on Java, but if there is any reference to standard algorithms i might try to implement one in Java.
It's hard to give a general answer to this question. It all depends on how your datapoints are stored, what properties your chart has, how it is rendered etc.
But as #dmuir suggested, you should check out the Douglas-Peucker algorithm. Another approach I just thought up could be to split the input data into chunks of some size (maybe corresponding to a single horizontal pixel) and then using some statistic (min, max, or average) for rendering chunk. If you use running statistics when adding data points to a chunk, this should be O(n), so it's not more expensive than the reading on of your data points.
Related
For a small project we're trying to implement an autopilot for a slot car. A gyro sensor is attached to the car and delivers the Z-value (meaning the amount of centrifugal force acting on the car/sensor) 20 times per second. One crucial part of this is the detection of whether or not the car is in a curve or on a straight part and when exactly it was entered and when it left that part. Only so we can have reliable prediction of what'll happen in the future.
As for now, we're working with a sliding window to smooth the data and then have hardcoded limits (-400 for a left curve and +400 for a right curve) to detect what kind of sector (left, right, straight) we're in.
Obviously this takes too long, as it takes a couple of messages until the program detects that it's a direction change because of the smoothing and the hardcoded limits.
Here's an example of two rounds on a simple track, starting at the checkered area:
A perfect algorithm would detect the sectors S R S R S L S R S R S R S for one round, with a delay of only a couple of data points.
We thought about using the first derivative of the gyro values, but in the sample graph right after the first left curve, the following right curve (between 22:36:40 and 22:36:42) shows signs of swerving. Here the first derivative would be close to 0 and indicate a straight part...
Also, there we'd need to set a hardcoded threshold again, but with the noise of the data it could be that a small bump in the track could result in such a noise level that it's derivative would exceed the threshold.
Now we're not sure about what would be the easiest/fastest/most reliable way to handle this sort of detection. Would using a derivative be a good idea? Is there a better way?
Any input would be greatly appreciated :)
The existing software is written in Java.
In such problems, you have to trade robustness for immediacy. If you don't know what happens in the future, you can only make assumptions. And these assumptions may hold or may not.
From the looks of your data, there shouldn't be any smoothing necessary. If you define a reasonable threshold, the curves should be recognized quite reliably. If, however, this is not the case, here are some things you could try:
You already mentioned smoothing. The crucial point is how you smooth. An asymmetric smoothing kernel is probably desirable (a half triangle filter can be updated in constant time). You can directly weigh robustness and immediacy by modifying the kernel width.
A simple alternative to filtering is counting. If your data is above the curve threshold, don't call it a curve just yet. Count how many data points are above the threshold in a row. If there are more than n data points above the threshold, then you're most likely in a curve.
Using derivatives is potentially problematic. The main reason against derivatives is that a curve is not defined by any derivative at all (at least no derivative of the force). The second problem is that you can only estimate the derivatives numerically, which is quite unstable with lots of noise. So you would have to smooth your data (or find a numerical scheme for your noise model), which again requires some latency.
I have a data set of time series data I would like to display on a line graph. The data is currently stored in an oracle table and the data is sampled at 1 point / second. The question is how do I plot the data over a 6 month period of time? Is there a way to down sample the data once it has been returned from oracle (this can be done in various charts, but I don't want to move the data over the network)? For example, if a query returns 10K points, how can I down sample this to 1K points and still have the line graph and keep the visual characteristics (peaks/valley)of the 10K points?
I looked at apache commons but without know exactly what the statistical name for this is I'm a bit at a loss.
The data I am sampling is indeed time series data such as page hits.
It sounds like what you want is to segment the 10K data points into 1K buckets -- the value of each one of these buckets may be any statistic computation that makes sense for your data (sorry, without actual context it's hard to say) For example, if you want to spot the trend of the data, you might want to use Median Percentile to summarize the 10 points in each bucket. Apache Commons Math have helper functions for that. Then, with the 1K downsampled datapoints, you can plot the chart.
For example, if I have 10K data points of page load times, I might map that to 1K data points by doing a median on every 10 points -- that will tell me the most common load time within the range -- and point that. Or, maybe I can use Max to find the maximum load time in the period.
There are two options: you can do as #Adrian Pang suggests and use time bins, which means you have bins and hard boundaries between them. This is perfectly fine, and it's called downsampling if you're working with a time series.
You can also use a smooth bin definition by applying a sliding window average/function convolution to points. This will give you a time series at the same sampling rate as your original, but much smoother. Prominent examples are the sliding window average (mean/median of all points in the window, equally weighted average) and Gaussian convolution (weighted average where the weights come from a Gaussian density curve).
My advice is to average the values over shorter time intervals. Make the length of the shorter interval dependent on the overall time range. If the overall time range is short enough, just display the raw data. E.g.:
overall = 1 year: let subinterval = 1 day
overall = 1 month: let subinterval = 1 hour
overall = 1 day: let subinterval = 1 minute
overall = 1 hour: no averaging, just use raw data
You will have to make some choices about where to shift from one subinterval to another, e.g., for overall = 5 months, is subinterval = 1 day or 1 hour?
My advice is to make a simple scheme so that it is easy for others to comprehend. Remember that the purpose of the plot is to help someone else (not you) understand the data. A simple averaging scheme will help get you to that goal.
If all you need is reduce the points of your visuallization without losing any visuall information, I suggest to use the code here. The tricky part of this approach is to find the correct threshold. Where threshold is the amount of data point you target to have after the downsampling. The less the threshold the more visual information you lose. However from 10K to 1K, is feasible, since I have tried it with a similar amount of data.
As a side note you should have in mind
The quality of your visualization depends one the amount of points and the size (in pixels) of your charts. Meaning that for bigger charts you need more data.
Any further analysis many not return the corrected results if it is applied at the downsampled data. Or at least I haven't seen anyone prooving the opposite.
I have 2D hydraulic data, which are multigigabyte text files containing depth and velocity information for each point in a grid, broken up into time steps. Each timestep contains a depth/velocity value for every point in the grid. So you could follow one point through each timestep and see how its depth/velocity changes. I want to read in this data one timestep at a time, calculating various things - the maximum depth a grid cell achieves, max velocity, the number of the first timestep where water is more than 2 feet deep, etc. The results of each of these calculations will be a grid - max depth at each point, etc.
So far, this sounds like the Decorator pattern. However, I'm not sure how to get the results out of the various calculations - each calculation produces a different grid. I would have to keep references to each decorator after I create it in order to extract the results from it, or else add a getResults() method that returns a map of different results, etc, neither of which sound ideal.
Another option is the Strategy pattern. Each calculation is a different algorithm that operates on a time step (current depth/velocity) and the results of previous rounds (max depth so far, max velocity so far, etc). However, these previous results are different for each computation - which means either the algorithm classes become stateful, or it becomes the caller's job to keep track of previous results and feed them in. I also dislike the Strategy pattern because the behavior of looping over the timesteps becomes the caller's responsibility - I'd like to just give the "calculator" an iterator over the timesteps (fetching them from the disk as needed) and have it produce the results it needs.
Additional constraints:
Input is large and being read from disk, so iterating exactly once, by time step, is the only practical method
Grids are large, so calculations should be done in place as much as possible
If i understand your problem right, you have a grid_points which have many timesteps & each timestep has depth & velocity. Now have GBs of data.
I would suggest to do one pass on the data & store the parsed data in a RDBMS. then run queries or stored procedures on this data. This way at least the application will not run out of memory
First, maybe I've not well understood the issue and miss the point in my answer, in which case I apologize for taking your time.
At first sight I would think of an approach that's more akin to the "strategy pattern", in combination with a data-oriented base, something like the following pseudo-code:
foreach timeStamp
readGridData
foreach activeCalculator in activeCalculators
useCalculatorPointerListToAccessSpecificStoredDataNeededForNewCalculation
performCalculationOnFreshGridData
updateUpdatableData
presentUpdatedResultsToUser
storeGridResultsInDataPool(OfResultBaseClassType)
discardNoLongerNeededStoredGridResults
next calculator
next timeStep
Again, sorry if this is off the point.
I am implementing a project which needs to cluster geographical points. OPTICS algorithm seems to be a very nice solution. It needs just 2 parameters as input(MinPts and Epsilon), which are, respectively, the minimum number of points needed to consider them as a cluster, and the distance value used to compare if two points are in can be placed in same cluster.
My problem is that, due to the extreme variety of the points, I can't set a fixed epsilon.
Just look at the image below.
The same points structure but in a different scale would result very different. Suppose to set MinPts=2 and epsilon = 1Km.
On the left, the algorithm would create 2 clusters(red and blue), but on the right it would create one single cluster containing all of the points(red), but I would like to obtain 2 clusters even on the right.
So my question is: is there any kind of way to calculate dynamically the epsilon value to get this result?
EDIT 05 June 2012 3.15pm:
I thought I was using the OPTICS algorithm implementation from the javaml library, but it seems it is actually a DBSCAN algorithm implementation.
So the question now is: does anybody know a java based implementation of OPTICS algorithm?
Thank you very much and excuse my for my poor english.
Marco
The epsilon value in OPTICS is solely to limit the runtime complexity when using index structures. If you do not have an index for acceleration, you can set it to infinity.
To quote Wikipedia on OPTICS
The parameter \varepsilon is strictly speaking not necessary. It can be set to a maximum value. When a spatial index is available, it does however play a practical role when it comes to complexity.
What you seem to have looks much more like DBSCAN than OPTICS. In OPTICS, you should not need to choose epsilon (it should have been called max-epsilon by the authors!), but your cluster extraction method will take care of that. Are you using the Xi extraction proposed in the OPTICS paper?
minPts is much more important. You should try a value of at least 5 or 10, not 2. With 2, you are essentially performing single-linkage clustering!
The example you gave above should work fine once you increase minPts!
Re: edit: As you can even see in the Wikipedia article, ELKI has a proper OPTICS implementation and it's in Java.
You'd can try to scale epsilon by the total size of the enclosing rectangle. For example, your left data is about 4km x 6km (using my Mark I eyeball to measure) and the right is about 2km x 2km. So, epsilon on the right should be about 2.5 times smaller.
Of course, this doesn't work reliably. If, on your right hand data, there were an additional single point 4km to the right and 2km down, that would make the enclosing rectangle for the right the same as on the left, and you'd get similar (wrong) results.
You can try a minimum spanning tree and then remove the longest edge. The remaining spanning tree and the center of them is the best center for OPTICS and you can count the numbers of points around it.
In your explanation above, it is the change in scale which creates the uncertainty. When your scale gets bigger, your epsilon should change accordingly. Because they are at two very different scales, the two images you've presented are NOT the same set of points. They will not respond identically to your OPTICS algorithm without changing the parameters.
In short, no. there is no way to dynamically calculate epsilon to get this result. Clustering like this is already NP-Hard, and these clustering algorithims (optics, k-means, veroni) can only approximate the optimal solution.
Problem Constraints
Size of the data set, but not the data itself, is known.
Data set grows by one data point at a time.
Trend line is graphed one data point at a time (using a spline/Bezier curve).
Graphs
The collage below shows data sets with reasonably accurate trend lines:
The graphs are:
Upper-left. By hour, with ~24 data points.
Upper-right. By day for one year, with ~365 data points.
Lower-left. By week for one year, with ~52 data points.
Lower-right. By month for one year, with ~12 data points.
User Inputs
The user can select:
the type of time series (hourly, daily, monthly, quarterly, annual); and
the start and end dates for the time series.
For example, the user could select a daily report for 30 days in June.
Trend Weight
To calculate the window size (i.e., the number of data points to average when calculating the trend line), the following expression is used:
data points / trend weight
Where data points is derived from user inputs and trend weight is 6.4. Even though a trend weight of 6.4 produces good fits, it is rather arbitrary, and might not be appropriate for different user inputs.
Question
How should trend weight be calculated given the constraints of this problem?
Based on the looks of the graphs I would say you have too many points for your 12 point graph (it is just a spline of the points given... which is visually pleasing, but actually does more harm than good when trying to understand the trend) and too few points for your 365 point graph. Perhaps try doing something a little exponential like:
(Data points)^1.2/14.1
I do realize this is even more arbitrary than what you already have, but arbitrary isn't the worst thing in the world.
(I got 14.1 by trying to keep the 52 point graph fixed, since that one looks nice, by taking (52^(1.2)/52)*6.4=14.1. You using this technique you could try other powers besides 1.2 to see what you visually get.
Dan
I voted this up for the quality of your results and the clarity of your write-up. I wish I could offer an answer that could improve on your already excellent work.
I fear that it might be a matter of trial and error with the trend weight until you see an improved fit.
It could be that you could make this an input from users as well: allow them to fiddle with the value, given realistic constraints, until they get satisfactory values.
I also wondered if the weight would be different for each graph, since the number of points in each is different. Are you trying to get a single weighting that works for all graphs?
Excellent work; a nice question. Well done. I wish I was more helpful. Perhaps someone else will have more wisdom to impart than I do.
It might look like the trend lines are accurate in those 4 graphs but its really quite off. (This is best seen in the begging of the lower left one and the beginning of the upper right. I would think that you would want to use no less than half of your points when finding the trend line (though really you should use much more than half). I would suggest a Trend Weight of 2 at a maximum. Though really you ought to stick closer to the 1-1.5 range. Since it is arbitrary i would suggest you give your user an "accuracy of trend line" slider that they can use where the most accurate setting uses a trend weight of 1 and the least accurate uses a weight of #of data points +1. This would use 0 points (amusing you always round down) and, i would assume, though your statistics software might be different, will generate a strait horizontal line.