HdrHistogram: how to control the number of buckets in outputPercentileDistribution()?

HdrHistogram: how to control the number of buckets in outputPercentileDistribution()? - java

I've been using HdrPercentile library in Java to monitor distribution of certain number in my system.
I decided to take a shortcut and use outputPercentileDistribution to let HdrHistogram show me what it thinks of my data.
The output has been useful, but I have hard time understanding how HdrHistogram controls the number of buckets it prints.
The number is controlled by the function argument
Produce textual representation of the value distribution of histogram
data by percentile. The distribution is output with exponentially
increasing resolution, with each exponentially decreasing
half-distance containing dumpTicksPerHalf percentile reporting tick
points.
percentileTicksPerHalfDistance The number of reporting points per
exponentially decreasing half-distance
I do not understand exactly how it's translated into buckets. I did notice that the larger the number that I pass, the more buckets I get.
Can someone explain exactly how the buckets are set up?

After looking at the source code, I think I see what's going on there.
The function argument is slightly misnamed. It really should be percentileBucketsPerHalfDistance.
The system takes half the distance to 100% (initially 50%) and splits it into the given number of buckets. So, for percentileBucketsPerHalfDistance = 2 we get buckets at 0 and 25%.
Then the system takes a half of what' left to 100% again, let's say 50 to 75%, and splits it into the required number of buckets, let's say 50% and 62.5%.
Then we deal with 75% to 87.5%, split into 2 again, so we get 75% and 81.25%
Eventually we get so close to 100% that we overshoot and get 99.9997138977 and 100.0

Related

Sampling a smaller set of line graph points without losing trends

Given a set of X/Y co-ordinates ([(x,y)] with increasing X(representing a timestamp) and Y representing a value/measurement at that timestamp.
This set can possibly be huge and i would like to avoid returning every single point in the set for display but rather find a smaller subset that would represent the overall trend of the measurement(some level of accuracy loss in the line graph will be acceptable).
So far, i tried the simple uniform sampling of measurement skipping points at uniform interval, then adding the max/min measurement value to the subset. While this is simple, It doesn't really account well for local peaks or valleys if the measurement fluctuates often.
I'm wondering if there are any standard algorithms that deal with solving this type of problems on server side?
Appreciate if anyone has solved it or know of any util/common libraries solving such problems. I'm on Java, but if there is any reference to standard algorithms i might try to implement one in Java.

It's hard to give a general answer to this question. It all depends on how your datapoints are stored, what properties your chart has, how it is rendered etc.
But as #dmuir suggested, you should check out the Douglas-Peucker algorithm. Another approach I just thought up could be to split the input data into chunks of some size (maybe corresponding to a single horizontal pixel) and then using some statistic (min, max, or average) for rendering chunk. If you use running statistics when adding data points to a chunk, this should be O(n), so it's not more expensive than the reading on of your data points.

Unexpected deviations of the linear search graph on an ordered table

I have implemented a simple linear search and shown the results on the graph with StdDraw library. I have ran the search for a randomly generated number on tables of sizes from 1000 to 100000 elements, incrementing the size by 1 each time. The points on the graph represent the average time it took to find a random number in the given table, approximated in 1000 runs on the same table size.
However there are big deviations visible on the graph which I do not know how to explain. Is it possible that this is due to the interference of other background tasks requesting CPU processing? Could it be that the spikes are because of poorly generated pseudorandom integers, because the nextInt() method is called in a really tiny time slice resulting in similar (very big or very low) random integers?
(The red line represents the linear search and the blue one binary search. Ignore the latter)

Need to search a big file of integers using Java

I have a file which has 100,000 lines and each line is a list of space separated 1000 integers(ranging from 0 to 1,000,000). Now I need to to make an API which when given two inputs a and b tells me if there are two numbers present in same line in file where b comes after a in terms of index. Total size of file is ~700 MB.
Since it is an API I cannot read every time from file by creating a stream, as I have to take care of response time and disk reads are slow. And I cannot load everything in memory since the file is too big.
Any suggestions on what is an optimal way?
Note - I created an API by loading everything to memory and making a hashmap of number -> set of line it belongs and then tried to search it. It works for smaller files, but when I try to start the server with larger file , the server does not starts(I am new to JAVA too, can anyone help me on where to see the logs on why it is not starting?. I am just doing java -jar $DIR/target/test.jar in my bash script)

I think here you have a lot of numbers (100M) and if you want to keep them all in memory you should prepare to use Gbs of ram. Good news is that highest number is 1M, thus making a lot of numbers repeating.
I would probably represent the file with a graph. Each node contains a number (1-1000000) so you have 1 million nodes, fast indexed for O(1) access (nodes could be easily implemented as cell of array). Then each node X is connected to a node Y if Y appear at right of X in any line of the file.
Solution involves finding a connectivity of two nodes in the graph. I'm not an expert here, and I would implement a dfs like algorithm paying attention to avoid cycles. Due to this avoiding, finding algorithm will touch at max 1 million nodes, making complexity low.
About space: each line should produce 999 connections, that is (multiplied by 100k lines) = almost 100 million connections. If each connection is 4 bytes (but you can improve as all you need is 20 bit to store 1 million) then you have 400Mb of memory for connections.
So with 400Mb of ram you can make your API answer very fast.

Sampling numerical arrays in java

I have a data set of time series data I would like to display on a line graph. The data is currently stored in an oracle table and the data is sampled at 1 point / second. The question is how do I plot the data over a 6 month period of time? Is there a way to down sample the data once it has been returned from oracle (this can be done in various charts, but I don't want to move the data over the network)? For example, if a query returns 10K points, how can I down sample this to 1K points and still have the line graph and keep the visual characteristics (peaks/valley)of the 10K points?
I looked at apache commons but without know exactly what the statistical name for this is I'm a bit at a loss.
The data I am sampling is indeed time series data such as page hits.

It sounds like what you want is to segment the 10K data points into 1K buckets -- the value of each one of these buckets may be any statistic computation that makes sense for your data (sorry, without actual context it's hard to say) For example, if you want to spot the trend of the data, you might want to use Median Percentile to summarize the 10 points in each bucket. Apache Commons Math have helper functions for that. Then, with the 1K downsampled datapoints, you can plot the chart.
For example, if I have 10K data points of page load times, I might map that to 1K data points by doing a median on every 10 points -- that will tell me the most common load time within the range -- and point that. Or, maybe I can use Max to find the maximum load time in the period.

There are two options: you can do as #Adrian Pang suggests and use time bins, which means you have bins and hard boundaries between them. This is perfectly fine, and it's called downsampling if you're working with a time series.
You can also use a smooth bin definition by applying a sliding window average/function convolution to points. This will give you a time series at the same sampling rate as your original, but much smoother. Prominent examples are the sliding window average (mean/median of all points in the window, equally weighted average) and Gaussian convolution (weighted average where the weights come from a Gaussian density curve).

My advice is to average the values over shorter time intervals. Make the length of the shorter interval dependent on the overall time range. If the overall time range is short enough, just display the raw data. E.g.:
overall = 1 year: let subinterval = 1 day
overall = 1 month: let subinterval = 1 hour
overall = 1 day: let subinterval = 1 minute
overall = 1 hour: no averaging, just use raw data
You will have to make some choices about where to shift from one subinterval to another, e.g., for overall = 5 months, is subinterval = 1 day or 1 hour?
My advice is to make a simple scheme so that it is easy for others to comprehend. Remember that the purpose of the plot is to help someone else (not you) understand the data. A simple averaging scheme will help get you to that goal.

If all you need is reduce the points of your visuallization without losing any visuall information, I suggest to use the code here. The tricky part of this approach is to find the correct threshold. Where threshold is the amount of data point you target to have after the downsampling. The less the threshold the more visual information you lose. However from 10K to 1K, is feasible, since I have tried it with a similar amount of data.
As a side note you should have in mind
The quality of your visualization depends one the amount of points and the size (in pixels) of your charts. Meaning that for bigger charts you need more data.
Any further analysis many not return the corrected results if it is applied at the downsampled data. Or at least I haven't seen anyone prooving the opposite.

frequency / pitch detection for dummies

While I have many questions on this site dealing with the concept of pitch detection... They all deal with this magical FFT with which I am not familiar. I am trying to build an Android application that needs to implement pitch detection. I have absolutely no understanding for the algorithms that are used to do this.
It can't be that hard can it? There are around 8 billion guitar tuner apps on the android market after all.
Can someone help?

The FFT is not really the best way to implement pitch detection or pitch tracking. One issue is that the loudest frequency is not always the fundamental frequency. Another is that the FFT, by itself, requires a pretty large amount of data and processing to obtain the resolution you need to tune an instrument, so it can appear slow to respond (i.e. latency). Yet another issue is that the result of an FFT is necessarily intuitive to work with: you get an array of complex numbers and you have to know how to interpret them.
If you really want to use an FFT, here is one approach:
Low-pass your signal. This will help prevent noise and higher harmonics from creating spurious results. Conceivably, you could do skip this step and instead weight your results towards the lower values of the FFT instead. For some instruments with strong fundamental frequencies, this might not be necessary.
Window your signal. Windows should be at lest 4096 in size. Larger is better to a point because it gives you better frequency resolution. If you go too large, it will end up increasing your computation time and latency. The hann function is a good choice for your window. http://en.wikipedia.org/wiki/Hann_function
FFT the windowed signal as often as you can. Even overlapping windows are good.
The results of the FFT are complex numbers. Find the magnitude of each complex number using sqrt( real^2 + imag^2 ). The index in the FFT array with the largest magnitude is the index with your peak frequency.
You may want to average multiple FFTs for more consistent results.
How do you calculate the frequency from the index? Well, let's say you've got a window of size N. After you FFT, you will have N complex numbers. If your peak is the nth one, and your sample rate is 44100, then your peak frequency will be near (44100/2)*n/N. Why near? well you have an error of (44100/2)*1/N. For a bin size of 4096, this is about 5.3 Hz -- easily audible at A440. You can improve on that by 1. taking phase into account (I've only described how to take magnitude into account), 2. using larger windows (which will increase latency and processing requirements as the FFT is an N Log N algorithm), or 3. use a better algorithm like YIN http://www.ircam.fr/pcm/cheveign/pss/2002_JASA_YIN.pdf
You can skip the windowing step and just break the audio into discrete chunks of however many samples you want to analyze. This is equivalent to using a square window, which works, but you may get more noise in your results.
BTW: Many of those tuner apps license code form third parties, such as z-plane, and iZotope.
Update: If you want C source code and a full tutorial for the FFT method, I've written one. The code compiles and runs on Mac OS X, and should be convertible to other platforms pretty easily. It's not designed to be the best, but it is designed to be easy to understand.

A Fast Fourier Transform changes a function from time domain to frequency domain. So instead of f(t) where f is the signal that you are getting from the microphone and t is the time index of that signal, you get g(θ) where g is the FFT of f and θ is the frequency. Once you have g(θ), you just need to find which θ with the highest amplitude, meaning the "loudest" frequency. That will be the primary pitch of the sound that you are picking up.
As for actually implementing the FFT, if you google "fast fourier transform sample code", you'll get a bunch of examples.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.