Calculating intraday candlesticks by time intervals - java

This maybe an over asked question, but my mind draws blank at this moment. I know what a candlestick chart is and how to draw it daily. But how to draw it intraday at asked time periods. I have this server, written in Java, that gives me trade depth (each trade done since the start of the day). Its just a stream of raw data: price, shares, timestamp.
How does one go about calculating candlestick data from that? Lets say, they want to have 5 min candlestick or 1min candlestick. Or is there a library that will do that for me if I feed it data?
Any help is appreciated!

The exact implementation varies depending on how you're storing the data, but in general:
Sort the data by timestamp
Decide when the day starts (e.g. 9 AM EST, whatever) and find the timestamp of that time on the first day. You then know when each 5 minute (or whatever) bar begins and ends, by adding an appropriate offset to that number.
Find the index of the first data point that is not in the first bar - every data point whose index is lower than that is in the first bar. It's now straightforward to take the first, last, maximum, and minimum prices for a candlestick.
Repeat 3, substituting the last index of the previous candle for 0.
You now have the data partitioned into candles.

Have you seen JFreeChart ? It will draw candlesticks, and since it's incredibly configurable, it may well do what you want.

Related

Sampling numerical arrays in java

I have a data set of time series data I would like to display on a line graph. The data is currently stored in an oracle table and the data is sampled at 1 point / second. The question is how do I plot the data over a 6 month period of time? Is there a way to down sample the data once it has been returned from oracle (this can be done in various charts, but I don't want to move the data over the network)? For example, if a query returns 10K points, how can I down sample this to 1K points and still have the line graph and keep the visual characteristics (peaks/valley)of the 10K points?
I looked at apache commons but without know exactly what the statistical name for this is I'm a bit at a loss.
The data I am sampling is indeed time series data such as page hits.
It sounds like what you want is to segment the 10K data points into 1K buckets -- the value of each one of these buckets may be any statistic computation that makes sense for your data (sorry, without actual context it's hard to say) For example, if you want to spot the trend of the data, you might want to use Median Percentile to summarize the 10 points in each bucket. Apache Commons Math have helper functions for that. Then, with the 1K downsampled datapoints, you can plot the chart.
For example, if I have 10K data points of page load times, I might map that to 1K data points by doing a median on every 10 points -- that will tell me the most common load time within the range -- and point that. Or, maybe I can use Max to find the maximum load time in the period.
There are two options: you can do as #Adrian Pang suggests and use time bins, which means you have bins and hard boundaries between them. This is perfectly fine, and it's called downsampling if you're working with a time series.
You can also use a smooth bin definition by applying a sliding window average/function convolution to points. This will give you a time series at the same sampling rate as your original, but much smoother. Prominent examples are the sliding window average (mean/median of all points in the window, equally weighted average) and Gaussian convolution (weighted average where the weights come from a Gaussian density curve).
My advice is to average the values over shorter time intervals. Make the length of the shorter interval dependent on the overall time range. If the overall time range is short enough, just display the raw data. E.g.:
overall = 1 year: let subinterval = 1 day
overall = 1 month: let subinterval = 1 hour
overall = 1 day: let subinterval = 1 minute
overall = 1 hour: no averaging, just use raw data
You will have to make some choices about where to shift from one subinterval to another, e.g., for overall = 5 months, is subinterval = 1 day or 1 hour?
My advice is to make a simple scheme so that it is easy for others to comprehend. Remember that the purpose of the plot is to help someone else (not you) understand the data. A simple averaging scheme will help get you to that goal.
If all you need is reduce the points of your visuallization without losing any visuall information, I suggest to use the code here. The tricky part of this approach is to find the correct threshold. Where threshold is the amount of data point you target to have after the downsampling. The less the threshold the more visual information you lose. However from 10K to 1K, is feasible, since I have tried it with a similar amount of data.
As a side note you should have in mind
The quality of your visualization depends one the amount of points and the size (in pixels) of your charts. Meaning that for bigger charts you need more data.
Any further analysis many not return the corrected results if it is applied at the downsampled data. Or at least I haven't seen anyone prooving the opposite.

What design pattern is appropriate for this situation?

I have 2D hydraulic data, which are multigigabyte text files containing depth and velocity information for each point in a grid, broken up into time steps. Each timestep contains a depth/velocity value for every point in the grid. So you could follow one point through each timestep and see how its depth/velocity changes. I want to read in this data one timestep at a time, calculating various things - the maximum depth a grid cell achieves, max velocity, the number of the first timestep where water is more than 2 feet deep, etc. The results of each of these calculations will be a grid - max depth at each point, etc.
So far, this sounds like the Decorator pattern. However, I'm not sure how to get the results out of the various calculations - each calculation produces a different grid. I would have to keep references to each decorator after I create it in order to extract the results from it, or else add a getResults() method that returns a map of different results, etc, neither of which sound ideal.
Another option is the Strategy pattern. Each calculation is a different algorithm that operates on a time step (current depth/velocity) and the results of previous rounds (max depth so far, max velocity so far, etc). However, these previous results are different for each computation - which means either the algorithm classes become stateful, or it becomes the caller's job to keep track of previous results and feed them in. I also dislike the Strategy pattern because the behavior of looping over the timesteps becomes the caller's responsibility - I'd like to just give the "calculator" an iterator over the timesteps (fetching them from the disk as needed) and have it produce the results it needs.
Additional constraints:
Input is large and being read from disk, so iterating exactly once, by time step, is the only practical method
Grids are large, so calculations should be done in place as much as possible
If i understand your problem right, you have a grid_points which have many timesteps & each timestep has depth & velocity. Now have GBs of data.
I would suggest to do one pass on the data & store the parsed data in a RDBMS. then run queries or stored procedures on this data. This way at least the application will not run out of memory
First, maybe I've not well understood the issue and miss the point in my answer, in which case I apologize for taking your time.
At first sight I would think of an approach that's more akin to the "strategy pattern", in combination with a data-oriented base, something like the following pseudo-code:
foreach timeStamp
readGridData
foreach activeCalculator in activeCalculators
useCalculatorPointerListToAccessSpecificStoredDataNeededForNewCalculation
performCalculationOnFreshGridData
updateUpdatableData
presentUpdatedResultsToUser
storeGridResultsInDataPool(OfResultBaseClassType)
discardNoLongerNeededStoredGridResults
next calculator
next timeStep
Again, sorry if this is off the point.

How to discard time intervals with Time Series / XYPlots using JFreeChart?

I am building a set of chart displays, one of which is for a month display of daily trading - that is, one point of data per day (closing).
Since there is no trade during weekends and holidays, I need to discard these data points. Not only that, but data points should still appear adjacent to each other, regardless of any gaps in time. This can be seen in any such chart e.g. in the 3 month graph for Nasdaq on Yahoo Finance - see how weekends are skipped.
My question is: how should one correctly implement this in JFreeChart?
Thanks in advance!
In addition to omitting the excluded data points, you can apply a SegmentedTimeline to the corresponding DateAxis. For example,
axis.setTimeline(SegmentedTimeline.newMondayThroughFridayTimeline());
Although deprecated in the current version, as discussed here, the implementation may guide creation of a custom TimeLine, as noted in a comment here.

Best fit curve for trend line

Problem Constraints
Size of the data set, but not the data itself, is known.
Data set grows by one data point at a time.
Trend line is graphed one data point at a time (using a spline/Bezier curve).
Graphs
The collage below shows data sets with reasonably accurate trend lines:
The graphs are:
Upper-left. By hour, with ~24 data points.
Upper-right. By day for one year, with ~365 data points.
Lower-left. By week for one year, with ~52 data points.
Lower-right. By month for one year, with ~12 data points.
User Inputs
The user can select:
the type of time series (hourly, daily, monthly, quarterly, annual); and
the start and end dates for the time series.
For example, the user could select a daily report for 30 days in June.
Trend Weight
To calculate the window size (i.e., the number of data points to average when calculating the trend line), the following expression is used:
data points / trend weight
Where data points is derived from user inputs and trend weight is 6.4. Even though a trend weight of 6.4 produces good fits, it is rather arbitrary, and might not be appropriate for different user inputs.
Question
How should trend weight be calculated given the constraints of this problem?
Based on the looks of the graphs I would say you have too many points for your 12 point graph (it is just a spline of the points given... which is visually pleasing, but actually does more harm than good when trying to understand the trend) and too few points for your 365 point graph. Perhaps try doing something a little exponential like:
(Data points)^1.2/14.1
I do realize this is even more arbitrary than what you already have, but arbitrary isn't the worst thing in the world.
(I got 14.1 by trying to keep the 52 point graph fixed, since that one looks nice, by taking (52^(1.2)/52)*6.4=14.1. You using this technique you could try other powers besides 1.2 to see what you visually get.
Dan
I voted this up for the quality of your results and the clarity of your write-up. I wish I could offer an answer that could improve on your already excellent work.
I fear that it might be a matter of trial and error with the trend weight until you see an improved fit.
It could be that you could make this an input from users as well: allow them to fiddle with the value, given realistic constraints, until they get satisfactory values.
I also wondered if the weight would be different for each graph, since the number of points in each is different. Are you trying to get a single weighting that works for all graphs?
Excellent work; a nice question. Well done. I wish I was more helpful. Perhaps someone else will have more wisdom to impart than I do.
It might look like the trend lines are accurate in those 4 graphs but its really quite off. (This is best seen in the begging of the lower left one and the beginning of the upper right. I would think that you would want to use no less than half of your points when finding the trend line (though really you should use much more than half). I would suggest a Trend Weight of 2 at a maximum. Though really you ought to stick closer to the 1-1.5 range. Since it is arbitrary i would suggest you give your user an "accuracy of trend line" slider that they can use where the most accurate setting uses a trend weight of 1 and the least accurate uses a weight of #of data points +1. This would use 0 points (amusing you always round down) and, i would assume, though your statistics software might be different, will generate a strait horizontal line.

Is there a good algorithm to check for changes in data over a specified period of time?

We have around 7k financial products whose closing prices should theoretically move up and down within a certain percentage range throughout a defined period of time (say a one week or month period).
I have access to an internal system that stores these historical prices (not a relational database!). I would like to produce a report that lists any products whose price has not moved at all or less than say 10% over the time period.
I can't just compare the first value (day 1) to the value at the end (day n) as the price could potentially have moved back to what it was on the last day which would lead to a false positive while the product's price could have spiked somewhere in between of course.
Are there any established algorithms to do this in reasonable compute time?
There isn't any way to do this without looking at every single day.
Suppose the data looks like such:
oooo0oooo
With that one-day spike in the middle. You're not going to catch that unless you check the day that the spike happens - in other words, you need to check every single day.
If this needs to be checked often (for a large number of interval, like daily for the last year, and for the same set of products), you can store the high and low values of each item per week/month. By combining the right weekly and/or monthly bounds with some raw data at the edges of the interval you can get the minimum and maximum value over the interval.
If you can add data to kdb (i.e. you're not limited to read access) you might consider adding the 'number of days since last price change' as a new set of data (i.e. one number per financial instrument). A daily task would then fetch today's mark and yesterday's, and update the numbers stored. Similarly you could maintain recent (last month, last year) highs and lows in kdb. You'd have to run a job over the larger dataset to prime the values initially, but then your daily updates will involve much less data.
Recommend that if you adopt something like this you have some way to rerun for all or part of the dataset (say for adding a new product).
Lastly - is the history normalised against current prices? (i.e. are revaluations for stock splits or similar taken into account). If not, you'd need to detect these discontinuities and divide them out.
EDIT
I'd investigate usng kdb+/Q to implement the signal processing, rather than extracting the raw data to a Java application. As you say, it's highly performant.
You can do this if you can keep track of the min and max value of the price during the time interval - this assumes that the time interval is not being constantly changed. One way of keeping track of the min and max values of a changing set of items is with two heaps placed 'back to back' - you could store this and some pointers necessary to find and remove old items in one or two arrays in your store. The idea of putting two heaps back to back is in Knuth's Art of Computer Programming Vol 3 as Exercise 31 section 5.2.3. Knuth calls this sort of beast a Priority Dequeue, and this seems to be searchable. Min and max are available at constant cost. Cost of modifying it when a new price arrives is log n, where n is the number of items stored.

Categories