Can we use Spark streaming for time based events - java

I have a requirement as follows
There are multiple devices producing data based on the device configuration. e.g., There are two devices producing data at their own intervals let’s say d1 producing for every 15 min and d2 producing for every 30 min
All this data will be sent to Kafka
I need to consume the data and perform calculations for each device which is based on the values produced for the current hour and the first value produced in the next hour. For e.g., If d1 is producing data for every 15min from 12:00 AM-1:00 AM then the calculation is based on the values produced for that hour and the first value produced from 1:00 AM-2:00 AM. If the value is not produced from 1:00AM-2:00 AM then I need to consider data from 12:00 AM-1:00 AM and save it data repository (Time series)
Like this there will be ‘n’ number of devices and each device has its own configuration. In the above scenario device d1 and d2 are producing data for every 1 hr. There might be other devices which will be producing data for every 3 hr, 6 hr.
Currently this requirement is done in Java. Since the devices are increasing so as the computations, I would like to know if Spark/Spark Streaming can be applied to this scenario?Any articles with respect to these kind of requirements can be shared so that it will be of great help.

If, and this is a big if, the computations are going to be device-wise, you can make use of topic partitions and scale the number of partitions with the number of devices. The messages are delivered in order per partition this is the most powerful idea that you need to understand.
However, some words of caution:
The number of topics may increase, if you want to decrease you may need to purge the topics and start again.
In order to ensure that the devices are uniformly distributed, you may consider assign a guid to each device.
If the calculations do not involve some sort of machine learning libraries and can be done in plain java, it may be a good idea to use plain old consumers (or Streams) for this, instead of abstracting them via Spark-Streaming. The lower the level the greater the flexibility.
You can check this. https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster

Related

What's exactly is the use of 'withIngestionTimestamps()' in Hazelcast Jet Pipeline?

I'm running a pipeline, source from Kafka topic and sink to an IMap. Everytime I write one, I come across the methods withIngestionTimestamps() and withoutTimestamps() and wondering how are they useful? I understand its all about the source adding time to the event. Question is how do I get to use it? I don't see any method to fetch the timestamp from the event?
My IMap have a possibility of getting filled with duplicate values. If I could make use of the withIngestionTimestamps() method to evaluate latest record and discard the old?
Jet uses the event timestamps to correctly apply windowing. It must decide which event belongs to which window and when the time has come to close a window and emit its aggregated result. The timestamps are present on the events as metadata and aren't exposed to the user.
However, if you want to apply your logic that refers to the wall-clock time, you can always call System.currentTimeMillis() to check it against the timestamp explicitly stored in the IMap value. That would be equivalent to using the processing time, which is quite similar to the ingestion time that Jet applies. Ingestion time is simply the processing time valid at the source vertex of the pipeline, so applying processing time at the sink vertex is just slightly different from that, and has the same practical properties.
Jet manages the event timestamp behind the scenes, it's visible only to processors. For example, the window aggregation will use the timestamp.
If you want to see the timestamp in the code, you have to include it in your item type. You have to go without timestamps from the source, add the ingestion timestamp using a map operator and let Jet know about it:
Pipeline p = Pipeline.create();
p.drawFrom(KafkaSources.kafka(...))
.withoutTimestamps()
.map(t -> tuple2(System.currentTimeMillis(), t))
.addTimestamps(Tuple2::f0, 2000)
.drainTo(Sinks.logger());
I used allowedLag of 2000ms. The reason for this is that the timestamps will be added in a vertex downstream of the vertex that assigned them. Stream merging can take place there and internal skew needs to be accounted for. For example it should account for the longest expected GC pause or network delay. See the note in addTimestamps method.

Cross Correlation: Android AudioRecord create sample data for TDoA

On one side with my Android smartphone I'm recording an audio stream using AudioRecord.read(). For the recording I'm using the following specs
SampleRate: 44100 Hz
MonoChannel
PCM-16Bit
size of the array I use for AudioRecord.read(): 100 (short array)
using this small size allows me to read every 0.5ms (mean value), so I can use this timestamp later for the multilateration (at least I think so :-) ). Maybe this will be obsolete if I can use cross correlation to determine the TDoA ?!? (see below)
On the other side I have three speaker emitting different sounds using the WebAudio API and the the following specs
freq1: 17500 Hz
freq2: 18500 Hz
freq3: 19500 Hz
signal length: 200 ms + a fade in and fade out of the gain node of 5ms, so in sum 210ms
My goal is to determine the time difference of arrival (TDoA) between the emitted sounds. So in each iteration I read 100 byte from my AudioRecord buffer and then I want to determine the time difference (if I found one of my sounds). So far I've used a simple frequency filter (using fft) to determine the TDoA, but this is really inacurrate in the real world.
So far I've found out that I can use cross correlation to determine the TDoA value even better (http://paulbourke.net/miscellaneous/correlate/ and some threads here on SO). Now my problem: at the moment I think I have to correlate the recorded signal (my short array) with a generated signal of each of my three sounds above. But I'm struggling to generate this signal. Using the code found at (http://repository.tudelft.nl/view/ir/uuid%3Ab6c16565-cac8-448d-a460-224617a35ae1/ section B1.1. genTone()) does not clearly solve my problem because this will generate an array way bigger than my recorded samples. And so far I know the cross correlation needs two arrays of the same size to work. So how can I generate a sample array?
Another question: is the thinking of how to determine the TDoA so far correct?
Here are some lessons I've learned the past days:
I can either use cross correlation (xcorr) or a frequency recognition technique to determine the TDoA. The latter one is far more imprecise. So i focus on the xcorr.
I can achieve the TDoA by appling the xcorr on my recorded signal and two reference signals. E.g. my record has a length of 1000 samples. With the xcorr I recognize sound A at sample 500 and sound B at sample 600. So I know they have a time difference of 100 sample (that can be converted to seconds depending on the sample rate).
Therefor I generate a linear chirp (chirps a better than simple sin waves (see literature)) using this code found on SO. For an easy example and to check if my experiment seems to work I save my record as well as my generated chirp sounds as .wav files (there are plenty of code example how to do this). Then I use MatLab as an easy way to calculate the xcorr: see here
Another point: "input of xcorr has to be the same size?" I'm quite not sure about this part but I think this has to be done. We can achieve this by zero padding the two signals to the same length (preferable a power of two, so we can use the efficient Radix-2 implementation of FFT) and then use the FFT to calculate the xcorr (see another link from SO)
I hope this so far correct and covers some questions of other people :-)

How to handle a large datalogger csv file / playback in java

I have a datalogger that produces a CSV file containing UTC time and 4 parameters. The UTC time is logged ABOUT every 30ms followed by the 4 parameters. The problem I have is 2 fold:
1) The CSV file is potentially huge if I run the datalogger for even an hour.
2) The UTC time is not exactly every 30ms.
In my simple design for a replay of the data I had planned to load the file, split each entry at character "'" then assign the values in a loop though the UTC time value and then load the 4 parameters, but with the file so large I am concerned it wont work or will be very slow. I am new to java and am not sure if the there is a better way to handle so much data (I suspect there is!).
My plan to loop through and repeat he filling of 4 variables for the parameters wont work as the UTC entries are not exact. I had planned to take a decimal place off the data, but that clearly looses me fidelity in the replay of my data. I want to be able to construct a "timeline" in my application to allow play pause stop style functionality hence my problem handling the UTC time.
Here is a sample of some of the data when the time is pretty tight, this isnt always the case:
,13:35:38.772,0,0,0,0.3515625
,13:35:38.792,0,0,-0.0439453125,0.3515625
,13:35:38.822,0,0,0,0.3515625
,13:35:38.842,0,0,0,0.3515625
,13:35:38.872,0,0,0.0439453125,0.3515625
,13:35:38.892,0,0,0,0.3076171875
,13:35:38.922,0,0,0,0.3076171875
,13:35:38.942,0,0,0,0.3076171875
,13:35:38.962,0,0,0.0439453125,0.3515625
,13:35:38.992,0,0,0,0.3515625
,13:35:39.012,0,0,0,0.3076171875
,13:35:39.042,0,0,-0.0439453125,0.3076171875
,13:35:39.072,0,0,0,0.3515625
,13:35:39.092,0,0,0,0.3515625
,13:35:39.112,0,0,0.0439453125,0.3076171875
,13:35:39.142,0,0,0,0.3515625
,13:35:39.162,0,0,0,0.3076171875
,13:35:39.192,0,0,0,0.3515625
,13:35:39.212,0,0,0,0.3076171875
,13:35:39.242,0,0,0,0.3515625
,13:35:39.262,0,0,0,0.3076171875
I realise this is a broad question, but I am looking for a general steer in how to tackle the problem. Code is welcome, but I am expecting to have to ask more questions as time goes on.
Thanks for the help;
Andy

unique id throughout application lifetime

I am developing a jboss Java EE application where I need to send messages through a messaging system (JMS or AMQP optional). Approx. there will be around 10k to 15k messages per second. The requirement is to generate a unique id for each outgoing message that is not used any time in the past, even after application restart i.e. the id should not repeat again through the application lifetime (from day 1 of application use until decommissioned)
I will prefer solutions based on
Numeric value only (what data type?)
String
The auto-generation of the id should be atomic.
Java provides a method for generating Universally Unique Identifiers in the UUID class
Wikipedia has an explanation why the probability that these generate a message with the same ID is negligible.
I tend to like UUIDs, especially since you can easily create them from disparate sources. But you could also just use a long (64 bit) integer. At 15k messages per second, you would get approximately 39 million years worth of unique numbers (half that if you want them to be greater than zero).
If you are looking for a fast and simple take look at: UIDGenerator.java
You can customize it (unique to process only, or world), it is easy to use and fast:
private static final UIDGenerator SCA_GEN = new UIDGenerator(new ScalableSequence(0, 100));
.......
SCA_GEN.next();
see my benchmarking results at:
http://zoltran.com/roller/zoltran/entry/generating_a_unique_id
or run them yourself.

Is there a good algorithm to check for changes in data over a specified period of time?

We have around 7k financial products whose closing prices should theoretically move up and down within a certain percentage range throughout a defined period of time (say a one week or month period).
I have access to an internal system that stores these historical prices (not a relational database!). I would like to produce a report that lists any products whose price has not moved at all or less than say 10% over the time period.
I can't just compare the first value (day 1) to the value at the end (day n) as the price could potentially have moved back to what it was on the last day which would lead to a false positive while the product's price could have spiked somewhere in between of course.
Are there any established algorithms to do this in reasonable compute time?
There isn't any way to do this without looking at every single day.
Suppose the data looks like such:
oooo0oooo
With that one-day spike in the middle. You're not going to catch that unless you check the day that the spike happens - in other words, you need to check every single day.
If this needs to be checked often (for a large number of interval, like daily for the last year, and for the same set of products), you can store the high and low values of each item per week/month. By combining the right weekly and/or monthly bounds with some raw data at the edges of the interval you can get the minimum and maximum value over the interval.
If you can add data to kdb (i.e. you're not limited to read access) you might consider adding the 'number of days since last price change' as a new set of data (i.e. one number per financial instrument). A daily task would then fetch today's mark and yesterday's, and update the numbers stored. Similarly you could maintain recent (last month, last year) highs and lows in kdb. You'd have to run a job over the larger dataset to prime the values initially, but then your daily updates will involve much less data.
Recommend that if you adopt something like this you have some way to rerun for all or part of the dataset (say for adding a new product).
Lastly - is the history normalised against current prices? (i.e. are revaluations for stock splits or similar taken into account). If not, you'd need to detect these discontinuities and divide them out.
EDIT
I'd investigate usng kdb+/Q to implement the signal processing, rather than extracting the raw data to a Java application. As you say, it's highly performant.
You can do this if you can keep track of the min and max value of the price during the time interval - this assumes that the time interval is not being constantly changed. One way of keeping track of the min and max values of a changing set of items is with two heaps placed 'back to back' - you could store this and some pointers necessary to find and remove old items in one or two arrays in your store. The idea of putting two heaps back to back is in Knuth's Art of Computer Programming Vol 3 as Exercise 31 section 5.2.3. Knuth calls this sort of beast a Priority Dequeue, and this seems to be searchable. Min and max are available at constant cost. Cost of modifying it when a new price arrives is log n, where n is the number of items stored.

Categories