Time based reservoir sampling in java?

Time based reservoir sampling in java? - java

I've devised a way to do reservoir sampling in java, the code I used is here.
I've put in a huge file to be read now, and it takes about 40 seconds to read the lot before out putting the results to screen, and then reading the lot again. The file is too big to store in memory and just pick a random sample from that.
I was hoping I could write an extra while loop in there to get it to out put my reservoirList at a set period of time, and not just after it finished scanning the file.
Something like:
long startTime = System.nanoTime();
timeElapsed = 0;
while(sc.hasNext()) //avoid end of file
do{
long currentTime = System.nanoTime();
timeElapsed = (int) TimeUnit.MILLISECONDS.convert(startTime-currentTime,
TimeUnit.NANOSECONDS);
//sampling code goes here
}while(timeElapsed%5000!=0)
return reservoirList;
} return reservoirList;
But this outputs a bunch (not the full length of my ReservoirList) of lines and then a whole stream (a few hundred?) of the same line.
Is there a more elegant way to do this? One that, perhaps, works if possible.

I've cheated. For now I'm outputting every X lines read from file, where X is large enough to give me a nice time delay between each sample. I use the count from the sampling program to work out when this is.
do {
//sampling which includes a count++
}while(count%5000!=0)
One final note, I intialise counts to 1 to stop it outputting the first ten lines as a sample.
If anyone has a better, time based, solution, let me know.

Related

How do I efficiently iterate through a big list?

I want to make a open world 2D Minecraft like game and have the world load in Chunks (just like MC) with a size of 16x16 blocks (a total of 256 blocks). But I found out through iterating 256 times that it takes almost 20ms to iterate completely with a code like this:
long time = System.nanoTime();
for(int i = 0; i < 16*16; i++)
{
System.out.println(i);
}
System.out.println(System.nanoTime() - time);
And since I'm not only going to print numbers but also get a block, get it's texture and draw that texture onto the frame, I fear it might take even longer to iterate. Maybe I just exaggerate a bit, but is there a way to iterate faster?

It's not the iteration that takes 20ms, it's println();.
The following will be much faster:
long time = System.nanoTime();
StringBuilder sb = new StringBuilder();
for(int i = 0; i < 16*16; i++)
{
sb.append(i + System.getProperty("line.separator"));
}
System.out.println(sb);
System.out.println(System.nanoTime() - time);

So, first off, take into account that a list with 256 is not considered generally to have a big size.
The main thing consuming time your code is not iterating through the list but using System.out.println(). Printing to console (or any I/O action) tends to take longer than other instructions.
When I try your code locally I get roughly 6 ms but if I do something like this:
long secondStart = System.nanoTime();
StringBuffer stringBuffer = new StringBuffer();
for(int i = 0; i < 16*16; i++)
{
stringBuffer.append(i);
stringBuffer.append("\n");
}
System.out.println(stringBuffer);
System.out.println(System.nanoTime() - secondStart);
I get 0.5ms.
If that approach is not suitable for your needs then you would need to do as other comments say, consider traversing different parts of the list in parallel, maybe move to a different kind of traversal or even a different kind of structure.
Hope this helps.

You should ask yourself if you really need to do all that work. Do you need to draw things that are not seen by the camera for example? Of course not, so exclude every block in that chunk that is outside the camera rect.
Filtering out the blocks not seen implies some overhead but it is generally worth it compared to drawing every block in the chunk on each render update because drawing stuff is quite a heavy operation.
If you only want to speed up the traversal you could spawn threads that traverse the chunk in parallell or buy better hardware. But it is better to start with the question of how you could achieve the same result with less work put in.
On the other hand your computer should probably be able to draw 256 textures without problem especially if done on the gpu. So maybe do some testing before making premature optimizations.
PS. It isn't really the traversal itself you want to optimize for but rather the work done in each iteration. Just iterating 256 times is going to be quite fast.

SlidingWindows for slow data (big intervals) on Apache Beam

I am working with Chicago Traffic Tracker dataset, where new data is published every 15 minutes. When new data is available, it represents records off by 10-15 minutes from the "real time" (example, look for _last_updt).
For example, at 00:20, I get data timestamped 00:10; at 00:35, I get from 00:20; at 00:50, I get from 00:40. So the interval that I can get new data "fixed" (every 15 minutes), although the interval on timestamps change slightly.
I am trying to consume this data on Dataflow (Apache Beam) and for that I am playing with Sliding Windows. My idea is to collect and work on 4 consecutive datapoints (4 x 15min = 60min), and ideally update my calculation of sum/averages as soon as a new datapoint is available. For that, I've started with the code:
PCollection<TrafficData> trafficData = input
.apply("MapIntoSlidingWindows", Window.<TrafficData>into(
SlidingWindows.of(Duration.standardMinutes(60)) // (4x15)
.every(Duration.standardMinutes(15))) . // interval to get new data
.triggering(AfterWatermark
.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane()))
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes());
Unfortunately, looks like when I receive a new datapoint from my input, I do not get a new (updated) result from the GroupByKey that I have after.
Is this something wrong with my SlidingWindows? Or am I missing something else?

One issue may be that the watermark is going past the end of the window, and dropping all later elements. You may try giving a few minutes after the watermark passes:
PCollection<TrafficData> trafficData = input
.apply("MapIntoSlidingWindows", Window.<TrafficData>into(
SlidingWindows.of(Duration.standardMinutes(60)) // (4x15)
.every(Duration.standardMinutes(15))) . // interval to get new data
.triggering(AfterWatermark
.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane())
.withLateFirings(AfterProcessingTime.pastFirstElementInPane()))
.withAllowedLateness(Duration.standardMinutes(15))
.accumulatingFiredPanes());
Let me know if this helps at all.

So #Pablo (from my understanding) gave the correct answer. But I had some suggestions that would not fit in a comment.
I wanted to ask whether you need sliding windows? From what I can tell, fixed windows would do the job for you and be computationally simpler as well. Since you are using accumulating fired panes, you don't need to use a sliding window since your next DoFn function will already be doing an average from the accumulated panes.
As for the code, I made changes to the early and late firing logic. I also suggest increasing the windowing size. Since you know the data comes every 15 minutes, you should be closing the window after 15 minutes rather than on 15 minutes. But you also don't want to pick a window which will eventually collide with multiples of 15 (like 20) because at 60 minutes you'll have the same problem. So pick a number that is co-prime to 15, for example 19. Also allow for late entries.
PCollection<TrafficData> trafficData = input
.apply("MapIntoFixedWindows", Window.<TrafficData>into(
FixedWindows.of(Duration.standardMinutes(19))
.triggering(AfterWatermark.pastEndOfWindow()
// fire the moment you see an element
.withEarlyFirings(AfterPane.elementCountAtLeast(1))
//this line is optional since you already have a past end of window and a early firing. But just in case
.withLateFirings(AfterProcessingTime.pastFirstElementInPane()))
.withAllowedLateness(Duration.standardMinutes(60))
.accumulatingFiredPanes());
Let me know if that solves your issue!
EDIT
So, I could not understand how you computed the above example, so I am using a generic example. Below is a generic averaging function:
public class AverageFn extends CombineFn<Integer, AverageFn.Accum, Double> {
public static class Accum {
int sum = 0;
int count = 0;
}
#Override
public Accum createAccumulator() { return new Accum(); }
#Override
public Accum addInput(Accum accum, Integer input) {
accum.sum += input;
accum.count++;
return accum;
}
#Override
public Accum mergeAccumulators(Iterable<Accum> accums) {
Accum merged = createAccumulator();
for (Accum accum : accums) {
merged.sum += accum.sum;
merged.count += accum.count;
}
return merged;
}
#Override
public Double extractOutput(Accum accum) {
return ((double) accum.sum) / accum.count;
}
}
In order to run it you would add the line:
PCollection<Double> average = trafficData.apply(Combine.globally(new AverageFn()));
Since you are currently using accumulating firing triggers, this would be the simplest coding way to solve the solution.
HOWEVER, if you want to use a discarding fire pane window, you would need to use a PCollectionView to store the previous average and pass it as a side input to the next one in order to keep track of the values. This is a little more complex in coding but would definitely improve performance since constant work is done every window, unlike in accumulating firing.
Does this make enough sense for you to generate your own function for discarding fire pane window?

execution of task in java within specified time

I want to execute few lines of code with 5ms in Java. Below is the snippet of my code:
public void delay(ArrayList<Double> delay_array, int counter_main) {
long start=System.currentTimeMillis();
ArrayList<Double> delay5msecs=new ArrayList<Double>();
int index1=0, i1=0;
while(System.currentTimeMillis() - start <= 5)
{
delay5msecs.add(i1,null);
//System.out.println("time");
i1++;
}
for(int i=0;i<counter_main-1;i++) {
if(delay5msecs.get(i)!=null) {
double x1=delay_array.get(i-index1);
delay5msecs.add(i,x1);
//System.out.println(i);
} else {
index1++;
System.out.println("index is :"+index1);
}
}
}
Now the problem is that the entire array is getting filled with null values and I am getting some exceptions related to index as well. Basically, I want to fill my array list with 0 till 5ms and post that fill the data from another array list in it. I've not done coding since a long time. Appreciate your help.
Thank You.

System.currentTimeMillis() will probably not have the resolution you need for 5ms. The granularity on Windows may not be better than 15ms anyway, so your code will be very platform sensitive, and may actually not do what you want.
The resolution you need might be doable with System.nanoTime() but, again, there are platform limitations you might have to research. I recall that you can't just scale the value you get and have it work everywhere.
If you can guarantee no other threads running this code, then I suppose a naive loop and fill will work, without having to implement a worker thread that waits for the filler thread to finish.
You should try to use the Collection utilities and for-each loops instead of doing all this index math in the second part.
I suppose I should also warn you that nothing in a regular JVM is guaranteed to be real-time. So if you need a hard, dependable, reproducible 5ms you might be out of luck.

Sudoku timing irregularities

I wrote a Sudoku puzzle solver using brute force recursion. Now, I wanted to see how long it would take to solve 10 puzzles of similar types. So, I made a folder called easy and placed 10 "easy" puzzles in the folder. When I run the solver the first time it may take 171 ms, the second time it takes 37 ms, and the third run takes 16 ms. Why the different time for solving the exact same problems over again? Shouldn't the time be consistent?
The second problem is that is only displaying the last puzzle solved even though I tell it to repaint the screen after loading the puzzle and again after solving it. If I load just a single puzzle without solving it it will show the initial puzzle state. If I then call the Solve method the final solution is drawn on screen. Here is my method that solves multiple puzzles.
void LoadFolderAndSolve() throws FileNotFoundException {
String folderName = JOptionPane.showInputDialog("Enter folder name");
long startTime = System.currentTimeMillis();
for (int i = 1; i < 11; i++) {
String fileName = folderName + "/puzzle" + i + ".txt";
ReadPuzzle(filename); // this has a call to repaint to show the initial puzzle
SolvePuzzle(); // this has a call to repaint to show the solution
// If I put something to delay here, puzzle 1-9 is still not shown only 10.
}
long finishTime = System.currentTimeMillis();
long difference = finishTime - startTime;
System.out.println("Time in ms - " + difference);
}

The first time it runs the JVM needs to load the classes, create the objects you're using etc - it takes more time. Further, it always takes time for the JVM to "start kicking", which is why, when profiling, usually running a few thousands of loops and dividing the result to get a better estimation.
For the second problem it's impossible to help you without seeing the code, but a good guess would be that you're not "flushing" the data.

System.out.err & System.out interleaving in Java

To work out how much time is taken to perform an algorithm, I do this in the main method, but it does not print the time as it gets interleaved with System.print.
long startTime = System.currentTimeMillis();
A1.Print(2);
long endTime = System.currentTimeMillis();
System.err.print(endTime - startTime);
and if the class A is this:
public class A{
public void Print(int n){
for(int i = 0; i <=n; i++){
System.out.println(i)
}
}
it prints
0
1
2
and in this line it is the amount of time that is supposed go through that loop, but it simply won't, so it won't print like this:
0
1
2
1
Here the last line or 1 is the millisecond taken for the algorithm. The textbook says you MUST use System.err. and figure out a way to prevent interleaving.

You could do something like
System.setErr(System.out);
so the output is in the same stream. They use two different streams so that's why you get interleaving.
For your code it would be:
long startTime = System.currentTimeMillis();
System.setErr(System.out);
A1.Print(50);
long endTime = System.currentTimeMillis();
System.err.print(endTime - startTime);

System.err & System.out use different buffer (dependent on OS), so these buffer might get flushed at different time. thus might give interleaving output.
And also, System.err is not guaranteed to be directed to console by default (unlike System.out), it might be linked to console or file-system.
To resolve this, you might want System.err to link to System.out
like
System.setErr(System.out);
or
System.setErr(System.console());

If you are in Eclipse it's a known bug: https://bugs.eclipse.org/bugs/show_bug.cgi?id=32205. Try the same from command line

You should probably be using System.err.println instead of System.err.print. It might simply be buffering until it gets an entire line.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Time based reservoir sampling in java? - java

Related

How do I efficiently iterate through a big list?

SlidingWindows for slow data (big intervals) on Apache Beam

execution of task in java within specified time

Sudoku timing irregularities

System.out.err & System.out interleaving in Java

Categories

Resources