Interleaved parallel file read slower than sequential read?

Interleaved parallel file read slower than sequential read? - java

I have implemented a small IO class, which can read from multiple and same files on different disks (e.g two hard disks containing the same file). In sequential case, both disks read 60MB/s in average over the file, but when I do an interleaved (e.g. 4k disk 1, 4k disk 2 then combine), the effective read speed is reduced to 40MB/s instead of increasing?
Context: Win 7 + JDK 7b70, 2GB RAM, 2.2GB test file. Basically, I try to mimic Win7's ReadyBoost and RAID x in a poor man's fashion.
In the heart, when a read() is issued to the class, it creates two runnables with instructions to read a pre-opened RandomAccessFile from a certain position and length. Using an executor service and Future.get() calls, when both finish, the data read gets copied into a common buffer and returned to the caller.
Is there a conceptional error in my approach? (For example, the OS caching mechanism will always counteract?)
protected <T> List<T> waitForAll(List<Future<T>> futures)
throws MultiIOException {
MultiIOException mex = null;
int i = 0;
List<T> result = new ArrayList<T>(futures.size());
for (Future<T> f : futures) {
try {
result.add(f.get());
} catch (InterruptedException ex) {
if (mex == null) {
mex = new MultiIOException();
}
mex.exceptions.add(new ExceptionPair(metrics[i].file, ex));
} catch (ExecutionException ex) {
if (mex == null) {
mex = new MultiIOException();
}
mex.exceptions.add(new ExceptionPair(metrics[i].file, ex));
}
i++;
}
if (mex != null) {
throw mex;
}
return result;
}
public int read(long position, byte[] output, int start, int length)
throws IOException {
if (start < 0 || start + length > output.length) {
throw new IndexOutOfBoundsException(
String.format("start=%d, length=%d, output=%d",
start, length, output.length));
}
// compute the fragment sizes and positions
int result = 0;
final long[] positions = new long[metrics.length];
final int[] lengths = new int[metrics.length];
double speedSum = 0.0;
double maxValue = 0.0;
int maxIndex = 0;
for (int i = 0; i < metrics.length; i++) {
speedSum += metrics[i].readSpeed;
if (metrics[i].readSpeed > maxValue) {
maxValue = metrics[i].readSpeed;
maxIndex = i;
}
}
// adjust read lengths
int lengthSum = length;
for (int i = 0; i < metrics.length; i++) {
int len = (int)Math.ceil(length * metrics[i].readSpeed / speedSum);
lengths[i] = (len > lengthSum) ? lengthSum : len;
lengthSum -= lengths[i];
}
if (lengthSum > 0) {
lengths[maxIndex] += lengthSum;
}
// adjust read positions
long positionDelta = position;
for (int i = 0; i < metrics.length; i++) {
positions[i] = positionDelta;
positionDelta += (long)lengths[i];
}
List<Future<byte[]>> futures = new LinkedList<Future<byte[]>>();
// read in parallel
for (int i = 0; i < metrics.length; i++) {
final int j = i;
futures.add(exec.submit(new Callable<byte[]>() {
#Override
public byte[] call() throws Exception {
byte[] buffer = new byte[lengths[j]];
long t = System.nanoTime();
long t0 = t;
long currPos = metrics[j].handle.getFilePointer();
metrics[j].handle.seek(positions[j]);
t = System.nanoTime() - t;
metrics[j].seekTime = t * 1024.0 * 1024.0 /
Math.abs(currPos - positions[j]) / 1E9 ;
int c = metrics[j].handle.read(buffer);
t0 = System.nanoTime() - t0;
// adjust the read speed if we read something
if (c > 0) {
metrics[j].readSpeed = (alpha * c * 1E9 / t0 / 1024 / 1024
+ (1 - alpha) * metrics[j].readSpeed) ;
}
if (c < 0) {
return null;
} else
if (c == 0) {
return EMPTY_BYTE_ARRAY;
} else
if (c < buffer.length) {
return Arrays.copyOf(buffer, c);
}
return buffer;
}
}));
}
List<byte[]> data = waitForAll(futures);
boolean eof = true;
for (byte[] b : data) {
if (b != null && b.length > 0) {
System.arraycopy(b, 0, output, start + result, b.length);
result += b.length;
eof = false;
} else {
break; // the rest probably reached EOF
}
}
// if there was no data at all, we reached the end of file
if (eof) {
return -1;
}
sequentialPosition = position + (long)result;
// evaluate the fastest file to read
double maxSpeed = 0;
maxIndex = 0;
for (int i = 0; i < metrics.length; i++) {
if (metrics[i].readSpeed > maxSpeed) {
maxSpeed = metrics[i].readSpeed;
maxIndex = i;
}
}
fastest = metrics[maxIndex];
return result;
}
(FileMetrics in metrics array contain measurements of read speed to adaptively determine the buffer sizes of various input channels - in my test with alpha = 0 and readSpeed = 1 results equal distribution)
Edit
I ran an non-entangled test (e.g read the two files independently in separate threads.) and I've got a combined effective speed of 110MB/s.
Edit2
I guess I know why is this happening.
When I read in parallel and in sequence, it is not a sequential read for the disks, but rather read-skip-read-skip pattern due the interleaving (and possibly riddled with allocation table lookups). This basically reduces the effective read speed per disk to half or worse.

As you said, a sequential read on a disk is much faster than a read-skip-read-skip pattern. Hard disks are capable of high bandwidth when reading sequentially, but the seek time (latency) is expensive.
Instead of storing a copy of the file in each disk, try storing block i of the file on disk i (mod 2). This way you can read from both disks sequentially and recombine the result in memory.

If you want to do a parallel read, break the read into two sequential reads. Find the halfway point and read the first half from the first file and the second half from the second file.

If you are sure that you performing no more than one read per disk (otherwise you will have many disk misses), you still create contention on other parts in the computer - the bus, the raid controller (if exists) and so on.

Maybe http://stxxl.sourceforge.net/ might be of any interest for you, too.

Related

Find the start of a cache line for a Java byte array

For a high performance blocked bloom filter, I would like to align data to cache lines. (I know it's easier to do such tricks in C, but I would like to use Java.)
I do have a solution, but I'm not sure if it's correct, or if there is a better way. My solution tries to find the start of the cache line using the following algorithm:
for each possible offset o (0..63; I assume cache line length of 64)
start a thread that reads from data[o] and writes that to data[o + 8]
in the main thread, write '1' to data[o], and wait until that ends up in data[o + 8] (so wait for the other thread)
repeat that
Then, measure how fast this was, basically how many increments for a loop of 1 million (in each thread). My logic is, it is slower if the data is in a different cache line.
Here my code:
public static void main(String... args) {
for(int i=0; i<20; i++) {
int size = (int) (1000 + Math.random() * 1000);
byte[] data = new byte[size];
int cacheLineOffset = getCacheLineOffset(data);
System.out.println("offset: " + cacheLineOffset);
}
}
private static int getCacheLineOffset(byte[] data) {
for (int i = 0; i < 10; i++) {
int x = tryGetCacheLineOffset(data, i + 3);
if (x != -1) {
return x;
}
}
System.out.println("Cache line start not found");
return 0;
}
private static int tryGetCacheLineOffset(byte[] data, int testCount) {
// assume synchronization between two threads is faster(?)
// if each thread works on the same cache line
int[] counters = new int[64];
int testOffset = 8;
for (int test = 0; test < testCount; test++) {
for (int offset = 0; offset < 64; offset++) {
final int o = offset;
final Semaphore sema = new Semaphore(0);
Thread t = new Thread() {
public void run() {
try {
sema.acquire();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
for (int i = 0; i < 1000000; i++) {
data[o + testOffset] = data[o];
}
}
};
t.start();
sema.release();
data[o] = 1;
int counter = 0;
byte waitfor = 1;
for (int i = 0; i < 1000000; i++) {
byte x = data[o + testOffset];
if (x == waitfor) {
data[o]++;
counter++;
waitfor++;
}
}
try {
t.join();
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
counters[offset] += counter;
}
}
Arrays.fill(data, 0, testOffset + 64, (byte) 0);
int low = Integer.MAX_VALUE, high = Integer.MIN_VALUE;
for (int i = 0; i < 64; i++) {
// average of 3
int avg3 = (counters[(i - 1 + 64) % 64] + counters[i] + counters[(i + 1) % 64]) / 3;
low = Math.min(low, avg3);
high = Math.max(high, avg3);
}
if (low * 1.1 > high) {
// no significant difference between low and high
return -1;
}
int lowCount = 0;
boolean[] isLow = new boolean[64];
for (int i = 0; i < 64; i++) {
if (counters[i] < (low + high) / 2) {
isLow[i] = true;
lowCount++;
}
}
if (lowCount != 8) {
// unclear
return -1;
}
for (int i = 0; i < 64; i++) {
if (isLow[(i - 1 + 64) % 64] && !isLow[i]) {
return i;
}
}
return -1;
}
It prints (example):
offset: 16
offset: 24
offset: 0
offset: 40
offset: 40
offset: 8
offset: 24
offset: 40
...
So arrays in Java seems to be aligned to 8 bytes.

You know that the GC can move objects... so your perfectly aligned array may get misaligned later.
I'd try ByteBuffer; I guess, a direct one gets aligned a lot (to a page boundary).
Unsafe can give you the address and with JNI, you can get an array pinned.

First things first - everything in java is 8 bytes aligned, not arrays only. There's a tool for that Java Object Layout, that you can play with. Small-ish thing here (unrelated, but related) - in java-9 String(s) internally are stored as byte[] to shrink their space for LATIN-1 ones, because everything is 8-bytes aligned there was an addition of a field coder (byte) without making any instance of the string bigger - there was a gap big enough to fit that byte.
Your entire idea that objects that are aligned will be faster to access is right. That is much much more visible when multiple threads try to access that data, also known as false-sharing (but I bet you knew that). Btw here, there are methods in Unsafe that will show you object addresses, but since GC can move these around, this becomes useless for your requirement.
You would not be the first one that tries to overcome this. Unfortunately if you read that blog entry - you will see that even very experienced developers (which I admire) fail at this. A VM is notoriously smart to remove checks and code that you might think is needed somewhere, especially when JIT C2 kicks in.
What you are really looking for is:
jdk.internal.vm.annotation.Contended
annotation. This is the only way that will guarantee cache line alignment. If you really want to read about all other "tricks" that could be done, than Alekesy Shipilev's examples are the ones you are looking for.

How to reduce noise in signal overlap add in Java?

I have been working with a java program (developed by other people) for text-to-speech synthesis. The synthesis is done by concatenation of "di-phones". In the oroginal version, there was no signal processing. The diphones were just collected and concatenated together to produce the output. In order to improve the output, I tried to perform "phase matching" of the concatenating speech signals. The modification I've made is summarized here:
Audio data is collected from the AudioInputStream into a byte array.
Since the audio data is 16 bit, I converted the byte array to a short
array.
The "signal processing" is done on the short array.
To output the audio data, short array is again converted to byte array.
Here's the part of the code that I've changed in the existing program:
Audio Input
This segment is called for every diphone.
Original Version
audioInputStream = AudioSystem.getAudioInputStream(sound);
while ((cnt = audioInputStream.read(byteBuffer, 0, byteBuffer.length)) != -1) {
if (cnt > 0) {
byteArrayPlayStream.write(byteBuffer, 0, cnt);
}
}
My Version
// public varialbe declarations
byte byteSoundFile[]; // byteSoundFile will contain a whole word or the diphones of a whole word
short shortSoundFile[] = new short[5000000]; // sound contents are taken in a short[] array for signal processing
short shortBuffer[];
int pos = 0;
int previousPM = 0;
boolean isWord = false;
public static HashMap<String, Integer> peakMap1 = new HashMap<String, Integer>();
public static HashMap<String, Integer> peakMap2 = new HashMap<String, Integer>();
// code for receiving and processing audio data
if(pos == 0) {
// a new word is going to be processed.
// so reset the shortSoundFile array
Arrays.fill(shortSoundFile, (short)0);
}
audioInputStream = AudioSystem.getAudioInputStream(sound);
while ((cnt = audioInputStream.read(byteBuffer, 0, byteBuffer.length)) != -1) {
if (cnt > 0) {
byteArrayPlayStream.write(byteBuffer, 0, cnt);
}
}
byteSoundFile = byteArrayPlayStream.toByteArray();
int nSamples = byteSoundFile.length;
byteArrayPlayStream.reset();
if(nSamples > 80000) { // it is a word
pos = nSamples;
isWord = true;
}
else { // it is a diphone
// audio data is converted from byte to short, so nSamples is halved
nSamples /= 2;
// transfer byteSoundFile contents to shortBuffer using byte-to-short conversion
shortBuffer = new short[nSamples];
for(int i=0; i<nSamples; i++) {
shortBuffer[i] = (short)((short)(byteSoundFile[i<<1]) << 8 | (short)byteSoundFile[(i<<1)+1]);
}
/************************************/
/**** phase-matching starts here ****/
/************************************/
int pm1 = 0;
int pm2 = 0;
String soundStr = sound.toString();
if(soundStr.contains("\\") && soundStr.contains(".")) {
soundStr = soundStr.substring(soundStr.indexOf("\\")+1, soundStr.indexOf("."));
}
if(peakMap1.containsKey(soundStr)) {
// perform overlap and add
System.out.println("we are here");
pm1 = peakMap1.get(soundStr);
pm2 = peakMap2.get(soundStr);
/*
Idea:
If pm1 is located after more than one third of the samples,
then threre will be too much overlapping.
If pm2 is located before the two third of the samples,
then where will also be extra overlapping for the next diphone.
In both of the cases, we will not perform the peak-matching operation.
*/
int idx1 = (previousPM == 0) ? pos : previousPM - pm1;
if((idx1 < 0) || (pm1 > (nSamples/3))) {
idx1 = pos;
}
int idx2 = idx1 + nSamples - 1;
for(int i=idx1, j=0; i<=idx2; i++, j++) {
if(i < pos) {
shortSoundFile[i] = (short) ((shortSoundFile[i] >> 1) + (shortBuffer[j] >> 1));
}
else {
shortSoundFile[i] = shortBuffer[j];
}
}
previousPM = (pm2 < (nSamples/3)*2) ? 0 : idx1 + pm2;
pos = idx2 + 1;
}
else {
// no peak found. simply concatenate the audio data
for(int i=0; i<nSamples; i++) {
shortSoundFile[pos++] = shortBuffer[i];
}
previousPM = 0;
}
Audio Output
After collecting all the diphones of a word, this segment is called to play the audio output.
Original Version
byte audioData[] = byteArrayPlayStream.toByteArray();
... code for writing audioData to output steam
My Version
byte audioData[];
if(isWord) {
audioData = Arrays.copyOf(byteSoundFile, pos);
isWord = false;
}
else {
audioData = new byte[pos*2];
for(int i=0; i<pos; i++) {
audioData[(i<<1)] = (byte) (shortSoundFile[i] >>> 8);
audioData[(i<<1)+1] = (byte) (shortSoundFile[i]);
}
}
pos = 0;
... code for writing audioData to output steam
But after the modification has done, the output has become worse. There is a lot of noise in the output.
Here is a sample audio with modification: modified output
Here is a sample audio from the original version: original output
Now I'd appreciate it if anyone can point out the reason that generates the noise and how to remove it. Am I doing anything wrong in the code? I have tested my algorithm in Mablab and it worked fine.

The problem has been solved temporarily. It turns out that the conversion between byte array and short array is not necessary. The required signal processing operations can be performed directly on byte arrays.
I'd like to keep this question open in case someone finds the bug(s) in the given code.

Hashmap memoization slower than directly computing the answer

I've been playing around with the Project Euler challenges to help improve my knowledge of Java. In particular, I wrote the following code for problem 14, which asks you to find the longest Collatz chain which starts at a number below 1,000,000. It works on the assumption that subchains are incredibly likely to arise more than once, and by storing them in a cache, no redundant calculations are done.
Collatz.java:
import java.util.HashMap;
public class Collatz {
private HashMap<Long, Integer> chainCache = new HashMap<Long, Integer>();
public void initialiseCache() {
chainCache.put((long) 1, 1);
}
private long collatzOp(long n) {
if(n % 2 == 0) {
return n/2;
}
else {
return 3*n +1;
}
}
public int collatzChain(long n) {
if(chainCache.containsKey(n)) {
return chainCache.get(n);
}
else {
int count = 1 + collatzChain(collatzOp(n));
chainCache.put(n, count);
return count;
}
}
}
ProjectEuler14.java:
public class ProjectEuler14 {
public static void main(String[] args) {
Collatz col = new Collatz();
col.initialiseCache();
long limit = 1000000;
long temp = 0;
long longestLength = 0;
long index = 1;
for(long i = 1; i < limit; i++) {
temp = col.collatzChain(i);
if(temp > longestLength) {
longestLength = temp;
index = i;
}
}
System.out.println(index + " has the longest chain, with length " + longestLength);
}
}
This works. And according to the "measure-command" command from Windows Powershell, it takes roughly 1708 milliseconds (1.708 seconds) to execute.
However, after reading through the forums, I noticed that some people, who had written seemingly naive code, which calculate each chain from scratch, seemed to be getting much better execution times than me. I (conceptually) took one of the answers, and translated it into Java:
NaiveProjectEuler14.java:
public class NaiveProjectEuler14 {
public static void main(String[] args) {
int longest = 0;
int numTerms = 0;
int i;
long j;
for (i = 1; i <= 10000000; i++) {
j = i;
int currentTerms = 1;
while (j != 1) {
currentTerms++;
if (currentTerms > numTerms){
numTerms = currentTerms;
longest = i;
}
if (j % 2 == 0){
j = j / 2;
}
else{
j = 3 * j + 1;
}
}
}
System.out.println("Longest: " + longest + " (" + numTerms + ").");
}
}
On my machine, this also gives the correct answer, but it gives it in 0.502 milliseconds - a third of the speed of my original program. At first I thought that maybe there was a small overhead in creating a HashMap, and that the times taken were too small to draw any conclusions. However, if I increase the upper limit from 1,000,000 to 10,000,000 in both programs, NaiveProjectEuler14 takes 4709 milliseconds (4.709 seconds), whilst ProjectEuler14 takes a whopping 25324 milliseconds (25.324 seconds)!
Why does ProjectEuler14 take so long? The only explanation I can fathom is that storing huge amounts of pairs in the HashMap data structure is adding a huge overhead, but I can't see why that should be the case. I've also tried recording the number of (key, value) pairs stored during the course of the program (2,168,611 pairs for the 1,000,000 case, and 21,730,849 pairs for the 10,000,000 case) and supplying a little over that number to the HashMap constructor so that it only has to resize itself at most once, but this does not seem to affect the execution times.
Does anyone have any rationale for why the memoized version is a lot slower?

There are some reasons for that unfortunate reality:
Instead of containsKey, do an immediate get and check for null
The code uses an extra method to be called
The map stores wrapped objects (Integer, Long) for primitive types
The JIT compiler translating byte code to machine code can do more with calculations
The caching does not concern a large percentage, like fibonacci
Comparable would be
public static void main(String[] args) {
int longest = 0;
int numTerms = 0;
int i;
long j;
Map<Long, Integer> map = new HashMap<>();
for (i = 1; i <= 10000000; i++) {
j = i;
Integer terms = map.get(i);
if (terms != null) {
continue;
}
int currentTerms = 1;
while (j != 1) {
currentTerms++;
if (currentTerms > numTerms){
numTerms = currentTerms;
longest = i;
}
if (j % 2 == 0){
j = j / 2;
// Maybe check the map only here
Integer m = map.get(j);
if (m != null) {
currentTerms += m;
break;
}
}
else{
j = 3 * j + 1;
}
}
map.put(j, currentTerms);
}
System.out.println("Longest: " + longest + " (" + numTerms + ").");
}
This does not really do an adequate memoization. For increasing parameters not checking the 3*j+1 somewhat decreases the misses (but might also skip meoized values).
Memoization lives from heavy calculation per call. If the function takes long because of deep recursion rather than calculation, the memoization overhead per function call counts negatively.

Parallel sum of elements in a large Array

I have program that sums the elements in a very large array. I want to parallelize this sum.
#define N = some_very_large_no; // say 1e12
float x[N]; // read from a file
float sum=0.0;
main()
{
for (i=0, i<N, i++)
sum=sum+x[i];
}
How can I parallelize this sum using threads (c/c++/Java any code example is fine)? How many threads should I use if there is 8 cores in the machine for optimal performance?
EDIT: N may be really large ( larger than 1e6 actually) and varies based on the file size I read the data from. The file is in the order of GBs.
Edit: N is changed to a large value (1e12 to 1e16)

In Java you can write
int cpus = Runtime.getRuntime().availableProcessors();
// would keep this of other tasks as well.
ExecutorService service = Executors.newFixedThreadPool(cpus);
float[] floats = new float[N];
List<Future<Double>> tasks = new ArrayList<>();
int blockSize = (floats.length + cpus - 1) / cpus;
for (int i=0, i < floats.length, i++) {
final start = blockSize * i;
final end = Math.min(blockSize * (i+1), floats.length);
tasks.add(service.submit(new Callable<Double>() {
public Double call() {
double d= 0;
for(int j=start;j<end;j++)
d += floats[j];
return d;
}
});
}
double sum = 0;
for(Future<Double> task: tasks)
sum += task.get();
As WhozCraig mentions, it is likely that one million floats isn't enough to need multiple threads, or you could find that your bottle neck is how fast you can load the array from main memory (a single threaded resource) In any case, you can't assume it will be faster by the time you include the cost getting the data.

You say that the array comes from a file. If you time the different parts of the program, you'll find that summing up the elements takes a negligible amount of time compared to how long it takes to read the data from disk. From Amdahl's Law it follows that there is nothing to be gained by parallelising the summing up.
If you need to improve performance, you should focus on improving the I/O throughput.

you can use many threads(more than cores). But no of threads & its performance depends on ur algorithm as how they are working.
As array length is 100000 then create x thread & each will calculate arr[x] to arr[x+limit]. where u have to set limit so that no overlapping with other thread & no element should remain un-used.
thread creation:
pthread_t tid[COUNT];
int i = 0;
int err;
while (i < COUNT)
{
void *arg;
arg = x; //pass here a no which will tell from where this thread will use arr[x]
err = pthread_create(&(tid[i]), NULL, &doSomeThing, arg);
if (err != 0)
printf("\ncan't create thread :[%s]", strerror(err));
else
{
//printf("\n Thread created successfully\n");
}
i++;
}
// NOW CALCULATE....
for (int i = 0; i < COUNT; i++)
{
pthread_join(tid[i], NULL);
}
}
void* doSomeThing(void *arg)
{
int *x;
x = (int *) (arg);
// now use this x to start the array sum from arr[x] to ur limit which should not overlap to other thread
}

Use divide and conquer algorithm
Divide the array into 2 or more (keep dividing recursively until you get an array with manageable size)
Start computing the sum for the sub arrays (divided arrays) (using separate threads)
Finally add the sum generated (from all the threads) for all sub arrays together to produce final result

As others have said, the time-cost of reading the file is almost certainly going to be much larger than that of calculating the sum. Is it a text file or binary? If the numbers are stored as text, then the cost of reading them can be very high depending on your implementation.
You should also be careful adding a large number of floats. Because of their limited precision, small values late in the array may not contribute to the sum. Think about at least using a double to accumulate the values.

You can use pthreads in c to solve your problem
Here is my code for N=4 ( you can change it to suit your needs )
To run this code, apply the following command :
gcc -pthread test.c -o test
./test
#include<stdio.h>
#include<stdlib.h>
#include<pthread.h>
#define NUM_THREADS 5
pthread_t threads[NUM_THREADS];
pthread_mutex_t mutexsum;
int a[2500];
int sum = 0;
void *do_work(void* parms) {
long tid = (long)parms;
printf("I am thread # %ld\n ", tid);
int start, end, mysum;
start = (int)tid * 500;
end = start + 500;
int i = 0;
printf("Thread # %ld with start = %d and end = %d \n",tid,start,end);
for (int i = start; i < end; i++) {
mysum += a[i];
}
pthread_mutex_lock(&mutexsum);
printf("Thread # %ld lock and sum = %d\n",tid,sum);
sum += mysum;
pthread_mutex_unlock(&mutexsum);
pthread_exit(NULL);
}
void main(int argv, char* argc) {
int i = 0; int rc;
pthread_attr_t attr;
pthread_mutex_init(&mutexsum, NULL);
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
pthread_mutex_init(&mutexsum, NULL);
printf("Initializing array : \n");
for(i=0;i<2500;i++){
a[i]=1;
}
for (i = 0; i < NUM_THREADS; i++) {
printf("Creating thread # %d.\n", i);
rc = pthread_create(&threads[i], &attr, &do_work, (void *)i);
if (rc) {
printf("Error in thread %d with rc = %d. \n", i, rc);
exit(-1);
}
}
pthread_attr_destroy(&attr);
printf("Creating threads complete. start ruun " );
for (i = 0; i < NUM_THREADS; i++) {
pthread_join(threads[i], NULL);
}
printf("\n\tSum : %d", sum);
pthread_mutex_destroy(&mutexsum);
pthread_exit(NULL);
}

OpenMP supports built-in reduction. Add flag -fopenmp while compiling.
#include <omp.h>
#define N = some_very_large_no; // say 1e12
float x[N]; // read from a file
int main()
{
float sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (i=0, i<N, i++)
sum=sum+x[i];
return 0;
}

"Last 100 bytes" Interview Scenario

I got this question in an interview the other day and would like to know some best possible answers(I did not answer very well haha):
Scenario: There is a webpage that is monitoring the bytes sent over a some network. Every time a byte is sent the recordByte() function is called passing that byte, this could happen hundred of thousands of times per day. There is a button on this page that when pressed displays the last 100 bytes passed to recordByte() on screen (it does this by calling the print method below).
The following code is what I was given and asked to fill out:
public class networkTraffic {
public void recordByte(Byte b){
}
public String print() {
}
}
What is the best way to store the 100 bytes? A list? Curious how best to do this.

Something like this (circular buffer) :
byte[] buffer = new byte[100];
int index = 0;
public void recordByte(Byte b) {
index = (index + 1) % 100;
buffer[index] = b;
}
public void print() {
for(int i = index; i < index + 100; i++) {
System.out.print(buffer[i % 100]);
}
}
The benefits of using a circular buffer:
You can reserve the space statically. In a real-time network application (VoIP, streaming,..)this is often done because you don't need to store all data of a transmission, but only a window containing the new bytes to be processed.
It's fast: can be implemented with an array with read and write cost of O(1).

I don't know java, but there must be a queue concept whereby you would enqueue bytes until the number of items in the queue reached 100, at which point you would dequeue one byte and then enqueue another.
public void recordByte(Byte b)
{
if (queue.ItemCount >= 100)
{
queue.dequeue();
}
queue.enqueue(b);
}
You could print by peeking at the items:
public String print()
{
foreach (Byte b in queue)
{
print("X", b); // some hexadecimal print function
}
}

Circular Buffer using array:
Array of 100 bytes
Keep track of where the head index is i
For recordByte() put the current byte in A[i] and i = i+1 % 100;
For print(), return subarray(i+1, 100) concatenate with subarray(0, i)
Queue using linked list (or the java Queue):
For recordByte() add new byte to the end
If the new length to be more than 100, remove the first element
For print() simply print the list

Here is my code. It might look a bit obscure, but I am pretty sure this is the fastest way to do it (at least it would be in C++, not so sure about Java):
public class networkTraffic {
public networkTraffic() {
_ary = new byte[100];
_idx = _ary.length;
}
public void recordByte(Byte b){
_ary[--_idx] = b;
if (_idx == 0) {
_idx = _ary.length;
}
}
private int _idx;
private byte[] _ary;
}
Some points to note:
No data is allocated/deallocated when calling recordByte().
I did not use %, because it is slower than a direct comparison and using the if (branch prediction might help here too)
--_idx is faster than _idx-- because no temporary variable is involved.
I count backwards to 0, because then I do not have to get _ary.length each time in the call, but only every 100 times when the first entry is reached. Maybe this is not necessary, the compiler could take care of it.
if there were less than 100 calls to recordByte(), the rest is zeroes.

Easiest thing is to shove it in an array. The max size that the array can accommodate is 100 bytes. Keep adding bytes as they are streaming off the web. After the first 100 bytes are in the array, when the 101st byte comes, remove the byte at the head (i.e. 0th). Keep doing this. This is basically a queue. FIFO concept. Ater the download is done, you are left with the last 100 bytes.
Not just after the download but at any given point in time, this array will have the last 100 bytes.
#Yottagray Not getting where the problem is? There seems to be a number of generic approaches (array, circular array etc) & a number of language specific approaches (byteArray etc). Am I missing something?

Multithreaded solution with non-blocking I/O:
private static final int N = 100;
private volatile byte[] buffer1 = new byte[N];
private volatile byte[] buffer2 = new byte[N];
private volatile int index = -1;
private volatile int tag;
synchronized public void recordByte(byte b) {
index++;
if (index == N * 2) {
//both buffers are full
buffer1 = buffer2;
buffer2 = new byte[N];
index = N;
}
if (index < N) {
buffer1[index] = b;
} else {
buffer2[index - N] = b;
}
}
public void print() {
byte[] localBuffer1, localBuffer2;
int localIndex, localTag;
synchronized (this) {
localBuffer1 = buffer1;
localBuffer2 = buffer2;
localIndex = index;
localTag = tag++;
}
int buffer1Start = localIndex - N >= 0 ? localIndex - N + 1 : 0;
int buffer1End = localIndex < N ? localIndex : N - 1;
printSlice(localBuffer1, buffer1Start, buffer1End, localTag);
if (localIndex >= N) {
printSlice(localBuffer2, 0, localIndex - N, localTag);
}
}
private void printSlice(byte[] buffer, int start, int end, int tag) {
for(int i = start; i <= end; i++) {
System.out.println(tag + ": "+ buffer[i]);
}
}

Just for the heck of it. How about using an ArrayList<Byte>? Say why not?
public class networkTraffic {
static ArrayList<Byte> networkMonitor; // ArrayList<Byte> reference
static { networkMonitor = new ArrayList<Byte>(100); } // Static Initialization Block
public void recordByte(Byte b){
networkMonitor.add(b);
while(networkMonitor.size() > 100){
networkMonitor.remove(0);
}
}
public void print() {
for (int i = 0; i < networkMonitor.size(); i++) {
System.out.println(networkMonitor.get(i));
}
// if(networkMonitor.size() < 100){
// for(int i = networkMonitor.size(); i < 100; i++){
// System.out.println("Emtpy byte");
// }
// }
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Interleaved parallel file read slower than sequential read? - java

If you want to do a parallel read, break the read into two sequential reads. Find the halfway point and read the first half from the first file and the second half from the second file.

If you are sure that you performing no more than one read per disk (otherwise you will have many disk misses), you still create contention on other parts in the computer - the bus, the raid controller (if exists) and so on.

Maybe http://stxxl.sourceforge.net/ might be of any interest for you, too.

Related

Find the start of a cache line for a Java byte array

How to reduce noise in signal overlap add in Java?

Hashmap memoization slower than directly computing the answer

Parallel sum of elements in a large Array

"Last 100 bytes" Interview Scenario

Categories

Resources