mmap() vs Java MappedByteBuffer performance?

mmap() vs Java MappedByteBuffer performance? - java

I have been developing a C++ project from existing Java code. I have the following C++ code and Java code reading from the same test file, which consists of millions of integers.
C++:
int * arr = new int[len]; //len is larger than the largest int from the data
fill_n(arr, len, -1); //fill with -1
long loadFromIndex = 0;
struct stat sizeResults;
long size;
if (stat(fileSrc, &sizeResults) == 0) {
size = sizeResults.st_size; //here size would be ~551950000 for 552M test file
}
mmapFile = (char *)mmap(NULL, size, PROT_READ, MAP_SHARED, fd, pageNum*pageSize);
long offset = loadFromIndex % pageSize;
while (offset < size) {
int i = htonl(*((int *)(mmapFile + offset)));
offset += sizeof(int);
int j = htonl(*((int *)(mmapFile + offset)));
offset += sizeof(int);
swapElem(i, j, arr);
}
return arr;
Java:
IntBuffer bb = srcFile.getChannel()
.map(MapMode.READ_ONLY, loadFromIndex, size)
.asIntBuffer().asReadOnlyBuffer();
while (bb.hasRemaining()) {
int i = bb.get();
int j = bb.get();
swapElem(i, j, arr); //arr is an int[] of the same size as the arr in C++ version, filled with -1
}
return arr;
void swapElem(arr) in C++ and Java are identical. It compares and modifies values in the array, but the original code is kind of long to post here. For testing purpose, I replaced it with the following function so the loop won't be dead code:
void swapElem(int i, int j, int * arr){ // int[] in Java
arr[i] = j;
}
I assumed the C++ version should outperform the java version, but the test gives the opposite result -- Java code is almost two times faster than the C++ code. Is there any way to improve the C++ code?
I feel maybe the mmapFile+offset in C++ is repeated too many times so it is O(n) additions for that and O(n) additions for offset+=sizeof(int), where n is number of integers to read. For Java's IntBuffer.get(), it just directly reads from a buffer's index so no addition operation is needed except O(n) increments of the buffer index by 1. Therefore, including the increments of buffer index, C++ takes O(2n) additions while Java takes O(n) additions. When it comes to millions of data, it might cause significant performance difference.
Following this idea, I modified the C++ code as follows:
mmapBin = (char *)mmap(NULL, size, PROT_READ, MAP_SHARED, fd, pageNum*pageSize);
int len = size - loadFromIndex % pageSize;
char * offset = loadFromIndex % pageSize + mmapBin;
int index = 0;
while (index < len) {
int i = htonl(*((int *)(offset)));
offset += sizeof(int);
int j = htonl(*((int *)(offset)));
offset += sizeof(int);
index+=2*sizeof(int);
}
I assumed there will be a slight performance gain, but there isn't.
Can anyone explain why the C++ code works slower than the Java code does? Thanks.
Update:
I have to apologize that when I said -O2 does not work, there was a problem at my end. I messed up Makefile so the C++ code did not recompile using -O2. I've updated the performance and the C++ version using -O2 has outperformed the Java version. This can seal the question, but if anyone would like to share how to improve the C++ code, I will follow. Generally I would expect it to be 2 times faster than the Java code, but currently it is not. Thank you all for your input.
Compiler: g++
Flags: -Wall -c -O2
Java Version: 1.8.0_05
Size of File: 552MB, all 4 byte integers
Processor: 2.53 GHz Intel Core 2 Duo
Memory 4GB 1067 MHz DDR3
Updated Benchmark:
Version Time(ms)
C++: ~1100
Java: ~1400
C++(without the while loop): ~35
Java(without the while loop): ~40
I have something before these code that causes the ~35ms performance(mostly filling the array with -1), but that is not important here.

I have some doubts that the benchmark method is correct. Both codes are "dead" codes. You don't actually use i and j anywhere so the gcc compiler or Java JIT might decide to actually remove the loop seeing that it has no effect on the future code flow.
Anyway, I would change the C++ code to:
mmapFile = (char *)mmap(NULL, size, PROT_READ, MAP_SHARED, fd, pageNum*pageSize);
long offset = loadFromIndex % pageSize;
int i, j;
int szInc = 2 * sizeof(int);
while (offset < size) {
scanf(mmapFile, "%d", &i);
scanf(mmapFile, "%d", &j);
offset += szInc; // offset += 8;
}
This would be the equivalent to Java code. In addition I would continue using -O2 as compilation flags. Keep in mind that htonl is an extra conversion that Java code does not seem to do it.

Related

Why is the java vector API so slow compared to scalar?

I recently decided to play around with Java's new incubated vector API, to see how fast it can get. I implemented two fairly simple methods, one for parsing an int and one for finding the index of a character in a string. In both cases, my vectorized methods were incredibly slow compared to their scalar equivalents.
Here's my code:
public class SIMDParse {
private static IntVector mul = IntVector.fromArray(
IntVector.SPECIES_512,
new int[] {0, 0, 0, 0, 0, 0, 1000000000, 100000000, 10000000, 1000000, 100000, 10000, 1000, 100, 10, 1},
0
);
private static byte zeroChar = (byte) '0';
private static int width = IntVector.SPECIES_512.length();
private static byte[] filler;
static {
filler = new byte[16];
for (int i = 0; i < 16; i++) {
filler[i] = zeroChar;
}
}
public static int parseInt(String str) {
boolean negative = str.charAt(0) == '-';
byte[] bytes = str.getBytes(StandardCharsets.UTF_8);
if (negative) {
bytes[0] = zeroChar;
}
bytes = ensureSize(bytes, width);
ByteVector vec = ByteVector.fromArray(ByteVector.SPECIES_128, bytes, 0);
vec = vec.sub(zeroChar);
IntVector ints = (IntVector) vec.castShape(IntVector.SPECIES_512, 0);
ints = ints.mul(mul);
return ints.reduceLanes(VectorOperators.ADD) * (negative ? -1 : 1);
}
public static byte[] ensureSize(byte[] arr, int per) {
int mod = arr.length % per;
if (mod == 0) {
return arr;
}
int length = arr.length - (mod);
length += per;
byte[] newArr = new byte[length];
System.arraycopy(arr, 0, newArr, per - mod, arr.length);
System.arraycopy(filler, 0, newArr, 0, per - mod);
return newArr;
}
public static byte[] ensureSize2(byte[] arr, int per) {
int mod = arr.length % per;
if (mod == 0) {
return arr;
}
int length = arr.length - (mod);
length += per;
byte[] newArr = new byte[length];
System.arraycopy(arr, 0, newArr, 0, arr.length);
return newArr;
}
public static int indexOf(String s, char c) {
byte[] b = s.getBytes(StandardCharsets.UTF_8);
int width = ByteVector.SPECIES_MAX.length();
byte bChar = (byte) c;
b = ensureSize2(b, width);
for (int i = 0; i < b.length; i += width) {
ByteVector vec = ByteVector.fromArray(ByteVector.SPECIES_MAX, b, i);
int pos = vec.compare(VectorOperators.EQ, bChar).firstTrue();
if (pos != width) {
return pos + i;
}
}
return -1;
}
}
I fully expected my int parsing to be slower, since it won't ever be handling more than the vector size can hold (an int can never be more than 10 digits long).
By my bechmarks, parsing 123 as an int 10k times took 3081 microseconds for Integer.parseInt, and 80601 microseconds for my implementation. Searching for 'a' in a very long string ("____".repeat(4000) + "a" + "----".repeat(193)) took 7709 microseconds to String#indexOf's 7.
Why is it so unbelievably slow? I thought the entire point of SIMD is that it's faster than the scalar equivalents for tasks like these.

You picked something SIMD is not great at (string->int), and something that JVMs are very good at optimizing out of loops. And you made an implementation with a bunch of extra copying work if the inputs aren't exact multiples of the vector width.
I'm assuming your times are totals (for 10k repeats each), not a per-call average.
7 us is impossibly fast for that.
"____".repeat(4000) is 16k bytes before the 'a', which I assume is what you're searching for. Even a well-tuned / unrolled memchr (aka indexOf) running at 2x 32-byte vectors per clock cycle, on a 4GHz CPU, would take 625 us for 10k reps. (16000B / (64B/c) * 10000 reps / 4000 MHz). And yes, I'd expect a JVM to either call the native memchr or use something equally efficient for a commonly-used core library function like String#indexOf. For example, glibc's avx2 memchr is pretty well-tuned with loop unrolling; if you're on Linux, your JVM might be calling it.
Built-in String indexOf is also something the JIT "knows about". It's apparently able to hoist it out of loops when it can see that you're using the same string repeatedly as input. (But then what's it doing for the rest of those 7 us? I guess doing a not-quite-so-great memchr and then doing an empty 10k iteration loop at 1/clock could take about 7 microseconds, especially if your CPU isn't as fast as 4GHz.)
See Idiomatic way of performance evaluation? - if doubling the repeat-count to 20k doesn't double the time, your benchmark is broken and not measuring what you think it does.
Your manual SIMD indexOf is very unlikely to get optimized out of a loop. It makes a copy of the whole array every time, if the size isn't an exact multiple of the vector width!! (In ensureSize2). The normal technique is to fall back to scalar for the last size % width elements, which is obviously much better for large arrays. Or even better, do an unaligned load that ends at the end of the array (if the total size is >= vector width) for something where overlap with previous work is not a problem.
A decent memchr on modern x86 (using an algorithm like your indexOf without unrolling) should go at about 1 vector (16/32/64 bytes) per maybe 1.5 clock cycles, with data hot in L1d cache, without loop unrolling or anything. (Checking both the vector compare and the pointer bound as possible loop exit conditions takes extra asm instructions vs. a simple strlen, but see this answer for some microbenchmarks of a simple hand-written strlen that assumes aligned buffers). Probably your indexOf loops bottlenecks on front-end throughput on a CPU like Skylake, with its pipeline width of 4 uops/clock.
So let's guess that your implementation takes 1.5 cycles per 16 byte vector, if perhaps you're on a CPU without AVX2? You didn't say.
16kB / 16B = 1000 vectors. At 1 vector per 1.5 clocks, that's 1500 cycles. On a 3GHz machine, 1500 cycles takes 500 ns = 0.5 us per call, or 5000 us per 10k reps. But since 16194 bytes isn't a multiple of 16, you're also copying the whole thing every call, so that costs some more time, and could plausibly account for your 7709 us total time.
What SIMD is good for
for tasks like these.
No, "horizontal" stuff like ints.reduceLanes is something SIMD is generally slow at. And even with something like How to implement atoi using SIMD? using x86 pmaddwd to multiply and add pairs horizontally, it's still a lot of work.
Note that to make the elements wide enough to multiply by place-values without overflow, you have to unpack, which costs some shuffling. ints.reduceLanes takes about log2(elements) shuffle/add steps, and if you're starting with 512-bit AVX-512 vectors of int, the first 2 of those shuffles are lane-crossing, 3 cycle latency (https://agner.org/optimize/). (Or if your machine doesn't even have AVX2, then a 512-bit integer vector is actually 4x 128-bit vectors. And you had to do separate work to unpack each part. But at least the reduction will be cheap, just vertical adds until you get down to a single 128-bit vector.)

Hmm. I found this post because I've hit something strange with the Vector perfomance for something that ostensibly it should be ideal for - multiplying two double arrays.
static private void doVector(int iteration, double[] input1, double[] input2, double[] output) {
Instant start = Instant.now();
for (int i = 0; i < SPECIES.loopBound(ARRAY_LENGTH); i += SPECIES.length()) {
DoubleVector va = DoubleVector.fromArray(SPECIES, input1, i);
DoubleVector vb = DoubleVector.fromArray(SPECIES, input2, i);
va.mul(vb);
System.arraycopy(va.mul(vb).toArray(), 0, output, i, SPECIES.length());
}
Instant finish = Instant.now();
System.out.println("vector duration " + iteration + ": " + Duration.between(start, finish).getNano());
}
The species length comes out at 4 on my machine (CPU is Intel i7-7700HQ at 2.8 GHz).
On my first attempt the execution was taking more than 15 milliseconds to execute (compared with 0 for the scalar equivalent), even with a tiny array length (8 elements). On a hunch I added the iteration to see whether something had to warm up - and indeed, the first iteration still ALWAYS takes ages (44 ms for 65536 elements). Whilst most of the other iterations are reporting zero time, a few are taking around 15ms but they are randomly distributed (i.e. not always the same iteration index on each run). I sort of expect that (because I'm measuring real-time measurement and other stuff will be going on).
However, overall for an array size of 65536 elements, and 32 iterations, the total duration for the vector approach is 2-3 times longer than that for the scalar one.

Byte to "Bit"array

A byte is the smallest numeric datatype java offers but yesterday I came in contact with bytestreams for the first time and at the beginning of every package a marker byte is send which gives further instructions on how to handle the package. Every bit of the byte has a specific meaning so I am in need to entangle the byte into it's 8 bits.
You probably could convert the byte to a boolean array or create a switch for every case but that can't certainly be the best practice.
How is this possible in java why are there no bit datatypes in java?

Because there is no bit data type that exists on the physical computer. The smallest allotment you can allocate on most modern computers is a byte which is also known as an octet or 8 bits. When you display a single bit you are really just pulling that first bit out of the byte with arithmetic and adding it to a new byte which still is using an 8 bit space. If you want to put bit data inside of a byte you can but it will be stored as a at least a single byte no matter what programming language you use.

You could load the byte into a BitSet. This abstraction hides the gory details of manipulating single bits.
import java.util.BitSet;
public class Bits {
public static void main(String[] args) {
byte[] b = new byte[]{10};
BitSet bitset = BitSet.valueOf(b);
System.out.println("Length of bitset = " + bitset.length());
for (int i=0; i<bitset.length(); ++i) {
System.out.println("bit " + i + ": " + bitset.get(i));
}
}
}
$ java Bits
Length of bitset = 4
bit 0: false
bit 1: true
bit 2: false
bit 3: true
You can ask for any bit, but the length tells you that all the bits past length() - 1 are set to 0 (false):
System.out.println("bit 75: " + bitset.get(75));
bit 75: false

Have a look at java.util.BitSet.
You might use it to interpret the byte read and can use the get method to check whether a specific bit is set like this:
byte b = stream.read();
final BitSet bitSet = BitSet.valueOf(new byte[]{b});
if (bitSet.get(2)) {
state.activateComponentA();
} else {
state.deactivateComponentA();
}
state.setFeatureBTo(bitSet.get(1));
On the other hand, you can create your own bitmask easily and convert it to a byte array (or just byte) afterwards:
final BitSet output = BitSet.valueOf(ByteBuffer.allocate(1));
output.set(3, state.isComponentXActivated());
if (state.isY){
output.set(4);
}
final byte w = output.toByteArray()[0];

How is this possible in java why are there no bit datatypes in java?
There are no bit data types in most languages. And most CPU instruction sets have few (if any) instructions dedicated to adressing single bits. You can think of the lack of these as a trade-off between (language or CPU) complexity and need.
Manipulating a single bit can be though of as a special case of manipulating multiple bits; and languages as well as CPU's are equipped for the latter.
Very common operations like testing, setting, clearing, inverting as well as exclusive or are all supported on the integer primitive types (byte, short/char, int, long), operating on all bits of the type at once. By chosing the parameters appropiately you can select which bits to operate on.
If you think about it, a byte array is a bit array where the bits are grouped in packages of 8. Adressing a single bit in the array is relatively simple using logical operators (AND &, OR |, XOR ^ and NOT ~).
For example, testing if bit N is set in a byte can be done using a logical AND with a mask where only the bit to be tested is set:
public boolean testBit(byte b, int n) {
int mask = 1 << n; // equivalent of 2 to the nth power
return (b & mask) != 0;
}
Extending this to a byte array is no magic either, each byte consists of 8 bits, so the byte index is simply the bit number divided by 8, and the bit number inside that byte is the remainder (modulo 8):
public boolean testBit(byte[] array, int n) {
int index = n >>> 3; // divide by 8
int mask = 1 << (n & 7); // n modulo 8
return (array[index] & mask) != 0;
}

Here is a sample, I hope useful for you!
DatagramSocket socket = new DatagramSocket(6160, InetAddress.getByName("0.0.0.0"));
socket.setBroadcast(true);
while (true) {
byte[] recvBuf = new byte[26];
DatagramPacket packet = new DatagramPacket(recvBuf, recvBuf.length);
socket.receive(packet);
String bitArray = toBitArray(recvBuf);
System.out.println(Integer.parseInt(bitArray.substring(0, 8), 2)); // convert first byte binary to decimal
System.out.println(Integer.parseInt(bitArray.substring(8, 16), 2)); // convert second byte binary to decimal
}
public static String toBitArray(byte[] byteArray) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < byteArray.length; i++) {
sb.append(String.format("%8s", Integer.toBinaryString(byteArray[i] & 0xFF)).replace(' ', '0'));
}
return sb.toString();
}

Efficient BigInteger multiplication modulo n in Java

I can calculate the multiplication of two BigIntegers (say a and b) modulo n.
This can be done by:
a.multiply(b).mod(n);
However, assuming that a and b are of the same order of n, it implies that during the calculation, a new BigInteger is being calculated, and its length (in bytes) is ~ 2n.
I wonder whether there is more efficient implementation that I can use. Something like modMultiply that is implemented like modPow (which I believe does not calculate the power and then the modulo).

I can only think of
a.mod(n).multiply(b.mod(n)).mod(n)
and you seem already to be aware of this.
BigInteger has a toByteArray() but internally ints are used. hence n must be quite large to have an effect. Maybe in key generation cryptographic code there might be such work.
Furhtermore, if you think of short-cutting the multiplication, you'll get something like the following:
public static BigInteger multiply(BigInteger a, BigInteger b, int mod) {
if (a.signum() == -1) {
return multiply(a.negate(), b, mod).negate();
}
if (b.signum() == -1) {
return multiply(a, b.negate(), mod).negate();
}
int n = (Integer.bitCount(mod - 1) + 7) / 8; // mod in bytes.
byte[] aa = a.toByteArray(); // Highest byte at [0] !!
int na = Math.min(n, aa.length); // Heuristic.
byte[] bb = b.toByteArray();
int nb = Math.min(n, bb.length); // Heuristic.
byte[] prod = new byte[n];
for (int ia = 0; ia < na; ++ia) {
int m = ia + nb >= n ? n - ia - 1 : nb; // Heuristic.
for (int ib = 0; ib < m; ++ib) {
int p = (0xFF & aa[aa.length - 1 - ia]) * (0xFF & bb[bb.length - 1 - ib]);
addByte(prod, ia + ib, p & 0xFF);
if (ia + ib + 1 < n) {
addByte(prod, ia + ib + 1, (p >> 8) & 0xFF);
}
}
}
// Still need to do an expensive mod:
return new BigInteger(prod).mod(BigInteger.valueOf(mod));
}
private static void addByte(byte[] prod, int i, int value) {
while (value != 0 && i < prod.length) {
value += prod[prod.length - 1 - i] & 0xFF;
prod[prod.length - 1 - i] = (byte) value;
value >>= 8;
++i;
}
}
That code does not look appetizing. BigInteger has the problem of exposing the internal value only as big-endian byte[] where the first byte is the most significant one.
Much better would be to have the digits in base N. That is not unimaginable: if N is a power of 2 some nice optimizations are feasible.
(BTW the code is untested - as it does not seem convincingly faster.)

First, the bad news: I couldn't find any existing Java libraries that provided this functionality.
I couldn't find any pure-Java big integer libraries ... apart from java.math.BigInteger.
There are Java / JNI wrappers for the GMP library, but GMP doesn't implement this either.
So what are your options?
Maybe there is some pure-Java library that I missed.
Maybe there some other native (C / C++) big integer library supports this operation ... though you may need to write your own JNI wrappers.
You should be able to implement such a method for yourself, by copying the source code of java.math.BigInteger and adding an extra custom method. Alternatively, it looks like you could extend it.
Having said that, I'm not sure that there is a "substantially faster" algorithm for computing a * b mod n in Java, or any other language. (Apart from special cases; e.g. when n is a power of 2).
Specifically, the "Montgomery Reduction" approach wouldn't help for a single multiplication step. (The Wikipedia page says: "Because numbers have to be converted to and from a particular form suitable for performing the Montgomery step, a single modular multiplication performed using a Montgomery step is actually slightly less efficient than a "naive" one.")
So maybe the most effective way to speedup the computation would be to use the JNI wrappers for GMP.

You can use generic maths, like:
(A*B) mod N = ((A mod N) * (B mod N)) mod N
It may be more CPU intensive, but one should choose between CPU and memory, right?
If we are talking about modular arithmetic then indeed Montgomery reduction may be what you need. Don't know any out of box solutions though.

You can write a BigInteger multiplication as a standard long multiplication in a very large base -- for example, in base 2^32. It is fairly straightforward. If you want only the result modulo n, then it is advantageous to choose a base that is a factor of n or of which n is a factor. Then you can ignore all but one or a few of the lowest-order result (Big)digits as you perform the computation, saving space and maybe time.
That's most practical if you know n in advance, of course, but such pre-knowledge is not essential. It's especially nice if n is a power of two, and it's fairly messy if n is neither a power of 2 nor smaller than the maximum operand handled directly by the system's arithmetic unit, but all of those cases can be handled in principle.
If you must do this specifically with Java BigInteger instances, however, then be aware that any approach not provided by the BigInteger class itself will incur overhead for converting between internal and external representations.

Maybe this:
static BigInteger multiply(BigInteger c, BigInteger x)
{
BigInteger sum = BigInteger.ZERO;
BigInteger addOperand;
for (int i=0; i < FIELD_ELEMENT_BIT_SIZE; i++)
{
if (c.testBit(i))
addOperand = x;
else
addOperand = BigInteger.ZERO;
sum = add(sum, addOperand);
x = x.shiftRight(1);
}
return sum;
}
with the following helper functions:
static BigInteger add(BigInteger a, BigInteger b)
{
return modOrder(a.add(b));
}
static BigInteger modOrder(BigInteger n)
{
return n.remainder(FIELD_ORDER);
}
To be honest though, I'm not sure if this is really efficient at all since none of these operations are performed in-place.

Improvement of Algorithm: Counting set bits in Byte-Arrays

We store knowledge in byte arrays as bits. Counting the number of set bits is pretty slow. Any suggestion to improve the algorithm is welcome:
public static int countSetBits(byte[] array) {
int setBits = 0;
if (array != null) {
for (int byteIndex = 0; byteIndex < array.length; byteIndex++) {
for (int bitIndex = 0; bitIndex < 7; bitIndex++) {
if (getBit(bitIndex, array[byteIndex])) {
setBits++;
}
}
}
}
return setBits;
}
public static boolean getBit(int index, final byte b) {
byte t = setBit(index, (byte) 0);
return (b & t) > 0;
}
public static byte setBit(int index, final byte b) {
return (byte) ((1 << index) | b);
}
To count the bits of a byte array of length of 156'564 takes 300 ms, that's too much!

Try Integer.bitcount to obtain the number of bits set in each byte. It will be more efficient if you can switch from a byte array to an int array. If this is not possible, you could also construct a look-up table for all 256 bytes to quickly look up the count rather than iterating over individual bits.
And if it's always the whole array's count you're interested in, you could wrap the array in a class that stores the count in a separate integer whenever the array changes. (edit: Or, indeed, as noted in comments, use java.util.BitSet.)

I would use the same global loop but instead of looping inside each byte I would simply use a (precomputed) array of size 256 mapping bytes to their bit count. That would probably be very efficient.
If you need even more speed, then you should separately maintain the count and increment it and decrement it when setting bits (but that would mean a big additional burden on those operations so I'm not sure it's applicable for you).
Another solution would be based on BitSet implementation : it uses an array of long (and not bytes) and here's how it counts :
658 int sum = 0;
659 for (int i = 0; i < wordsInUse; i++)
660 sum += Long.bitCount(words[i]);
661 return sum;

I would use:
byte[] yourByteArray = ...
BitSet bitset = BitSet.valueOf(yourByteArray); // java.util.BitSet
int setBits = bitset.cardinality();
I don't know if it's faster, but I think it will be faster than what you have. Let me know?
Your method would look like
public static int countSetBits(byte[] array) {
return BitSet.valueOf(array).cardinality();
}
You say:
We store knowledge in byte arrays as bits.
I would recommend to use a BitSet for that. It gives you convenient methods, and you seem to be interested in bits, not bytes, so it is a much more appropriate data type compared to a byte[]. (Internally it uses a long[]).

By far the fastest way is counting bits set, in "parallel", method is called Hamming weight
and is implemented in Integer.bitCount(int i) as far as I know.

As per my understaning,
1 Byte = 8 Bits
So if Byte Array size = n , then isn't total number of bits = n*8 ?
Please correct me if my understanding is wrong
Thanks
Vinod

Arrays manipulation with respect to range of values

I have this requirement that I need to set the values in a byte array of size 20MB.
I'm looking for a JAVA API which does the following. I've gone through apache commons arrayutils but couldn't find something useful.
The operation should be something of this type. Say the values range from 0 to 100.
I'd like to manipulate the array such that values less than 15 are changed to 15 and values greater than 70 are changed to 70.
Basically, I'm looking for an operation which would avoid me doing this - iterate through the array, check if the value is below 15, if it is below 15 then set it to 15 otherwise is it above 75, if it is above 75 then set the value to 75.
Any help is appreciated.

Even if there's some third-party library which has this functionality, it's just going to be doing exactly the same operation - looping over an array. Fundamentally you need something like:
for (int i = 0; i < array.length; i++)
{
array[i] = clamp(array[i], 15, 70);
}
...
public static byte clamp(byte value, byte min, byte max)
{
return value < min ? min
: value > max ? max
: value;
}
You could implement this in native code if you really wanted, but I suspect you won't find an existing implementation. It's more likely that there are libraries which perform the sort of image manipulation you're interested in as image manipulation rather than as an array operation.

You could use Guava's Lists.transform method to update the values. However, this would result in a new array not updating the values in the existing array.
List<Byte> list = Lists.newArrayList(myArray);
List<Byte> trans = Lists.transform(list, new Function<Byte, Byte>(){...});
byte[] bytes = Bytes.toArray(trans);
However, given what you are trying to do, I would suggest just looping over the values.

I'd recommend that you write the simple loop, and profile it in the context of your application. Only if you can demonstrate that this code is the overall bottleneck it would make sense to try and make it faster.
I'd try something like this:
final int n = array.length;
for (int i = 0; i < n; i++) {
int val = array[i];
if (val < 15) {
array[i] = 15;
} else if (val > 75) {
array[i] = 75;
}
}
My final point is that this type of code is likely to be limited by memory bandwidth, so it seems unlikely that a native C solution would be a lot faster anyway.

Instead of checking the ranges like Jon skeet proposes, you could create a lookup table for each of the 256 possible a byte could have, i.e. something like
{15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,16,17,18,...,69,70,70,70,70,...}
for (int i = 0; i < len; i++)
{
array[i] = lookup[array[i]];
}
In C: Less branching, much faster. In Java: Unfortunately not faster, even a bit slower, maybe because Java's array range checks eat up the speed gained; and since Java's bytes are always signed, it's a bit more complicated than shown above.
In C, you could even do that for 16bit halfwords, making it faster again. (Probably by factor 2)
EDIT: To my own shame, I must admit that proper testing revealed that the lookup table isn't faster in C. My first results were probably skewed by compiler optimisations. Anyway, at least on my machine,
if (array[i]<15) array[i]=15;
else if (array[i]>70) array[i]=70;
is noticable faster then using the ternary operator.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.