Write a MessageDirect using the xxhash algorithm - java

I would be grateful for help with the following problem.
An application I'm working on uses the SHA-256 algorithm to produce a hashing value. i.e.
MessageDigest messageDigest = MessageDigest.getInstance("SHA-256");
This has given us performance issues, and so I have been asked to replace this with an xxhash function. (Our main concern is having a good and fast hashing algorithm.)
We have identified the following component which we understand to be a good and effective implementation.
http://mvnrepository.com/artifact/net.jpountz.lz4/lz4/1.0.0
An example is given here for how to use the artefact. https://github.com/jpountz/lz4-java
XXHashFactory factory = XXHashFactory.fastestInstance();
byte[] data = "12345345234572".getBytes("UTF-8");
ByteArrayInputStream in = new ByteArrayInputStream(data);
int seed = 0x9747b28c; // used to initialize the hash value, use whatever
// value you want, but always the same
StreamingXXHash32 hash32 = factory.newStreamingHash32(seed);
byte[] buf = new byte[8]; // for real-world usage, use a larger buffer, like 8192 bytes
for (;;) {
int read = in.read(buf);
if (read == -1) {
break;
}
hash32.update(buf, 0, read);
}
int hash = hash32.getValue();
In order to switch-out the SHA-256 MessageDigest, I think the best course of action is to extend the java.security.MessageDigest abstract class. The following shows what I think would be an appropriate implementation, with missing implementation for engineDigest(). Would anyone be able to give guidance on how to implement this method, or point out an approach which they think is better? Thanks in advance.
package com.company.functions.crypto;
import java.security.MessageDigest;
import java.util.ArrayList;
import java.util.List;
import net.jpountz.xxhash.StreamingXXHash32;
import net.jpountz.xxhash.XXHashFactory;
public class XxHashMessageDigest extends MessageDigest {
public static final String XXHASH = "xxhash";
private static final int SEED = 0x9747b28c;
private static final XXHashFactory FACTORY = XXHashFactory
.fastestInstance();
private List<Integer> values = new ArrayList<Integer>();
protected XxHashMessageDigest() {
super(XXHASH);
}
#Override
protected void engineUpdate(byte input) {
// intentionally no implementation
}
#Override
protected void engineUpdate(byte[] input, int offset, int len) {
StreamingXXHash32 hash32 = FACTORY.newStreamingHash32(SEED);
hash32.update(input, offset, len);
values.add(hash32.getValue());
}
#Override
protected byte[] engineDigest() {
/*
* TODO provide implementation to "digest" the list of values into a
* hash value
*/
engineReset();
return null;
}
#Override
protected void engineReset() {
values.clear();
}
}

Related

Why aren't bytebuffer collected when using FinalizableReferenceQueue?

I have a buffer pool implementation which basically provides pre-allocated ByteBuffer objects via allocate()/release() API. In order to detect the cases when caller forgot to call release and the ByteBuffer ref is leaked, I am using Guava's FinalizableReferenceQueue in conjunction with FinalizablePhantomReference. finalizeReferent().
Additionally, I need to selectively destroy the buffer pool and replace it with a newer one with a different configuration. For that, I was setting previous SampleBufferPool reference to null and let garbage collector do its job. However, I noticed that the ByteBuffer were not getting collected/finalizeReferent is not being called. (I verified that the full GC pause are not collecting any memory via adding -XX:+PrintGCDetails -verbose:gc JVM flags)
package foo;
import com.google.common.base.FinalizablePhantomReference;
import com.google.common.base.FinalizableReferenceQueue;
import com.google.common.collect.Sets;
import java.lang.ref.Reference;
import java.nio.ByteBuffer;
import java.util.Set;
import java.util.concurrent.ConcurrentLinkedDeque;
import java.util.stream.IntStream;
public class App {
public static class SampleBufferPool {
// Phantom reference queue for detecting memory leaks
// See. https://guava.dev/releases/19.0/api/docs/com/google/common/base/FinalizableReferenceQueue.html
private static final FinalizableReferenceQueue FRQ = new FinalizableReferenceQueue();
// This ensures that the FinalizablePhantomReference itself is not garbage-collected.
public static final Set<Reference<?>> REFERENCES = Sets.newConcurrentHashSet();
private final ConcurrentLinkedDeque<ByteBuffer> _bufferCache = new ConcurrentLinkedDeque<>();
private final int _chunkSize;
private final int _numChunks;
public SampleBufferPool(int chunkSize, int numChunks) {
_chunkSize = chunkSize;
_numChunks = numChunks;
IntStream.range(0, _numChunks).forEach(i -> populateSingleChunk());
}
public ByteBuffer allocate() {
return _bufferCache.pollLast();
}
public void release(ByteBuffer chunk) {
_bufferCache.offerLast(chunk);
}
private void populateSingleChunk() {
ByteBuffer chunk = ByteBuffer.allocate(_chunkSize);
_bufferCache.offerLast(chunk);
Reference<?> reference = new FinalizablePhantomReference<>(chunk, FRQ) {
#Override
public void finalizeReferent() {
REFERENCES.remove(this);
System.out.println("LEAK DETECTED. ByteBuf[" + "] from RecyclingMemoryPool");
}
};
REFERENCES.add(reference);
}
}
public static void main(String[] args) {
SampleBufferPool sampleBufferPool = new SampleBufferPool(20000000, 400);
sampleBufferPool = null;
for (int i = 0; i < 10; i++) {
System.gc();
}
}
}
You are creating a subclass of FinalizablePhantomReference as anonymous subclass inside a non-static context:
Reference<?> reference = new FinalizablePhantomReference<>(chunk, FRQ) {
#Override
public void finalizeReferent() {
REFERENCES.remove(this);
System.out.println("LEAK DETECTED. ByteBuf[" + "] from RecyclingMemoryPool");
}
};
Prior to JDK 18, anonymous inner classes always keep a reference to their surrounding instance, whether they are using it or not. As described by bug report JDK-8271717, this changes with JDK 18 when compiling the source code with javac. Since this still is a compiler specific behavior and further, it is too easy to use a member of the surrounding class by accident, which would force keeping a reference to the instance, you should use a static context. E.g.
private void populateSingleChunk() {
ByteBuffer chunk = ByteBuffer.allocate(_chunkSize);
_bufferCache.offerLast(chunk);
registerChunk(chunk);
}
private static void registerChunk(ByteBuffer chunk) {
Reference<?> reference = new FinalizablePhantomReference<>(chunk, FRQ) {
#Override
public void finalizeReferent() {
REFERENCES.remove(this);
System.out.println("LEAK DETECTED. ByteBuf[" + "] from RecyclingMemoryPool");
}
};
REFERENCES.add(reference);
}
Of course, the phantom reference object must not have strong references to the referent as well. If you need properties of the referent, you must extract them beforehand, e.g.
private void populateSingleChunk() {
ByteBuffer chunk = ByteBuffer.allocate(_chunkSize);
_bufferCache.offerLast(chunk);
registerChunk(chunk);
}
private static void registerChunk(ByteBuffer chunk) {
int capacity = chunk.capacity();
Reference<?> reference = new FinalizablePhantomReference<>(chunk, FRQ) {
#Override
public void finalizeReferent() {
REFERENCES.remove(this);
System.out.println("LEAK DETECTED. ByteBuf[capacity = "
+ capacity + "] from RecyclingMemoryPool");
}
};
REFERENCES.add(reference);
}

Is it possible to get StackOverflowError without recursion?

I have a task to get "StackOverflowError" in java without using -Xss and recursion. I really don't have ideas... Only some nonsense like generating huge java class at runtime, compile it and invoke...
Java stores primitive types on the stack. Objects created in local scope are allocated on the heap, with the reference to them on the stack.
You can overflow the stack without recursion by allocating too many primitive types in method scope. With normal stack size settings, you would have to allocate an excessive number of variables to overflow.
Here is the implementation of Eric J. idea of generating excessive number of local variables using javassist library:
class SoeNonRecursive {
static final String generatedMethodName = "holderForVariablesMethod";
#SneakyThrows
Class<?> createClassWithLotsOfLocalVars(String generatedClassName, final int numberOfLocalVarsToGenerate) {
ClassPool pool = ClassPool.getDefault();
CtClass generatedClass = pool.makeClass(generatedClassName);
CtMethod generatedMethod = CtNewMethod.make(getMethodBody(numberOfLocalVarsToGenerate), generatedClass);
generatedClass.addMethod(generatedMethod);
return generatedClass.toClass();
}
private String getMethodBody(final int numberOfLocalVarsToGenerate) {
StringBuilder methodBody = new StringBuilder("public static long ")
.append(generatedMethodName).append("() {")
.append(System.lineSeparator());
StringBuilder antiDeadCodeEliminationString = new StringBuilder("long result = i0");
long i = 0;
while (i < numberOfLocalVarsToGenerate) {
methodBody.append(" long i").append(i)
.append(" = ").append(i).append(";")
.append(System.lineSeparator());
antiDeadCodeEliminationString.append("+").append("i").append(i);
i++;
}
antiDeadCodeEliminationString.append(";");
methodBody.append(" ").append(antiDeadCodeEliminationString)
.append(System.lineSeparator())
.append(" return result;")
.append(System.lineSeparator())
.append("}");
return methodBody.toString();
}
}
and tests:
class SoeNonRecursiveTest {
private final SoeNonRecursive soeNonRecursive = new SoeNonRecursive();
//Should be different for every case, or once generated class become
//"frozen" for javassist: http://www.javassist.org/tutorial/tutorial.html#read
private String generatedClassName;
#Test
void stackOverflowWithoutRecursion() {
generatedClassName = "Soe1";
final int numberOfLocalVarsToGenerate = 6000;
assertThrows(StackOverflowError.class, () -> soeNonRecursive
.createClassWithLotsOfLocalVars(generatedClassName, numberOfLocalVarsToGenerate));
}
#SneakyThrows
#Test
void methodGeneratedCorrectly() {
generatedClassName = "Soe2";
final int numberOfLocalVarsToGenerate = 6;
Class<?> generated = soeNonRecursive.createClassWithLotsOfLocalVars(generatedClassName, numberOfLocalVarsToGenerate);
//Arithmetic progression
long expected = Math.round((numberOfLocalVarsToGenerate - 1.0)/2 * numberOfLocalVarsToGenerate);
long actual = (long) generated.getDeclaredMethod(generatedMethodName).invoke(generated);
assertEquals(expected, actual);
}
}
EDIT:
The answer is incorrect, because it is one type of recursion. It is called indirect recursion https://en.wikipedia.org/wiki/Recursion_(computer_science)#Indirect_recursion.
I think the simplest way to do this without recursion is the following:
import java.util.LinkedList;
import java.util.List;
interface Handler {
void handle(Chain chain);
}
interface Chain {
void process();
}
class FirstHandler implements Handler {
#Override
public void handle(Chain chain) {
System.out.println("first handler");
chain.process();
}
}
class SecondHandler implements Handler {
#Override
public void handle(Chain chain) {
System.out.println("second handler");
chain.process();
}
}
class Runner implements Chain {
private List<Handler> handlers;
private int size = 5000; // change this parameter to avoid stackoverflowerror
private int n = 0;
public static void main(String[] args) {
Runner runner = new Runner();
runner.setHandlers();
runner.process();
}
private void setHandlers() {
handlers = new LinkedList<>();
int i = 0;
while (i < size) {
// there can be different implementations of handler interface
handlers.add(new FirstHandler());
handlers.add(new SecondHandler());
i += 2;
}
}
public void process() {
if (n < size) {
Handler handler = handlers.get(n++);
handler.handle(this);
}
}
}
At first glance this example looks a little crazy, but it's not as unrealistic as it seems.
The main idea of this approach is the chain of responsibility pattern. You can reproduce this exception in real life by implementing chain of responsibility pattern. For instance, you have some objects and every object after doing some logic call the next object in chain and pass the results of his job to the next one.
You can see this in java filter (javax.servlet.Filter).
I don't know detailed mechanism of working this class, but it calls the next filter in chain using doFilter method and after all filters/servlets processing request, it continue working in the same method below doFilter.
In other words it intercepts request/response before servlets and before sending response to a client.It is dangerous piece of code because all called methods are in the same stack at the same thread. Thus, it may initiate stackoverflow exception if the chain is too big or you call doFilter method on deep level that also provide the same situation. Perhaps, during debugging you might see chain of calls
in one thread and it potentially can be the cause of stackoverflowerror.
Also you can take chain of responsibility pattern example from links below and add collection of elements instead of several and you also will get stackoverflowerror.
Links with the pattern:
https://www.journaldev.com/1617/chain-of-responsibility-design-pattern-in-java
https://en.wikipedia.org/wiki/Chain-of-responsibility_pattern
I hope it was helpful for you.
Since the question is very interesting, I have tried to simplify the answer of hide :
public class Stackoverflow {
static class Handler {
void handle(Chain chain){
chain.process();
System.out.println("yeah");
}
}
static class Chain {
private List<Handler> handlers = new ArrayList<>();
private int n = 0;
private void setHandlers(int count) {
int i = 0;
while (i++ < count) {
handlers.add(new Handler());
}
}
public void process() {
if (n < handlers.size()) {
Handler handler = handlers.get(n++);
handler.handle(this);
}
}
}
public static void main(String[] args) {
Chain chain = new Chain();
chain.setHandlers(10000);
chain.process();
}
}
It's important to note that if stackoverflow occurs, the string "yeah" will never be output.
Of course we can do it :) . No recursion at all!
public static void main(String[] args) {
throw new StackOverflowError();
}
Looking at this answer below, not sure if this works for Java, but sounds like you can declare an array of pointers? Might be able to achieve Eric J's idea without requiring a generator.
Is it on the Stack or Heap?
int* x[LARGENUMBER]; // The addresses are held on the stack
int i; // On the stack
for(i = 0; i < LARGENUMBER; ++i)
x[i] = malloc(sizeof(int)*10); // Allocates memory on the heap

Check number of invocations within class in case of JUnit testing

I have a code which calculates something, caches is, and if already calculated, then reads from the cache; similar to this:
public class LengthWithCache {
private java.util.Map<String, Integer> lengthPlusOneCache = new java.util.HashMap<String, Integer>();
public int getLenghtPlusOne(String string) {
Integer cachedStringLenghtPlusOne = lengthPlusOneCache.get(string);
if (cachedStringLenghtPlusOne != null) {
return cachedStringLenghtPlusOne;
}
int stringLenghtPlusOne = determineLengthPlusOne(string);
lengthPlusOneCache.put(string, new Integer(stringLenghtPlusOne));
return stringLenghtPlusOne;
}
protected int determineLengthPlusOne(String string) {
return string.length() + 1;
}
}
I want to test if function determineLengthPlusOne has been called adequate number of times, like this:
public class LengthWithCacheTest {
#Test
public void testGetLenghtPlusOne() {
LengthWithCache lengthWithCache = new LengthWithCache();
assertEquals(6, lengthWithCache.getLenghtPlusOne("apple"));
// here check that determineLengthPlusOne has been called once
assertEquals(6, lengthWithCache.getLenghtPlusOne("apple"));
// here check that determineLengthPlusOne has not been called
}
}
Mocking class LengthWithCache does not seem a good option, as I want to test their functions. (According to my understanding we mock the classes used by the tested class, and not the tested class itself.) Which is the most elegant solution for this?
My first idea was to create another class LengthPlusOneDeterminer containing function determineLengthPlusOne, add pass it to function getLenghtPlusOne as parameter, and mock LengthPlusOneDeterminer in case of unit testing, but that seems a bit strange, as it has unnecessary impact on the working code (the real clients of class LengthWithCache).
Basically I am using Mockito, but whatever mock framework (or other solution) is welcome! Thank you!
Most elegant way would be to create a separate class that does the caching and decorate with it the current class (after removal of the caching), this way you can safely unit test the caching itself without interfering with functionalities of the base class.
public class Length {
public int getLenghtPlusOne(String string) {
int stringLenghtPlusOne = determineLengthPlusOne(string);
lengthPlusOneCache.put(string, new Integer(stringLenghtPlusOne));
return stringLenghtPlusOne;
}
protected int determineLengthPlusOne(String string) {
return string.length() + 1;
}
}
public class CachedLength extends Length {
private java.util.Map<String, Integer> lengthPlusOneCache = new java.util.HashMap<String, Integer>();
public CachedLength(Length length) {
this.length = length;
}
public int getLenghtPlusOne(String string) {
Integer cachedStringLenghtPlusOne = lengthPlusOneCache.get(string);
if (cachedStringLenghtPlusOne != null) {
return cachedStringLenghtPlusOne;
}
return length.getLenghtPlusOne(string);
}
}
Then you can easily test the caching my injecting a mocked Length:
Length length = Mockito.mock(Length.class);
CachedLength cached = new CachedLength(length);
....
Mockito.verify(length, Mockito.times(5)).getLenghtPlusOne(Mockito.anyInt());
You don't need mock to address your need.
To test the internal behavior (is getLenghtPlusOne() was called or not called), you would need to have a method to access to the cache in LengthWithCache.
But at level of your design, we imagine that you don't want to open the cache in a public method. Which is normal.
Multiple solutions exist to do a test on the cache behavior despite this constraint.
I will present my way of doing. Maybe, there is better.
But I think that in most of cases, you will be forced to use some tricks or to complexify your design to do your unit test.
It relies on augmenting your class to test by extending it in order to add the needed information and behavior for your test.
And it's this subclass you will use in your unit test.
The most important point in this class extension is not to break or modify the behavior of the object to test.
It must add new information and add new behavior and not modify information and behavior of the original class otherwise the test loses its value since it doesn't test any longer the behavior in the original class.
The key points :
- having a private field lengthPlusOneWasCalledForCurrentCallwhich registers for the current call if the method lengthPlusOneWasCalledwas called
- having a public method to know the value of lengthPlusOneWasCalledForCurrentCall for the string used as parameter. It enable the assertion.
- having a public method to clean the state of lengthPlusOneWasCalledForCurrentCall. It enable to keep a clean state after the assertion.
package cache;
import java.util.HashSet;
import java.util.Set;
import org.junit.Assert;
import org.junit.Test;
public class LengthWithCacheTest {
private class LengthWithCacheAugmentedForTest extends LengthWithCache {
private Set<String> lengthPlusOneWasCalledForCurrentCall = new HashSet<>();
#Override
protected int determineLengthPlusOne(String string) {
// start : info for testing
this.lengthPlusOneWasCalledForCurrentCall.add(string);
// end : info for testing
return super.determineLengthPlusOne(string);
}
// method for assertion
public boolean isLengthPlusOneCalled(String string) {
return lengthPlusOneWasCalledForCurrentCall.contains(string);
}
// method added for clean the state of current calls
public void cleanCurrentCalls() {
lengthPlusOneWasCalledForCurrentCall.clear();
}
}
#Test
public void testGetLenghtPlusOne() {
LengthWithCacheAugmentedForTest lengthWithCache = new LengthWithCacheAugmentedForTest();
final String string = "apple";
// here check that determineLengthPlusOne has been called once
Assert.assertEquals(6, lengthWithCache.getLenghtPlusOne(string));
Assert.assertTrue(lengthWithCache.isLengthPlusOneCalled(string));
// clean call registered
lengthWithCache.cleanCurrentCalls();
// here check that determineLengthPlusOne has not been called
Assert.assertEquals(6, lengthWithCache.getLenghtPlusOne(string));
Assert.assertFalse(lengthWithCache.isLengthPlusOneCalled(string));
}
}
Edit 28-07-16 to show why more code is needed to handle more scenarios
Suppose, i will improve the test by asserting that there is no side effects : adding an element in the cache for a key has not effect on how the cache is handled for other keys.
This test fails because it doesn't rely on the string key. So, it always increments.
#Test
public void verifyInvocationCountsWithDifferentElementsAdded() {
final AtomicInteger plusOneInvkCounter = new AtomicInteger();
LengthWithCache lengthWithCache = new LengthWithCache() {
#Override
protected int determineLengthPlusOne(String string) {
plusOneInvkCounter.incrementAndGet();
return super.determineLengthPlusOne(string);
}
};
Assert.assertEquals(0, plusOneInvkCounter.get());
lengthWithCache.getLenghtPlusOne("apple");
Assert.assertEquals(1, plusOneInvkCounter.get());
lengthWithCache.getLenghtPlusOne("pie");
Assert.assertEquals(1, plusOneInvkCounter.get());
lengthWithCache.getLenghtPlusOne("eggs");
Assert.assertEquals(1, plusOneInvkCounter.get());
}
My version is longer because it provides more features and so, it can handle a broader range of unit testing scenarios .
Edit 28-07-16 to point the Integer caching
No direct relation with the original question but little wink :)
Your getLenghtPlusOne(String string) should use Integer.valueOf(int) instead of new Integer(int)
Integer.valueOf(int) uses in internal a cache
It feels like using mocks is overthinking it. The LengthWithCache can be overridden as an anonymous-inner class within the context of a test to get the invocation count. This requires no restructuring of the existing class being tested.
public class LengthWithCacheTest {
#Test
public void verifyLengthEval() {
LengthWithCache lengthWithCache = new LengthWithCache();
assertEquals(6, lengthWithCache.getLenghtPlusOne("apple"));
}
#Test
public void verifyInvocationCounts() {
final AtomicInteger plusOneInvkCounter = new AtomicInteger();
LengthWithCache lengthWithCache = new LengthWithCache() {
#Override
protected int determineLengthPlusOne(String string) {
plusOneInvkCounter.incrementAndGet();
return super.determineLengthPlusOne(string);
}
};
lengthWithCache.getLenghtPlusOne("apple");
assertEquals(1, plusOneInvkCounter.get());
lengthWithCache.getLenghtPlusOne("apple");
lengthWithCache.getLenghtPlusOne("apple");
lengthWithCache.getLenghtPlusOne("apple");
lengthWithCache.getLenghtPlusOne("apple");
lengthWithCache.getLenghtPlusOne("apple");
lengthWithCache.getLenghtPlusOne("apple");
assertEquals(1, plusOneInvkCounter.get());
}
}
It's worth noting the separation between the two tests. One verifies
that the length eval is right, the other verifies the invocation
count.
If a wider data set for validation is required, then you can turn the Test above into a Parameterized test and provide multiple data sets and expectations. In the sample below I've added a data set of 50 strings (lengths 1-50), an empty string, and a null value.
Null fails
#RunWith(Parameterized.class)
public class LengthWithCacheTest {
#Parameters(name="{0}")
public static Collection<Object[]> buildTests() {
Collection<Object[]> paramRefs = new ArrayList<Object[]>();
paramRefs.add(new Object[]{null, 0});
paramRefs.add(new Object[]{"", 1});
for (int counter = 1 ; counter < 50; counter++) {
String data = "";
for (int index = 0 ; index < counter ; index++){
data += "a";
}
paramRefs.add(new Object[]{data, counter+1});
}
return paramRefs;
}
private String stringToTest;
private int expectedLength;
public LengthWithCacheTest(String string, int length) {
this.stringToTest = string;
this.expectedLength = length;
}
#Test
public void verifyLengthEval() {
LengthWithCache lengthWithCache = new LengthWithCache();
assertEquals(expectedLength, lengthWithCache.getLenghtPlusOne(stringToTest));
}
#Test
public void verifyInvocationCounts() {
final AtomicInteger plusOneInvkCounter = new AtomicInteger();
LengthWithCache lengthWithCache = new LengthWithCache() {
#Override
protected int determineLengthPlusOne(String string) {
plusOneInvkCounter.incrementAndGet();
return super.determineLengthPlusOne(string);
}
};
assertEquals(0, plusOneInvkCounter.get());
lengthWithCache.getLenghtPlusOne(stringToTest);
assertEquals(1, plusOneInvkCounter.get());
lengthWithCache.getLenghtPlusOne(stringToTest);
assertEquals(1, plusOneInvkCounter.get());
lengthWithCache.getLenghtPlusOne(stringToTest);
assertEquals(1, plusOneInvkCounter.get());
}
}
Parameterized testing is one of the best ways to vary your data set through a test, but it adds complexity to the test and can be difficult to maintain. It's useful to know about, but not always the right tool for the job.
As this was an interesting question, I decided to write the tests. In two different ways, one with mocking and the other without. (Personally, I prefer the version without mocking.) In either case, the original class is tested, with no modifications:
package example;
import mockit.*;
import org.junit.*;
import static org.junit.Assert.*;
public class LengthWithCacheMockedTest {
#Tested(availableDuringSetup = true) #Mocked LengthWithCache lengthWithCache;
#Before
public void recordComputedLengthPlusOneWhileFixingTheNumberOfAllowedInvocations() {
new Expectations() {{
lengthWithCache.determineLengthPlusOne(anyString); result = 6; times = 1;
}};
}
#Test
public void getLenghtPlusOneNotFromCacheWhenCalledTheFirstTime() {
int length = lengthWithCache.getLenghtPlusOne("apple");
assertEquals(6, length);
}
#Test
public void getLenghtPlusOneFromCacheWhenCalledAfterFirstTime() {
int length1 = lengthWithCache.getLenghtPlusOne("apple");
int length2 = lengthWithCache.getLenghtPlusOne("apple");
assertEquals(6, length1);
assertEquals(length1, length2);
}
}
package example;
import mockit.*;
import org.junit.*;
import static org.junit.Assert.*;
public class LengthWithCacheNotMockedTest {
#Tested LengthWithCache lengthWithCache;
#Test
public void getLenghtPlusOneNotFromCacheWhenCalledTheFirstTime() {
long t0 = System.currentTimeMillis(); // millisecond precision is enough here
int length = lengthWithCache.getLenghtPlusOne("apple");
long dt = System.currentTimeMillis() - t0;
assertEquals(6, length);
assertTrue(dt >= 100); // assume at least 100 millis to compute the expensive value
}
#Test
public void getLenghtPlusOneFromCacheWhenCalledAfterFirstTime() {
// First time: takes some time to compute.
int length1 = lengthWithCache.getLenghtPlusOne("apple");
// Second time: gets from cache, takes no time.
long t0 = System.nanoTime(); // max precision here
int length2 = lengthWithCache.getLenghtPlusOne("apple");
long dt = System.nanoTime() - t0;
assertEquals(6, length1);
assertEquals(length1, length2);
assertTrue(dt < 1000000); // 1000000 nanos = 1 millis
}
}
Just one detail: for the tests above to work, I added the following line inside the LengthWithCache#determineLengthPlusOne(String) method, in order to simulate the real-world scenario where the computation takes some time:
try { Thread.sleep(100); } catch (InterruptedException ignore) {}
Based on the proposal by krzyk here is my fully working solution:
The calculator itself:
public class LengthPlusOneCalculator {
public int calculateLengthPlusOne(String string) {
return string.length() + 1;
}
}
The separate caching mechanism:
public class LengthPlusOneCache {
private LengthPlusOneCalculator lengthPlusOneCalculator;
private java.util.Map<String, Integer> lengthPlusOneCache = new java.util.HashMap<String, Integer>();
public LengthPlusOneCache(LengthPlusOneCalculator lengthPlusOneCalculator) {
this.lengthPlusOneCalculator = lengthPlusOneCalculator;
}
public int calculateLenghtPlusOne(String string) {
Integer cachedStringLenghtPlusOne = lengthPlusOneCache.get(string);
if (cachedStringLenghtPlusOne != null) {
return cachedStringLenghtPlusOne;
}
int stringLenghtPlusOne = lengthPlusOneCalculator.calculateLengthPlusOne(string);
lengthPlusOneCache.put(string, new Integer(stringLenghtPlusOne));
return stringLenghtPlusOne;
}
}
The unit test for checking the LengthPlusOneCalculator:
import static org.junit.Assert.assertEquals;
import org.junit.Test;
public class LengthPlusOneCalculatorTest {
#Test
public void testCalculateLengthPlusOne() {
LengthPlusOneCalculator lengthPlusOneCalculator = new LengthPlusOneCalculator();
assertEquals(6, lengthPlusOneCalculator.calculateLengthPlusOne("apple"));
}
}
And finally, the unit test for LengthPlusOneCache, checking the number of invocations:
import static org.junit.Assert.assertEquals;
import static org.mockito.Mockito.*;
import org.junit.Test;
public class LengthPlusOneCacheTest {
#Test
public void testNumberOfInvocations() {
LengthPlusOneCalculator lengthPlusOneCalculatorMock = mock(LengthPlusOneCalculator.class);
when(lengthPlusOneCalculatorMock.calculateLengthPlusOne("apple")).thenReturn(6);
LengthPlusOneCache lengthPlusOneCache = new LengthPlusOneCache(lengthPlusOneCalculatorMock);
verify(lengthPlusOneCalculatorMock, times(0)).calculateLengthPlusOne("apple"); // verify that not called yet
assertEquals(6, lengthPlusOneCache.calculateLenghtPlusOne("apple"));
verify(lengthPlusOneCalculatorMock, times(1)).calculateLengthPlusOne("apple"); // verify that already called once
assertEquals(6, lengthPlusOneCache.calculateLenghtPlusOne("apple"));
verify(lengthPlusOneCalculatorMock, times(1)).calculateLengthPlusOne("apple"); // verify that not called again
}
}
We can safely do the mocking mechanism, as we are already convinced that the mocked class works properly, using its own unit tests.
Normally this is built into a build system; this example can be complied and run from command line as follows (files junit-4.10.jar and mockito-all-1.9.5.jar have to be copied to the working directory):
javac -cp .;junit-4.10.jar;mockito-all-1.9.5.jar *.java
java -cp .;junit-4.10.jar org.junit.runner.JUnitCore LengthPlusOneCalculatorTest
java -cp .;junit-4.10.jar;mockito-all-1.9.5.jar org.junit.runner.JUnitCore LengthPlusOneCacheTest
However, I'm still not fully satisfied with this approach. My issues are the following:
Function calculateLengthPlusOne is mocked. I would prefer such a solution where a mocking or whatever framework just calculates the number of invocations, but the original code runs. (Somehow mentioned by davidhxxx, however I do not find also that a perfect one.)
The code became a bit over-complicated. This is not the way one would create normally. Therefore this approach is not adequate if the original code is not of our fully control. This could be a constraint in reality.
Normally I would make function calculateLengthPlusOne static. This approach does not work in such a case. (But maybe my Mockito knowledge is weak.)
If some could address any of these issues, I would really appreciate it!

Cloud Dataflow: reading entire text files rather than lines by line

I'm looking for a way to read ENTIRE files so that every file will be read entirely to a single String.
I want to pass a pattern of JSON text files on gs://my_bucket/*/*.json, have a ParDo then process each and every file entirely.
What's the best approach to it?
I am going to give the most generally useful answer, even though there are special cases [1] where you might do something different.
I think what you want to do is to define a new subclass of FileBasedSource and use Read.from(<source>). Your source will also include a subclass of FileBasedReader; the source contains the configuration data and the reader actually does the reading.
I think a full description of the API is best left to the Javadoc, but I will highlight the key override points and how they relate to your needs:
FileBasedSource#isSplittable() you will want to override and return false. This will indicate that there is no intra-file splitting.
FileBasedSource#createForSubrangeOfFile(String, long, long) you will override to return a sub-source for just the file specified.
FileBasedSource#createSingleFileReader() you will override to produce a FileBasedReader for the current file (the method should assume it is already split to the level of a single file).
To implement the reader:
FileBasedReader#startReading(...) you will override to do nothing; the framework will already have opened the file for you, and it will close it.
FileBasedReader#readNextRecord() you will override to read the entire file as a single element.
[1] One example easy special case is when you actually have a small number of files, you can expand them prior to job submission, and they all take the same amount of time to process. Then you can just use Create.of(expand(<glob>)) followed by ParDo(<read a file>).
Was looking for similar solution myself. Following Kenn's recommendations and few other references such as XMLSource.java, created the following custom source which seems to be working fine.
I am not a developer so if anyone has suggestions on how to improve it, please feel free to contribute.
public class FileIO {
// Match TextIO.
public static Read.Bounded<KV<String,String>> readFilepattern(String filepattern) {
return Read.from(new FileSource(filepattern, 1));
}
public static class FileSource extends FileBasedSource<KV<String,String>> {
private String filename = null;
public FileSource(String fileOrPattern, long minBundleSize) {
super(fileOrPattern, minBundleSize);
}
public FileSource(String filename, long minBundleSize, long startOffset, long endOffset) {
super(filename, minBundleSize, startOffset, endOffset);
this.filename = filename;
}
// This will indicate that there is no intra-file splitting.
#Override
public boolean isSplittable(){
return false;
}
#Override
public boolean producesSortedKeys(PipelineOptions options) throws Exception {
return false;
}
#Override
public void validate() {}
#Override
public Coder<KV<String,String>> getDefaultOutputCoder() {
return KvCoder.of(StringUtf8Coder.of(),StringUtf8Coder.of());
}
#Override
public FileBasedSource<KV<String,String>> createForSubrangeOfFile(String fileName, long start, long end) {
return new FileSource(fileName, getMinBundleSize(), start, end);
}
#Override
public FileBasedReader<KV<String,String>> createSingleFileReader(PipelineOptions options) {
return new FileReader(this);
}
}
/**
* A reader that should read entire file of text from a {#link FileSource}.
*/
private static class FileReader extends FileBasedSource.FileBasedReader<KV<String,String>> {
private static final Logger LOG = LoggerFactory.getLogger(FileReader.class);
private ReadableByteChannel channel = null;
private long nextOffset = 0;
private long currentOffset = 0;
private boolean isAtSplitPoint = false;
private final ByteBuffer buf;
private static final int BUF_SIZE = 1024;
private KV<String,String> currentValue = null;
private String filename;
public FileReader(FileSource source) {
super(source);
buf = ByteBuffer.allocate(BUF_SIZE);
buf.flip();
this.filename = source.filename;
}
private int readFile(ByteArrayOutputStream out) throws IOException {
int byteCount = 0;
while (true) {
if (!buf.hasRemaining()) {
buf.clear();
int read = channel.read(buf);
if (read < 0) {
break;
}
buf.flip();
}
byte b = buf.get();
byteCount++;
out.write(b);
}
return byteCount;
}
#Override
protected void startReading(ReadableByteChannel channel) throws IOException {
this.channel = channel;
}
#Override
protected boolean readNextRecord() throws IOException {
currentOffset = nextOffset;
ByteArrayOutputStream buf = new ByteArrayOutputStream();
int offsetAdjustment = readFile(buf);
if (offsetAdjustment == 0) {
// EOF
return false;
}
nextOffset += offsetAdjustment;
isAtSplitPoint = true;
currentValue = KV.of(this.filename,CoderUtils.decodeFromByteArray(StringUtf8Coder.of(), buf.toByteArray()));
return true;
}
#Override
protected boolean isAtSplitPoint() {
return isAtSplitPoint;
}
#Override
protected long getCurrentOffset() {
return currentOffset;
}
#Override
public KV<String,String> getCurrent() throws NoSuchElementException {
return currentValue;
}
}
}
A much simpler method is to generate the list of filenames and write a function to process each file individually. I'm showing Python, but Java is similar:
def generate_filenames():
for shard in xrange(0, 300):
yield 'gs://bucket/some/dir/myfilname-%05d-of-00300' % shard
with beam.Pipeline(...) as p:
(p | generate_filenames()
| beam.FlatMap(lambda filename: readfile(filename))
| ...)
FileIO does that for you without the need to implement your own FileBasedSource.
Create matches for each of the files that you want to read:
mypipeline.apply("Read files from GCS", FileIO.match().filepattern("gs://mybucket/myfilles/*.txt"))
Also, you can read like this if do not want Dataflow to throw exceptions when no file is found for your filePattern:
mypipeline.apply("Read files from GCS", FileIO.match().filepattern("gs://mybucket/myfilles/*.txt").withEmptyMatchTreatment(EmptyMatchTreatment.ALLOW))
Read your matches using FileIO:
.apply("Read file matches", FileIO.readMatches())
The above code returns a PCollection of the type FileIO.ReadableFile (PCollection<FileIO.ReadableFile>). Then you create a DoFn that process these ReadableFiles to meet your use case.
.apply("Process my files", ParDo.of(MyCustomDoFnToProcessFiles.create()))
You can read the entire documentation for FileIO here.

CellUtil: Key type in createCell method

I am using the CellUtil class packaged in org.apache.hadoop.hbase to create a Cell object. The function header looks like this:
public static Cell createCell(byte[] row, byte[] family, byte[] qualifier, long timestamp, byte type, byte[] value)
What does the 5th. argument byte type represent? I looked into the KeyValueType class and it refers to an enum called Type with the following definition:
public static enum Type {
Minimum((byte)0),
Put((byte)4),
Delete((byte)8),
DeleteFamilyVersion((byte)10),
DeleteColumn((byte)12),
DeleteFamily((byte)14),
// Maximum is used when searching; you look from maximum on down.
Maximum((byte)255);
private final byte code;
Type(final byte c) {
this.code = c;
}
public byte getCode() {
return this.code;
}
My question is, what has the type minimum, put, etc. got to do with the type of cell I want to create?
Sarin,
Please refer 69.7.6. KeyValue
There are some scenarios in which you will use these enums. For Example, I'm writing coprocessor like below then I will use KeyValue.Type.Put.getCode()
similarly other Enums also can be used like this.
See example co-processor usage below...
package getObserver;
import java.io.IOException;
import java.util.List;
import java.util.NavigableSet;
import org.apache.hadoop.hbase.Cell;
import org.apache.hadoop.hbase.CellUtil;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.coprocessor.BaseRegionObserver;
import org.apache.hadoop.hbase.coprocessor.ObserverContext;
import org.apache.hadoop.hbase.coprocessor.RegionCoprocessorEnvironment;
public class Observer extends BaseRegionObserver{
private boolean isOewc;
#Override
public void preGetOp(ObserverContext<RegionCoprocessorEnvironment> arg0,
Get arg1, List<Cell> arg2) throws IOException {
NavigableSet<byte[]> qset = arg1.getFamilyMap().get("colfam1".getBytes());
if(qset==null){//do nothing
}else{
String message = "qset.size() = "+String.valueOf(qset.size());
String m = "isOewc = "+String.valueOf(isOewc);
this.isOewc = true;
Cell cell = CellUtil.createCell(
"preGet Row".getBytes(),
m.getBytes(),
message.getBytes(),
System.currentTimeMillis(),
KeyValue.Type.Put.getCode(),
"preGet Value".getBytes());
arg2.add(cell);
}
}
#Override
public void postGetOp(ObserverContext<RegionCoprocessorEnvironment> arg0,
Get arg1, List<Cell> arg2) throws IOException {
String m = "isOewc = "+String.valueOf(isOewc);
Cell cell = CellUtil.createCell(
"postGet Row".getBytes(),
m.getBytes(),
"postGet Qualifier".getBytes(),
System.currentTimeMillis(),
KeyValue.Type.Put.getCode(),
"postGet Value".getBytes());
arg2.add(cell);
}
}
Similarly other below EnumTypes can be used if you don't know which
operation you are going to perform on co-processor event..
programcreek examples clearly explains what is the usage of Put,Delete(prepare key value pairs for mutation) maximum,minimum (for range check). also Co-processor above example uses Put.

Categories