Buffer and flush Apache Beam streaming data

Buffer and flush Apache Beam streaming data - java

I have a streaming job that with initial run will have to process large amount of data. One of DoFn calls remote service that supports batch requests, so when working with bounded collections I use following approach:
private static final class Function extends DoFn<String, Void> implements Serializable {
private static final long serialVersionUID = 2417984990958377700L;
private static final int LIMIT = 500;
private transient Queue<String> buffered;
#StartBundle
public void startBundle(Context context) throws Exception {
buffered = new LinkedList<>();
}
#ProcessElement
public void processElement(ProcessContext context) throws Exception {
buffered.add(context.element());
if (buffered.size() > LIMIT) {
flush();
}
}
#FinishBundle
public void finishBundle(Context c) throws Exception {
// process remaining
flush();
}
private void flush() {
// build batch request
while (!buffered.isEmpty()) {
buffered.poll();
// do something
}
}
}
Is there a way to window data so the same approach can be used on unbounded collections?
I've tried following:
pipeline
.apply("Read", Read.from(source))
.apply(WithTimestamps.of(input -> Instant.now()))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(2L))))
.apply("Process", ParDo.of(new Function()));
but startBundle and finishBundle are called for every element. Is there a chance to have something like with RxJava (2 minute windows or 100 element bundles):
source
.toFlowable(BackpressureStrategy.LATEST)
.buffer(2, TimeUnit.MINUTES, 100)

This is a quintessential use case for the new feature of per-key-and-windows state and timers.
State is described in a Beam blog post, while for timers you'll have to rely on the Javadoc. Nevermind what the javadoc says about runners supporting them, the true status is found in Beam's capability matrix.
The pattern is very much like what you have written, but state allows it to work with windows and also across bundles, since they may be very small in streaming. Since state must be partitioned somehow to maintain parallelism, you'll need to add some sort of key. Currently there is no automatic sharding for this.
private static final class Function extends DoFn<KV<Key, String>, Void> implements Serializable {
private static final long serialVersionUID = 2417984990958377700L;
private static final int LIMIT = 500;
#StateId("bufferedSize")
private final StateSpec<Object, ValueState<Integer>> bufferedSizeSpec =
StateSpecs.value(VarIntCoder.of());
#StateId("buffered")
private final StateSpec<Object, BagState<String>> bufferedSpec =
StateSpecs.bag(StringUtf8Coder.of());
#TimerId("expiry")
private final TimerSpec expirySpec = TimerSpecs.timer(TimeDomain.EVENT_TIME);
#ProcessElement
public void processElement(
ProcessContext context,
BoundedWindow window,
#StateId("bufferedSize") ValueState<Integer> bufferedSizeState,
#StateId("buffered") BagState<String> bufferedState,
#TimerId("expiry") Timer expiryTimer) {
int size = firstNonNull(bufferedSizeState.read(), 0);
bufferedState.add(context.element().getValue());
size += 1;
bufferedSizeState.write(size);
expiryTimer.set(window.maxTimestamp().plus(allowedLateness));
if (size > LIMIT) {
flush(context, bufferedState, bufferedSizeState);
}
}
#OnTimer("expiry")
public void onExpiry(
OnTimerContext context,
#StateId("bufferedSize") ValueState<Integer> bufferedSizeState,
#StateId("buffered") BagState<String> bufferedState) {
flush(context, bufferedState, bufferedSizeState);
}
private void flush(
WindowedContext context,
BagState<String> bufferedState,
ValueState<Integer> bufferedSizeState) {
Iterable<String> buffered = bufferedState.read();
// build batch request from buffered
...
// clear things
bufferedState.clear();
bufferedSizeState.clear();
}
}
Taking a few notes here:
State replaces your DoFn's instance variables, since
instance variables have no cohesion across windows.
The buffer and the size are just initialized as needed instead
of #StartBundle.
The BagState supports "blind" writes, so there doesn't need to be
any read-modify-write, just committing the new elements in the same
way as when you output.
Setting a timer repeatedly for the same time is just fine;
it should mostly be a noop.
#OnTimer("expiry") takes the place of #FinishBundle, since
finishing a bundle is not a per-window thing but an artifact of
how a runner executes your pipeline.
All that said, if you are writing to an external system, perhaps you would want to reify the windows and re-window into the global window before just doing writes where the manner of your write depends on the window, since "the external world is globally windowed".

The documentation for apache beam 0.6.0 says that StateId is "Not currently supported by any runner."

Related

Mockito tests for class with Handler

I'm a new one in android dev, so I have an app which contain viewPager with 2 UI fragments and 1 nonUIFragment in which operations are performed (i used "setRetainInstance(true)", it deprecated, but i must use it). In this nonUIFragment i have Handler which accepts messages from operations started with ExecutorServices.
But now my task is test this app with Mockito and i'm totaly confused.
Mentor said "you have to mock the operation that produces the result, is performed in a nonUIFragment, and its result is stored in a collection."
How must look this test, I can't create spy() class NonUIFragment and use real methods because of "Method getMainLooper in android.os.Looper not mocked."
All of my methods are void, they don't returne something, how can i trace this chain.
NonUIFragment.java
private NonUIToActivityInterface nonUIInterface;
private final Map<DefOperandTags, HashMap<DefOperationTags, String>> allResultsMap
= new HashMap<>();
#Override
public void onCreate(#Nullable Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setRetainInstance(true);
}
//Handler pass result to here
public void passAndSaveResult(DefOperandTags operandTag, DefOperationTags operationTag, String result) {
allResultsMap.get(operandTag)).put(operationTag, result);
}
private final Handler handler = new Handler(Looper.getMainLooper()) {
public void handleMessage(Message msg) {
if (msg.what != null)
passAndSaveResult(defOperandTags, defOperationTag, msg.obj.toString());
};
OneOfOperation.java (add value to the List)
public class AddToStartList extends Operation {
public AddToStartList(List list, DefOperationTags operationTag) {
super(list);
key = operationTag;
}
#Override
public void operation(Object collection) {
((List)collection).add(0, "123");
}
So, how can I implement what my mentor said?

This is going to be tricky, because your Android testing library has no implementations, and static methods are generally more difficult to mock safely and effectively.
Recent versions of Mockito have added the ability to mock static methods without using another library like PowerMock, so the first choice would be something like that. If at all possible, use mockStatic on Looper::getMainLooper to mock.
Another solution is to add some indirection, giving you a testing seam:
public class NonUIFragment extends Fragment {
/** Visible for testing. */
static Looper overrideLooper;
// ...
private final Handler handler = new Handler(
overrideLooper != null ? overrideLooper : Looper.getMainLooper()) {
/* ... */
};
}
Finally, if you find yourself doing this kind of mock a lot, you can consider a library like Robolectric. Using Robolectric you could simulate the looper with a ShadowLooper, which would let you remote-control it, while using Mockito for any classes your team has written. This would prevent you from having to mock a realistic Looper for every test, for instance.

How do I suggest that the user wrap this parameter in a method call?

A common problem we find in code review is people writing this:
assertThat(thing, nullValue());
instead of this:
assertThat(thing, is(nullValue()));
In order to catch it sooner, I thought I'd try writing a custom error-prone check. This is a poorly documented area though so I've been doing so by digging inside GitHub for working examples.
I have so far:
#AutoService(BugChecker.class)
#BugPattern(
name = "AssertThatThingNullValue",
summary = "`assertThat(thing, nullValue())` doesn't sound like English, wrap `nullValue` in `is`"
severity = WARNING)
public class AssertThatThingNullValue extends BugChecker implements MethodInvocationTreeMatcher
{
private static final Matcher<ExpressionTree> ASSERT_THAT = staticMethod()
.onClassAny("org.hamcrest.MatcherAssert", "org.junit.Assert")
.named("assertThat");
private static final Matcher<ExpressionTree> NULL_VALUE = staticMethod()
.onClass("org.hamcrest.Matchers")
.named("nullValue");
private static final Matcher<ExpressionTree> NULL_VALUE_INVOCATION =
methodInvocation(NULL_VALUE);
private static final Matcher<ExpressionTree> ASSSERT_THAT_THING_NULL_VALUE =
methodInvocation(ASSERT_THAT, MatchType.LAST, NULL_VALUE_INVOCATION);
#Override
public Description matchMethodInvocation(MethodInvocationTree tree, VisitorState state)
{
if (ASSSERT_THAT_THING_NULL_VALUE.matches(tree))
{
buildDescription(tree)
.addFix(SuggestedFixes.somethingGoesHere(...))
.build();
}
return Description.NO_MATCH;
}
}
My problem is I can't figure out how to build the suggested fix from the available methods in SuggestedFixes. I'm wondering whether this API just isn't fleshed out well or whether I'm just going down the wrong track entirely and should be writing the check in a better way?

#OnTimer method receives null references when fired

I've been recently dealing with an issue that has been driving me crazy as it is just happening once deployed in Dataflow but never in local where everything works flawlessly. FYI, I'm using Apache Beam 2.9.0.
I'm defining a DoFn step which buffers event for a certain period of time, say 5 minutes, and after that time it fires some logic.
#StateId("bufferSize")
private final StateSpec<ValueState<Integer>> bufferSizeSpec =
StateSpecs.value(VarIntCoder.of());
#StateId("eventsBuffer")
private final StateSpec<BagState<String>> eventsBufferSpec =
StateSpecs.bag(StringUtf8Coder.of());
#TimerId("trigger")
private final TimerSpec triggerSpec =
TimerSpecs.timer(TimeDomain.PROCESSING_TIME);
I've got my processElement logic to add incoming events...
#ProcessElement
public void processElement(
ProcessContext processContext,
#StateId("bufferSize") ValueState<Integer> bufferSize,
#StateId("eventsBuffer") BagState<String> eventsBuffer,
#TimerId("trigger") Timer triggerTimer) {
triggerTimer.offset(Duration.standardMinutes(1)).setRelative();
int size = ObjectUtils.firstNonNull(bufferSize.read(), 0);
eventsBuffer.add(processContext.element().getValue());
bufferSize.write(++size);
}
And then my trigger...
#OnTimer("trigger")
public void onExpiry(
#StateId("bufferSize") ValueState<Integer> bufferSize,
#StateId("eventsBuffer") BagState<String> eventsBuffer) throws Exception {
doSomethingHere();
}
Whenever onExpiry is executed, the parameters that it receives are null and 0.
What could be going on cluster-wise?
EDIT:
Window used prior the DoFn.
.apply(
"1min Window",
Window
.<KV<String, String>>into(
FixedWindows.of(Duration.standardMinutes(1)))
.triggering(AfterProcessingTime
.pastFirstElementInPane()
.plusDelayOf(Duration.standardSeconds(1)))
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes())

It is important to note that state is held for a key-window tuple when the window expires the state will be GC'd.
So For key-1 your Bag object will have data for {key-1, TimeInterval-1} , {key-1,TimeInterval-2} etc..
If you want strong semantics between the input values and your timer, you may want to explore the use of a EventTime timer.

Android - slow memory allocation for big count of objects

I have app for android, where I need to load some file and process every string in that file and works further with this processed strings.
So I created class Entry for processed line and holder for this Entries - class EntryManager, which looks like this:
public class EntryManager {
private static EntryManager instance;
private List<Entry> rules = new ArrayList<>();
public static synchronized EntryManager getInstance() {
if (instance == null) {
instance = new EntryManager();
}
return instance;
}
public void addRule(String rule) {
rules.add(RulesParser.parseRule(rule));
}
public void clear() {
rules.clear();
}
}
Entry class is quite simple - it contains original String line and few boolean flags. General count of entries is ~70'000, so will be created 70'000 objects.
The problem is:
the first time it takes 0.3 seconds to finish, but all other times it takes ~7 seconds. Is is possible to avoid without using NDK?
UPD. Looks like it was my fail - during testing my phone was connected to PC via cable, when I took it out everything became ok

passing static fields to a thread

I wrote a small HTTP server in Java and I have a problem passing static variables (server configuration: port, root, etc.) to the thread that handles requests. I do not want my thread to modify these variables and if it extends the server class, it will also inherit its methods which I don't want.
I don't want to use getters for reasons of performance. If I make the static members final, I will have a problem when loading their values from the config file.
here's an example
class HTTPServer {
static int port;
static File root;
etc..
....
//must be public
public void launch() throws HTTPServerException {
loadConfig();
while (!pool.isShutdown()) {
....
//using some config here
...
try {
Socket s = ss.accept();
Worker w = new Worker(s);
pool.execute(w);
}catch () {...}
}
}
private void loadConfig(){ //reading from file};
...
other methods that must be public goes here
}
I also don't want to have the worker as nested class. It's in another package...
What do you propose?

You could put your config in a final AtomicReference. Then it can be referenced by your worker and also updated in a thread-safe manner.
Something like:
class HTTPServer {
public static final AtomicReference<ServerConf> config =
new AtomicReference(new ServerConf());
}
Make the new ServerConf class immutable:
class ServerConf {
final int port;
final File root;
public ServerConf(int port, File root) {
this.port = port;
this.root = root;
}
}
Then your worker can get a reference to the current config via HTTPServer.config.get(). Perhaps something like:
Worker w = new Worker(s, HTTPServer.config.get());
loadConfig() can set new config via something like:
HTTPServer.config.set(new ServerConf(8080, new File("/foo/bar"));
If it's not important for all your config to change at the same time, you could skip the ServerConf class and use AtomicInteger for the port setting, and AtomicReference<File> for the root.

Read the static data into a static 'sharedConfig' object that also has a socket field - you can use that field for the listening socket. When acccpet() returns with a server<> client socket, clone() the 'sharedConfig', shove in the new socket and pass that object to the server<>client worker thread. The thread then gets a copy of the config that it can erad and even modify if it wants to without afecting any other thread or the static config.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Buffer and flush Apache Beam streaming data - java

The documentation for apache beam 0.6.0 says that StateId is "Not currently supported by any runner."

Related

Mockito tests for class with Handler

How do I suggest that the user wrap this parameter in a method call?

#OnTimer method receives null references when fired

Android - slow memory allocation for big count of objects

passing static fields to a thread

Categories

Resources