I have a particularly simple Dataflow pipeline where I want to read a file and output its parsed records to Avro. This works in most cases, except where the source file is particularly large (20+ GB) which causes me to OOM even with particularly large memory machines. I am pretty sure this happens because the non-splittable source is read in its entirety by Beam, so I implemented a splittable DoFn<FileIO.ReadableFile, GenericRecord>
This functionally works in that the pipeline now succeeds, which seems to validate my assumption that the single large batch from a non-splittable file is the cause. However, this does not seem to spread the work across multiple workers. I tried the following:
Disabled autoscaling (autoscalingAlgorithm=NONE) and set numWorkers to 10. This had the same throughput as numWorkers 1
Left autoscaling on with a high maxWorkers. This went briefly up to 2, and then came back down to 1
Added a shuffle (Reshuffle.viaRandomKey) after the DoFn, but before the Avro write
Any ideas? The exact code is difficult to share because of company policy, but overall is pretty simple. I implemented the following:
public class SplittableReadFn extends DoFn<FileIO.ReadableFile, GenericRecord> {
// ...
#ProcessElement
public void process(final ProcessContext c, final OffsetRangeTracker tracker) {
final FileIO.ReadableFile file = c.element();
// Followed by something like
ReadableByteStream in = file.open()
in.seek(tracker.from())
Parser parser = new Parser(in)
while (parser.next()) {
if (parser.getOffset() > tracker.to()) {
break
}
tracker.tryClaim(parser.getOffset())
c.output(parser.item())
}
tracker.markDone()
}
#GetInitialRestriction
public OffsetRange getInitialRestriction(final FileIO.ReadableFile file) {
return new Offset(0, getSize(file) - 1);
}
#SplitRestriction
public void splitRestriction(final FileIO.ReadableFile file, final OffsetRange restriction, final DoFn.OutputReceiver<OffsetRange> receiver) {
// chunkRange for test purposes just breaks into at most 500MB chunks
for (final OffsetRange chunk: chunkRange(restriction)) {
receiver.output(chunk);
}
}
Related
I'm using Apache beam, with a streaming collection of 1.5GB.
My lookup table is a JDBCio mysql response.
When I run the pipeline without the side input, my job will finish in about 2 minutes. When I run my job with the side input, my job will never finish, stucks and dies.
Here is the code I use to store my lookup (~1M records)
PCollectionView<Map<String,String>> sideData = pipeline.apply(JdbcIO.<KV<String, String>>read()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"com.mysql.jdbc.Driver", "jdbc:mysql://ip")
.withUsername("username")
.withPassword("password"))
.withQuery("select a_number from cell")
.withCoder(KvCoder.of(StringUtf8Coder.of(), StringUtf8Coder.of()))
.withRowMapper(new JdbcIO.RowMapper<KV<String, String>>() {
public KV<String, String> mapRow(ResultSet resultSet) throws Exception {
return KV.of(resultSet.getString(1), resultSet.getString(1));
}
})).apply(View.asMap());
Here is the code of my streaming collection
pipeline
.apply("ReadMyFile", TextIO.read().from("/home/data/**")
.watchForNewFiles(Duration.standardSeconds(60), Watch.Growth.<String>never()))
.apply(Window.<String>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(30))))
.accumulatingFiredPanes()
.withAllowedLateness(ONE_DAY))
Here is the code of my parDo to iterate on each event row (of 10M records)
.apply(ParDo.of(new DoFn<KV<String,Integer>,KV<String,Integer>>() {
#ProcessElement
public void processElement(ProcessContext c) {
KV<String,Integer> i = c.element();
String sideInputData = c.sideInput(sideData).get(i.getKey());
if (sideInputData == null) {
c.output(i);
}
}
}).withSideInputs(sideData));
I'm using a flink cluster, but using direct runner outputs the same.
cluster:
2 cpu
6 cores
24gb ram
What am I doing wrong?
I've followed this
The solution was to create a "cache" MAP.
The sideInput only triggers once and then I cache it into a map equivalent suctruture.
So, that I'm avoiding doing a sideInput for every processElement.
.apply(ParDo.of(new DoFn<KV<String,Integer>,KV<String,Integer>>() {
#ProcessElement
public void processElement(ProcessContext c) {
if (isFirstTime) {
myList = c.sideInput(sideData);
}
isFirstTime = false;
boolean result = myList.containsKey(c.element().getKey());
if (result == false) {
c.output(i);
}
}
}).withSideInputs(sideData));
If it runs with much less data, I suspect that the program is using up all memory of the java process. You can monitor that through JVisualVM or JConsole. There are many articles that cover the problem, I just stumbled upon this one with a quick google search.
If memory runs out, your java process is mostly busy with cleaning up memory and you see a huge performance decline. At some point, java gives up and fails.
To solve the issue, it should be enough to increase the java heap size. How you increase that depends on how and where you execute it. Look at for Java's -Xmx parameter or some heap option in beam.
Background
I have a project where we are using akka-streams with Java.
In this project I have a stream of strings and a graph that does some operations on them.
Objective
In my graph, I want to broadcast that stream to 2 workers. One will replace all characters 'a' with 'A' and send data as it receives it in real time.
The other one will receive the data, and every 3 strings, it will concat those 3 strings and map them to numbers.
It would look like the following:
Obviously Sink 2 will not receive information as fast as Sink 1. but that is expected behavior. The interesting part here, is worker 2.
Problem
Doing worker 1 is easy, and not hard. The issue here is doing worker 2. I know akka has buffers that can save up to X messages, but then it looks like I am forced to choose one of the existing Overflow strategies which often result in choosing which message I want to drop or if I want to keep the stream alive or not.
All I want is to, when my buffer in worke2 reaches the maximum size of the buffer, to perform the concat and map operations on all the messages it has, and then send them along ( resetting the buffer after ).
But even after reading the stream-rate documentation for akka I couldn't find a way of doing it, at least using Java.
Research
I also checked a similar SO question, Selective request-throttling using akka-http stream however it has been over an year and no one has responded.
Questions
Using the graph DSL, how would I create the path from:
Source -> bcast -> worker2 -> Sink 2
??
After your bcast apply the groupedWithin operator with an unlimited duration and a number of element set to 3.
https://doc.akka.io/docs/akka/2.5/stream/operators/Source-or-Flow/groupedWithin.html
You can also do it yourself, adding a stage that stores element in a List and emit the list every time it reaches 3 elements.
import akka.stream.Attributes;
import akka.stream.FlowShape;
import akka.stream.Inlet;
import akka.stream.Outlet;
import akka.stream.stage.AbstractInHandler;
import akka.stream.stage.GraphStage;
import akka.stream.stage.GraphStageLogic;
import com.google.common.collect.ImmutableList;
import java.util.ArrayList;
import java.util.List;
public class RecordGrouper<T> extends GraphStage<FlowShape<T, List<T>>> {
private final Inlet<T> inlet = Inlet.create("in");
private final Outlet<List<T>> outlet = Outlet.create("out");
private final FlowShape<T, List<T>> shape = new FlowShape<>(inlet, outlet);
#Override
public GraphStageLogic createLogic(Attributes inheritedAttributes) {
return new GraphStageLogic(shape) {
List<T> batch = new ArrayList<>(3);
{
setHandler(
inlet,
new AbstractInHandler() {
#Override
public void onPush() {
T record = grab(inlet);
batch.add(record);
if (batch.size() == 3) {
emit(outlet, ImmutableList.copyOf(batch));
batch.clear();
}
pull(inlet);
}
});
}
#Override
public void preStart() {
pull(inlet);
}
};
}
#Override
public FlowShape<T, List<T>> shape() {
return shape;
}
}
As a side node, I don't think the buffer operator will work as it only kicks in when there's backpressure. So if everything is quiet, elements will still be emitted one by one instead of 3 by 3. https://doc.akka.io/docs/akka/2.5/stream/operators/Source-or-Flow/buffer.html
I have 5 documents(say) and I have some processing on each of them. Processing here includes open the document/file, read the data, do some document manipulation(edit text etc). For document manipulation I will probably be using docx4j or apache-poi. But my use case is this - I want to somehow process these 4-5 documents in parallel utilizing multiple cores available to me on my CPU. The processing on each document is independent of each other.
What would be the best way to achieve this parallel processing in Java. I have used ExecutorService in java before and Thread class too. But I dont have much idea about the newer concepts like Streams or RxJava. Can this task be achieved by using Parallel Stream in Java as introduced in Java 8? What would be better to use Executors/Streams/Thread Class etc. If Streams can be used please provide a link where I can find some tutorial on how to do that. Thanks for your help!
You can process in parallel using Java Streams using the following pattern.
List<File> files = ...
files.parallelStream().forEach(f -> process(f));
or
File[] files = dir.listFiles();
Stream.of(files).parallel().forEach(f -> process(f));
Note: process cannot throw a CheckedException in this example. I suggest you either log it or return a result object.
If you want to learn about ReactiveX, I would recomend use rxJava Observable.zip http://reactivex.io/documentation/operators/zip.html
Where you can run multiple process on parallel here an example:
public class ObservableZip {
private Scheduler scheduler;
private Scheduler scheduler1;
private Scheduler scheduler2;
#Test
public void testAsyncZip() {
scheduler = Schedulers.newThread();//Thread to open and read 1 file
scheduler1 = Schedulers.newThread();//Thread to open and read 1 file
scheduler2 = Schedulers.newThread();//Thread to open and read 1 file
Observable.zip(obAsyncString(file1), obAsyncString1(file2), obAsyncString2(file3), (s, s2, s3) -> s.concat(s2)
.concat(s3))
.subscribe(result -> showResult("All files in one:", result));
}
public void showResult(String transactionType, String result) {
System.out.println(result + " " +
transactionType);
}
public Observable<String> obAsyncString(File file) {
return Observable.just(file)
.observeOn(scheduler)
.doOnNext(val -> {
//Here you read your file
});
}
public Observable<String> obAsyncString1(File file) {
return Observable.just(file)
.observeOn(scheduler1)
.doOnNext(val -> {
//Here you read your file 2
});
}
public Observable<String> obAsyncString2(File file) {
return Observable.just(file)
.observeOn(scheduler2)
.doOnNext(val -> {
//Here you read your file 3
});
}
}
Like I said, just in case that you want to learn about ReactiveX, because if it not, add this framework in your stack to solve the issue would be a little overkill, and I would much rather the previous stream parallel solution
I've started drawing plugs in Java, like connectors using bezier curves, but just the visual stuff.
Then I begin wondering about making some kind of modular thing, with inputs and outputs. However, I'm very confused on decisions about how to implement it. Let's say for example, a modular synthesizer, or Pure Data / MaxMSP concepts, in which you have modules, and any module has attributes, inputs and outputs.
I wonder if you know what keywords should I use to search something to read about. I need some basic examples or abstract ideas concerning this kind of interface. Is there any some design pattern that fits this idea?
Since you're asking for a keyword real-time design patterns, overly OOP is often a performance bottleneck to real-time applications, since all the objects (and I guess polymorphism to some extent) add overhead.
Why real-time application? The graph you provided looks very sophisticated,
You process the incoming data multiple times in parallel, split it up, merge it and so on.
Every node in the graph adds different effects and makes different computations, where some computations may take longer than others - this leads to the conclusion, that in order to have uniform data (sound), you have to keep the data in sync. This is no trivial task.
I guess some other keywords would be: sound processing, filter. Or you could ask companies that work in that area for literature.
Leaving the time sensitivity aside, I constructed a little OOP example,
maybe an approach like that is sufficient for less complex scenarios
public class ConnectionCable implements Runnable, Closeable {
private final InputLine in;
private final OutputLine out;
public ConnectionCable(InputLine in, OutputLine out) {
this.in = in;
this.out = out;
// cable connects open lines and closes them upon connection
if (in.isOpen() && out.isOpen()) {
in.close();
out.close();
}
}
#Override
public void run() {
byte[] data = new byte[1024];
// cable connects output line to input line
while (out.read(data) > 0)
in.write(data);
}
#Override
public void close() throws IOException {
in.open();
out.open();
}
}
interface Line {
void open();
void close();
boolean isOpen();
boolean isClosed();
}
interface InputLine extends Line {
int write(byte[] data);
}
interface OutputLine extends Line {
int read(byte[] data);
}
I tried to distribute a calculation using hadoop.
I am using Sequence input and output files, and custom Writables.
The input is a list of triangles, maximum size 2Mb, but can be smaller around 50kb too.
The intermediate values and the output is a map(int,double) in the custom Writable.
Is this the bottleneck?
The problem is that the calculation is much slower than the version without hadoop. also, increasing the nodes from 2 to 10, doesn't speed up the process.
One possibility is that I don't get enough mappers because of the small input size.
I made tests changing the mapreduce.input.fileinputformat.split.maxsize, but it just got worse, not better.
I am using hadoop 2.2.0 locally, and at amazon elastic mapreduce.
Did I overlook something? Or this is just the kind of task which should be done without hadoop? (it's my first time using mapreduce).
Would you like to see code parts?
Thank you.
public void map(IntWritable triangleIndex, TriangleWritable triangle, Context context) throws IOException, InterruptedException {
StationWritable[] stations = kernel.newton(triangle.getPoints());
if (stations != null) {
for (StationWritable station : stations) {
context.write(new IntWritable(station.getId()), station);
}
}
}
class TriangleWritable implements Writable {
private final float[] points = new float[9];
#Override
public void write(DataOutput d) throws IOException {
for (int i = 0; i < 9; i++) {
d.writeFloat(points[i]);
}
}
#Override
public void readFields(DataInput di) throws IOException {
for (int i = 0; i < 9; i++) {
points[i] = di.readFloat();
}
}
}
public class StationWritable implements Writable {
private int id;
private final TIntDoubleHashMap values = new TIntDoubleHashMap();
StationWritable(int iz) {
this.id = iz;
}
#Override
public void write(DataOutput d) throws IOException {
d.writeInt(id);
d.writeInt(values.size());
TIntDoubleIterator iterator = values.iterator();
while (iterator.hasNext()) {
iterator.advance();
d.writeInt(iterator.key());
d.writeDouble(iterator.value());
}
}
#Override
public void readFields(DataInput di) throws IOException {
id = di.readInt();
int count = di.readInt();
for (int i = 0; i < count; i++) {
values.put(di.readInt(), di.readDouble());
}
}
}
You won't get any benefit from hadoop with only 2MB of data. Hadoop is all about big data. Distributing the 2MB to your 10 nodes costs more time then just doing the job on a single node. The real benfit starts with a high number of nodes and huge data.
If the processing is really that complex, you should be able to realize a benefit from using Hadoop.
The common issue with small files, is that Hadoop will run a single java process per file and that will create overhead from having to start many processes and slows down the output. In your case this does not sound like it applies. More likely you have the opposite problem that only one Mapper is trying to process your input and it doesn't matter how big your cluster is at that point. Using the input split sounds like the right approach, but because your use case is specialized and deviates significantly from the norm, you may need to tweak a number of components to get the best performance.
So you should be able to get the benefits you are seeking from Hadoop Map Reduce, but it will probably take significant tuning and custom Input handling.
That said seldom(never?) will MapReduce be faster than a purpose built solution. It is a generic tool that is useful in that it can be used to distribute and solve many diverse problems without the need to write a purpose built solution for each.
So at the end I figured out a way to not store intermediate values in writables, only in the memory. This way it is faster.
But still, a non-hadoop solution is the best in this usecase.