Splitting, Aggregating then streaming writing on one big file using Apache Camel - java

I have a large database from which i load huge records. I process them in a batch mode using the splitter and agregator patterns.
The step where i'm stuck is the streaming of each batch to one json file where i want them all to be stored. Here are the steps :
Fetch records from DB
Process them as batchs of N
Each processed batch is written in a same big json file (missing step)
I tested this solution with the Append option from File2 but it does write multiples arrays in an an array. I could flatten this JSON but it takes me to one question.
How to stop the route from running knowing that i have two requirements :
After run the batch, the size at the start is not necessarly the same on in the end.
I tried to work with completionFromConsumer but does not work with quartz consumers.
I have this route :
from(endpointsURL))
.log(LoggingLevel.INFO, LOGGER, "Start fetching records")
.bean(DatabaseFetch, "fetch")
.split().method(InspectionSplittingStrategy.class, "splitItems")
.aggregate(constant(true), batchAggregationStrategy())
.completionPredicate(batchSizePredicate())
.completionTimeout(BATCH_TIME_OUT)
.log(LoggingLevel.INFO, LOGGER, "Start processing items")
.bean(ItemProcessor, "process")
.marshal()
.json(JsonLibrary.Jackson, true)
.setHeader(Exchange.FILE_NAME, constant("extract.json")))
.to("file:/json?doneFileName=${file:name}.done")
.log(LoggingLevel.INFO, LOGGER, "Processing done");
The problem here is as i supposed, my extract.json gets overwritten with every batch processed. I want to append every batch after an other.
I have no clue how to design and which pattern to use to make this possible. Stream and File have good features but in which fashion i can use them ?

You need to tell Camel to append to the file if it exists, add fileExists=Append as option to your file endpoint.

I changed the route strategy using only a splitting stategy :
from(endpointsURLs.get(START_AGENT))
.bean(databaseFetch, "fetch")
.split().method(SplittingStrategy.class, "splitItems")
.parallelProcessing()
.bean(databaseBatchExtractor, "launch")
.end()
.to("seda:generateExportFiles");
from("seda:generateExportFiles")
.bean(databaseFetch, "fetchPublications")
.multicast()
.parallelProcessing()
.to("direct:generateJson", "direct:generateCsv");
from("direct:generateJson")
.log("generate JSON file")
.marshal()
.json(JsonLibrary.Jackson, true)
.setHeader(Exchange.FILE_NAME, constant("extract.json")))
.to("file:/json?doneFileName=${file:name}.done")
.to("direct:notify");
from("direct:generateCsv")
.log("generate CSV file")
.bean(databaseFetch, "exportCsv")
.to("direct:notify");
from("direct:notify")
.log("generation done");
The important class SplittingStrategy :
public class SplittingStrategy {
private static final int BATCH_SIZE = 500;
private AtomicInteger counter = new AtomicInteger();
public Collection<List<Pair<Integer, Set<Integer>>>> splitItems(Map<Integer, Set<Integer>> itemsByID) {
List<Pair<Integer, Set<Integer>>> rawList = itemsByID.entrySet().stream()
.map((inspUA) -> new ImmutablePair<>(inspUA.getKey(), inspUA.getValue()))
.collect(Collectors.toList());
return rawList.parallelStream()
.collect(Collectors.groupingBy(pair -> counter.getAndIncrement() / BATCH_SIZE)).values();
}
}
With this strategy instead of using aggregate to re-assemble items. I embeeded the aggregation strategy as part of the splitting :
Transform my hashmap into an Iterable List> to be returned by the split method (c.f Splitter with POJO)
Split items in batches of 500 items size with a groupingBy of my initial list stream.
Give a comment or your opinion about it!

Related

How to read output file for collecting stats (post) processing

Summary
I need to build a set of statistics during a Camel server in-modify-out process, and emit those statistics as one object (a single json log line).
Those statistics need to include:
input file metrics (size/chars/bytes and other, file-section specific measures)
processing time statistics (start/end/duration of processing time, start/end/duration of metrics gathering time)
output file metrics (same as input file metrics, and will be different numbers, output file being changed)
The output file metrics are the problem as I can't access the file until it's written to disk, and
its not written to disk until 'process'ing finishes
Background
A log4j implementation is being used for service logging, but after some tinkering we realised it really doesn't suit the requirement here as it would output multi-line json and embed the json into a single top-level field. We need varying top level fields, depending on the file processed.
The server is expected to deal with multiple file operations asynchronously, and the files vary in size (from tiny to fairly immense - which is one reason we need to iterate stats and measures before we start to tune or review)
Current State
input file and even processing time stats are working OK, and I'm using the following technique to get them:
Inside the 'process' override method of "MyProcessor" I create a new instance of my JsonLogWriter class. (shortened pseudo code with ellipsis)
import org.apache.camel.Exchange;
import org.apache.camel.Processor;
...
#Component
public class MyProcessor implements Processor {
...
#Override
public void process(Exchange exchange) throws Exception {
...
JsonLogWriter jlw = new JsonLogWriter();
jlw.logfilePath = jsonLogFilePath;
jlw.inputFilePath = inFilePath;
jlw.outputfilePath = outFilePath;
...
jlw.metricsInputFile(); //gathers metrics using inputFilePath - OK
...
(input file is processed / changed and returned as an inputstream:
InputStream result = myEngine.readAndUpdate(inFilePath);
... get timings
jlw.write
}
From this you can see that JsonLogWriter has
properties for file paths (input file, output file, log output),
a set of methods to populate data:
a method to emit the data to a file (once ready)
Once I have populated all the json objects in the class, I call the write() method and the class pulls all the json objects together and
the stats all arrive in a log file (in a single line of json) - OK.
Error - no output file (yet)
If I use the metricsOutputFile method however:
InputStream result = myEngine.readAndUpdate(inFilePath);
... get timings
jlw.metricsOutputFile(); // using outputfilePath
jlw.write
}
... the JsonLogWriter fails as the file doesn't exist yet.
java.nio.file.NoSuchFileException: aroute\output\a_long_guid_filename
when debugging I can't see any part of the exchange or result objects which I might pipe into a file read/statistics gathering process.
Will this require more camel routes to solve? What might be an alternative approach where I can get all the stats from input and output files and keep them in one object / line of json?
(very happy to receive constructive criticism - as in why is your Java so heavy-handed - and yes it may well be, I am prototyping solutions at this stage, so this isn't production code, nor do I profess deep understanding of Java internals - I can usually get stuff working though)
Use one route and two processors: one for writing the file and the next for reading the file, so one finishes writing before the other starts reading
Or , also you can use two routes: one for writing the file (to:file) and other that listens to read the file(from:file)
You can check for common EIP patterns that will solve most of this questions here:
https://www.enterpriseintegrationpatterns.com/patterns/messaging/

What is the ideal way to design this Apache Beam transform that outputs multiple files including binary outputs?

I am trying to process PDF files in a Beam pipeline coming from an input bucket, and output the results, input, and intermediate file all to a separate output bucket.
The filenames of all three outputs are derived from the final step, and there is a 1:1 mapping of input files to output filenames, so I don't want to have shard templates in the output filenames (my UniquePrefixFileNaming class is doing the same thing as TextIO.withoutSharding())
Since the filenames are only known in the last step, I don't think I can set up tagged outputs and output files in each of the previous processing steps - I have to carry data all the way through the pipeline.
What is the best way of achieving this? Below is my attempt at the problem - the text outputs work okay but I don't have a solution for the PDF output (no binary output sink available, no binary data carried through). Is FileIO.writeDynamic the best approach?
Pipeline p = Pipeline.create();
PCollection<MyProcessorTransformResult> transformCollection = p.apply(FileIO.match().filepattern("Z:\\Inputs\\en_us\\**.pdf"))
.apply(FileIO.readMatches())
.apply(TikaIO.parseFiles())
.apply(ParDo.of(new MyProcessorTransform()));
// Write output PDF
transformCollection.apply(FileIO.match().filepattern())
transformCollection.apply(FileIO.<String, MyProcessorTransformResult>writeDynamic()
.withTempDirectory("Z:\\Temp\\vbeam")
.by(input -> input.data.getResourceKey())
.via(
Contextful.fn((SerializableFunction<MyProcessorTransformResult, byte[]>) input -> new byte[] {})
)
.withNaming(d -> new UniquePrefixFileNaming(d, ".pdf"))
.withNumShards(1)
.withDestinationCoder(ByteArrayCoder.of())
.to("Z:\\Outputs"));
// Write output TXT
transformCollection.apply(FileIO.<String, MyProcessorTransformResult>writeDynamic()
.withTempDirectory("Z:\\Temp\\vbeam")
.by(input -> input.data.getResourceKey())
.via(
Contextful.fn((SerializableFunction<MyProcessorTransformResult, String>) input -> input.originalContent),
TextIO.sink()
)
.withNaming(d -> new UniquePrefixFileNaming(d, ".pdf.txt"))
.withNumShards(1)
.withDestinationCoder(StringUtf8Coder.of())
.to("Z:\\Outputs"));
// Write output JSON
transformCollection.apply(FileIO.<String, MyProcessorTransformResult>writeDynamic()
.withTempDirectory("Z:\\Temp\\vbeam")
.by(input -> input.data.getResourceKey())
.via(
Contextful.fn((SerializableFunction<MyProcessorTransformResult, String>) input -> SerializationHelpers.toJSON(input.data)),
TextIO.sink()
)
.withNaming(d -> new UniquePrefixFileNaming(d, ".pdf.json"))
.withNumShards(1)
.withDestinationCoder(StringUtf8Coder.of())
.to("Z:\\Outputs"));
p.run();
I ended up writing my own File sink that saves all 3 outputs. FileIO is very much tailored towards streaming, having Windows and Panes to split the data up, - my sink step kept running out of memory because it would try to aggregate everything before doing any actual writes, as batch jobs run in a single Window in Beam. I had no such issues with my custom DoFn.
My recommendation for anyone looking into this is to do the same - you could try to hook into Beam's Filesystems classes or look at jclouds for filesystem agnostic storage.

How to log the content of csv in Apache Camel?

I have the following code
DataFormat bindy = new BindyCsvDataFormat(Employee.class);
from("file:src/main/resources/csv2?noop=true").routeId("route3").unmarshal(bindy).to("mock:result").log("${body[0].name}");
I am trying to log every line of the csv file, currently I am only able to hardcode it to print.
Do I have to use Loop even I don't know the number of lines of the csv ? Or Do I have to use processor ? Whats the easiest way to achieve what I want ?
The unmarshalling step is producing an exchange whose body is a list of tuples. For that reason you can simply use Camel splitter to slice the original exchange into 1-N sub-exchanges (one per line/item of the list) and then log each of these lines:
from("file:src/main/resources/csv2?noop=true")
.unmarshal(bindy)
.split().body()
.log("${name}");
If you do not want to alter the original message, you can use the wiretap pattern in order to log a copy of the exchange:
from("file:src/main/resources/csv2?noop=true")
.unmarshal(bindy)
.wireTap("direct:logBody")
.to("mock:result");
from("direct:logBody")
.split().body()
.log("Row# ${exchangeProperty.CamelSplitIndex} : ${name}");

Flink job hangs on submission when loading big file

I wrote a Flink streaming job in Java that loads a csv file that contains subscriber data (4 columns) and then reads data from a socket stream while matching against the subscriber data.
Initially I was using a small csv file (8 MB) and everything was working fine:
# flink run analytics-flink.jar 19001 /root/minisubs.csv /root/output.csv
loaded 200000 subscribers from csv file
11/02/2015 16:36:59 Job execution switched to status RUNNING.
11/02/2015 16:36:59 Socket Stream -> Flat Map -> Filter -> Map -> Stream Sink(1/1) switched to SCHEDULED
11/02/2015 16:36:59 Socket Stream -> Flat Map -> Filter -> Map -> Stream Sink(1/1) switched to DEPLOYING
11/02/2015 16:36:59 Socket Stream -> Flat Map -> Filter -> Map -> Stream Sink(1/1) switched to RUNNING
I switched the csv file to a bigger one (~45 MB) and now all I see is this:
# flink run analytics-flink.jar 19001 /root/subs.csv /root/output.csv
loaded 1173547 subscribers from csv file
Note that the number of subscribers above is the number of lines in the file. I tried to look for any timeouts in the Flink configuration but I couldn't find any.
Any help is greatly appreciated!
Edit: Csv is loaded using this method by utilizing the commons-csv 1.2 library:
private static HashMap<String, String> loadSubscriberGroups(
String referenceDataFile) throws IOException {
HashMap<String,String> subscriberGroups = new HashMap<String, String>();
File csvData = new File(referenceDataFile);
CSVParser parser = CSVParser.parse(csvData, Charset.defaultCharset(), CSVFormat.EXCEL);
for (CSVRecord csvRecord : parser) {
String imsi = csvRecord.get(0);
String groupStr = csvRecord.get(3);
if(groupStr == null || groupStr.isEmpty()) {
continue;
}
subscriberGroups.put(imsi, groupStr);
}
return subscriberGroups;
}
and here's a sample of the file (I know there's a comma at the end, the last column is empty for now):
450000000000001,450000000001,7752,Tier-2,
450000000000002,450000000002,1112,Tier-1,
450000000000003,450000000003,6058,Tier-2,
From Robert Meztger (apache flink developer):
I can explain why your first approach didn't work:
You were trying to send the CSV files from the Flink client to the
cluster using our RPC system (Akka). When you submit a job to Flink,
we serialize all the objects the user created (mappers, sources, ...)
and send it to the cluster. There is a method
StreamExecutionEnvironment.fromElements(..) which allows users to
serialize a few objects along with the job submission. But the amount
of data you can transfer like this is limited by the Akka frame size.
In our case I think the default is 10 megabytes. After that, Akka will
probably just drop or reject the deployment message.
The solution would be to use a rich operator instead of a regular operator (e.g. RichMapFunction instead of MapFunction), overriding the open() method and loading the CSV file inside that method.
Thanks Robert!

Camel Use a splitter without aggregator

I'm new to Camel and I'd like to use it to read a XML file on a FTP server and to a assynch process for all NODE element of the XML.
Indeed, I'll use a splitter to process every node (I use a stream because the XML file is big).
from(ftp://user#host:port/...)
.split().tokenizeXML("node").streaming()
.to("seda:processNode")
.end();
Then the route to the nodeProcessor:
from("seda:processNode")
.bean(lookup(MyNodeProcessor.class))
.end();
I was wondering if it's ok to use a splitter without an aggregator? In my case, I don't need to aggregate the outcome of all processed nodes.
I was wondering if it's a problem in Camel to have many "splitted" threads going in a "dead end" instead of being aggreagated?
The examples provided by Camel show a splitter withtout an aggregator, but they still provide an aggregationStrategy with the splitter. Is it mandatory?
No this is perfect fine, you can use the splitter without the agg strategy which would be normal, like the splitter EIP: http://camel.apache.org/splitter
If you use an agg strategy then its more like this EIP: http://camel.apache.org/composed-message-processor.html which can be done with splitter only in Camel.

Categories