Flink job hangs on submission when loading big file - java

I wrote a Flink streaming job in Java that loads a csv file that contains subscriber data (4 columns) and then reads data from a socket stream while matching against the subscriber data.
Initially I was using a small csv file (8 MB) and everything was working fine:
# flink run analytics-flink.jar 19001 /root/minisubs.csv /root/output.csv
loaded 200000 subscribers from csv file
11/02/2015 16:36:59 Job execution switched to status RUNNING.
11/02/2015 16:36:59 Socket Stream -> Flat Map -> Filter -> Map -> Stream Sink(1/1) switched to SCHEDULED
11/02/2015 16:36:59 Socket Stream -> Flat Map -> Filter -> Map -> Stream Sink(1/1) switched to DEPLOYING
11/02/2015 16:36:59 Socket Stream -> Flat Map -> Filter -> Map -> Stream Sink(1/1) switched to RUNNING
I switched the csv file to a bigger one (~45 MB) and now all I see is this:
# flink run analytics-flink.jar 19001 /root/subs.csv /root/output.csv
loaded 1173547 subscribers from csv file
Note that the number of subscribers above is the number of lines in the file. I tried to look for any timeouts in the Flink configuration but I couldn't find any.
Any help is greatly appreciated!
Edit: Csv is loaded using this method by utilizing the commons-csv 1.2 library:
private static HashMap<String, String> loadSubscriberGroups(
String referenceDataFile) throws IOException {
HashMap<String,String> subscriberGroups = new HashMap<String, String>();
File csvData = new File(referenceDataFile);
CSVParser parser = CSVParser.parse(csvData, Charset.defaultCharset(), CSVFormat.EXCEL);
for (CSVRecord csvRecord : parser) {
String imsi = csvRecord.get(0);
String groupStr = csvRecord.get(3);
if(groupStr == null || groupStr.isEmpty()) {
continue;
}
subscriberGroups.put(imsi, groupStr);
}
return subscriberGroups;
}
and here's a sample of the file (I know there's a comma at the end, the last column is empty for now):
450000000000001,450000000001,7752,Tier-2,
450000000000002,450000000002,1112,Tier-1,
450000000000003,450000000003,6058,Tier-2,

From Robert Meztger (apache flink developer):
I can explain why your first approach didn't work:
You were trying to send the CSV files from the Flink client to the
cluster using our RPC system (Akka). When you submit a job to Flink,
we serialize all the objects the user created (mappers, sources, ...)
and send it to the cluster. There is a method
StreamExecutionEnvironment.fromElements(..) which allows users to
serialize a few objects along with the job submission. But the amount
of data you can transfer like this is limited by the Akka frame size.
In our case I think the default is 10 megabytes. After that, Akka will
probably just drop or reject the deployment message.
The solution would be to use a rich operator instead of a regular operator (e.g. RichMapFunction instead of MapFunction), overriding the open() method and loading the CSV file inside that method.
Thanks Robert!

Related

readFile cause "Could not fulfill resource requirements of job"

I have s3 with terabytes of data, separated to small files less than 5 mb.
I try to use flink to process them.
I create source with next code.
var inputFormat = new TextInputFormat(null);
inputFormat.setNestedFileEnumeration(true);
return streamExecutionEnvironment.readFile(inputFormat, "s3://name/");
But used memory growing up to limit, and job killed, and not scheduled again with error:
Could not fulfill resource requirements of job
Without data in sink.
On small set of data it works fine.
How I can read files without using too much memory?
Thanks.
same behaviour with:
env.fromSource( FileSource.forRecordStreamFormat(
new TextLineFormat(),
new Path("s3://name/")
)
.monitorContinuously(Duration.ofMillis(10000L))
.build(),
WatermarkStrategy.noWatermarks(),
"MySourceName"
)
The FileSource is the preferred way to ingest data from files. It should be able to handle the sort of scale you are talking about.
docs
javadocs
setQueueLimit on kinesis producer solved my problem https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/kinesis/#backpressure

Spring Cloud Function - Form Data/Multipart File?

I am creating a Spring Cloud Function that I want to give two inputs, an id and a Multipart file (CSV file) but I am having trouble.
If I choose to send a post with a multipart file the function won't recognise this and gives an error like Failed to determine input for function call with parameters:
With the Postman request being this:
#Bean
public Function<MultipartFile, String> uploadWatchlist() {
return body -> {
try {
return service.convert(body);
}
}
}
I have tried using something more akin to Spring MVC like a request entity object but no luck.
The backup I have (other than Python haha) will be using the binary data post so it will just be a string that has the contents of the file which does work, but requires me to append the id inside to each row of the csv which is a bit messy.
There are other solutions but trying to get this working as Java lambdas are what we want to try and use as first choice.
The infrastructure will be to fix up a manual file upload/verification process that is tedious at the moment and looks like: postman -> load balancer -> lambda -> ecs
The postman/load balancer part will be replaced in future. Ideally have the lambda sorted in Java taking in a file and id.
Thanks for any help :)

Splitting, Aggregating then streaming writing on one big file using Apache Camel

I have a large database from which i load huge records. I process them in a batch mode using the splitter and agregator patterns.
The step where i'm stuck is the streaming of each batch to one json file where i want them all to be stored. Here are the steps :
Fetch records from DB
Process them as batchs of N
Each processed batch is written in a same big json file (missing step)
I tested this solution with the Append option from File2 but it does write multiples arrays in an an array. I could flatten this JSON but it takes me to one question.
How to stop the route from running knowing that i have two requirements :
After run the batch, the size at the start is not necessarly the same on in the end.
I tried to work with completionFromConsumer but does not work with quartz consumers.
I have this route :
from(endpointsURL))
.log(LoggingLevel.INFO, LOGGER, "Start fetching records")
.bean(DatabaseFetch, "fetch")
.split().method(InspectionSplittingStrategy.class, "splitItems")
.aggregate(constant(true), batchAggregationStrategy())
.completionPredicate(batchSizePredicate())
.completionTimeout(BATCH_TIME_OUT)
.log(LoggingLevel.INFO, LOGGER, "Start processing items")
.bean(ItemProcessor, "process")
.marshal()
.json(JsonLibrary.Jackson, true)
.setHeader(Exchange.FILE_NAME, constant("extract.json")))
.to("file:/json?doneFileName=${file:name}.done")
.log(LoggingLevel.INFO, LOGGER, "Processing done");
The problem here is as i supposed, my extract.json gets overwritten with every batch processed. I want to append every batch after an other.
I have no clue how to design and which pattern to use to make this possible. Stream and File have good features but in which fashion i can use them ?
You need to tell Camel to append to the file if it exists, add fileExists=Append as option to your file endpoint.
I changed the route strategy using only a splitting stategy :
from(endpointsURLs.get(START_AGENT))
.bean(databaseFetch, "fetch")
.split().method(SplittingStrategy.class, "splitItems")
.parallelProcessing()
.bean(databaseBatchExtractor, "launch")
.end()
.to("seda:generateExportFiles");
from("seda:generateExportFiles")
.bean(databaseFetch, "fetchPublications")
.multicast()
.parallelProcessing()
.to("direct:generateJson", "direct:generateCsv");
from("direct:generateJson")
.log("generate JSON file")
.marshal()
.json(JsonLibrary.Jackson, true)
.setHeader(Exchange.FILE_NAME, constant("extract.json")))
.to("file:/json?doneFileName=${file:name}.done")
.to("direct:notify");
from("direct:generateCsv")
.log("generate CSV file")
.bean(databaseFetch, "exportCsv")
.to("direct:notify");
from("direct:notify")
.log("generation done");
The important class SplittingStrategy :
public class SplittingStrategy {
private static final int BATCH_SIZE = 500;
private AtomicInteger counter = new AtomicInteger();
public Collection<List<Pair<Integer, Set<Integer>>>> splitItems(Map<Integer, Set<Integer>> itemsByID) {
List<Pair<Integer, Set<Integer>>> rawList = itemsByID.entrySet().stream()
.map((inspUA) -> new ImmutablePair<>(inspUA.getKey(), inspUA.getValue()))
.collect(Collectors.toList());
return rawList.parallelStream()
.collect(Collectors.groupingBy(pair -> counter.getAndIncrement() / BATCH_SIZE)).values();
}
}
With this strategy instead of using aggregate to re-assemble items. I embeeded the aggregation strategy as part of the splitting :
Transform my hashmap into an Iterable List> to be returned by the split method (c.f Splitter with POJO)
Split items in batches of 500 items size with a groupingBy of my initial list stream.
Give a comment or your opinion about it!

How to read output file for collecting stats (post) processing

Summary
I need to build a set of statistics during a Camel server in-modify-out process, and emit those statistics as one object (a single json log line).
Those statistics need to include:
input file metrics (size/chars/bytes and other, file-section specific measures)
processing time statistics (start/end/duration of processing time, start/end/duration of metrics gathering time)
output file metrics (same as input file metrics, and will be different numbers, output file being changed)
The output file metrics are the problem as I can't access the file until it's written to disk, and
its not written to disk until 'process'ing finishes
Background
A log4j implementation is being used for service logging, but after some tinkering we realised it really doesn't suit the requirement here as it would output multi-line json and embed the json into a single top-level field. We need varying top level fields, depending on the file processed.
The server is expected to deal with multiple file operations asynchronously, and the files vary in size (from tiny to fairly immense - which is one reason we need to iterate stats and measures before we start to tune or review)
Current State
input file and even processing time stats are working OK, and I'm using the following technique to get them:
Inside the 'process' override method of "MyProcessor" I create a new instance of my JsonLogWriter class. (shortened pseudo code with ellipsis)
import org.apache.camel.Exchange;
import org.apache.camel.Processor;
...
#Component
public class MyProcessor implements Processor {
...
#Override
public void process(Exchange exchange) throws Exception {
...
JsonLogWriter jlw = new JsonLogWriter();
jlw.logfilePath = jsonLogFilePath;
jlw.inputFilePath = inFilePath;
jlw.outputfilePath = outFilePath;
...
jlw.metricsInputFile(); //gathers metrics using inputFilePath - OK
...
(input file is processed / changed and returned as an inputstream:
InputStream result = myEngine.readAndUpdate(inFilePath);
... get timings
jlw.write
}
From this you can see that JsonLogWriter has
properties for file paths (input file, output file, log output),
a set of methods to populate data:
a method to emit the data to a file (once ready)
Once I have populated all the json objects in the class, I call the write() method and the class pulls all the json objects together and
the stats all arrive in a log file (in a single line of json) - OK.
Error - no output file (yet)
If I use the metricsOutputFile method however:
InputStream result = myEngine.readAndUpdate(inFilePath);
... get timings
jlw.metricsOutputFile(); // using outputfilePath
jlw.write
}
... the JsonLogWriter fails as the file doesn't exist yet.
java.nio.file.NoSuchFileException: aroute\output\a_long_guid_filename
when debugging I can't see any part of the exchange or result objects which I might pipe into a file read/statistics gathering process.
Will this require more camel routes to solve? What might be an alternative approach where I can get all the stats from input and output files and keep them in one object / line of json?
(very happy to receive constructive criticism - as in why is your Java so heavy-handed - and yes it may well be, I am prototyping solutions at this stage, so this isn't production code, nor do I profess deep understanding of Java internals - I can usually get stuff working though)
Use one route and two processors: one for writing the file and the next for reading the file, so one finishes writing before the other starts reading
Or , also you can use two routes: one for writing the file (to:file) and other that listens to read the file(from:file)
You can check for common EIP patterns that will solve most of this questions here:
https://www.enterpriseintegrationpatterns.com/patterns/messaging/

What is the ideal way to design this Apache Beam transform that outputs multiple files including binary outputs?

I am trying to process PDF files in a Beam pipeline coming from an input bucket, and output the results, input, and intermediate file all to a separate output bucket.
The filenames of all three outputs are derived from the final step, and there is a 1:1 mapping of input files to output filenames, so I don't want to have shard templates in the output filenames (my UniquePrefixFileNaming class is doing the same thing as TextIO.withoutSharding())
Since the filenames are only known in the last step, I don't think I can set up tagged outputs and output files in each of the previous processing steps - I have to carry data all the way through the pipeline.
What is the best way of achieving this? Below is my attempt at the problem - the text outputs work okay but I don't have a solution for the PDF output (no binary output sink available, no binary data carried through). Is FileIO.writeDynamic the best approach?
Pipeline p = Pipeline.create();
PCollection<MyProcessorTransformResult> transformCollection = p.apply(FileIO.match().filepattern("Z:\\Inputs\\en_us\\**.pdf"))
.apply(FileIO.readMatches())
.apply(TikaIO.parseFiles())
.apply(ParDo.of(new MyProcessorTransform()));
// Write output PDF
transformCollection.apply(FileIO.match().filepattern())
transformCollection.apply(FileIO.<String, MyProcessorTransformResult>writeDynamic()
.withTempDirectory("Z:\\Temp\\vbeam")
.by(input -> input.data.getResourceKey())
.via(
Contextful.fn((SerializableFunction<MyProcessorTransformResult, byte[]>) input -> new byte[] {})
)
.withNaming(d -> new UniquePrefixFileNaming(d, ".pdf"))
.withNumShards(1)
.withDestinationCoder(ByteArrayCoder.of())
.to("Z:\\Outputs"));
// Write output TXT
transformCollection.apply(FileIO.<String, MyProcessorTransformResult>writeDynamic()
.withTempDirectory("Z:\\Temp\\vbeam")
.by(input -> input.data.getResourceKey())
.via(
Contextful.fn((SerializableFunction<MyProcessorTransformResult, String>) input -> input.originalContent),
TextIO.sink()
)
.withNaming(d -> new UniquePrefixFileNaming(d, ".pdf.txt"))
.withNumShards(1)
.withDestinationCoder(StringUtf8Coder.of())
.to("Z:\\Outputs"));
// Write output JSON
transformCollection.apply(FileIO.<String, MyProcessorTransformResult>writeDynamic()
.withTempDirectory("Z:\\Temp\\vbeam")
.by(input -> input.data.getResourceKey())
.via(
Contextful.fn((SerializableFunction<MyProcessorTransformResult, String>) input -> SerializationHelpers.toJSON(input.data)),
TextIO.sink()
)
.withNaming(d -> new UniquePrefixFileNaming(d, ".pdf.json"))
.withNumShards(1)
.withDestinationCoder(StringUtf8Coder.of())
.to("Z:\\Outputs"));
p.run();
I ended up writing my own File sink that saves all 3 outputs. FileIO is very much tailored towards streaming, having Windows and Panes to split the data up, - my sink step kept running out of memory because it would try to aggregate everything before doing any actual writes, as batch jobs run in a single Window in Beam. I had no such issues with my custom DoFn.
My recommendation for anyone looking into this is to do the same - you could try to hook into Beam's Filesystems classes or look at jclouds for filesystem agnostic storage.

Categories