How to read output file for collecting stats (post) processing - java

Summary
I need to build a set of statistics during a Camel server in-modify-out process, and emit those statistics as one object (a single json log line).
Those statistics need to include:
input file metrics (size/chars/bytes and other, file-section specific measures)
processing time statistics (start/end/duration of processing time, start/end/duration of metrics gathering time)
output file metrics (same as input file metrics, and will be different numbers, output file being changed)
The output file metrics are the problem as I can't access the file until it's written to disk, and
its not written to disk until 'process'ing finishes
Background
A log4j implementation is being used for service logging, but after some tinkering we realised it really doesn't suit the requirement here as it would output multi-line json and embed the json into a single top-level field. We need varying top level fields, depending on the file processed.
The server is expected to deal with multiple file operations asynchronously, and the files vary in size (from tiny to fairly immense - which is one reason we need to iterate stats and measures before we start to tune or review)
Current State
input file and even processing time stats are working OK, and I'm using the following technique to get them:
Inside the 'process' override method of "MyProcessor" I create a new instance of my JsonLogWriter class. (shortened pseudo code with ellipsis)
import org.apache.camel.Exchange;
import org.apache.camel.Processor;
...
#Component
public class MyProcessor implements Processor {
...
#Override
public void process(Exchange exchange) throws Exception {
...
JsonLogWriter jlw = new JsonLogWriter();
jlw.logfilePath = jsonLogFilePath;
jlw.inputFilePath = inFilePath;
jlw.outputfilePath = outFilePath;
...
jlw.metricsInputFile(); //gathers metrics using inputFilePath - OK
...
(input file is processed / changed and returned as an inputstream:
InputStream result = myEngine.readAndUpdate(inFilePath);
... get timings
jlw.write
}
From this you can see that JsonLogWriter has
properties for file paths (input file, output file, log output),
a set of methods to populate data:
a method to emit the data to a file (once ready)
Once I have populated all the json objects in the class, I call the write() method and the class pulls all the json objects together and
the stats all arrive in a log file (in a single line of json) - OK.
Error - no output file (yet)
If I use the metricsOutputFile method however:
InputStream result = myEngine.readAndUpdate(inFilePath);
... get timings
jlw.metricsOutputFile(); // using outputfilePath
jlw.write
}
... the JsonLogWriter fails as the file doesn't exist yet.
java.nio.file.NoSuchFileException: aroute\output\a_long_guid_filename
when debugging I can't see any part of the exchange or result objects which I might pipe into a file read/statistics gathering process.
Will this require more camel routes to solve? What might be an alternative approach where I can get all the stats from input and output files and keep them in one object / line of json?
(very happy to receive constructive criticism - as in why is your Java so heavy-handed - and yes it may well be, I am prototyping solutions at this stage, so this isn't production code, nor do I profess deep understanding of Java internals - I can usually get stuff working though)

Use one route and two processors: one for writing the file and the next for reading the file, so one finishes writing before the other starts reading
Or , also you can use two routes: one for writing the file (to:file) and other that listens to read the file(from:file)
You can check for common EIP patterns that will solve most of this questions here:
https://www.enterpriseintegrationpatterns.com/patterns/messaging/

Related

readFile cause "Could not fulfill resource requirements of job"

I have s3 with terabytes of data, separated to small files less than 5 mb.
I try to use flink to process them.
I create source with next code.
var inputFormat = new TextInputFormat(null);
inputFormat.setNestedFileEnumeration(true);
return streamExecutionEnvironment.readFile(inputFormat, "s3://name/");
But used memory growing up to limit, and job killed, and not scheduled again with error:
Could not fulfill resource requirements of job
Without data in sink.
On small set of data it works fine.
How I can read files without using too much memory?
Thanks.
same behaviour with:
env.fromSource( FileSource.forRecordStreamFormat(
new TextLineFormat(),
new Path("s3://name/")
)
.monitorContinuously(Duration.ofMillis(10000L))
.build(),
WatermarkStrategy.noWatermarks(),
"MySourceName"
)
The FileSource is the preferred way to ingest data from files. It should be able to handle the sort of scale you are talking about.
docs
javadocs
setQueueLimit on kinesis producer solved my problem https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/kinesis/#backpressure

How to access file name within a DoFn in an unbounded pipeline

I'm looking for a way to access the name of the file being processed during the data transformation within a DoFn.
My pipeline is as shown below:
Pipeline p = Pipeline.create(options);
p.apply(FileIO.match()
.filepattern(options.getInput())
.continuously(Duration.standardSeconds(5),
Watch.Growth.<String>never()))
.apply(FileIO.readMatches()
.withCompression(Compression.GZIP))
.apply(XmlIO.<MyString>readFiles()
.withRootElement("root")
.withRecordElement("record")
.withRecordClass(MyString.class))//<-- This only returns the contents of the file
.apply(ParDo.of(new ProcessRecord()))//<-- I need to access file name here
.apply(ParDo.of(new FormatRecord()))
.apply(Window.<String>into(FixedWindows.of(Duration.standardSeconds(5))))
.apply(new CustomWrite(options));
Each file that is processed is an XML document. While processing the content, I need access to the name of the file being processed too to include in the transformed record.
Is there a way to achieve this?
This post has a similar question, but since i'm trying to use XmlIO I havent found a way to access the file metadata.
Below is the approach I found online, but not sure if there is a way to use it in the pipeline described above.
p.apply(FileIO.match()
.filepattern(options.getInput())
.continuously(Duration.standardSeconds(5),
Watch.Growth.<String>never()))//File Metadata
.apply(FileIO.readMatches()
.withCompression(Compression.GZIP))//Readable Files
.apply(MapElements
.into(TypeDescriptors.kvs(TypeDescriptors.strings(),new TypeDescriptor<ReadableFile>() {} ))
.via((ReadableFile file) -> {
return KV.of(file.getMetadata().resourceId().getFilename(),file);
})
);
Any suggestions are highly appreciated.
Thank you for your time reviewing this.
EDIT:
I took Alexey's advice and implemented a custom XmlIO. It would be nice if we could just extend the class we need and override the appropriate method. However, in this specific case, there was a reference to one method which was protected within the sdk because of which I couldn't easily override what i needed and instead ended up copying a whole bunch of files. While this works for now, I hope in future there is a more straighforward way to access the file metadata in these IO implementations.
I don't think it's possible to do "out-of-box" with a current implementation of of XmlIO since it returns a PCollection<T> where T is a type of your xml record and, if I'm not mistaken, there is no way to add a file name there. Though, you still can try to "reimplement" a ReadFiles and XmlSource in a way that it will return parsed payload and input file metadata.

How can I create a route with multiple connected ftp calls using Camel and Java DSL?

I have this synchronous pipeline that need to be executed from time to time (lets say every 30 minutes):
Connect to a ftp;
Read a .json file (single file) from folder A;
Unmarshall the content of the file (Class A) and add it to the route context;
Read all the .fixedlenght files (multiple files) from folder B (preMove: processingFolder, move: doneFolder, moveFailed: errorFolder);
Unmarshall the content of the files (Class B) and do some logic;
Read all the .xml files (multiple files) from folder C (preMove: processingFolder, move: doneFolder, moveFailed: errorFolder);
Unmarshall the content of the files (Class C) and do some logic;
End the route.
It is a single pipeline created with Java DSL. If a error happen, the process stop.
I'm really struggling with Camel to create this. It is possible or I will need to handle this manually? I created some demos, but none of them are properly working.
Any help will be appreciated.
I would approach this in the following manner:
All the interfaces to the FTP where you read the files are separate routes. Their job is only to pick up the file. They don't deal with parsing or transformation.
Then create separate routes for actually receiving the data, parsing and transformation.
Finally the delivery routes which take the data and deliver to your end destination.
This way you can customise the error handling, easier to find out what went wrong were, makes it easier to change one part without affecting everything and you can reuse the routes in several different parts.
The way you describe your message pipeline it seems beneficial to have 3 separate routes each handling a different folder in your FTP server. You can have a timer that triggers all 3 every 30 minutes of so. The FTP component derives from Camel's File Component and there are a lot a useful parameters that would help with your routing logic here.
For each of your 3 routes you would have something like this:
from("ftp://foo#myserver?include=*.xml&preMove=processingFolder&move=doneFolder&moveFailed=errorFolder")
.unmarshal()
...
You can find more info about filtering files by their extensions here

How to fully read a file with delimited messages in Google Protobufs?

I'm trying to read a file, which has multiple delimited messages in it (in the thousands), how can I do this properly using Google protobufs?
This is how I'm writing the delimited:
MyMessage myMessage = MyMessage.parseFrom(byte[] msg);
myMessage.writeDelimitedTo(FileOutputStream);
and this is how I'm reading the delimited file;
CodedInputStream is = CodedInputStream.newInstance(new FileInputStream("/location/to/file"));
while (!is.isAtEnd()) {
int size = is.readRawVarint32();
MyMessage msg = MyMessage.parseFrom(is.readRawBytes(size));
//do stuff with your messages
}
I'm kind of confused because the accepted answer in this question say's to use .parseDelimitedFrom() to read the delimited bytes; Google Protocol Buffers - Storing messages into file
However, when using .parseDelimitedFrom(), it only reads the first message. (I don't know how to read the whole file using parseDelimitedFrom()).
This comment say's to write the delimited messages using CodedOutputStream: Google Protocol Buffers - Storing messages into file (i.e. writer.writeRawVariant()). I'm currently using the implementation of this comment to read the whole file. Does writeDelimitedTo() basically do the same thing as
writer.writeRawVarint32(bytes.length);
and
writer.writeRawBytes(bytes);
Also, if my way isn't the proper way of reading a whole file consisting of delimited messages, can you please show me what is?
thank you.
Yes, writeDelimitedTo() simply writes the length as a varint followed byte the bytes. There's no need to use CodedOutputStream directly if you're working in Java.
parseDelimitedFrom() parses one message, but you may call it repeatedly to parse all the messages in the InputStream. The method will return null when you reach the end of the stream.

Programmatically obtain RF test result xml file name and path

I am asked to work on a RF Java Test Library that can post test results to a different service through SOAP. While the concept seems easy to understand, I am stuck on using RF due to inexperience.
Ideally, I will write the Test Library ABC which contains a method that takes a string as input parameter. The string would be the location of the test result xml file. Then I add ABC to a RF Test Case. When the Test Case is run, ABC will get called. The code inside ABC would parse this file to get the list of test case id and results then send this data to the external service using SOAP.
The difficulty now is how to make RF know the value of the test rest file and pass the value to ABC during run time. Any ideas? Thank you in advance.
If you want to update an external system during test execution, the best way is to use the listener interface instead of a test library, which is similar to how TestNG and JUnit solve this problem.
Your listener will have access to the same information that is being written to the results XML file including metadata and tags, where you are presumably going to store test case IDs.
EDIT:
You can ignore the data sent to the listener and still work with the XML file. If you have a output_file method in your listener, Robot Framework will send the path to the output XML file once it has done writing to it. Then you can open it an process it.
You can also externalize the whole process and create an application or script that takes the location of the output XML file(s) as parameter(s).

Categories