I have s3 with terabytes of data, separated to small files less than 5 mb.
I try to use flink to process them.
I create source with next code.
var inputFormat = new TextInputFormat(null);
inputFormat.setNestedFileEnumeration(true);
return streamExecutionEnvironment.readFile(inputFormat, "s3://name/");
But used memory growing up to limit, and job killed, and not scheduled again with error:
Could not fulfill resource requirements of job
Without data in sink.
On small set of data it works fine.
How I can read files without using too much memory?
Thanks.
same behaviour with:
env.fromSource( FileSource.forRecordStreamFormat(
new TextLineFormat(),
new Path("s3://name/")
)
.monitorContinuously(Duration.ofMillis(10000L))
.build(),
WatermarkStrategy.noWatermarks(),
"MySourceName"
)
The FileSource is the preferred way to ingest data from files. It should be able to handle the sort of scale you are talking about.
docs
javadocs
setQueueLimit on kinesis producer solved my problem https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/kinesis/#backpressure
Related
I'm looking for a way to access the name of the file being processed during the data transformation within a DoFn.
My pipeline is as shown below:
Pipeline p = Pipeline.create(options);
p.apply(FileIO.match()
.filepattern(options.getInput())
.continuously(Duration.standardSeconds(5),
Watch.Growth.<String>never()))
.apply(FileIO.readMatches()
.withCompression(Compression.GZIP))
.apply(XmlIO.<MyString>readFiles()
.withRootElement("root")
.withRecordElement("record")
.withRecordClass(MyString.class))//<-- This only returns the contents of the file
.apply(ParDo.of(new ProcessRecord()))//<-- I need to access file name here
.apply(ParDo.of(new FormatRecord()))
.apply(Window.<String>into(FixedWindows.of(Duration.standardSeconds(5))))
.apply(new CustomWrite(options));
Each file that is processed is an XML document. While processing the content, I need access to the name of the file being processed too to include in the transformed record.
Is there a way to achieve this?
This post has a similar question, but since i'm trying to use XmlIO I havent found a way to access the file metadata.
Below is the approach I found online, but not sure if there is a way to use it in the pipeline described above.
p.apply(FileIO.match()
.filepattern(options.getInput())
.continuously(Duration.standardSeconds(5),
Watch.Growth.<String>never()))//File Metadata
.apply(FileIO.readMatches()
.withCompression(Compression.GZIP))//Readable Files
.apply(MapElements
.into(TypeDescriptors.kvs(TypeDescriptors.strings(),new TypeDescriptor<ReadableFile>() {} ))
.via((ReadableFile file) -> {
return KV.of(file.getMetadata().resourceId().getFilename(),file);
})
);
Any suggestions are highly appreciated.
Thank you for your time reviewing this.
EDIT:
I took Alexey's advice and implemented a custom XmlIO. It would be nice if we could just extend the class we need and override the appropriate method. However, in this specific case, there was a reference to one method which was protected within the sdk because of which I couldn't easily override what i needed and instead ended up copying a whole bunch of files. While this works for now, I hope in future there is a more straighforward way to access the file metadata in these IO implementations.
I don't think it's possible to do "out-of-box" with a current implementation of of XmlIO since it returns a PCollection<T> where T is a type of your xml record and, if I'm not mistaken, there is no way to add a file name there. Though, you still can try to "reimplement" a ReadFiles and XmlSource in a way that it will return parsed payload and input file metadata.
Summary
I need to build a set of statistics during a Camel server in-modify-out process, and emit those statistics as one object (a single json log line).
Those statistics need to include:
input file metrics (size/chars/bytes and other, file-section specific measures)
processing time statistics (start/end/duration of processing time, start/end/duration of metrics gathering time)
output file metrics (same as input file metrics, and will be different numbers, output file being changed)
The output file metrics are the problem as I can't access the file until it's written to disk, and
its not written to disk until 'process'ing finishes
Background
A log4j implementation is being used for service logging, but after some tinkering we realised it really doesn't suit the requirement here as it would output multi-line json and embed the json into a single top-level field. We need varying top level fields, depending on the file processed.
The server is expected to deal with multiple file operations asynchronously, and the files vary in size (from tiny to fairly immense - which is one reason we need to iterate stats and measures before we start to tune or review)
Current State
input file and even processing time stats are working OK, and I'm using the following technique to get them:
Inside the 'process' override method of "MyProcessor" I create a new instance of my JsonLogWriter class. (shortened pseudo code with ellipsis)
import org.apache.camel.Exchange;
import org.apache.camel.Processor;
...
#Component
public class MyProcessor implements Processor {
...
#Override
public void process(Exchange exchange) throws Exception {
...
JsonLogWriter jlw = new JsonLogWriter();
jlw.logfilePath = jsonLogFilePath;
jlw.inputFilePath = inFilePath;
jlw.outputfilePath = outFilePath;
...
jlw.metricsInputFile(); //gathers metrics using inputFilePath - OK
...
(input file is processed / changed and returned as an inputstream:
InputStream result = myEngine.readAndUpdate(inFilePath);
... get timings
jlw.write
}
From this you can see that JsonLogWriter has
properties for file paths (input file, output file, log output),
a set of methods to populate data:
a method to emit the data to a file (once ready)
Once I have populated all the json objects in the class, I call the write() method and the class pulls all the json objects together and
the stats all arrive in a log file (in a single line of json) - OK.
Error - no output file (yet)
If I use the metricsOutputFile method however:
InputStream result = myEngine.readAndUpdate(inFilePath);
... get timings
jlw.metricsOutputFile(); // using outputfilePath
jlw.write
}
... the JsonLogWriter fails as the file doesn't exist yet.
java.nio.file.NoSuchFileException: aroute\output\a_long_guid_filename
when debugging I can't see any part of the exchange or result objects which I might pipe into a file read/statistics gathering process.
Will this require more camel routes to solve? What might be an alternative approach where I can get all the stats from input and output files and keep them in one object / line of json?
(very happy to receive constructive criticism - as in why is your Java so heavy-handed - and yes it may well be, I am prototyping solutions at this stage, so this isn't production code, nor do I profess deep understanding of Java internals - I can usually get stuff working though)
Use one route and two processors: one for writing the file and the next for reading the file, so one finishes writing before the other starts reading
Or , also you can use two routes: one for writing the file (to:file) and other that listens to read the file(from:file)
You can check for common EIP patterns that will solve most of this questions here:
https://www.enterpriseintegrationpatterns.com/patterns/messaging/
I am currently new to machine learning and I will be working on a project that involves using a Machine Learning library to detect and alert about possible anomalies. I will be using Apache Spark and I decided to use the KMeans method to solve the project.
The main project consists on analyzing daily files and detecting fluctuating changes in some of the records and reporting them as possible anomalies (if they are considered one based on the model). The files are generated at the end of a day and my program needs to check them on the morning of the next day to see if there is an anomaly. However, I need to check anomalies file vs file, NOT within the file. This means that I have to compare the data of every file and see if it fits to the model I would create following the specific algorithm. What I'm trying to say is that I have some valid data that I will apply the algorithm to in order to train my model. Then I have to apply this same model to other files of the same format but, obviously, different data. I'm not looking for a prediction column but rather detecting anomalies in these other files. If there is an anomaly the program should tell me which row/column has the anomaly and then I have to program it to send an email saying that there is a possible anomaly in the specific file.
Like I said I am new to machine learning. I want to know how I can use the KMeans algorithm to detect outliers/anomalies on a file.
So far I have created the model:
SparkConf conf = new SparkConf().setAppName("practice").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SparkSession spark = SparkSession
.builder()
.appName("Anomaly Detection")
.getOrCreate();
String day1txt = "C:\\Users\\User\\Documents\\day1.txt";
String day2txt = "C:\\Users\\User\\Documents\\day2.txt";
Dataset<Row> day1 = spark.read().
option("header", "true").
option("delimiter", "\t").
option("inferSchema", "true").
csv(day1txt);
day1 = day1.withColumn("Size", day1.col("Size").cast("Integer"));
day1 = day1.withColumn("Records", day1.col("Records").cast("Integer"));
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[]{"Size", "Records"})
.setOutputCol("features");
Dataset<Row> day1vector = assembler.transform(day1);
KMeans kmeans = new KMeans().setK(5).setSeed(1L);
KMeansModel model = kmeans.fit(day1vector);
I don't know what to do from this point on to detect outliers. I have several other .txt files that should have "normalized" data, and also I have a couple of files that have "tampered/not-normalized" data. Do I need to train my model with all the test data I have available, and if so, how can I train a model using different datasets? Or can I only train it with one dataset and test it with the others?
EDIT:
This is a sample of the file (day1.txt) I will be using (dummy data of course / top 10)
Name Size Records
File1 1000 104370
File2 990 101200
File3 1500 109123
File4 2170 113888
File5 2000 111974
File6 1820 110666
File7 1200 106771
File8 1500 108991
File9 1000 104007
File10 1300 107037
This is considered normal data, and I will have different files with the same format but different values around the same range. Then I have some files where I purposely added an outlier, like Size: 1000, Records: 50000.
How can I detect that with KMeans? Or if KMeans is not the perfect model, which model should I use and how should I go around it?
There is a simple approach for this. create your clusters with kmeans, then for each clusters, set some good radius with respect to center of that cluster, if some point lie out of that radius, it is an outlier.
Try looking at this: https://arxiv.org/pdf/1402.6859.pdf
There is some outlier detection Technics like: OneClassSvm or AngleBaseOutlierDetection and so on. Try looking at this: http://scikit-learn.org/stable/modules/outlier_detection.html
I have this synchronous pipeline that need to be executed from time to time (lets say every 30 minutes):
Connect to a ftp;
Read a .json file (single file) from folder A;
Unmarshall the content of the file (Class A) and add it to the route context;
Read all the .fixedlenght files (multiple files) from folder B (preMove: processingFolder, move: doneFolder, moveFailed: errorFolder);
Unmarshall the content of the files (Class B) and do some logic;
Read all the .xml files (multiple files) from folder C (preMove: processingFolder, move: doneFolder, moveFailed: errorFolder);
Unmarshall the content of the files (Class C) and do some logic;
End the route.
It is a single pipeline created with Java DSL. If a error happen, the process stop.
I'm really struggling with Camel to create this. It is possible or I will need to handle this manually? I created some demos, but none of them are properly working.
Any help will be appreciated.
I would approach this in the following manner:
All the interfaces to the FTP where you read the files are separate routes. Their job is only to pick up the file. They don't deal with parsing or transformation.
Then create separate routes for actually receiving the data, parsing and transformation.
Finally the delivery routes which take the data and deliver to your end destination.
This way you can customise the error handling, easier to find out what went wrong were, makes it easier to change one part without affecting everything and you can reuse the routes in several different parts.
The way you describe your message pipeline it seems beneficial to have 3 separate routes each handling a different folder in your FTP server. You can have a timer that triggers all 3 every 30 minutes of so. The FTP component derives from Camel's File Component and there are a lot a useful parameters that would help with your routing logic here.
For each of your 3 routes you would have something like this:
from("ftp://foo#myserver?include=*.xml&preMove=processingFolder&move=doneFolder&moveFailed=errorFolder")
.unmarshal()
...
You can find more info about filtering files by their extensions here
I have web service which receives 100 Mb video file by chunks
public void addFileChunk(Long fileId, byte[] buffer)
How can I store this file in Postgresql database using hibernate?
Using regular JDBC is very straight forward. I would use the following code inside my web service method:
LargeObject largeObject = largeObjectManager.Open(fileId, LargeObjectManager.READWRITE);
int size = largeObject.Size();
largeObject.Seek(size);
largeObject.Write(buffer);
largeObject.Close();
How can I achieve the same functionality using Hibernate? and store this file by chunk?
Storing each file chunk in separate row as bytea seems to me not so smart idea. Pease advice.
its now advisable to store 100MB files in database. I would instead store them in the filesystem, but considering transactions are active, employing Servlets seems reasonable.
process http request so that file (received one) is stored in some temporal location.
open transaction, persist file metadata including temporal location, close transaction
using some external process which will monitor temporal files, transfer this file to its final destination from which it will be available to user through some Servlet.
see http://in.relation.to/Bloggers/PostgreSQLAndBLOBs
Yeah byteas would be bad. Hibernate has a way to continue to use large objects and you get to keep the streaming interface.