In my project, I am trying to add some metadata to the data processed in my pipeline. The metadata is located in a DBF file in a subfolder called resources next the src-folder.
The src-folder contains the main-class and I have several packages (IO, processing, aggregation, utils).
I read and process the file with metadata in my main class where the pipeline is defined. The code I am using to access the file is as follows:
File temp1 = new File("resources/xxx.dbf");
I check if the file was found using:
LOG.info(temp1.exists())
which runs fine.
There are messages coming in as Strings which I read using PubSubIO. I use the contents of this file to fill a Map containing keys and values.
Map<String, ArrayList<Double>> sensorToCoordinates = coordinateData.getSensorLocations();
I then set a static variable in a custom class called 'SensorValues' I made:
SensorValue.setKeyToCoordinates(sensorToCoordinates);
When parsing the incoming messages from Strings to a SensorValue-class I made using a ParDo function (going from a PCollection to PCollection) the map is used in the constructor of the SensorValue-class.
Running this code using a DirectPipelineRunner works perfect. However, when I use a DataflowPipelineRunner, and I try to access the map in the SensorValue-constructor, I run into a NullPointerException.
Now I am wondering why the setter is not working when using a DataflowPipelineRunner (I'm guessing it has something to do with the execution being distributed among several workers) and what the best practice would be to use any static resource files to enrich your pipeline with?
You're right that the problem is because the execution of the ParDo is distributed to multiple workers. They don't have the local file, and they may don't have the contents of the map.
There are a few options here:
Put the file in GCS, and have the pipeline read the contents of the file (using TextIO or something like that) and use it as a side-input to your later processing.
Include the file in the resources for the pipeline and load that in the startBundle of the DoFn that needs it (in the future there will be ways to make this happen less often than every bundle).
You could serialize the contents of the map into the arguments of the DoFn, by putting it in as a non-static field passed to the constructor of that class.
Option 1 is better as the size of this file increases (since it can support splitting it up to pieces and doing lookups) while Option 2 is likely less network traffic to retrieve the file. Option 3 will only work if the file is extremely small, since it will significantly increase the size of the serialized DoFn, which may lead to the job being to large to submit to the Dataflow service.
Related
As a best practice, I am trying to index a bunch of documents to Solr in one request instead of indexing one at a time. Now I have the problem that the files I am indexing are of different types (pdf, word document, text file, ...) and therefore have different metadata that gets extracted from Tika and indexed.
I'd like to have certain fields/information for all files, regardless of the type, such as creator, creation date and path for example, but I don't know how to manually add fields when I index all the files at once.
If I would index one file at a time, I could just add fields with request.setParam() but that is for the whole request and not for one file. And even if something like this is possible, how would I get information like the creator of a file in Java?
Is there a possibility to add fields for each file?
if(listOfFiles != null) {
for (File file : listOfFiles) {
if (file.isFile()) {
request.addFile(file, getContentType(file));
//add field only for this file?
}else{
//Folder, call the same method again -> recursion
request = addFilesToRequest(file, request);
}
}
}
As far as I know there is no way of submitting multiple files in the same requests. These requests are usually so heavy on processing anyway that lowering the amount of HTTP requests may not change the total processing time much.
If you want to speed it up, you can process all your files locally with Tika first (Tika is what's being used internally in Solr as well), then only submit the extracted data. That way you can multithread the extracting process, add the results to a queue and let the Solr submission process be performed as the queue grows - with all the content submitted to Solr in several larger batches (for example 1000 documents at a time).
This also allows you to scale your indexing process without having to add more Solr servers to make that part of the process go faster (if your Solr node can keep up with search traffic, it shouldn't be necessary to scale it just to process documents).
Using Tika manually also makes it easier to correct or change details while processing, such as file formats returning dates in different timezones etc. than what you expect.
Goal
I have the task to find duplicate entries within import files, and in a later stage duplicate entries of these import files compared to a global database. The data inside of the files are personal information like name, email, address etc. The data is not always complete, and often spelled incorrectly.
The files will be uploaded by external users through a web form. The user needs to be notified when the process is done, and he / she has to be able to download the results.
Additionally to solving this task I need to assess the suitability of Apache Beam for this task.
Possible solution
I was thinking about the following: The import files are uploaded to S3, and the pipeline will either get the file location as a pub-sub event (Kafka queue), or watch S3 (if possible) for incoming files.
Then the file is read by one PTransform and each line is pushed into a PCollection. As a side output I would update a search index (inside Redis or some such). The next transform would access the search index, and tries to find matches. The end result (unique value, duplicate values) are written to an output file to S3, and the index is cleared for the next import.
Questions
Does this approach make sense - is it idiomatic for Beam?
Would Beam be suitable for this processing?
Any improvement suggestions for the above?
I would need to track the file name / ID to notify the user at the end. How can I move this meta-data through the pipeline. Do I need to create an "envelope" object for meta-data and payload, and use this object in my PCollection?
The incoming files are unbounded, but the file contents itself are bounded. Is there a way to find out the end of the file processing in an idiomatic way?
Does this approach make sense - is it idiomatic for Beam?
This is a subjective question. In general, I would say no, this is not idiomatic for Apache Beam. Apache Beam is a framework for defining ETL pipelines. The Beam programming model has no opinions or builtin functionality for deduplicating data. Deduplication is achieved through implementation (business logic code you write) or a feature of a data store (UNIQUE constraint, SELECT DISTINCT in SQL or key/value storage).
Would Beam be suitable for this processing?
Yes, Beam is suitable.
Any improvement suggestions for the above?
I do not recommend writing to a search index in the middle of the pipeline. By doing this and then attempting to read the data back in the following transform, you've effectively created a cycle in the DAG. The pipeline may suffer from race conditions. It is less complex to have two separate pipelines - one to write to the search index (deduplicate) and a second one to write back to S3.
I would need to track the file name / ID to notify the user at the end. How can I move this meta-data through the pipeline. Do I need to create an "envelope" object for meta-data and payload, and use this object in my PCollection?
Yes, this is one approach. I believe you can get the file metadata via ReadableFile class.
The incoming files are unbounded, but the file contents itself are bounded. Is there a way to find out the end of the file processing in an idiomatic way?
I'm not sure off the top, but I don't think this is possible for a pipeline executing in streaming mode.
I'm a new Spark user and I'm trying to process a large file set of XML files sitting on a HDFS file system. There are about 150k files, totalling about 28GB, on a "development" cluster of 1 machine (actually a VM).
The files are organised into a directory structure in HDFS such that there are about a hundred subdirectories under a single parent directory. Each "child" directory contains anything between a couple of hundred and a couple of thousand XML files.
My task is to parse each XML file, extract a few values using XPath expressions, and save the result to HBase. I'm trying to do this with Apache Spark, and I'm not having much luck. My problem appears to be a combination of the Spark API, and the way RDDs work. At this point it might be prudent to share some pseudocode to express what I'm trying to do:
RDD[String] filePaths = getAllFilePaths()
RDD[Map<String,String>] parsedFiles = filePaths.map((filePath) => {
// Load the file denoted by filePath
// Parse the file and apply XPath expressions
})
// After calling map() above, I should have an RDD[Map<String,String>] where
// the map is keyed by a "label" for an xpath expression, and the
// corresponding value is the result of the expression applied to the file
So, discounting the part where I write to HBase for a moment, lets focus on the above. I cannot load a file from within the RDD map() call.
I have tried this a number of different ways, and all have failed:
Using a call to SparkContext.textFile("/my/path") to load the file fails because SparkContext is not serializable
Using a call to FileSystem.open(path) from the Hadoop API, where the FileSystem is instantiated outside the RDD fails because FileSystem is not serializable
Using a call to FileSystem.open(path) from the Hadoop API, where the FileSystem is instantiated inside the RDD fails because the program runs out of file handles.
Alternative approaches have included attempting to use SparkContext.wholeTextFiles("/my/path/*") so I don't have to do the file load from within the map() call, fails because the program runs out of memory. This is presumably because it loads the files eagerly.
Has anyone attempted anything similar in their own work, and if so, what approach did you use?
Try using a wildcard to read the whole directory.
val errorCount = sc.textFile("hdfs://some-directory/*")
Actually, spark can read a whole hfs directory, quote from spark documentation
All of Spark’s file-based input methods, including textFile, support
running on directories, compressed files, and wildcards as well. For
example, you can use textFile("/my/directory"),
textFile("/my/directory/*.txt"), and textFile("/my/directory/*.gz").
We have a functionality to create a number of folders when user saves the data in CM.
The format would is attached in image:
ParentFolder
ChildFolder1
ChildFolder2
ChildFolder3
File1
File2
File3
ParentFolderConfig
ChildFolderConfig1
ChildFolderConfig2
ChildFolderConfig3
FileConfig1
FileConfig2
FileConfig3
These all are created all the times when user creates it. I have found a way to add nodes one by one using addNode(). But to save the time and increase performance I wanted to find out a way in which I create this files and folder temporary in JAVA and save them to JCR in one call and afterwards dispose these temporary files.
Calling addNode() multiple times and saving a the end with Session.save() is a common pattern in JCR, it's perfectly fine to create your structure like that.
To make your code simpler you could use a utility class that takes the path of a node that's deep in your hierarchy, and creates intermediate nodes as needed. The JcrUtils.getOrCreateByPath method provided by the Jackrabbit commons module does that.
If I have a property of an object which is a large String (say the contents of a file ~ 50KB to 1 MB, maybe larger), what is the practice around declaring such a property in a POJO? All I need to do is to be able to set a value from one layer of my application and transfer it to another without making the object itself "heavy".
I was considering if it makes sense to associate an InputStream or OutputStream to get / set the value, rather than reference the String itself - which means when I attempt to read the value of the contents, I read it as a stream of bytes, rather than a whole huge string loaded into memory... thoughts?
What you're describing depends largely on your anticipated use of the data. If you're delivering the contents in raw form, then there may be more efficient ways to manage it.
For example, if your app has a web interface, your app may just provide a URL for a web server to stream the contents to the requester. If it's a CLI-based app, you may be able to get away with a simple file copy. If your app is processing the file, however, then perhaps your POJO could retain only the results of that processing rather than the raw data itself.
If you wish to provide a general pattern along the lines of using POJO's with references to external streams, I would suggest storing in your POJO something akin to a URI that tells where to find the stream (like a row ID in a database or a filename or a URI) rather than storing an instance of the stream itself. In doing so, you'll reduce the number of open file handles, prevent potential concurrency issues, and will be able to serialize those objects locally if needed without having to duplicate the raw data persisted elsewhere.
You could have an object that supplies a stream or an iterator every time you access it. Note that the content has to live on some storage, like a file. I.e your object will store a pointer (e.g. a file path) to the storage and every time someone access it, you open a stream or create an iterator and let that party read. Note also that in order to save on memory, whoever consumes it has to make sure not to store the whole content in memory.
However, 50KB or 1MB is really tiny. Unless you have like gigabytes (or maybe hundred megabytes), I wouldn't try to do something like that.
Also, even if you have large data, it's often simpler to just use files or whatever storage you'll use.
tl;dr: Just use String.