textFileStream in Spark

textFileStream in Spark - java

I have the following code:
SparkConf sparkConf = new SparkConf().setAppName("My app")
.setMaster("local[4]")
.set("spark.executor.memory", "2g")
.set("spark.driver.allowMultipleContexts", "true");
JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
JavaDStream<String> trainingData = jssc.textFileStream("filesDirectory");
trainingData.print();
jssc.start();
jssc.awaitTermination();
Unfortunately, to stream any file exists in the directory I have to edit this file and rename it after starting stream context, otherwise it will not be processed.
Should I edit and rename each file to process it or there is another way to process the existing files by just edit and save them.
P.S. When I move new file to this directory, I need also to edit and rename this file to stream it!!!

Try touching file before moving to the destination directory.
Below is what javadoc says.
Identify whether the given path is a new file for the batch of currentTime. For it to be
accepted, it has to pass the following criteria.
It must pass the user-provided file filter.
It must be newer than the ignore threshold. It is assumed that files older than the ignore
threshold have already been considered or are existing files before start
(when newFileOnly = true).
It must not be present in the recently selected files that this class remembers.
It must not be newer than the time of the batch (i.e. currentTime for which this
file is being tested. This can occur if the driver was recovered, and the missing batches
(during downtime) are being generated. In that case, a batch of time T may be generated
at time T+x. Say x = 5. If that batch T contains file of mod time T+5, then bad things can
happen. Let's say the selected files are remembered for 60 seconds. At time t+61,
the batch of time t is forgotten, and the ignore threshold is still T+1.
The files with mod time T+5 are not remembered and cannot be ignored (since, t+5 > t+1).
Hence they can get selected as new files again. To prevent this, files whose mod time is more
than current batch time are not considered.
*

JavaStreamingContext.textFileStream returns a FileInputDStream, which is used to monitor a folder when the files in the folder are being added/updated regularly. You will get the notification after every two seconds, only when a new file is added/updated.
If your intent is just to read the file, you can rather use SparkContext.textFile.
Looking at the documentation from source code of JavaStreamingContext.textFileStream()
/**
* Create a input stream that monitors a Hadoop-compatible filesystem
* for new files and reads them as text files (using key as LongWritable, value
* as Text and input format as TextInputFormat). Files must be written to the
* monitored directory by "moving" them from another location within the same
* file system. File names starting with . are ignored.
*/

Related

Reading current and new files from a directory using Java

I have written a program to process files in a directory. At start up it reads the current files in a directory, and then it uses a monitor to discover new files. Once it has processed a file,the program deletes the file. The problem is that there is a time gap, no matter how slight, between reading the files in a directory at startup and then starting the listener. A file created in that gap would be missed. One possible solution would be to repeatedly read the files in a directory (newDirectoryStream), but that doesn't seem as elegant or possibly efficient as using a monitor. The code uses the Apache Commons monitor and looks something like:
// Read Current files
stream = Files.newDirectoryStream(listenDir);
processFile(file);
// Process New files
FileAlterationObserver observer = new
FileAlterationObserver(listenDir.toAbsolutePath().toString(),filter);
FileAlterationMonitor monitor = new FileAlterationMonitor(POLL_INTERVAL);
FileAlterationListener listener = new FileAlterationListenerAdaptor() {
#Override
public void onFileCreate(File file) {
processFile( file.toPath());
}
};
observer.addListener(listener);
monitor.addObserver(observer);
monitor.start();

Simply flip it: First set up the listener and then obtain a directory stream. Go through a concurrent set which lets you do a 'only once ever' layout (the one that added the file name to the set and got the return value indicating 'you actually are the one that added it, you're not merely re-applying something that was already in there' - then you handle the file, otherwise you keep going). This way, if the file is added right in the 'sweet spot', both the dirstream and the observer would get it, but still only one will process it.

How to prevent file wipe if an error occurs while writing to it?

This is an issue I have had in many applications.
I want to change the information inside a file, which has an outdated version.
In this instance, I am updating the file that records playlists after adding a song to a playlist. (For reference, I am creating an app for android.)
The problem is if I run this code:
FileOutputStream output = new FileOutputStream(file);
output.write(data.getBytes());
output.close();
And if an IOException occurs while trying to write to the file, the data is lost (since creating an instance of FileOutputStream empties the file). Is there a better method to do this, so if an IOException occurs, the old data remains intact? Or does this error only occur when the file is read-only, so I just need to check for that?
My only "work around" is to inform the user of the error, and give said user the correct data, which the user has to manually update. While this might work for a developer, there is a lot of issues that could occur if this happens. Additionally, in this case, the user doesn't have permission to edit the file themselves, so the "work around" doesn't work at all.
Sorry if someone else has asked this. I couldn't find a result when searching.
Thanks in advance!

One way you could ensure that you do not wipe the file is by creating a new file with a different name first. If writing that file succeeds, you could delete the old file and rename the new one.
There is the possibility that renaming fails. To be completely safe from that, your files could be named according to the time at which they are created. For instance, if your file is named save.dat, you could add the time at which the file was saved (from System.currentTimeMillis()) to the end of the file's name. Then, no matter what happens later (including failure to delete the old file or rename the new one), you can recover the most recent successful save. I have included a sample implementation below which represents the time as a 16-digit zero-padded hexadecimal number appended to the file extension. A file named save.dat will be instead saved as save.dat00000171ed431353 or something similar.
// name includes the file extension (i.e. "save.dat").
static File fileToSave(File directory, String name) {
return new File(directory, name + String.format("%016x", System.currentTimeMillis()));
}
// return the entire array if you need older versions for which deletion failed. This could be useful for attempting to purge any unnecessary older versions for instance.
static File fileToLoad(File directory, String name) {
File[] files = directory.listFiles((dir, n) -> n.startsWith(name));
Arrays.sort(files, Comparator.comparingLong((File file) -> Long.parseLong(file.getName().substring(name.length()), 16)).reversed());
return files[0];
}

Do not need .0 extension log file when using logger with FileHandler from java.util.logging

Using Java.util.logging's FileHandler class to create cyclic logs. However why these logs are appended with .0 extension. .1,.2,.3 etc are fine, I only do not need .0 as my extension of file, since its confusing for the customer. Any way to achieve the same?
I am using java version java version "1.8.0_144".
FileHandler fileTxt = new FileHandler(pathOfLogFile+"_%g.log",
Integer.parseInt(prop.getProperty("MAX_LOG_FILE_SIZE")),
Integer.parseInt(prop.getProperty("NO_OF_LOG_FILES")), true);
SimpleFormatter formatterTxt = new SimpleFormatter();
fileTxt.setFormatter(formatterTxt);
logger.addHandler(fileTxt);
Name of log file is LOG_0.log. requirement is to not to have _0 on the latest file, need to be simply LOG.log.

You'll have to add more information about how your are setting up your FileHandler. Include code and or logging.properties file.
Most likely you are creating multiple open file handlers and are simply not closing the previous one before you create the next one. This can happen due to bug in your code or that you are simply not holding a strong reference to the logger that holds your FileHandler. Another way to create this issue is by create two running JVM processes. In which case you have no option but to choose the location of the where the unique number is placed in your file name.
Specify the %g token and %u in your file pattern. For example, %gfoo%u.log.
Per the FileHandler documentation:
If no "%g" field has been specified and the file count is greater than one, then the generation number will be added to the end of the generated filename, after a dot.
[snip]
Normally the "%u" unique field is set to 0. However, if the FileHandler tries to open the filename and finds the file is currently in use by another process it will increment the unique number field and try again. This will be repeated until FileHandler finds a file name that is not currently in use. If there is a conflict and no "%u" field has been specified, it will be added at the end of the filename after a dot. (This will be after any automatically added generation number.)
Thus if three processes were all trying to log to fred%u.%g.txt then they might end up using fred0.0.txt, fred1.0.txt, fred2.0.txt as the first file in their rotating sequences.
If you want to remove the zero from the first file only then the best approximation of the out of the box behavior would be to modify your code to:
If no LOG_0.log file exists then use File.rename to add the zero to the file. This changes LOG.log -> LOG_0.log.
Trigger a rotation. Results in LOG_0.log -> LOG_1.log etc. Then LOG_N.log -> LOG_0.log
Use File.rename to remove the zero from the file. LOG_0.log -> LOG.log
Open your file handler with number of logs as one and no append. This wipes the oldest log file and starts your new current one.
The problem with this is that your code won't rotate based on file size during a single run.

Simply use logger name as filename (Don't include %g in it). The latest file will be filename.log. Also note that the rotated files will have the numbers as extension.

Java copy-overwrite file, gets old file when reading

In a unit test I am overwriting a config file to test handling bad property values.
I am using Apache Commons IO:
org.apache.commons.io.FileUtils.copyFile(new File(configDir, "xyz.properties.badValue"), new File(configDir, "xyz.properties"), false)
When investigating the file system I can see that xyz.properties is in fact overwritten - size is updated and the content is the same as that of xyz.properties.badValue.
When I complete the test case which goes through code that reads the file into a Properties object (using a FileReader object) I get the properties of the original xyz.properties file, not the newly copied version.
Through debugging where I single step and investigate the file I can rule out it being a timing issue of writing to the file system.
Does the copy step somehow hold a file handle? If so how would I release it again?
If not, does anybody have any idea why this happens and how to resolve it?
Thanks.

If you initialized the FileReader object before this object, then it will have already stored a temp copy of the old version.
You'll need to reset it:
FileReader f = new FileReader("the.file");
// Copy and overwrite "the.file"
f = new FileReader("the.file");
In the Unix filesystem model, the inode containing the file's contents will persist as long as someone has an open filehandle into the file, or there is a directory entry pointing to it.
Replacing the file's name in the directory, does not remove the inode (contents of the file), so your already-open filehandle can continue to be used.
This is actually exploitable to create temporary files that never need to be cleaned up: create the file, then unlink it immediately, while keeping it open. When you close the file handle, the inode is reaped

I realize that this doesn't answer your question directly, but I think that it would be better to maintain two separate files, and arrange for your code to have the name of the configuration file configurable / injected at runtime. That way, your tests can specify which config file to use, rather than overwriting a single file.

Java: Efficient way to scan a folder for a particular file

I am contacting an external services with my Java app.
The flow is as follow: ->I generate an XML file, and put it in an folder, then the service processes the file and return another file with the same name having an extension .out
Right now after I put the file in the folder I start with a loop, until I get that file back so I can read the result.
Here is the code:
fileName += ".out";
File f = new File(fileName);
do
{
f = new File(fileName);
} while (!f.exists());
response = readResponse(fileName); // got the response now read it
My question comes here, am I doing it in the right way, is there a better/more efficient way to wait for the file?
Some info: I run my app on WinXP, usually it takes the external service less than a second to respond with a file, I send around 200 request per day to this services. The path to the folder with the result file is always the same.
All suggestions are welcome.
Thank you for your time.

There's no reason to recreate the File object. It just represents the file location, whether the file exists or not. Also you probably don't want a loop without at least a short delay, otherwise it'll just max out a processor until the file exists. You probably want something like this instead:
File file = new File(filename);
while (!file.exists()) {
Thread.sleep(100);
}
Edit: Ingo makes a great point in the comments. The file might not be completely there just because it exists. One way to guarantee that it's ready is have the first process create a second file after the first is completely written. Then have the Java program detect that second file, delete it and then safely read the first one.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

textFileStream in Spark - java

Related

Reading current and new files from a directory using Java

How to prevent file wipe if an error occurs while writing to it?

Do not need .0 extension log file when using logger with FileHandler from java.util.logging

Java copy-overwrite file, gets old file when reading

Java: Efficient way to scan a folder for a particular file

Categories

Resources