Reading Java properties files in Hadoop MapReduce applications

Reading Java properties files in Hadoop MapReduce applications - java

I was wondering what is the standard practice for reading Java properties files in MapReduce applications and how to pass the location to it when submitting (starting) a job.
In regular Java applications you can pass the location to the properties file as a JVM system property (-D) or argument to main method.
What is the best alternative (standard practice) for this for MapReduce jobs? Some good examples would be very helpful.

The best alternative is to use DistributedCache, however it may not be the standard way. There can be other ways. But I haven't seen any code using anything else so far.
The idea is to add the file to the cache, and read it inside setup method of map/reduce and load values into a Properties or a Map. If you need snippet I can add.
Oh I can remember, my friend JtheRocker used another approach. He set entire contents of the file against a key in the Configuration object, got it's value on setup then parsing & loading the pairs in a Map. In this case, file reading is done on the driver, which was previously on the task's side. While it's suitable for small files and seems cleaner, orthodox people may not like to pollute conf at all.
I would like to see, what other posts bring out.

Related

Apache Spark - run extermal exe or jar file parallel

I have .exe file (I don't have source files so I won't be able to edit the program) taking as parameter path to file which be processing and on the end giving results. For example in console I run this program as follow : program.exe -file file_to_process [other_parametrs]. I have also jar executable file which take two parameters file_to_process and second file and [others_parameters]. In both cases I would like to split input file into smallest part and run programs in parallel. Is there any way to do it efficient with Apache Spark Java framework. I'm new with parallel computations and I read about RDD and pipe operator but I don't know if it would be good in my case because I have path to file.
I will be very grateful for some help or tips.

I have run into similar issues recently, and I have a working code with spark 2.1.0. The basic idea is that, you put your exe with its dependencies such as dll into HDFS or your local and use addFiles to add them into driver, which will also copy them into work executors. Then you can load your file as a RDD, and use mapPartitionsWithIndex function to save each partition into local and execute the exe (use SparkFiles.get to get the path from the work executor) to that partition using Process.
Hope that helps.

I think the general answer is "no". Spark is a framework and in general it administers very specific mechanisms for cluster configuration, shuffling its own data, read big inputs (based on HDFS), monitoring task completion and retries and performing efficient computation. It is not well suited for a case where you have a program you can't touch and that expects a file from the local filesystem.
I guess you could put your inputs on HDFS, then, since Spark accepts arbitrary java/Scala code, you could use whatever language facilities you have to dump to a local file, launch a process (i.e.this), then build some complex logic to monitor for completion (maybe based on the content of the output). the mapPartitions() Spark method would be the one best suited for this.
That said, I would not recommend it. It will be ugly, complex, require you to mess with permissions on the nodes and things like that and would not take good advantage of Spark's strengths.
Spark is well suited for you problem though, especially if each line of your file can be processed independently. I would look to see if there is a way to get the program's code, a library that does the same or if the algorithm is trivial enough to re-implement.
Probably not the answer you were looking for though :-(

parsing vagrant file inside java

I need to parse the configurations defined in a Vagrantfile written in Ruby and use the settings elsewhere in my java code. Tried exploring jRubyParser but din't come across any documentation that defines it's use.
Cloned the Vagrant repo locally, but browsing through the code does not help either as I don't have prior experience with Ruby. How would Vagrant be reading the configurations defined in the file ? Any inputs ?

Vagrantfile is a regular Ruby script, i.e. it's meant to be interpreted by Ruby intepreter more than read as a configuration file.
To make things harder, some configuration options aren't declared as top level variables in Vagrantfile, but rather as properties of object in some function calls (like "config.vm.provider".
Depending on how complex your configuration is, I would consider just reading the file line by line and do regular expression matching to get variables I'd need. Not the most elegant solution, but probably way quicker too implement than alternatives.
Also, if your provider is always the same, say VirtualBox, maybe you could get some of your configuration from there. In that case, you would just need to read file located somewhere in "VirtualBox VMs" directory (on Mac, it's in "$HOME/VirtualBox VMs"). It's an XML file, so you could use one of the Java XML parsers to get what you need.

Options for file backed persistence in Java and Spring

I am inexperienced with Spring and I've been reading up on persistence options in Spring, as I am trying to find a suitable way to store data without the use of a database such as Oracle or MySQL etc...
When my app loads, it will read a file containing IDs. As the app runs, it may gain new IDs which will need to be written to the file in case of a crash. From what I can tell, I will need to replace the whole file each time, which is fine, as the data should be held in RAM and I can just overwrite the original file.
What I would prefer, however, is a way in Spring, or even Java, to sync the file and the data so that if I add 1 new ID to my list, it would automatically add a single line to the end of the file without me needing to write additional file management code. I know I can probably just concatenate the line, but something that basic probably won't be thread safe, and thread safety is a major concern here. I'd rather find a ready-made lib rather than re-invent the wheel.
So, can anyone point me in the direction of a tutorial, or technology, that allows for what I need? Or tell me if one exists, or how best I should go about this?
Thanks.
EDIT: It seems Springs resource bundle is the way forward. But I don't think it does exactly what I need to do. Using this, I will have to write code to both add to the map, and then add to the file.

Take a look of SQLite
Is a thread safe and server less sql database with Java driver.
EDIT
Other option is spring batch support for flat files.
see http://docs.spring.io/spring-batch/reference/html/readersAndWriters.html#flatfiles

java - write two files atomically

I am facing a problem for which I don't have a clean solution. I am writing a Java application and the application stores certain data in a limited set of files. We are not using any database, just plain files. Due to some user-triggered action, certain files needs to be changed. I need this to be a all-or-nothing operation. That is, either all files are updated, or none of them. It is disastrous if for example 2 of the 5 files are changed, while the other 3 are not due to some IOException.
What is the best strategy to accomplish this?
Is embedding an in-memory database, such as hsqldb, a good reason to get this kind of atomicity/transactional behavior?
Thanks a lot!

A safe approach IMO is:
Backup
Maintain a list of processed files
On exception, restore the ones that have been processed with the backed up one.
It depends on how heavy it is going to be and the limits for time and such.

What is the best strategy to accomplish this? Is embedding an in-memory database, such as hsqldb, a good reason to get this kind of atomicity/transactional behavior?
Yes. If you want transactional behavior, use a well-tested system that was designed with that in mind instead of trying to roll your own on top of an unreliable substrate.
File systems do not, in general, support transactions involving multiple files.
Non-Windows file-systems and NTFS tend to have the property that you can do atomic file replacement, so if you can't use a database and
all of the files are under one reasonably small directory
which your application owns and
which is stored on one physical drive:
then you could do the following:
Copy the directory contents using hard-links as appropriate.
Modify the 5 files.
Atomically swap the modified copy of the directory with the original

Ive used the apache commons transactions library for atomic file operations with success. This allows you to modify files transactionally and potentially roll back on failures.
Here's a link: http://commons.apache.org/transaction/

My approach would be to use a lock, in your java code. So only one process could write some file at each time. I'm assuming your application is the only which writes the files.
If even so some write problem occurs to "rollback" your files you need to save a copy of files like upper suggested.

Can't you lock all the files and only write to them once all files have been locked?

Shipping Java code with data baked into the .jar

I need to ship some Java code that has an associated set of data. It's a simulator for a device, and I want to be able to include all of the data used for the simulated records in the one .JAR file. In this case, each simulated record contains four fields (calling party, called party, start of call, call duration).
What's the best way to do that? I've gone down the path of generating the data as Java statements, but IntelliJ doesn't seem particularly happy dealing with a 100,000 line Java source file!
Is there a smarter way to do this?
In the C#/.NET world I'd create the data as a separate file, embed it in the assembly as a resource, and then use reflection to pull that out at runtime and access it. I'm unsure of what the appropriate analogy is in the Java world.
FWIW, Java 1.6, shipping for Solaris.

It is perfectly OK to include static resource files in the JAR. This is commonly done with properties files. You can access the resource with the following:
Class.getResourceAsStream ("/some/pkg/resource.properties");
Where / is relative to the root of the classpath.
This article deals with the subject Smartly load your properties.

Sure, just include them in your jar and do
InputStream is = this.getClass().getClassLoader().getResourceAsStream("file.name");
If you put them under some folders, like "data" then just do
InputStream is = this.getClass().getClassLoader().getResourceAsStream("data/file.name");

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Reading Java properties files in Hadoop MapReduce applications - java

Related

Apache Spark - run extermal exe or jar file parallel

parsing vagrant file inside java

Options for file backed persistence in Java and Spring

java - write two files atomically

Shipping Java code with data baked into the .jar

Categories

Resources