Apache Spark - run extermal exe or jar file parallel - java

I have .exe file (I don't have source files so I won't be able to edit the program) taking as parameter path to file which be processing and on the end giving results. For example in console I run this program as follow : program.exe -file file_to_process [other_parametrs]. I have also jar executable file which take two parameters file_to_process and second file and [others_parameters]. In both cases I would like to split input file into smallest part and run programs in parallel. Is there any way to do it efficient with Apache Spark Java framework. I'm new with parallel computations and I read about RDD and pipe operator but I don't know if it would be good in my case because I have path to file.
I will be very grateful for some help or tips.

I have run into similar issues recently, and I have a working code with spark 2.1.0. The basic idea is that, you put your exe with its dependencies such as dll into HDFS or your local and use addFiles to add them into driver, which will also copy them into work executors. Then you can load your file as a RDD, and use mapPartitionsWithIndex function to save each partition into local and execute the exe (use SparkFiles.get to get the path from the work executor) to that partition using Process.
Hope that helps.

I think the general answer is "no". Spark is a framework and in general it administers very specific mechanisms for cluster configuration, shuffling its own data, read big inputs (based on HDFS), monitoring task completion and retries and performing efficient computation. It is not well suited for a case where you have a program you can't touch and that expects a file from the local filesystem.
I guess you could put your inputs on HDFS, then, since Spark accepts arbitrary java/Scala code, you could use whatever language facilities you have to dump to a local file, launch a process (i.e.this), then build some complex logic to monitor for completion (maybe based on the content of the output). the mapPartitions() Spark method would be the one best suited for this.
That said, I would not recommend it. It will be ugly, complex, require you to mess with permissions on the nodes and things like that and would not take good advantage of Spark's strengths.
Spark is well suited for you problem though, especially if each line of your file can be processed independently. I would look to see if there is a way to get the program's code, a library that does the same or if the algorithm is trivial enough to re-implement.
Probably not the answer you were looking for though :-(

Related

How to get git -log of all branches in Java?

I have a task to implement a program in Java (pure Java without 3rd party libraries) that reads a history of any git repository and puts the commits into tree data structure.
Could you give me any hints? How to read git log in Java without 3rd party libraries?
You might want to take a look at Processes and Threads and how to execute a command in the runtime. It does have some details and need fundamental understanding of java.lang.Runtime, java.io and some other relevant topics, so that I'd refrain to write a whole method here and recommend you to search for a good tutorial and also get the first idea from other questions here, like → getting output from executing a command line program

integrating an external program

So I have been tasked with integrating a program called "lightSIDE" into a hadoop job, and I'm having some trouble figuring out how to go about this.
So essentially, rather than a single JAR, lightSIDE comes as an entire directory, including xml files that are crucial to its running.
Up until now, the way the data scientists on my team have been using this program is by running a python script that actually runs an executable, but this seems extremely inefficient as it would be spinning up a new JVM every time it gets called. That being said, I have no idea how else to handle this.
If you are writing your own MapReduce jobs then it is possible to include all the jar files as as libraries and xml files as resources.
I'm one of the maintainers for the LightSide Researcher's Workbench. LightSide also includes a tiny PredictionServer class to handle predictions on new instances over HTTP - you can see it here on BitBucket.
If you want to train new models instead, you could modify this server to do what you want, drawing clues from the side.recipe.Chef class.

Does MapReduce need to be use with HDFS

I want to make a better performance for data processing using Hadoop MapReduce. So, do I need to use it along with Hadoop DFS? Or maybe MapReduce can be use with other type of data distributed? Show me the way, please....
Hadoop is a framework which includes Map Reduce programming model for computation and HDFS for storage.
HDFS stands for hadoop distributed file system which is inspired from Google File System. The overall Hadoop project is inspired based on the research paper published by Google.
research.google.com/archive/mapreduce-osdi04.pdf
http://research.google.com/archive/mapreduce.html
Using Map Reduce programming model data will be computed in parallel way in different nodes across the cluster which will decrease the processing time.
You need to use HDFS or HBASE to store your data in the cluster to get the high performance. If you like to choose normal file system, then there will not be much difference. Once the data goes to distributed system, automatically it will be divided across different block and replicated by default 3 times to avoid fault tolerance. All these will not be possible with normal file system
Hope this helps!
First, your idea is wrong. Performance of Hadoop MapReduce is not directly related to the performance of HDFS. It is considered to be slow because of its architecture:
It processes data with Java. Each separate mapper and reducer is a separate instance of JVM, which need to be invoked, which takes some time
It puts intermediate data on the HDDs many times. At minimum, mappers write their results (one), reducers reads and merges them, writing result set to disks (two), reducer results written back to your filesystem, usually HDFS (three). You can find more details on the process here: http://0x0fff.com/hadoop-mapreduce-comprehensive-description/.
Second, Hadoop is open framework and it supports many different filesystems. You can read data from FTP, S3, local filesystem (NFS share, for instance), MapR-FS, IBM GPFS, GlusterFS by RedHat, etc. So you are free to choose the one you like. The main idea for MapReduce is to specify InputFormat and OutputFormat that would be able to work with your filesystem
Spark at the moment is considered to be a faster replacement of the Hadoop MapReduce as it puts much of the computations to the memory. But its use really depends on your case

Reading Java properties files in Hadoop MapReduce applications

I was wondering what is the standard practice for reading Java properties files in MapReduce applications and how to pass the location to it when submitting (starting) a job.
In regular Java applications you can pass the location to the properties file as a JVM system property (-D) or argument to main method.
What is the best alternative (standard practice) for this for MapReduce jobs? Some good examples would be very helpful.
The best alternative is to use DistributedCache, however it may not be the standard way. There can be other ways. But I haven't seen any code using anything else so far.
The idea is to add the file to the cache, and read it inside setup method of map/reduce and load values into a Properties or a Map. If you need snippet I can add.
Oh I can remember, my friend JtheRocker used another approach. He set entire contents of the file against a key in the Configuration object, got it's value on setup then parsing & loading the pairs in a Map. In this case, file reading is done on the driver, which was previously on the task's side. While it's suitable for small files and seems cleaner, orthodox people may not like to pollute conf at all.
I would like to see, what other posts bring out.

behavior of external executable

I am currently writing a program in JAVA that examines the behavior of external executable. One of the requirements is to observe the file operations of the external executable in real time (check if the executable creates/ deletes/modifies any file). I tried to find a suitable API in java to help me do this though it was not possible to find one. I have found the Class FileAlterationObserver which is not suitable for my program since you have to specify manually all the directories you want to monitor.
I was wondering if any of you knows a good API to use?
Thanks for your time in advance.
Without java, you could use the linux lsof command to list the open files in the system. Alternatively, and with Java, you can use libnotify, but you will need to specify the folders. I can't see any other way of doing this with pure java.
EDIT #Keppil linked you to the file change notification API that looks way more suitable than libjnotify. I wasn't aware it existed!

Categories