Spark, Alternative to Fat Jar - java

I know at least 2 ways to get my dependencies into a Spark EMR job. One is to create a fat jar and another is to specify which packages you want in spark submit using the --packages option.
The fat jar takes quite a long time to zip up. Is that normal? ~10 minutes. Is it possible that we have it incorrectly configured?
The command line option is fine, but error prone.
Are there any alternatives? I'd like it if there (already existed) a way to include the dependency list in the jar with gradle, then have it download them. Is this possible? Are there other alternatives?
Update: I'm posting a partial answer. One thing I didn't make clear in my original question was that I also care about when you have dependency conflicts because you have the same jar with different versions.
Update
Thank you for the responses relating to cutting back the number of dependencies or using provided where possible. For the sake of this question, lets say we have the minimal number of dependencies necessary to run the jar.

Spark launcher can used if spark job has to be launched through some application with the help of Spark launcher you can configure your jar patah and no need to create fat.jar for runing application.
With a fat-jar you have to have Java installed and launching the Spark application requires executing java -jar [your-fat-jar-here]. It's hard to automate it if you want to, say, launch the application from a web application.
With SparkLauncher you're given the option of launching a Spark application from another application, e.g. the web application above. It is just much easier.
import org.apache.spark.launcher.SparkLauncher
SparkLauncher extends App {
val spark = new SparkLauncher()
.setSparkHome("/home/knoldus/spark-1.4.0-bin-hadoop2.6")
.setAppResource("/home/knoldus/spark_launcher-assembly-1.0.jar")
.setMainClass("SparkApp")
.setMaster("local[*]")
.launch();
spark.waitFor();
}
Code:
https://github.com/phalodi/Spark-launcher
Here
setSparkHome(“/home/knoldus/spark-1.4.0-bin-hadoop2.6”) is use to set spark home which is use internally to call spark submit.
.setAppResource(“/home/knoldus/spark_launcher-assembly-1.0.jar”) is use to specify jar of our spark application.
.setMainClass(“SparkApp”) the entry point of the spark program i.e driver program.
.setMaster(“local[*]”) set the address of master where its start here now we run it on loacal machine.
.launch() is simply start our spark application
What are the benefits of SparkLauncher vs java -jar fat-jar?
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-SparkLauncher.html
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/launcher/SparkLauncher.html
http://henningpetersen.com/post/22/running-apache-spark-jobs-from-applications

For example on Cloudera's clusters, there is some set of libraries already available on all nodes which will be available on classpath for drivers, executors.
Those are e.g. spark-core, spark-hive, hadoop etc
Versions are grouped by Cloudera, so e.g. you have spark-core-cdh5.9.0 where cdh5.9.0 suffix means that all libraries with that suffix were actually verified by Cloudera to be working together properly.
Only thing you should do is to use libraries with the same group suffix and you shouldn't have any classpath conflicts.
What that allows is:
set dependencies configured in an app as Maven's scope provided, so they will not be part of fat jar, but resolved from classpath on nodes.
You dind't write what kind of cluster you have, but maybe you you can use similliar approach.
maven shade plugin may be used to create fat jar which additionally allows to set libraries you want to include ina jar, and those not in a list are not included.
I think something similiar is described in this answer Spark, Alternative to Fat Jar but using S3 as dependency storage.

HubSpot has a (partial) solution: SlimFast. You can find an explanation here http://product.hubspot.com/blog/the-fault-in-our-jars-why-we-stopped-building-fat-jars and you can find the code here https://github.com/HubSpot/SlimFast
Effectively it stores all the jars it'll ever need on s3, so when it builds it does it without packaging the jars, but when it needs to run it gets them from s3. So you're builds are quick, and downloads don't take long.
I think if this also had the ability to shade the jar's paths on upload, in order to avoid conflicts, then it would be a perfect solution.

The fat jar indeed take a lot of time to create. I was able to optimize a little bit by removing the dependencies which were not required at runtime. But it is really a pain.

Related

Does Java classpath different order give "No Method found error"

I have a Scala Akka Application which connects to HBase (currently CDP earlier HDP) deployed on rancher; Never faced any trouble when connecting to HDP hbase; Since recent HDP to CDP change, with the same image we are getting no method found on one of the dependency's class in one of the container, where as another container of same image connects to hbase properly.; even though the jar exists in the same image and classpath also.
one of the noticeable difference is change in the order of classpath.
Does change in the classpath order will effect the jars availability.
Does java libraries/classes would load in different order when they would hit a faster CPU cycle at startup.
What could be the reason for such "no class method found".
It certainly can, if the same class file is present in different classpath entries. For example, if your classpath is: java -cp a.jar:b.jar com.foo.App, and:
a.jar:
pkg/SomeClass.class
b.jar:
pkg/SomeClass.class
Then this can happen - usually because one of the jars on your classpath is an older version than the other, or the same but more complicated: one of the jars of your classpath contains a whole heap of different libraries all squished together and one of those components is an older version.
There are some basic hygiene rules to observe:
Don't squish jars together. If you have 500 deps, put 500 entries on your classpath. We have tools to manage this stuff, use them. Don't make striped jars, uber jars, etc.
Use dependency trackers to check if there are version difference in your dependency chain. If your app depends on, say, 'hibernate' and 'jersey', and they both depend on google's guava libraries, but hibernate imports v26 and jersey imports v29, that's problematic. Be aware of it and ensure that you explicitly decide which version ends up making it. Presumably, you'd want to explicitly pick v29 and perhaps check that hibernate also runs on v29*. If it doesn't, you have bigger problems. They are fixable (with modular classloaders), but not easily.
*) Neither hibernate nor jersey actually depend on guava, I'm just using them as hypothetical examples.
For example, if you use maven, check out the enforcer plugin. (groupId: org.apache.maven.plugins, artifactId: maven-enforcer-plugin).
My bet is that there is another version of the jar somewhere in CDP, and occasionally it is loaded before the version that you ship with your project, causing the error.
So, when your container starts, try logging from which location the conflicting class is loaded. This question might help you: Determine which JAR file a class is from

Is it possible to use jQAssistant as a tool inside a java application?

I am currently working on a small project. The idea is to use jQAssistant to fill the neo4j database so that the data can be used by an rest api. The plan is to upload a jar, war or ear to a java backend so that it can be scanned (scan -f) and then start the neo4j server on port 7474.
What I already have tried:
1. Trying to execute "scan" and "server" with Java ProcessBuilder and Runtime.
2. Importing JQAssistant Commandline Neo4jv3 - 1.6.0 with gradle and trying to use the run-Method in Main.class with the commandline arguments (scan -f foldername).
Server-start works without any problems in both cases, but scanning is a huge problem. It does not seem to scan the specified folder correctly. The jqassistant-folder which has been created does not have any scanned data.
I assume that the root of the problem is the plugins folder and the variables JQASSISTANT_HOME and JQASSISTANT_OPTS appearing in the jqassistant.cmd and .sh files.
Is it actually possible to execute "server" and especially "scan" within java code?
It is possible to use jQAssistant from Java code but I'd not recommend it as the underlying APIs are subject to change. What remains downwards compatible over releases are the command line arguments, so going for the Main class as described in your question should be safe for a while. This approach is also used by the Gradle integration provided by Kontext E (http://techblog.kontext-e.de/jqassistant-with-gradle/).
Assuming that you're encountering the same problem with missing data when using the provided shell scripts for Windows/Linux. A common issue is that for scanning folders containing Java classes you need to specify a scope:
scan -f java:classpath::build/classes/main
The java:classpath prefix provides a hint that the folder shall be treated as a classpath element, see http://buschmais.github.io/jqassistant/doc/1.6.0/#_scanner and http://buschmais.github.io/jqassistant/doc/1.6.0/#cli:scan.

Java: Resolve namespace conflict

We have a jar that we lost the source code to. I decompiled the jar and created new source from it. I want to then verify that the source code and the old jar have the same behavior. I am writing unit tests to do the verification the problem is that they both have same namespace / class name so I do not know how to disambiguate the old jar and the new source code. What can I do or is it impossible?
You need to only have one version on the class path at once to guarantee that you are running that version of the code. Develop your unit test separate from the code so you can drop in either version.
Give the new source a temporary namespace for testing purposes. Then instead of import, you can refer your new classes as:
com.yourfirm.test.packagename.TheClassName
the old ones can be simply imported and refered to as TheClassName. This way you can tell by looking at your test cases which is which.
Or simply run the tests with -cp oldpackage.jar and then -cp newpackage.jar.
It's possible, but you have to mess around with class loading. Instead of putting either of the jars on the classpath, you'll need to load them at runtime. Check out JCL for a library to allow you to do this. (Disclaimer: I have never used JCL.)
Basically, each test would have to load the class from the old JAR, grab the results of the method you're testing, then unload that JAR, load up the new one, run the same method against the new version, and compare the results.
I'd change which classes are being tested at runtime with the classpath. This approach would be less error-prone in terms of ensuring that you're running the same test code against both binaries. Otherwise you introduce more complexity around whether the tests are correct.
It sounds like you are trying to execute the tests against both jars at the same time. I don't know of a way to disambiguate the old/new jars if they are both in the classpath.
If your unit tests output results to stdout/stderr, you could run the tests against the original jar and save the results. Then run the tests against the new jar and save the results in a separate file. Then diff the files.
Another approach would be to refactor the new source code so that it has a unique namespace. You could then test against both jars at the same time, but it could be a lot of work to make existing programs use the new jar.
If you run your tests via ant (Junit-task), you can control the ant classpath seperately for both runs (once via jar, once via fileset of classes).

Minimizing jar dependency sizes

an application I have written uses several third party jars. Sometimes only a small portion of the entire 50kB to 1.7mB jar is used - one or two function calls or classes.
What is the best way to reduce the jar sizes. Should I download the sources and build a jar with just the classes I need? What existing tools can help automate this (ex I briefly looked at http://code.google.com/p/jarjar/)?
Thank you
Edit 1:
I would like to lower the size of my third party 'official' jars like swingx-1.6.jar (1.4 MB), set-3.6 (1.7 MB) glazedlists-1.8.jar (820kB) , etc. so that they only contain the bare minimum classes I need
Edit 2:
Minimizing a jar by hand or by using a program like proguard is further complicated if the library uses reflection.
Injection with google guice does not work anymore after obfuscation with proguard
The answer by cletus on another post is very good How to determine which classes are used by a Java program?
Proguard would be an option. It can eliminate unused classes and methods. You can also use it to obfuscate, which can further reduce the size of your final jar. Be aware that class loading by name is liable to break unless care is taken to keep the affected classes unobfuscated.
I've found Proguard quite effective - can be a bit cryptic to understand at the outset. But I don't have any experience with similar to offer a comparison.
First of all, if you use only one class from JAR file this does not mean that this class does not use other classed from that JAR.
The option for you, if you use open source JARs, is to get sources of that JAR, attach them to your project, remove unnecessary stuff and build the changes by yourself.
You could add GenJar as an Ant task and use it to build the JAR. As it says on the library's home page,
GenJar is a specialized Ant task that
builds jar files based on class
dependencies rather than simply the
contents of a directory.
You can find it on SourceForge.

java plugin specification via properties file

I have a project which I want to add plugins. I have all the interfaces/factories/etc. setup (my gateway interface is called ApplicationMonitorFactory), I just need to make a way to locate/activate the plugin. My configuration file is a java properties file.
I think what I need to do is:
find a good way to specify a set of one or more plugins
for each plugin, run it
1. find a good way to specify a set of one or more plugins
something like:
application.plugins=foo-monitor.jar,bar-monitor.jar
I think maybe it's just best to specify a list of jar files; for each jar file specified, the implication is that it contains one or more classes which implement ApplicationMonitorFactory, and these are the ones that will be instantiated. (I might also add an annotation #ApplicationMonitorPlugin so that a .jar file can have a test ApplicationMonitorFactory that does not get instantiated)
Does this sound reasonable?
2. for each plugin, run it
I did this once a while back, and if I remember right I think I need to use a custom classloader to add the appropriate .jar file to the classpath dynamically. Or is there an easier way?
Could I suggest using OSGI instead? If it's a serverside project, something like Apache Karaf gives you quite a lot out of the box in terms of plugin deployment and specification.
To answer the questions based on what you have at the moment:
1. find a good way to specify a set of one or more plugins
The properties file approach is fine. You may want to just be able to drop plugins into a folder that you monitor if you want hot deploy. Just having 1 jar file for a plugin does limit plugin developers to packaging all of their dependencies into a single jar file (maven shade plugin is useful for this). The annotation approach should work (the approach that Servlet 3.0 uses). Using OSGI, you'd have a manifest file with a Bundle-Activator property that would reference the plugin class that should be instantiated.
2. for each plugin, run it
Yes, you would need to fire up a class loader for the Jar files. This is where things get a bit hairier. It's easy enough to do but Class loading has all sorts of gotcha's. This is where OSGI would really help, even though it is a bit of an upfront cost.

Categories