Include spark avoiding huge dependencies - java

I would like to include spark sql in my project. However, if doing so, the jar file becomes huge (over 120 MB) because Maven includes numerous dependencies.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
</dependency>
Is there a way to minimize the included dependencies?

Depends on your use case. By default, maven includes all the dependencies of spark-sql in the uber jar. Based on your case, you may not use all of them. So you can exclude them from your dependency.
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
<exclusions>
<!-- to remove jackson-databind from your uber jar -->
<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
</exclusions>
</exclusions>
</dependency>
But this won't help you if your application uses most of the features of spark-sql.
In many of the cases, the spark dependencies will be provided by the environment in which you are going to run your application(apart from standalone mode). In such cases, you can just flag spark-sql dependency as provided dependency as shown below,
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
<scope>provided</scope>
</dependency>

Related

Maven: Transitive Dependencies

I am working on a library for some projects which relies on Spark and HBase.
So the POM of the library looks something link this:
<dependencies>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.7.4</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>1.1.2</version>
</dependency>
...
</dependencies>
And in the specific project that uses the central library (which is published on an internal Maven repository) I have this:
<dependencies>
<dependency>
<groupId>my.group.id</groupId>
<artifactId>myartifact</artifactId>
<version>LATEST</version>
</dependency>
</dependencies>
This, however does not then automatically include the dependencies that the library itself has. Therefore I would need to copy the dependency section of the library POM into the application POM.
Do you have any advice what might be missing/wrong?
Thanks and regards!

How can I remove the old vulnerable Apache commons collection version dependency from my project's maven dependency tree?

My Java app project is being managed by Maven.
My project has a few library dependencies depending again on Apache commons collection 3.2.1 which is vulnerable - e.g. Apache commons configuration, velocity, etc.
(I can see it is being used by running mvn dependency:tree command.)
I did neither write any line of codes using Apache commons collection directly nor defined the dependency of it, but it's being used.
What could I do to remove its dependency and to force to use safe version - 3.2.2, 4.1.
For your information:
JIRA Bug - Arbitrary remote code execution with InvokerTransformer
Here is the part of my pom.xml, and I guess there's nothing remarkable.
...
<dependency>
<groupId>commons-configuration</groupId>
<artifactId>commons-configuration</artifactId>
<version>1.6</version>
</dependency>
<dependency>
<groupId>org.apache.velocity</groupId>
<artifactId>velocity</artifactId>
<version>1.7</version>
</dependency>
...
Unless I am missing something obvious, just specifying dependency in your POM ought to be sufficient:
<dependencies>
<dependency>
<groupId>commons-collections</groupId>
<artifactId>commons-collections</artifactId>
<version>3.2.2</version>
<dependency>
...
</dependencies>
If you specify it a the top of your <dependencies> section, it will override any other transitive inclusion of commons-collections.
Of course, you may wind up with incompatibilities where other dependencies depend on the other version, but that's what unit tests are for, right? ;-)
What you need to do is exclude commons-collections from the affected dependencies and include the desired version in your dependencies directly.
Example pom.xml excerpt assuming commons-configuration uses the vulnerable commons-collections
<dependency>
<groupId>commons-configuration</groupId>
<artifactId>commons-configuration</artifactId>
<version>1.10</version>
<exclusions>
<exclusion>
<artifactId>commons-collections</artifactId>
<groupId>commons-collections</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>commons-collections</groupId>
<artifactId>commons-collections</artifactId>
<version>3.2.2</version>
<scope>runtime</scope>
</dependency>
For simplicity I didn't show configuring this in a root pom.xml in the dependency-management section.
The <scope> should be set to runtime since you mentioned not using the library directly.
I've added these lines in my pom.xml, but still commons-collections3.2 is getting downloaded..
<dependencies>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-collections4</artifactId>
<version>4.1</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>${apachecommonslang.version}</version>
<exclusions>
<exclusion>
<artifactId>commons-collections</artifactId>
<groupId>commons-collections</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>commons-dbcp</groupId>
<artifactId>commons-dbcp</artifactId>
<version>${dbcp.version}</version>
<exclusions>
<exclusion>
<artifactId>commons-collections</artifactId>
<groupId>commons-collections</groupId>
</exclusion>
</exclusions>
</dependency>

maven dependency pulling a wrong dependency

I have a dependency as follows:
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.2</version>
<scope>compile</scope>
</dependency>
This is pulling down another dependency httpcore.4.1.4 which throws a ClassDefNotFound, when i deploy httpcore.4.2 everything works.
I added both of the dependencies as follows:
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<version>4.2</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.2</version>
<scope>compile</scope>
</dependency>
and still facing the same issue ie: mvn brings down httpcore.4.1.2 not httpcore.4.2
how can i resolve this?
EDIT:
added;
<dependencyManagement>
<dependencies>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.2</version>
<scope>compile</scope>
</dependency>
</dependencies>
</dependencyManagement>
You might have a transitive dependency, one your other dependencies depend on the version you don't want.
To get an overview of all dependencies, direct and transitive, try:
mvn dependency:tree
If you find a crash between different versions of the same dependency, the first thing you should do is figure out whether the crash is critical (do you need both?) If not, upgrade so that the lowest dependency version will become equal to the highest. If it is a transitive dependency consider upgrading the version of this.
If you just want to lock on to a specific version of the dependency, you have some choices:
Exclude the transitive dependency:
<dependency>
<groupId>com.something</groupId>
<artifactId>something</artifactId>
<exclusions>
<exclusion>
<groupId>com.somethingElse</groupId>
<artifactId>somethingElse</artifactId>
</exclusion>
</exclusions>
</dependency>
Include a specific version:
<dependency>
<groupId>com.somethingElse</groupId>
<artifactId>somethingElse</artifactId>
<version>2.0</version>
</dependency>
Any dependency version added explicitly in your pom will override the version of any transitive dependency of the same groupId/artifactId.
Although being a bit of a puzzle, you should try to get compatible versions of your dependencies, that being version with the same version transitive dependencies.

How to get rid of servletcontainer-specific library jsp-api.jar?

I am facing the issue described here. I found a dependency to jsp-api.jar, which in fact comes from a dependency to Joda-Time:
<dependency>
<groupId>joda-time</groupId>
<artifactId>joda-time-jsptags</artifactId>
<version>1.0.1</version>
<exclusions>
<exclusion>
<artifactId>jsp-api</artifactId>
<groupId>javax.servlet</groupId>
</exclusion>
</exclusions>
</dependency>
I have tried to exclude it (see above), but the application won't compile. How do I make sure jsp-api is not shipped in my .war?
Instead of excluding this library, add to your dependencies explicitly with provided scope:
<dependency>
<groupId>joda-time</groupId>
<artifactId>joda-time-jsptags</artifactId>
<version>1.0.1</version>
</dependency>
<dependency>
<artifactId>jsp-api</artifactId>
<groupId>javax.servlet</groupId>
<version>2.0</version>
<scope>provided</scope>
</dependency>
Add the appropriate version of the JSP API to the dependencies of your project, with the provided scope. It will be available at compile-time, but Maven will consider that it's provided by the runtime environment and thus don't ship it with the app.

Java - getting jar dependencies right

I'm relatively new to Java & maven, and so to get to know my way around, I decided to do a project as a means for learning.
I picked a pretty common stack :
Java 1.6
Hibernate (with annotations)
Spring (with annotations)
JUnit 4
Tomcat
Oracle XE / In-mem hsqldb
By far one of the biggest problems I've experienced is getting the correct combination of jar versions to get a stable environment. It's an issue I'm still fighting with over two months later.
Quite often I get noSuchMethod or classNotFound exceptions thrown, and it turns out to be that Spring module A x.x.x is not compatible with Hibernate module B y.y.y. Or even, just as commonly, spring module A x.x.x is not compatible with spring module B y.y.y
I expected in starting from a clean slate, version dependencies should be minimal -- just grab the latest version and everything should work... but that has not been the case.
I expected that using maven would simplify this process, and no doubt it has.
But it's certainly be far from painless. I'd have thought that if module A requires a specific version of module B, that it be enforced somewhere along the line, and certinaly provide more meaningful messages that just "noSuchMethod".
Additionally, it seems that the only way I discover these problems is to try a new method call, get the dreaded noSuchMethod error, and start googling.
Have I missed something along the way here that has made this more difficult on myself than it needed to be?
For reference, here's the dependencies section of my pom...if you notice anything horrendously non-standard, please let me know!
<dependencies>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.5.6</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.5.6</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.4</version>
</dependency>
<dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
<version>2.3</version>
</dependency>
<dependency>
<groupId>ojdbc</groupId>
<artifactId>ojdbc</artifactId>
<version>14</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-io</artifactId>
<version>1.3.2</version>
</dependency>
<dependency>
<groupId>commons-io</groupId>
<artifactId>commons-io</artifactId>
<version>1.4</version>
</dependency>
<dependency><!-- java bytecode processor -->
<groupId>javassist</groupId>
<artifactId>javassist</artifactId>
<version>3.8.0.GA</version>
</dependency>
<dependency>
<groupId>commons-dbcp</groupId>
<artifactId>commons-dbcp</artifactId>
<version>1.2.1</version>
</dependency>
<dependency>
<groupId>hsqldb</groupId>
<artifactId>hsqldb</artifactId>
<version>1.8.0.7</version>
</dependency>
<dependency>
<groupId>org.dbunit</groupId>
<artifactId>dbunit</artifactId>
<version>2.4.4</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring</artifactId>
<version>2.5.6</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-test</artifactId>
<version>2.5.6</version>
</dependency>
<dependency>
<groupId>org.springframework</groupId>
<artifactId>spring-orm</artifactId>
<version>2.5.6</version>
</dependency>
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-annotations</artifactId>
<version>3.4.0.GA</version>
</dependency>
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-commons-annotations</artifactId>
<version>3.3.0.ga</version>
</dependency>
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-core</artifactId>
<version>3.3.1.GA</version>
</dependency>
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-validator</artifactId>
<version>3.1.0.GA</version>
</dependency>
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-entitymanager</artifactId>
<version>3.4.0.GA</version>
</dependency>
<dependency>
<groupId>commons-lang</groupId>
<artifactId>commons-lang</artifactId>
<version>2.3</version>
</dependency>
</dependencies>
Thanks
Marty
One thing that I've found challenging is determining what is in each package, especially from Spring.
To that end, I've found Netbeans' support for maven to be outstanding in how it lets you know what libraries are pulled in by each requirement. 6.7 Beta contains a graphical tree which is outstanding, and m2eclipse also has a very nice graphical dependency tree. How else would you know that spring-orm includes, spring-beans, spring-core, spring-context, and spring-tx? You can ask maven for the dependencies using the dependency plugin from the command line, but the graphical representation is quite handy. dependency:tree is the goal you want to run. Obviously you can also run that from Netbeans or Eclipse.
So, as an example of one of your collisions:
<dependency>
<groupId>org.hibernate</groupId>
<artifactId>hibernate-annotations</artifactId>
<version>3.4.0.GA</version>
</dependency>
actually includes hibernate-commons-annotations-3.1.0.GA not 3.3. It also includes hibernate-core-3.3.0.SP1, not 3.3.1.GA.
I would start at your "biggest" component, and start to see what parts that already includes and only add what is missing. Even then, double check that you don't have a duplicate dependency and if need be, exclude the duplicate as shown in the answer to this question.
If you are using eclipse, then you should download the maven plugin from sonatype here http://m2eclipse.sonatype.org/.
This comes with a useful graphic visualisation of your dependencies (in particular transitive dependencies - dependencies you have not explicitly defined in your POM), and also shows conflicting dependencies.
Update: from the comments below, your mileage may vary.

Categories