CPD / PMD between projects? - java

I am rephrasing this question to make it a little more straightforward and easy to understand, hopefully.
I have roughly 30 components (internal) that go into a single web application. That means 30 different projects with their own separate POM. I use inheritance quite a bit in my POMs so one of the things they inherit is a PMD/CPD configuration to prevent code duplication.
Even though I have CPD/PMD running, it only detects duplicate code within the same project. I would like it to detect in any of my projects if there is code shared among the projects that can be refactored out. Moreover, I was looking for something that could (using the same concept/pattern) verify that no code is shared between other open source dependencies.
It would be CPD/PMD, except it would operate on the source jars. This task would consume a large amount of memory if you scan all projects and their dependencies for duplication. Right now, I would just like to apply that to internal projects. If it works, then it would be relatively easy/straightforward to scale that out.
Walter

I'm not sure I got everything but...
I'd create an aggregating module with all projects as dependencies, use the maven-dependency-plugin and it's unpack-dependencies mojo to get all dependencies sources jar (the mojo can take a classifier as parameter) and unpack-them (maybe in target/generated-sources/java, the maven build helper plugin may help here) and finally run pmd:cpd on the whole source base.
This may need some tweaking, I didn't test this at all.

It sounds like you want to find duplicate code anywhere in your 30 projects. I can't speak for PMD; I assume you tell it to make one giant project containing all the source files from the union of the projects. But yes, this would take a lot of RAM and CPU.
Another tool that does is the Java CloneDR. The CloneDR finds duplicate code whether it is exactly the same or close (e.g., a few edits) regardless of source code layout or intervening comments. It is pretty easy to set it up to process all the files in your set of projects.

Just run PMD:CPD as a stand-alone program. All it needs is a directory, and it will recurse. At least, it did for me. I moved all my source to one directory and ran the CPD gui from the batch file distributed with PMD-4.2.5 .

You can perhaps take a look at sonar :
Sonar-CPD engine that is much more scalable and can detect cross-projects duplications.

You can try Lizard for Python.
It doesn't work on source jars, though.
"Code Duplicate Detector
lizard -Eduplicate {path to your code}"
https://pypi.org/project/lizard/
PMD/CPD provides more granularity since it allows the user to specify the number of tokens before a block of code is flagged as duplicate.
https://pmd.github.io/latest/pmd_userdocs_cpd.html#cli-options-reference

Related

Lightweight way to use a Maven dependency?

What do you normally do when you want to make use of some utility which is part of some Java library? I mean, if you add it as a Maven dependency is the whole library get included in the assembled JAR?
I don't have storage problems of any sort and I don't mind the JAR to get bloated but I'm just curious if there's some strategy to reduce the JAR size in this case.
Thanks
It is even worse: You do not only add the "whole library" to your project, but also its dependencies, whether you need them or not.
Joachim Sauer is right in saying that you do not bundle dependencies into your artifact unless you want it to be runnable. But IMHO this just moves the problem to a different point. Eventually, you want to run the stuff. At some point, a runnable JAR, a WAR or an EAR is built and it will incorporate the whole dependency tree (minus the fact that you get only one version per artifact).
The Maven shade plugin can help you to minimise your jar (see Minimize an Uber Jar correctly, Using Shade-Plugin) by only adding the "necessary" classes. But this is of course tricky in general:
Classes referenced by non-Java elements are not found.
Classes used by reflection are not found.
If you have some dependency injection going on, you only use the interfaces in your code. So the implementations might get kicked out.
So your minimizedJar might in general just be too small.
As you see I don't know any general solution for this.
What are sensible approaches?
Don't build JARs that have five purposes, but build JARs that are small. To avoid a huge number of different build processes, use multi-module projects.
Don't add libraries as dependencies unless you really need them. Better duplicate a method to convert Strings instead of adding a whole String library just for this purpose.

Should I bundle source and class files in the same JAR?

Separate Jars
When creating JAR files, I've always kept the source separate and offered it as an optional extra.
eg:
Foo.jar
Foo-source.jar
It seems to be the obvious way to do things and is very common. Advantages being:
Keeps binary jar small
Source may not be open / public
Faster for classloader? (I've no idea, just guessing)
Single Jar
I've started to doubt whether these advantages are always worth it. I'm working on a tiny component that is open-source. None of the advantages I've listed above were problems in this project anyway:
Classes + source still trivially small (and will remain that way)
Source is open
Class loading speed of this jar is irrelevant
Keeping the source with the classes does however bring new advantages:
Single dependency
No issues of version mismatch between source and classes
Developers using this jar will always have the source to hand (to debug or inspect)
Those new advantages are really attractive to me. Yes, I could just zip source, classes and even javadoc into a zip file and let clients of my component decide which they want to use (like Google do with the guava libraries) but is it really worth it?
I know it goes against conventional software engineering logic a little, but I think the advantages of a single jar file out-weigh the alternatives.
Am I wrong? Is there a better way?
Yes, I could just zip source, classes and even javadoc into a zip file and let clients of my component decide which they want to use (like Google do with the guava libraries) but is it really worth it?
Of course it is worth it! It takes about 2 seconds to do it, or just a few minutes to change your build scripts.
This is the way that most people who distribute sources and binaries handle this problem.
EDIT
It is not your perspective you need to consider. You have to think of this from the perspective of the people deploying / using your software.
They aren't going to use the source code on the deployment platform.
Therefore putting the source code in the binary JAR is a waste of disc space, slows down deployment and slows down application startup.
If they want to do something about it, they've got a problem. How do they rebuild the JAR file to get rid of the source code? How do they know what is safe to leave out?
From the deployer / user's perspectives, there are no positives, only negatives.
Finally, your point about people not being able to track source versus binary versions doesn't really hold water. Most people who would be interested in the source code are perfectly capable of doing this. Besides, there some simple things you can do to address the issue, like using JAR filenames that include your software's version number, or putting the version number into the manifest.
I have just come across a potential pitfall for the java+classes in a single jar.
If you have java files in a jar and that jar is included in the classpath of a subsequent javac execution, you MUST make sure that the timestamps of the java file is less than the timestamp of the class file.
This scenario can happen when you copy/move the java or class files prior to packaging as a jar.
If the java file is newer than the class, then even though the java file is found on the classpath (rather than an argument to javac), javac will attempt to compile that java file and then potentially end up with duplicate class errors during the compilation stage.
For this reason I would recommend keeping the source in a separate jar to the class files.
Note that relevant flags in javac will not allow you to prefer class over source: http://docs.oracle.com/javase/7/docs/technotes/tools/windows/javac.html#searching
I prefer 'Separate Jars'.
Because binary class jar is for running on JVM, but source not. Source should be carefully maintained by your source control system(SVN). If source needs to release, zip it in separate jar. Many open source separates class jar and source one.
If you want others to test and inspect/improve your code then you can have your source with the binaries. If not, keep the source away from the jar.
How small is small and why should your jar act differently from others?
Unless you have a very good reason why your jar should have the sources, not simply debugging but something specific to this one jar then I'd say no, choice is best.
I say this because if your jar should not be different from other, then you have to work on the assumption that others should do the same as you. If so, the size of the jar is not important, because its duplicated over all "small" jars. Then my WAR is much bigger than needed which, admittedly is not a massive issue, but is not something I would chose for production when I can download sources in DEV so easily.

How to automate a build of a Java class and all the classes it depends on?

I guess this is kind of a follow-on to question 1522329.
That question talked about getting a list of all classes used at runtime via the java -verbose:class option.
What I'm interested in is automating the build of a JAR file which contains my class(es), and all other classes they rely on. Typically, this would be where I am using code from some third party open source product's "client logic" but they haven't provided a clean set of client API objects. Their complete set of code goes server-side, but I only need the necessary client bits.
This would seem a common issue but I haven't seen anything (e.g. in Eclipse) which helps with this. Am I missing something?
Of course I can still do it manually by: biting the bullet and including all the third-party code in a massive JAR (offending my purist sensibilities) / source walkthrough / trial and error / -verbose:class type stuff (but the latter wouldn't work where, say, my code runs as part of a J2EE servlet, and thus I only want to see this for a given Tomcat webapp and, ideally, only for classes related to my classes therein).
I would recommend using a build system such as Ant or Maven. Maven is designed with Java in mind, and is what I use pretty much exclusively. You can even have Maven assemble (using the assembly plugin) all of the dependent classes into one large jar file, so you don't have to worry about dependencies.
http://maven.apache.org/
Edit:
Regarding the servlet, you can also define which dependencies you want packaged up with your jar, and if you are making a stand alone application you can have the jar tool make an executable jar.
note: yes, I am a bit of a Maven advocate, as it has made the project I work on much easier. No I do not work on the project personally. :)
Take a look at ProGuard.
ProGuard is a free Java class file shrinker, optimizer, obfuscator, and preverifier. It detects and removes unused classes, fields, methods, and attributes. It optimizes bytecode and removes unused instructions. It renames the remaining classes, fields, and methods using short meaningless names. Finally, it preverifies the processed code for Java 6 or for Java Micro Edition.
What you want is not only to include the classes you rely on but also the classes, the classes you rely on, rely on. And so on, and so forth.
So that's not really a build problem, but more a dependency one. To answer your question, you can either solve this with Maven (apparently) or Ant + Ivy.
I work with Ivy and I sometimes build "ueber-jar" using the zipgroupfileset functionality of the Ant Jar task. Not very elegant would say some, but it's done in 10 seconds :-)

How many multiple "Eclipse Projects" is considered too excessive for one actual development project?

I'm currently working on a project that contains many different Eclipse projects referencing each other to make up one large project. Is there a point where a developer should ask themselves if they should rethink the way their development project is structured?
NOTE: My project currently contains 25+ different Eclipse projects.
My general rule of thumb is I would create a new project for every reusable component. So for example if I have some isolated functionality that can be packaged say as a jar, I would create a new project so I can build,package and distribute the component independently.
Also, if there are certain projects that you do not need to make frequent changes to, you can build them only when required and keep them "closed" in eclipse to save time on indexing, etc. Even if you think that a certain component is not reusable, as long as it is separated from the rest of the code base in terms of logic/concerns you may be well served by just separating it out. Sometimes seemingly specific code might be reusable in another project or in a future version of the same project.
When compiled, a project would typically result in a jar. So if your application consists of potentially reusable components, it is ok to use a project for each.
I'm a big fan of using a lot of projects, I feel that this "breaks down" large things beyond what I can do with packages, and helps me orient and navigate.
Of course, if you're developing Eclipse plug-ins, everything would be a project anyway.
The only thing I would watch out for has to do with your source-control and it's ability to handle moves of files between projects. Subclipse had been giving me trouble with it, or maybe it's my SVN server that did.
If your project has that many sub-projects, or modules, needed to actually compose your final artifact then it is time to look at having something like Maven and setting up a multi-module project. It will a) allow you to work on each module independently without ide worries and allow easy setup in your ide (and others' IDEs) through the mvn eclipse:eclipse goal. In addition, when building your entire top level project, maven will be able to derive from list of dependencies you have described what modules need to be built in what order.
Here's a quick link via google and a link to the book Maven: The Definitive Guide, which will explain things in much better detail in chapter 6 (once you have the basics).
This will also force your project to not be explicitly tied to Eclipse. Being able to build independent from an ide means that any Joe Schmoe can come along and easily work with your code base using whatever tools he/she needs.
Create jars for the projects you don't work in often. That should greatly reduce the clutter. If you work on all the projects often, then you can add targets to your build that will jar up the respective projects for you, which condenses everything down to one file that you can then include on the class path.
An additional method is to create many different workspaces. The benefit of separate workspaces is that you can remove some of the visual clutter/ performance overhead of having lots of projects. You can use targets to jar up all of you projects and put them in a repository so you can reference them in each workspace.
At a former job the entire application was more then +170 projects. While it was rarely necessary to have all projects checked out locally, even the 30-40 projects constantly in our scope made reindexing, etc. very slow.
Yeesh. One Project for each Project. If you are using reusable projects, make them into a library for heavens sake. Break the none re-usable projects into packages, that's what they are there for.
That's a hard question and answers span from having one eclipse project at all to having one eclipse project for every single class.
My bottomline:
You can have too few projects,
and never too many (of course use
automation e.g. mvn eclipse:eclipse)
Use
-Declipse.useProjectReferences=true/false
when using maven to switch workspace
mode btw jar and project
dependencies
Use mvn release plugin to generate
consecutive releases (automatic
version increase)
Multiple projects gives you
independent versioning which is
extremely important. E.g. one dev may work on a new version of a
module while you still depends on
the previous one and you at some
point decide to upgrade to the newer
version(possibly by increasing its version in pom.xml dependency section). Or in other scenario if one
project contains a bug you downgrade
to its previous version.
Multiple projects makes you think
about the architecture more than if
you have just packages.
Multiple projects generally make
architectural problems evident more
than if you have just one project.
Anyone would like to comment on
this?
You never know if you project
evolves into OSGI/SOA/EDA where you
need separation.
Even if you're 100% sure that you
projects will be deployed as one jar
in an old way in a single jvm, it
still does not hurt(mvn assembly
plugin) to have multiple eclipse
projects for logically independent
pieces of code
BTW, the project I work on is divided into 24 eclipse projects.
Hell, we have more than 100. Projects don't cost anything.

How to divide a large Java project into smaller components

We've trying to separate a big code base into logical modules. I would like some recommendations for tools as well as whatever experiences you might have had with this sort of thing.
The application consists of a server WAR and several rich-clients distributed in JARs. The trouble is that it's all in one big, hairy code base, one source tree of > 2k files war. Each JAR has a dedicated class with a main method, but the tangle of dependencies ensnares quickly. It's not all that bad, good practices were followed consistently and there are components with specific tasks. It just needs some improvement to help our team scale as it grows.
The modules will each be in a maven project, built by a parent POM. The process has already started on moving each JAR/WAR into it's own project, but it's obvious that this will only scratch the surface: a few classes in each app JAR and a mammoth "legacy" project with everything else. Also, there are already some unit and integration tests.
Anyway, I'm interesting in tools, techniques, and general advice to breaking up an overly large and entangled code base into something more manageable. Free/open source is preferred.
Have a look a Structure 101. It is awesome for visualizing dependencies, and showing the dependencies to break on your way to a cleaner structure.
We recently have accomplished a similar task, i.e. a project that consisted of > 1k source files with two main classes that had to be split up. We ended up with four separate projects, one for the base utility classes, one for the client database stuff, one for the server (the project is a rmi-server-client application), and one for the client gui stuff. Our project had to be separated because other applications were using the client as a command line only and if you used any of the gui classes by accident you were experiencing headless exceptions which only occurred when starting on the headless deployment server.
Some things to keep in mind from our experience:
Use an entire sprint for separating the projects (don't let other tasks interfere with the split up for you will need the the whole time of a sprint)
Use version control
Write unit tests before you move any functionality somewhere else
Use a continuous integration system (doesn't matter if home grown or out of the box)
Minimize the number of files in the current changeset (you will save yourself a lot of work when you have to undo some changes)
Use a dependency analysis tool all the way before moving classes (we have made good experiences with DependencyFinder)
Take the time to restructure the packages into reasonable per project package sets
Don't fear to change interfaces but have all dependent projects in the workspace so that you get all the compilation errors
Two advices: The first thing you need is Test suites. The second advice is to work in small steps.
If you have a strong test suite already then you're in a good position. Otherwise, I would some good high level tests (aka: system tests).
The main advantage of high level tests is that a relatively small amount of tests can get you great coverage. They will not help you in pin-pointing a bug, but you won't really need that: if you work in small steps and you make sure to run the tests after each change you'll be able to quickly detect (accidentally introduced) bugs: the root of the bug is in the small portion of the code has changed since the last time you ran the tests.
I would start with the various tasks that you need to accomplish.
I was faced with a similar task recently, given a 15 year old code base that had been made by a series of developers who didn't have any communication with one another (one worked on the project, left, then another got hired, etc, with no crosstalk). The result is a total mishmash of very different styles and quality.
To make it work, we've had to isolate the necessary functionality, distinct from the decorative fluff to make it all work. For instance, there's a lot of different string classes in there, and one person spent what must have been a great deal of time making a 2k line conversion between COleDateTime to const char* and back again; that was fluff, code to solve a task ancillary to the main goal (getting things into and out of a database).
What we ended up having to do was identify a large goal that this code accomplished, and then writing the base logic for that. When there was a task we needed to accomplish that we know had been done before, we found it and wrapped it in library calls, so that it could exist on its own. One code chunk, for instance, activates a USB device driver to create an image; that code is untouched by this current project, but called when necessary via library calls. Another code chunk works the security dongle, and still another queries remote servers for data. That's all necessary code that can be encapsulated. The drawing code, though, was built over 15 years and such an edifice to insanity that a rewrite in OpenGL over the course of a month was a better use of time than to try to figure out what someone else had done and then how to add to it.
I'm being a bit handwavy here, because our project was MFC C++ to .NET C#, but the basic principles apply:
find the major goal
identify all the little goals that make the major goal possible
Isolate the already encapsulated portions of code, if any, to be used as library calls
figure out the logic to piece it all together.
I hope that helps...
To continue Itay's answer, I suggest reading Michael Feathers' "Working Effectively With Legacy Code"(pdf). He also recommends every step to be backed by tests. There is also A book-length version.
Maven allows you to setup small projects as children of a larger one. If you want to extract a portion of your project as a separate library for other projects, then maven lets you do that as well.
Having said that, you definitely need to document your tasks, what each smaller project will accomplish, and then (as has been stated here multiple times) test, test, test. You need tests which work through the whole project, then have tests that work with the individual portions of the project which will wind up as child projects.
When you start to pull out functionality, you need additional tests to make sure that your functionality is consistent, and that you can mock input into your various child projects.

Categories