How to divide a large Java project into smaller components

How to divide a large Java project into smaller components - java

We've trying to separate a big code base into logical modules. I would like some recommendations for tools as well as whatever experiences you might have had with this sort of thing.
The application consists of a server WAR and several rich-clients distributed in JARs. The trouble is that it's all in one big, hairy code base, one source tree of > 2k files war. Each JAR has a dedicated class with a main method, but the tangle of dependencies ensnares quickly. It's not all that bad, good practices were followed consistently and there are components with specific tasks. It just needs some improvement to help our team scale as it grows.
The modules will each be in a maven project, built by a parent POM. The process has already started on moving each JAR/WAR into it's own project, but it's obvious that this will only scratch the surface: a few classes in each app JAR and a mammoth "legacy" project with everything else. Also, there are already some unit and integration tests.
Anyway, I'm interesting in tools, techniques, and general advice to breaking up an overly large and entangled code base into something more manageable. Free/open source is preferred.

Have a look a Structure 101. It is awesome for visualizing dependencies, and showing the dependencies to break on your way to a cleaner structure.

We recently have accomplished a similar task, i.e. a project that consisted of > 1k source files with two main classes that had to be split up. We ended up with four separate projects, one for the base utility classes, one for the client database stuff, one for the server (the project is a rmi-server-client application), and one for the client gui stuff. Our project had to be separated because other applications were using the client as a command line only and if you used any of the gui classes by accident you were experiencing headless exceptions which only occurred when starting on the headless deployment server.
Some things to keep in mind from our experience:
Use an entire sprint for separating the projects (don't let other tasks interfere with the split up for you will need the the whole time of a sprint)
Use version control
Write unit tests before you move any functionality somewhere else
Use a continuous integration system (doesn't matter if home grown or out of the box)
Minimize the number of files in the current changeset (you will save yourself a lot of work when you have to undo some changes)
Use a dependency analysis tool all the way before moving classes (we have made good experiences with DependencyFinder)
Take the time to restructure the packages into reasonable per project package sets
Don't fear to change interfaces but have all dependent projects in the workspace so that you get all the compilation errors

Two advices: The first thing you need is Test suites. The second advice is to work in small steps.
If you have a strong test suite already then you're in a good position. Otherwise, I would some good high level tests (aka: system tests).
The main advantage of high level tests is that a relatively small amount of tests can get you great coverage. They will not help you in pin-pointing a bug, but you won't really need that: if you work in small steps and you make sure to run the tests after each change you'll be able to quickly detect (accidentally introduced) bugs: the root of the bug is in the small portion of the code has changed since the last time you ran the tests.

I would start with the various tasks that you need to accomplish.
I was faced with a similar task recently, given a 15 year old code base that had been made by a series of developers who didn't have any communication with one another (one worked on the project, left, then another got hired, etc, with no crosstalk). The result is a total mishmash of very different styles and quality.
To make it work, we've had to isolate the necessary functionality, distinct from the decorative fluff to make it all work. For instance, there's a lot of different string classes in there, and one person spent what must have been a great deal of time making a 2k line conversion between COleDateTime to const char* and back again; that was fluff, code to solve a task ancillary to the main goal (getting things into and out of a database).
What we ended up having to do was identify a large goal that this code accomplished, and then writing the base logic for that. When there was a task we needed to accomplish that we know had been done before, we found it and wrapped it in library calls, so that it could exist on its own. One code chunk, for instance, activates a USB device driver to create an image; that code is untouched by this current project, but called when necessary via library calls. Another code chunk works the security dongle, and still another queries remote servers for data. That's all necessary code that can be encapsulated. The drawing code, though, was built over 15 years and such an edifice to insanity that a rewrite in OpenGL over the course of a month was a better use of time than to try to figure out what someone else had done and then how to add to it.
I'm being a bit handwavy here, because our project was MFC C++ to .NET C#, but the basic principles apply:
find the major goal
identify all the little goals that make the major goal possible
Isolate the already encapsulated portions of code, if any, to be used as library calls
figure out the logic to piece it all together.
I hope that helps...

To continue Itay's answer, I suggest reading Michael Feathers' "Working Effectively With Legacy Code"(pdf). He also recommends every step to be backed by tests. There is also A book-length version.

Maven allows you to setup small projects as children of a larger one. If you want to extract a portion of your project as a separate library for other projects, then maven lets you do that as well.
Having said that, you definitely need to document your tasks, what each smaller project will accomplish, and then (as has been stated here multiple times) test, test, test. You need tests which work through the whole project, then have tests that work with the individual portions of the project which will wind up as child projects.
When you start to pull out functionality, you need additional tests to make sure that your functionality is consistent, and that you can mock input into your various child projects.

Related

Managing Java wrapping of a C++ API for use by a Java GUI: proper version control

We have a large project consisting of the following:
A: C++ source code / libraries
B: Java and Python wrapping of the C++ libraries, using SWIG
C: a GUI written in Java, that depends on the Java API/wrapping.
People use the project in all the possible ways:
C++ projects using the C++ API
Java projects using the Java API
Python scripting
MATLAB scripting (using the Java API)
through the Java GUI
Currently, A, B and C are all in a single Subversion repository. We're moving to git/GitHub, so we have an opportunity to reorganize. We are thinking of splitting A, B, and C into their own repositories. This raises a few questions for us:
Does it make sense to split off the Java and Python SWIG wrapping (that is, the interface (*.i) files) into a separate repository?
Currently, SWIG-generated .java files are output in the source tree of the GUI and are tracked in SVN. Since we don't want to track machine-generated files in our git repositories, what is the best way of maintaining the dependency of the GUI on the .java/.jar files generated by SWIG? Consider this: if a new developer wants to build the Java GUI, they shouldn't need to build A and B from scratch; they should be able to get a copy of C from which they can immediately build the GUI.
Versioning: When everything is in one repository, A, B and C are necessarily consistent with each other. When we have different repositories, the GUI needs to work with a known version of the wrapping, and the wrapping needs to work with a known version of the C++ API. What is the best way to manage this?
We have thought deeply about each of these questions, but want to hear how the community would approach these issues. Perhaps git submodule/subtree is part of the solution to some of these? We haven't used either of these, and it seems submodules cause people some headache. Does anybody have stories of success with either of these?

OK, I looked in a similar problem like you (multiple interacting projects) and I tried the three possibilities subtree, submodules and a single plain repository with multiple folders containing the individual parts. If there are more/better solutions I am not aware of them.
In short I went for a single repository but this might not be the best solution in your case, that depends...
The benefit of submodules is that it allows easy management as every part is itself a repo. Thus individual parties can work only on their repo and the other parts can be added from predefined binary releases/... (however you like). You have to add an addtional repo that concatenates the individual repos together.
This is both the advantage and disadvantage: Each commit in this repo defines a running configuration. Ideally your developers will have to make each commit twice one for the "working repo" (A through C) and one for the configuration repo.
So this method might be well suited if you intent you parts A-C to be mostly independent and not changing too often (that is only for new releases).
I have to confess that I did not like the subtree method personally. For me (personally) the syntax seems clumsy and the benefit is not too large.
The benefit is that remote modifications are easily fetched and inserted but you loose the remote history. Thus you should avoid to interfere with the remote development.
This is the downside: If you intend to do modifications on the parts you have always to worry about the history. You can of course just develop in a git remote and for testing/merging/integrating change to the master branch. This is ok for mainly reading remote repos (if I am developing only on A but need B and C) but not for regular modifications (in the example for A).
The last possibility is one plain repo with folders for each part. The benefit is that no adminstration to keep the parts in sync is directly needed. However you will not be able to guarantee that each commit will be a running commit. Also you developers will have to do the administration by hand.
You see that the choice depends on how close the individual parts A-C are interconnected. Here I can only guess:
If you are in an earlier stage of development where modifications throughout the whole source tree are common one big repo is better handleable than a splitted version. If your interfaces are mostly constant the splitting allows smaller repos and a more strict separation of things.
The SWIG code and the C++ code seems quite close. Thus splitting those two seems less practical than splitting the GUI from the rest (for my guess).
For you other question "How to handle new developers/(un)tracking machine-generated code?":
How many commits are made/required (only releases or each individual commit)? If only releases are of interest you could go with binary packages. If you intent to share each single commit, you would have to provide many different binary versions. Here I would suggest let them compile the whole tree once consuming a few minutes and from there on rebuilding is just a short make that should not take too long. This could even automatized using hooks.

Package vs project separation in java

First off, I'm coming (back) to Java from C#, so apologies if my terminology or philosophy doesn't quite line up.
Here's the background: we've got a growing collection of internal support tools written for the web. They use HTML5/AJAX/other buzzwords for the frontend and Java for the backend. These tools utilize a lightweight in-house framework so they can share an administrative interface for security and other configuration. Each tool has been written by a separate author and I expect that trend to continue, so I'd like to make it easy for future authors to stay "standardized" on the third-party libraries that we've already decided to use for things like DI, unit testing, ORM, etc.
Our package naming currently looks like this:
com.ourcompany.tools.framework
com.ourcompany.tools.apps.app1name
com.ourcompany.tools.apps.app2name
...and so on.
So here's my question: should each of these apps (and the framework) be treated as a separate project for purposes of Maven setup, Eclipse, etc?
We could have lots of apps appear here over time, so it seems like separation would keep dependencies cleaner and let someone jump in on a single tool more easily. On the other hand, (1) maybe "splitting" deeper portions of a package structure over multiple projects is a code smell and (2) keeping them combined would make tool writers more inclined to use third-party libraries already in place for the other tools.
FWIW, my initial instinct is to separate them.
What say you, Java gurus?

I would absolutely separate them. For the purposes of Maven, make sure each app/project has the appropriate dependencies to the framework/apps so you don't have to build everything when you just want to build a single app.

I keep my projects separated out, but use a parent pom for including all of the dependencies and other common properties. Individual tools / projects have a name and a reference to the parent project, and any project-specific dependencies, if any. This works for helping to keep to common libraries and dependencies, since the common ones are already all configured, but allows me to focus on the specific portion of the codebase that I need to work with.

I'd definitely separate these kind of things out into separate projects.
You should use Maven to handle the dependencies / build process automatically (both for your own internal shared libraries and third party dependencies). There won't be any issue having multiple applications reference the same shared libraries - you can even keep multiple versions around if you need to.
Couple of bonuses from this approach:
This forces you to think carefully about your API design for the shared projects which will be a good thing in the long run.
It will probably also give you about the right granularity for source code control - i.e. your developers can check out and work on specific applications or backend modules individually

If there is a section of a project that is likely to be used on more than one project it makes sense to pull that out. It will make it a little cleaner as well if you need to update the code in one of the commonly used projects.

If you keep them together you will have fewer obstacles developing, building and deploying your tools.
We had the opposite situation, having many separate projects. After merging them into one project tree we are much more productive and this is more important to us than whatever conventions happen to be trending.

How to determine which source files are required for an Eclipse run configuration

When writing code in an Eclipse project, I'm usually quite messy and undisciplined in how I create and organize my classes, at least in the early hacky and experimental stages. In particular, I create more than one class with a main method for testing different ideas that share most of the same classes.
If I come up with something like a useful app, I can export it to a runnable jar so I can share it with friends. But this simply packs up the whole project, which can become several megabytes big if I'm relying on large library such as httpclient.
Also, if I decide to refactor my lump of code into several projects once I work out what works, and I can't remember which source files are used in a particular run configuration, all I can do it copy the main class to a new project and then keep copying missing types till the new project compiles.
Is there a way in Eclipse to determine which classes are actually used in a particular run configuration?
EDIT: Here's an example. Say I'm experimenting with web scraping, and so far I've tried to scrape the search-result pages of both youtube.com and wrzuta.pl. I have a bunch of classes that implement scraping in general, a few that are specific to each of youtube and wrzuta. On top of this I have a basic gui common to both scrapers, but a few wrzuta- and youtube-specific buttons and options.
The WrzutaGuiMain and YoutubeGuiMain classes each contain a main method to configure and show the gui for each respective website. Can Eclipse look at each of these to determine which types are referenced?

Take a look at ProGuard, it is a "java shrinker, optimizer, obfuscator, and preverifier". I think you'll mainly be interested in the first capability for this problem.
Yes it's not technically part of Eclipse, as you requested, but it can be run from an Ant script, which can be pretty easily run in Eclipse.

I create more than one class with a main method for testing different ideas that share most of the same classes.
It's better to be pedantic than lazy, it saves you time when coding :-)
You can have one class with a main method that accepts a command-line argument and calls a certain branch of functionality based on its value.

Separation of JUnit classes into special test package?

I am learning the concepts of Test-Driven Development through reading the Craftsman articles (click Craftsman under By Topic) recommended in an answer to my previous question, "Sample project for learning JUnit and proper software engineering". I love it so far!
But now I want to sit down and try it myself. I have a question that I hope will need only a simple answer.
How do you organize your JUnit test classes and your actual code? I'm talking mainly about the package structure, but any other concepts of note would be helpful too.
Do you put test classes in org.myname.project.test.* and normal code in org.myname.project.*? Do you put the test classes right alongside the normal classes? Do you prefer to prefix the class names with Test rather than suffix them?
I know this seems like the kind of thing I shouldn't worry about so soon, but I am a very organization-centric person. I'm almost the kind of person that spends more time figuring out methods to keep track of what to get done, rather than actually getting things done.
And I have a project that is currently neatly divided up into packages, but the project became a mess. Instead of trying to refactor everything and write tests, I want to start fresh, tests first and all. But first I need to know where my tests go.
edit: I totally forgot about Maven, but it seems a majority of you are using it! In the past I had a specific use case where Maven completely broke down on me but Ant gave me the flexibility I needed, so I ended up attached to Ant, but I'm thinking maybe I was just taking the wrong approach. I think I'll give Maven another try because it sounds like it will go well with test-driven development.

I prefer putting the test classes into the same package as the project classes they test, but in a different physical directory, like:
myproject/src/com/foo/Bar.java
myproject/test/com/foo/BarTest.java
In a Maven project it would look like this:
myproject/src/main/java/com/foo/Bar.java
myproject/src/test/java/com/foo/BarTest.java
The main point in this is that my test classes can access (and test!) package-scope classes and members.
As the above example shows, my test classes have the name of the tested class plus Test as a suffix. This helps finding them quickly - it's not very funny to try searching among a couple of hundred test classes, each of whose name starts with Test...
Update inspired by #Ricket's comment: this way test classes (typically) show up right after their tested buddy in a project-wise alphabetic listing of class names. (Funny that I am benefiting from this day by day, without having consciously realized how...)
Update2: A lot of developers (including myself) like Maven, but there seems to be at least as many who don't. IMHO it is very useful for "mainstream" Java projects (I would put about 90% of projects into this category... but the other 10% is still a sizeable minority). It is easy to use if one can accept the Maven conventions; however if not, it makes life a miserable struggle. Maven seems to be difficult to comprehend for many people socialized on Ant, as it apparently requires a very different way of thinking. (Myself, having never used Ant, can't compare the two.) One thing is for sure: it makes unit (and integration) testing a natural, first-class step in the process, which helps developers adopt this essential practice.

I put my test classes in the same package as what they are testing but in a different source folder or project. Organizing my test code in this fashion allows me to easily compile and package it separately so that production jar files do not contain test code. It also allows the test code to access package private fields and methods.

I use Maven. The structure that Maven promotes is:-
src/main/java/org/myname/project/MyClass.java
src/test/java/org/myname/project/TestMyClass.java
i.e. a test class with Test prepended to the name of the class under test is in a parallel directory structure to the main test.
One advantage of having the test classes in the same package (not necessarily directory though) is you can leverage package-scope methods to inspect or inject mock test objects.

CPD / PMD between projects?

I am rephrasing this question to make it a little more straightforward and easy to understand, hopefully.
I have roughly 30 components (internal) that go into a single web application. That means 30 different projects with their own separate POM. I use inheritance quite a bit in my POMs so one of the things they inherit is a PMD/CPD configuration to prevent code duplication.
Even though I have CPD/PMD running, it only detects duplicate code within the same project. I would like it to detect in any of my projects if there is code shared among the projects that can be refactored out. Moreover, I was looking for something that could (using the same concept/pattern) verify that no code is shared between other open source dependencies.
It would be CPD/PMD, except it would operate on the source jars. This task would consume a large amount of memory if you scan all projects and their dependencies for duplication. Right now, I would just like to apply that to internal projects. If it works, then it would be relatively easy/straightforward to scale that out.
Walter

I'm not sure I got everything but...
I'd create an aggregating module with all projects as dependencies, use the maven-dependency-plugin and it's unpack-dependencies mojo to get all dependencies sources jar (the mojo can take a classifier as parameter) and unpack-them (maybe in target/generated-sources/java, the maven build helper plugin may help here) and finally run pmd:cpd on the whole source base.
This may need some tweaking, I didn't test this at all.

It sounds like you want to find duplicate code anywhere in your 30 projects. I can't speak for PMD; I assume you tell it to make one giant project containing all the source files from the union of the projects. But yes, this would take a lot of RAM and CPU.
Another tool that does is the Java CloneDR. The CloneDR finds duplicate code whether it is exactly the same or close (e.g., a few edits) regardless of source code layout or intervening comments. It is pretty easy to set it up to process all the files in your set of projects.

Just run PMD:CPD as a stand-alone program. All it needs is a directory, and it will recurse. At least, it did for me. I moved all my source to one directory and ran the CPD gui from the batch file distributed with PMD-4.2.5 .

You can perhaps take a look at sonar :
Sonar-CPD engine that is much more scalable and can detect cross-projects duplications.

You can try Lizard for Python.
It doesn't work on source jars, though.
"Code Duplicate Detector
lizard -Eduplicate {path to your code}"
https://pypi.org/project/lizard/
PMD/CPD provides more granularity since it allows the user to specify the number of tokens before a block of code is flagged as duplicate.
https://pmd.github.io/latest/pmd_userdocs_cpd.html#cli-options-reference

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.