Best practice to store .jar files in VCS (SVN, Git, ...) - java

I know, in the time of Maven it is not recommended to store libraries in VCS, but sometimes it makes sense, though.
My question is how to best store them - compressed or uncompressed? Uncompressed they are larger, but if they are replaced a couple of times with newer ones, then maybe the stored difference between two uncompressed .jar files might be much smaller than the difference of compressed ones. Did someone make some tests?

Best practice to store .jar files in VCS (SVN, Git, …): don't.
It could make sense in a CVCS (Centralized VCS) like SVN, which can handle millions of files whatever their size is.
It doesn't in a DVCS, especially one like Git (and its limits):
Binary files don't fit well with VCS.
By default, cloning a DVCS repo will get you all of its history, with all the jar versions.
That will be slow and take a lot of disk space, not matter how well those jar are compressed.
You could try to play with shallow cloning, but that's highly unpractical.
Use a second repository, like Nexus, for storing those jars, and only reference a txt file (or a pom.xml file for Maven project) in order to fetch the right jar versions.
A artifact repo is more adapted for distribution and release management purpose.
All that being said, if you must store jar in a Git repo, I would have recommend initially to store them in their compressed format (which is the default format for a jar: see Creating a JAR File)
Both compressed and uncompressed format would be treated as binary by Git, but at least, in a compressed format, clone and checkout would take less time.
However, many threads mentions the possibility to store jar in uncompressed format:
I'm using some repos that get regular 50MB tarballs checked into them.
I convinced them to not compress the tarballs, and git does a fairly decent job of doing delta compression between them (although it needs quite a bit of RAM to do so).
You have more on deltified object on Git here:
It does not make a difference if you are dealing with binary or text;
The delta is not necessarily against the same path in the previous revision, so even a new file added to the history can be stored in a delitified form;
When an object stored in the deltified representation is used, it would incur more cost than using the same object in the compressed base representation. The deltification mechanism makes a trade-off taking this cost into account, as well as the space efficiency.
So, if clones and checkouts are not common operations that you would have to perform every 5 minutes, storing jar in an uncompressed format in Git would make more sense because:
Git would compressed/compute delta for those files
You would end up with uncompressed jar in your working directory, jars which could then potentially be loaded more quickly.
Recommendation: uncompressed.

You can use similar solution as found in answers to "Uncompress OpenOffice files for better storage in version control" question here on SO, namely using clean / smudge gitattribute using rezip as filter to store *.jar files uncompressed.

.jar files are (can be) compressed already, compressing them a second time probably will not yield the size improvement you expect.

Related

Ways to store small files in Hadoop HDFS other than HAR or Sequence Files + doubts about them

I have read lots of blog entries and articles about the "Small Files problem in hadoop", but a lot of them simply seem to be a copy-paste of the previous. Furthermore they all seem a little bit dated, and the last ones (2015ish) describe anyway what this cloudera blog did in the early 2009.
Does this mean no archiving solution has been found in 6 years?
Here is the reason of my research: I need to move and catalogue files as they come, in different numbers, sometimes even singlely, and then store them in HDFS.
These files will be later be accessed and returned in a web service layer (must be fast), to be opened and seen by people or softwares.
The files may be videos, images, documents, whatever and need to be accessed later using an ID I produce with the Java class UUID.
The choice to use hdfs is completey personal of my PM, as I've proposed HBase to compensate the lack of indexing in HDFS (although I'm not sure it is an optimal solution), but he has asked me to look anyway outisde of HBase in case of having to deal with bigger files (so far the biggest among 1000 has been 2MB, but we expect 1Gb videos).
As far as I have understood, the small files problem happen when you use MapReduce jobs, for memory consumption, but I was wondering:
Does it really matter how many files are there in HDFS if I am using Spark to extract them? Or if I am using webhdfs/v1/ ? Or Java?
Talking about storing a group of small files, so far I've found three main solutions, all of which are quite inconvenient in production environment:
HAR: looks fantastic with the indexed file extraction, but the fact that I cannot append or add new files is quite troublesome. Does the opening and recreation of HARs weigh a lot on the system?
Sequence Files have the opposite pros and cons: you can append files, but they're not indexed, so there is a O(n) look-up time. Is it worth it?
Merge them: impossible to do in my case.
Is there some new technology I'm missing out regarding this common problem? Something on the lines of Avro or Parquet for files?
Here some feedback to your solutions:
a) HAR is not appendable. You can unarchive and archive your har archive with the new files via HDFS command line interface. Both methods are implemented as MapReduce job, so execution time depends on your compute cluster as well as size of your archive files. Me and my colleague use and developed AHAR. A tool that allows you to append data more efficiency without rewriting the whole archive.
b) As far as I know, you are right with a high index look-up time. But note, with HAR you also have a higher look-up time due to a two step indexing strategy.
This post gives you are very good overview about the small file problem and possible solutions. Maybe you can "just" increase the memory at the NameNode.

Accessing HSQLDB "res" database bundled as resource in jar is extremely slow

I'm attempting to build an open source project to provide easy access to machine learning datasets, which bundles the data in an easily accessible way. Basically, I have code, which converts the raw data into a HSQLDB file DB, producing *.data, *.properties, and *.script files. I then take those 3 files, put them in src/main/resources of my Maven project and build a jar. Applications depending on this jar can then access the HSQLDB database as a res database.
Technically, I have no problems getting all the pieces in place to accomplish this. However, accessing the data is extremely slow. The strange thing is though, that if I have the datasets project and a project depending on datasets both open in Eclipse and run it from there, it's fast as one would expect. This means that the problem has to do with the HSQLDB files being jarred up. Another clue, is that the larger the DB, the (seemingly) exponentially longer it takes to access the data.
I've tried bumping of the memory and perm space given as JVM args. I've also tried setting various HSQLDB flags in the the *.properties file as well.
Any ideas??
Edit: I also have jar compression turned off using the <compress>false</compress> element in the maven-jar-plugin definition.
I tried many things including setting the cache size and cache rows as suggested at the HSQLDB forum. I ended up solving this problem with a workaround as suggested above by Boris the Spider, which was to:
Create a temporary dir at java.io.temp.
Move the DB files out of the jar and into the temporary dir.
Use the file HSQLDB database using these files.
Cleanup afterwards by deleting the temporary dir.
Worked like a charm. A bit of a hack, but at least it works.

Merging two folders with priority on timestamp

One of our teams recently migrated a good deal of legacy data to a folder with all of the current data. Some of the team members weren't aware of the changes, so they continued to make modifications in the legacy folder.
I'd like to consolidate the data by doing timestamp checks. I can write a script for this, but I essentially want to do a windows explorer merge of two folders, where the option selected is always the "newer" one in conflicts. That is, if the source has the newer file, copy the source into the dest. If the dest is newer, don't copy the source into the dest. If the source exists but the dest doesn't, copy the source over.
I'm writing a quick script in Java, but I'm running into some issues, and I wanted to know if there's a much simpler solution to simply always select the "newer" option.
To copy file Standard concise way to copy a file in Java?
To get timestamp http://docs.oracle.com/javase/6/docs/api/java/io/File.html#lastModified%28%29

Patching Java software

I'm trying to create a process to patch our current java application so users only need to download the diffs rather than the entire application. I don't think I need to go as low level as a binary diff since most of the jar files are small, so replacing an entire jar file wouldn't be that big of a deal (maybe 5MB at most).
Are there standard tools for determining which files changed and generating a patch for them? I've seen tools like xdelta and vpatch, but I think they work at a binary level.
I basically want to figure out - which files need to be added, replaced or removed. When I run the patch, it will check the current version of the software (from a registry setting) and ensure the patch is for the correct version. If it is, it will then make the necessary changes. It doesn't sound like this would be too difficult to implement on my own, but I was wondering if other people had already done this. I'm using NSIS as my installer if that makes any difference.
Thanks,
Jeff
Be careful when doing this--I recommend not doing it at all.
The biggest problem is public static variables. They are actually compiled into the target, not referenced. This means that even if a java file doesn't change, the class must be recompiled or you will still refer to the old value.
You also want to be very careful of changing method signatures--you will get some very subtle bugs if you change a method signature and do not recompile all files that call that method--even if the calling java files don't actually need to change (for instance, change a parameter from an int to a long).
If you decide to go down this path, be ready for some really hard to debug errors (generally no traces or significant indications, just strange behavior like the number received not matching the one sent) on customer site that you cannot duplicate and a lot of pissed off customers.
Edit (too long for comment):
A binary diff of the class files might work but I'd assume that some kind of version number or date gets compiled in and that they'd change a little every compile for no reason but that could be easily tested.
You could take on some strict development practices of not using public final statics (make them private) and not every changing method signatures (deprecate instead) but I'm not convinced that I know all the possible problems, I just know the ones we encountered.
Also binary diffs of the Jar files would be useless, you'd have to diff the classes and re-integrate them into the jars (doesn't sound easy to track)
Can you package your resources separately then minimize your code a bit? Pull out strings (Good for i18n)--I guess I'm just wondering if you could trim the class files enough to always do a full build/ship.
On the other hand, Sun seems to do an okay job of making class files that are completely compatible with the previous JRE release, so they must have guidelines somewhere.
You may want to see if Java WebStart can help you as it is designed to do exactly those things you want to do.
I know that the documentation describes how to create and do incremental updates, but we deploy the whole application as it changes very rarely. It is then an issue of updating the JNLP when ready.
How is it deployed?
On a local network I just leave everything as .class files in a folder. The startup script uses robocopy or rsync to copy from network share to local. If any .class file is different it is synced down. If not, it doesn't sync.
For non-local network I created my own updater. It downloads a text file of md5sums and compares to local files. If different it pulls file down from http.
A long time ago the way we solved this was to used Classpath and jar files. Our application was built in a Jar file, and it had a launcher Jar file. The launcher classpath had a patch.jar that was read into the classpath before the main application.jar. This meant that we could update the patch.jar to supersede any classes in the main application.
However, this was a long time ago. You may be better using something like the Java Web Start type of approach, which offers more seamless application updating.

Alternative to ZIP as a project file format. SQLite or Other?

My Java application is currently using ZIP as a project file format. The project files contain a few XML files and many image and sound files.
The project files are getting pretty big, and since I can't find a way with the java.util.zip classes to write to a ZIP file without recreating it, my file saves are becoming very slow. So for example, if I just want to update one XML file, I need to rewrite the entire ZIP.
Is there some other Java ZIP library that will allow me to do random writes to a ZIP file?
I know switching to something like SQLite solves the random write issue. Would using SQLite just to write XML, Sound and Images as blobs be an appropriate use?
I suppose I could come up with my own file format and use RandomAccessFile but then there would be a lot of bookkeeping I'd have to write.
Update...
My file format is very much like Office Open XML. It is a ZIP file containing XML and other resources.
Someone must have solved the problem of how to do random writes to update a ZIP file. Does anyone know how?
There exist so-called single-file virtual file systems, that let you create file-based containers and provide file-system like structure and APIs. One of the samples is SolFS (it has C-written core with JNI wrapper) and some other C- and Delphi-written solutions (I don't remember their names at the moment). I guess there exist similar native Java solutions as well.
First of all I would separate your app's resources in those that are static (such as images) and those that can be changed (the xml files you mentioned).
Since the static files won't be re-written, you can continue to store them in a zip file, which IMHO is a good approach to deploy any resources.
Now you have 2 options:
Since the non-static files are probably not too big (the xml files are likely to be smaller than images+sounds), you can stick with your current solution (zip file) and simply maintain 2 zip files, of which only one (the smaller one with the changeable files) can/will be re-written.
You could use a in-memory-database (such as hsqldb) to store the changeable files and only persist them (transferring from the database to a file on the drive) when your application shuts down or that operation is explicitly needed.
sqlite is not always fast (at least in my experience). I would suggest individually compressing the XML files -- you'll still get decent compression, and just use the file system to save them. You could experiment with btrfs, or just go with ext4. If you're not on Linux, then this should still work okay, but it might not be as fast until things are cached in memory.
the idea is that if you do not have redundancy between XML files, then you don't get that much saving by compressing them in one "solid" archive.
Before offering another answer along the lines of using properly structured JARs, I have to ask -- why does the project need to be encapsulated in one file? How do you distribute the program to users to run?
If you must keep a project contained within a single file and be able to replace resources efficiently, yes I would say SQLite is a good choice.
If you do choose to use SQLite, also consider converting some of the XML schemas to one or more SQL tables rather than storing large XML documents as BLOBs.

Categories