Git: Finding files where only serialVersionUID changed

Git: Finding files where only serialVersionUID changed - java

I have a set of auto-generated java files which are checked into git. Each file contains the line
final static long serialVersionUID = -4268385597939353497L;
where the part after the serialVersionUID is changed to a random number on each regeneration.
Note: This is set in stone and I am aware of "not checking generated code into version control etc.".
How can I identify all files where only the serialVersionUID changed?
Changed means the files are modified in the working copy, but not committed yet.
My goal is to revert those files via pre-commit hook.
I've come as far as either
git diff -U10000 --raw MyFile.java
which gives me a diff of the whole file
or
git diff -U0 --raw --word-diff=porcelain MyFile.java
which gives me a "diff header" plus a list of changes.

Note: this particular StackOverflow answer doesn't solve your problem (I literally can't solve it properly as I don't have a Java parser). It's all about all the other stumbling blocks you're going to run into, and how to avoid them so that your task really is only the Java-related part.
It's important here to note that there are three copies of every file here:
the one in your current commit, HEAD:MyFile.java (use git show HEAD:MyFile.java to see this one);
the one in your proposed next commit, :MyFile.java (again, use git show to see it); and
the one in your work-tree, MyFile.java, which you can see and edit directly.
The git diff command will, in general, pick two of the three to compare.
Running git diff with no arguments, or with arguments that select only a file (not a commit), compares the index copy of the file with the work-tree copy. It does not extract the currently-committed file. The index copy is the one that git commit will write to a new commit, so it is, in effect, what you are proposing to commit now.
Using git diff --cached tells Git to compare the file(s) in HEAD to the file(s) in the index. Using git diff HEAD tells Git to compare the file(s) in HEAD to the file(s) in the work-tree. So these are how you select which pairs of files get compared. But no matter what, each git diff just picks one pair of files, or one set-of-pairs if you let Git compare all the files.
If you run git commit -a—and I recommend that you don't, here—that's roughly equivalent to git add -u && git commit, except that it builds a temporary index with the updated files. Things get particularly tricky in the various commit hooks here since there are now multiple different index files with different proposed-next-commits. That's why I recommend avoiding git commit -a here. It's already hard enough to work with, and reason about, three copies of a file, and using tricky commit options, such as -a or --only or --include throws a fourth and even sometimes a fifth set of copies into the mix.
(Git can only deal with one index file at a time. A standard git commit has only the one standard index file. The standard index file has copies of the files that would or will go into the next commit.1 The options cause Git to create additional temporary index files, into which it builds the proposed new commit, then run the rest of the operations—including your hooks—with $GIT_INDEX_FILE set in the environment to make these sub-commands look at whichever temporary index is to be used. If all goes well and git commit winds up making a new commit, one of these various temporary index files, with whatever contents are appropriate based on the options and arguments, becomes the new index, after which you're back to the normal situation with a mere three copies of each file.)
Since your plan is to work in a pre-commit hook, you probably should compare the HEAD files against the proposed-for-commit files in the index, i.e., you should probably be using git diff --cached here. However, if you intend to do this by computer program, rather than as something a human peruses at leisure, you should not be using git diff at all. The front-end git diff command is meant for use by humans, which is why it paginates and colors the output and does all those kinds of things that just annoy computer programs. Git calls these fancy front ends porcelain commands.
Each kind of git diff is implemented by a back-end plumbing command. The plumbing command that compares a commit—technically, a tree—to the index is git diff-index, which still needs --cached to tell it to do the desired comparison: git diff-index --cached HEAD produces predictable output, that does not depend on each user's preferred pager, color styles, and so on.
(If you're writing this hook exclusively for your own use, you can use either git diff or git diff-index since you can compensate for your own personal git diff settings. But it's better, in a sense, to use the plumbing command anyway—then there's no need to compensate for anything.)
Whatever you choose here, you still have to write your own code to interpret the diff output. You might instead choose to write a program that simply extracts the two files of interest—HEAD:MyFile.java and :MyFile.java, that is—from the current commit and from the index, and compare them in your own program, rather than using git diff at all. You can extract the files using git show, but that has the slight defect that it's another porcelain command. You can use git cat-file -p, which is the underlying plumbing command, to extract the files directly, without going through git show.
Actually parsing the Java code would be the most reliable method, so that you don't get tripped up by some sort of silly formatting change. A more-hacky method such as assuming that everything must match except for one line of some specific form would be not too difficult in, say, awk (read both files one line at a time, check that only one line is different in the two files and that it has the expected form). All of these seem likely to be simpler than trying to parse diff output, though if you want to parse diff output, a non-Git non-context diff might be simpler.
Finally, regarding:
My goal is to revert those files via pre-commit hook.
This is OK to do (Git will handle it correctly, for some definition of "correct"), but it's also a bit surprising to many Git users. Git hooks like this are not supposed to change things. The intent of the people writing Git is for Git hooks like this to merely verify things. If something fails the verification step, the hook should exit nonzero, which will cause git commit to stop. Any fixing-up is supposed to be done by some non-hook operation.
Note that git commit --no-verify skips the pre-commit hook entirely.
1Technically, an index has references to read-only copies of each file. Because these copies are read-only, they can be shared. So "copying" an index is cheap, because it really just copies all the references. Also, every file that's in the proposed new commit that's 100% bit-for-bit identical to a file that's already in some existing commit, is really just a reference to that file, since every file stored within every commit is itself entirely read-only.

Related

Tracking changes between master and project branches

Our QA team wants to know what areas we have changed between revisions and the possible UI locations these changes can affect.
Right now each developer is in charge of writing this out on their own tickets. Then at the end of that project we use git to generate a diff of this branch vs master and manually trace each class to all possible UI locations.
This is eating up a lot of developer time and if UAT pushes a project back we have to do the whole process again.
We have thought about writing a program that looks at the source code finds all the files that contain the name of the class that changes.
We end up getting a lot of red herrings when we do this and runs over several hours.
Is there a better way to handle this preferably something we can put into our release management tools?
We are using Struts 2 and Spring to pull our application together.

We have thought about writing a program that looks at the source code finds all the files that contain the name of the class that changes
You can use the log command
git log -- path_to_file
This command will print out all the commits which modified the given file name.
git log --stat -- <file to search>
Search for a string in all the commits:
# search for the given string in all the commits
git log -S"string to search"

Efficient way to move directories recursively and merge them in Java

I am looking for the most efficient way to move a directory recursively in Java. At the moment, I am using Apache commons-io as shown in the code below. (If the destDir exists and contains part of the files, I would like those to be overwritten and the nested directory structures to be merged).
FileUtils.copyDirectoryToDirectory(srcDir, destDir);
FileUtils.deleteDirectory(srcDir);
While this does the trick, in my opinion, it isn't efficient enough. There are at least two issues that come to mind:
You will need to have twice as much space.
If this is an SSD, copying the data over to another part of the drive and then erasing the old data will eventually have an impact on the on the hardware, as it will in effect shorten the hard disk's life.
What is the best approach to do this?
As per my understanding commons-io doesn't seem to be using the new Java 7/8 features available in Files. On the other hand, I was unable to get Files.move(...) to work, if the destDir exists (by "get it to work" I mean have it merge the directory structures -- it complains that the destDir exists).
Regarding failures to move (please correct me, if I am wrong):
As far as I understand, an atomic move is one that only succeeds, if all files are moved at once. If I understand this correctly, this means that this is copying first and then deleting. I don't think this is what I'm looking for.
If a certain path/file cannot be moved, then the operation should cease and throw an exception, preserving the current source path it reached.
Please, note that I am not limiting myself to using the commons-io library. I am open to suggestions. I am using Java 8.

This is just an answer to the "what needs to happen to the filesystem" part of the question, not how to do it with Java.
Even if you did want to call out to an external tool, Unix mv is not like Windows Explorer. Same-name directories don't merge. So you will need to implement that yourself, or find a library function that does. There is no single Unix system call that does the whole recursive operation (let alone atomically), so it's something either your code or a library function has to do.
If you need to atomically cut from one version of a tree to another, you need to build a new tree. The files can be hard-links to the old version. i.e. do the equivalent of
cp -al dir/ new
rsync -a /path/to/some/stuff/ new/
# or maybe something smarter / custom that renames instead of copies files.
# your sanity check here
mv dir old &&
mv new dir && # see below for how to make this properly atomic
rm -rf old
This leaves a window where dir doesn't exist. To solve this, add a level of indirection, by making dir a symlink. Symlinks can be replaced atomically with mv (but not ln -sf). So in Java, you want something that will end up doing a rename system call, not an unlink / rename.
Unless you have a boatload of extremely small files (under 100 bytes), the directory metadata operations of building a hardlink farm are much cheaper than a full copy of a directory tree. The file data will stay put (and never even be read), the directory data will be a fresh copy. The file metadata (inodes) will be written for all files (to update the ctime and link count, when creating the hardlink farm, and again when removing the old tree, leaving files with the original link count.
If you're running on a recent Linux kernel, there is a new(2013) system call (called renameat2) available that can exchange two paths atomically. This avoids the symlink level of indirection. Using a Linux-only system call from Java is going to be more trouble than it's worth, though, since symlinks are easy.

I am answering my own question, as I ended up writing my own implementation.
What I didn't like about the implementations of:
Apache Commons IO
Guava
Springframework
for moving files was that all of them first copy the directories and files and then delete them. (As far as I checked, September 2015) They all seem to be stuck with methods from JDK 1.6.
My solution isn't atomic. It handles the moving by walking the directory structure and performing the moves file by file. I am using the new methods from JDK 1.7. It does the job for me and I'm sure other people would like to do the same and wonder how to do it and then waste time. I have therefore created a small Github project, which contains an illustration here:
FileUtils#moveDirectory((Path srcPath, Path destPath)
An illustration of how to use it can be seen here.
If anybody has suggestions on how to improve it, or would like to add features, please feel free to open a pull request.

Traverse the source directory tree:
When meeting a directory, ensure the same directory exist in the target tree (and has the right permissions etc).
When meeting a file, rename it to the same name in the corresponding directory in the target tree.
When leaving a directory, ensure it is empty and delete it.
Consider carefully how any error should be handled.
Note that you might also simply call out to "rsync" if it is available on your system.

Modification in symbolic link

How to identify which link modified the target file when multiple symbolic link exists for a single file using Java? I was unable to find which file modified the target file.
Example: D:\sample.txt,D:\folder1\sample.txt; these two are links. The target file is located in E:\sample.txt.
Now how to identify whether D:\sample.txt or D:\folder1\sample.txt modified the E:\sample.txt?

How to identify which link modified the target file when multiple symbolic link exists for a single file using java?
It is not possible.
It is not possible in any programming language.
This functionality would have to be supported by the operating system, and no operating system I've ever come across does.
There are heuristics (using timestamps) that will probably work "most of the time", but in each case there are circumstances under which the heuristic will give no answer or even the wrong answer. Here are some of the confounding issues:
With simple timestamp heuristics:
it won't if either of the symlinks is on a read-only file system, or a file system where access times are not recorded (e.g. depending on mount options), and
it won't work if a file read occurs on the symlink after the last file write.
When you add a watcher:
it won't work if you weren't "watching" at the time (duh!), and
it won't work if you have too many watcher events ... and you can't keep up.
(Besides, I don't think you can get events on the use of a symlink. So you would still need to check the symlink access timestamps anyway. And that means that read-only file systems, etc are a problem here too.)
And then there are scenarios like:
both symlinks are used to write the file,
you don't know about all of the symlinks, or
the symlink used for writing has been deleted or "touched".
These are probably beyond of the scope of the OP's use-case. But they are relevant to the general question as set out by the OP's first sentence.

Maybe you can do that using Files.readAttributes(). The below works with Linux, since when you "use" a symlink under Linux, its last access time is modified. No idea under Windows, you'll have to test.
If symlink1 is a Path to your first symlink and symlink2 a Path to your second symlink, and realFile a Path to your real file, then you can retrieve FileTime objects of the last access time for both symlinks and last modification time of the file using:
Files.readAttributes(symlink1, BasicFileAttributes.class).lastAccessTime();
Files.readAttributes(symlink2, BasicFileAttributes.class).lastAccessTime();
Files.readAttributes(realFile, BasicFileAttributes.class).lastModifiedTime();
Since FileTime is Comparable to itself, you may spot which symlink is modified but this is NOT a guarantee.
Explanation: if someone uses symlink1 to modify realFile, then the access time of symlink1 will be modified and the modification time of realFile will be modified. If the last access time of symlink1 is GREATER than the last access time of symlink2, then there is a possibility that symlink1 was used for this operation; on the other hand, if the last access time of symlink2 is greater and the last modification time of realFile is lesser, then you are sure that symlink2 was not used for this purpose.
But again there is no REAL guarantee. Those are only heuristics!
You should also have a look at using a WatchService in order to watch for modifications on the real file; this would make the heuristics above more precise even. But again, no guarantee.

Is Git Smart Enough to Merge After Refactoring

Assume I have a class.
package org.my.domain;
public class BestClassEver{}
My Workflow
Through some refactoring, I change this class's package.
package org.another.domain;
public class BestClassEver{}
I commit this to my local repository using git and push it to a remote repository.
git add .
git commit -m "Refactoring"
git push origin master
Another Developer's Workflow
Another developer modifies the class, without pulling my changes.
package org.my.domain;
public class BestClassEver{
private String something;
}
Then commits and pushes to the remote repository
git add .
git commit -m "Some Changes"
git push origin master
Questions
Will Git merge the other developer's changes into the class?
If not, what happens?
Is this workflow something that needs to be coordinated amongst the team?

Git won't allow the other developer to push his changes without pulling.
It will throw an error that both refs don't match and therefore his local branch needs to be updated with the remote refs.
That's pretty much all there is to know about that. If there are changes in the remote repository, unless you do a forced push, git won't allow you to push changes if there are changes in the remote.
EDIT
Once he pulls, if there are any conflicts in the file, the developer will have to correct any conflicts, commit them and only then he will be able to push.
If there are no conflicts, git will automatically merge those changes and the developer will be able to push after the pull.
EDIT AGAIN
I didn't realize that you were moving the file. In any case, running git status would give you an idea as to the state of your local repository after a pull. If there was any conflicts, you'd be able to correct them.
NOTE
On git rebase or git pull --rebase are usually used to give you a much cleaner commit history as they will pretty much will apply any local changes on top of any other changes that were pulled.
On the other hand, git pull and git merge will usually make an extra commit to link the changes on the branches.
For more information take a look at this guide Git Rebasing

It is always good idea to let people work on different parts of program.
Merge or rebase in this case should be fully automatic, but in real world it is always a bit dramatic and sometimes there are some conflicts. Of course, this merge/rebase will be done after server rejects push for beeing non-fast-forward.

When such thing fails, some workarounds include:
Just repeating "refactoring" in the other branch prior to merging;
Converting the work to patch (git format-patch), editing the patch (applying that "refactor" to it) and applying the edited patch (git am). It is like manual rebasing.
I think it's better to separate merge-complicating refactoring (one that involves renaming, for example) and usual minor refactoing.
Sometimes for a merge-complicating refactoing a script can be written (such as find -name '*.c' -exec sed 's/something/anything/g' -i '{}' ';'). The script be used to repeat the refactoring multiple times in various places when we need it, so avoiding to merge refactored code with non-refactored one.

YES. Git is able to identify those changes. I am working on a project using git on my own fork (forked from origin). In the meantime another developer refactored the codebase on the original fork, which includes changing package structure.
I used the following commands:
git stash save // this saves all your work onto a personal stash
git pull // fetches all the latest changes on the primary fork and merges
git stash apply stash#{0} // or whatever the stash is where you saved your personal work
Now you will have the refactored codebase+your changes. You can now commit without any conflict and the packaging will not be changed.
Just note that in my case I waited for the original fork to be refactored and then I committed my changes, which are only changes to few files and not repackaging.
Also note that if you have added new files, then you might have to edit a few imports to make sure the imports are correct.
eg. import org.my.domain;
should now be edited to: import org.another.domain;

SVN commit behavior / failure criateria

So I'm making some changes/fixes to someone's subclipse mod and had a few questions.
First, I noticed that an svn commit fails when trying to commit a single file that is identical to the existing one in the repository. (And returns -1 for the revision #) Makes sense.
Does this happen if you commit multiple files, only some of which have no changes?
Is the best way around this to just do a diff (on every file?) before attempting to commit?
If anyone knows, that'd be great. Or if you can point me in the right direction? (My google-fu failed me)

If the file is identical, SVN will not commit it. If you provide a list of files, the ones that are identical will just be skipped. I assume you are working with the SVN API and not Subclipse GUI or command line client as you do not see a -1 in either of those.

If you are 100% sure the file is 'identical', then the quickest solution would be to do a 'revert' on the troublesome file (by right-clicking on the file then selecting 'Team' then 'Revert'). Subversion does 'atomic' commits, ( What is the value of atomic commits in Subversion? ) , which basically means if one commit fails in a batch commit, then they all fail.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.