I am trying to write a program which will delete all duplicate files in a directory. It is currently able to detect duplicates, but my deleting code does not seem to be working (Files.delete() returns false). Can anybody tell me why this is?
Current code:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.lang.SecurityManager;
public class Duplicate {
#SuppressWarnings("resource")
public static boolean isDuplicate(File a, File b) throws IOException {
FileInputStream as = new FileInputStream(a);
FileInputStream bs = new FileInputStream(b);
while(true) {
int aBytes = as.read();
int bBytes = bs.read();
if(aBytes != bBytes) {
return false;
} else if(aBytes == -1) {
System.out.println("Duplicate found: "+a.getName()+", "+b.getName());
return true;
}
}
}
public static void main(String[] args) throws IOException {
File dir = new File(System.getProperty("user.dir"));
File[] files = dir.listFiles();
for(int i = 0; i < files.length; i++) {
for(int j = i+1; j < files.length; j++) {
if(isDuplicate(files[i], files[j])) {
String filePath = System.getProperty("user.dir").replace("\\", "/")+"/"+files[i].getName();
System.out.println("Deleting "+filePath);
File f = new File(filePath);
if(f.delete())
System.out.println(filePath+" deleted successfully");
else
System.out.println("Could not delete "+filePath);
}
}
}
}
}
Did you close your file streams? It would make sense that it would return false if the file is currently open.
Apart from the resources problem (which certainly explains why you can't delete), the problem is that you won't know why the deletion fails -- in fact, with File you have no means to know at all.
Here is the equivalent program written with java.nio.file, with resource management:
public final class Duplicates
{
private Duplicates()
{
throw new Error("nice try!");
}
private static boolean duplicate(final Path path1, final Path path2)
throws IOException
{
if (Files.isSameFile(path1, path2))
return true;
final BasicFileAttributeView view1
= Files.getFileAttributeView(path1, BasicFileAttributeView.class);
final BasicFileAttributeView view2
= Files.getFileAttributeView(path2, BasicFileAttributeView.class);
final long size1 = view1.readAttributes().size();
final long size2 = view2.readAttributes().size();
if (size1 != size2)
return false;
try (
final FileChannel channel1 = FileChannel.open(path1,
StandardOpenOption.READ);
final FileChannel channel2 = FileChannel.open(path2,
StandardOpenOption.READ);
) {
final ByteBuffer buf1
= channel1.map(FileChannel.MapMode.READ_ONLY, 0L, size1);
final ByteBuffer buf2
= channel2.map(FileChannel.MapMode.READ_ONLY, 0L, size1);
// Yes, this works; see javadoc for ByteBuffer.equals()
return buf1.equals(buf2);
}
}
public static void main(final String... args)
throws IOException
{
final Path dir = Paths.get(System.getProperty("user.dir"));
final List<Path> list = new ArrayList<>();
for (final Path entry: Files.newDirectoryStream(dir))
if (Files.isRegularFile(entry))
list.add(entry);
final int size = list.size();
for (int i = 0; i < size; i++)
for (int j = i + 1; j < size; j++)
try {
if (duplicate(list.get(i), list.get(j)))
Files.deleteIfExists(list.get(j));
} catch (IOException e) {
System.out.printf("Aiie... Failed to delete %s\nCause:\n%s\n",
list.get(j), e);
}
}
}
Note: a better strategy would probably be to create a directory in which you will move all duplicates you detect; when done, just delete all files in this directory then the directory itself. See Files.move().
Related
I have a directory with sub-directories which contains text or binary files ( like pictures ). I need to find duplicate files which can be in different sub-directories and with different names. So, I need to use some algorithm which would look inside the files and NOT rely on file name, or length of file.
I could come up with a quick solution. I know this code can be written much better but functionality wise its working perfect. I even tested it on jpeg, gif files.
public static Map<String, List<File>> mapFilesHash = new HashMap<String, List<File>>();
public static MessageDigest md ;
static {
try {
md = MessageDigest.getInstance("MD5");
} catch (Exception ex) {}
}
private static String checksum(File file) throws IOException {
FileInputStream fis = new FileInputStream(file);
byte[] byteArray = new byte[1024];
int bytesCount = 0;
while ((bytesCount = fis.read(byteArray)) != -1) {
md.update(byteArray, 0, bytesCount);
}
fis.close();
byte[] bytes = md.digest();
StringBuilder sb = new StringBuilder();
for (int i = 0; i < bytes.length; i++) {
sb.append(Integer.toString((bytes[i] & 0xff) + 0x100, 16).substring(1));
}
return sb.toString();
}
public static void findDuplicateFiles(File rootDir) throws Exception {
iterateOverDirectory(rootDir);
System.out.println("based on hash "+mapFilesHash.size());
for (List<File> files: mapFilesHash.values()) {
if (files.size() > 1 ) {
System.out.println(files);
}
}
}
private static void iterateOverDirectory (File rootDir) throws Exception {
for (File file : rootDir.listFiles()) {
if (file.isDirectory()) {
iterateOverDirectory(file);
} else {
if (mapFilesSize.get(file.length()) == null) {
mapFilesSize.put(file.length(), new ArrayList<>());
}
mapFilesSize.get(file.length()).add(file);
String md5hash = checksum(file);
if (mapFilesHash.get(md5hash) == null) {
mapFilesHash.put(md5hash, new ArrayList<>());
}
mapFilesHash.get(md5hash).add(file);
}
}
}
Without mapFilesSize your method iterateOverDirectory can became:
private static void iterateOverDirectory(File rootDir) throws Exception {
for (File file : rootDir.listFiles()) {
if (file.isDirectory()) {
iterateOverDirectory(file);
}
else {
mapFilesHash.computeIfAbsent(checksum(file), k -> new ArrayList<>()).add(file);
}
}
}
I'm currently trying write a program that will delete duplicate files within a given folder. I've been told to use Path object in favor of the File object, and that the API for Path has everything that File would have, but I can't seem to figure out how to make an array of items within the given path. Is it poor practice to be converting a Path to a File and using the listFiles() method? Is it bad practice to be converting from a Path to a File and back as I do in the code below?
public class FileIO {
final static String FILE_PATH = "C:\\Users\\" + System.getProperty("user.name") + "\\Documents\\Duplicate Test";
public static void main(String args[]) throws IOException {
Path path = Paths.get(FILE_PATH);
folderDive(path);
}
public static void folderDive(Path path) throws IOException {
File [] pathList = path.toFile().listFiles();
ArrayList<String> deletedList = new ArrayList<String>();
Arrays.sort(pathList);
BufferedWriter writer = new BufferedWriter(new FileWriter(FILE_PATH + "\\Deleted.txt"));
deletedList.add("Listed Below are files that have been succesfully deleted: ");
for(int pivot = 0; pivot < pathList.length - 1; pivot++) {
for(int index = pivot + 1; index < pathList.length; index++) {
if(pathList[pivot].exists() && pathList[index].exists() &&
fileCompare(pathList[pivot].toPath(), pathList[index].toPath())) {
deletedList.add(pathList[index].getName());
pathList[index].delete();
}
}
}
for(String list: deletedList) {
writer.write(list);
writer.newLine();
}
writer.close();
}
public static boolean fileCompare(Path firstFile, Path comparedFile) throws IOException {
byte [] first = Files.readAllBytes(firstFile);
byte [] second = Files.readAllBytes(comparedFile);
if(Arrays.equals(first, second)) {
return true;
}
return false;
}
}
i want to copy files from parent directory into subfolder in parent directory. Now i get the copied files into subfolder, but it repeated itself everytime if i get already the subfolder and files copied, it makes it all time repeatedly, i want it to male only one time
public static void main(String[] args) throws IOException {
File source = new File(path2);
File target = new File("Test/subfolder");
copyDirectory(source, target);
}
public static void copyDirectory(File sourceLocation, File targetLocation)
throws IOException {
if (sourceLocation.isDirectory()) {
if (!targetLocation.exists()) {
targetLocation.mkdir();
}
String[] children = sourceLocation.list();
for (int i = 0; i < children.length; i++) {
copyDirectory(new File(sourceLocation, children[i]), new File(
targetLocation, children[i]));
}
} else {
InputStream in = new FileInputStream(sourceLocation);
OutputStream out = new FileOutputStream(targetLocation);
byte[] buf = new byte[1];
int length;
while ((length = in.read(buf)) > 0) {
out.write(buf, 0, length);
}
in.close();
out.close();
}
}
You program have problem in following line
String[] children = sourceLocation.list();
Lets suppose your parent dir = Test
So the following code will create a sub-folder under test
if (!targetLocation.exists()) {
targetLocation.mkdir();
}
And after that you are retrieving the children of source folder as your destination is already created it will also be counted as child of source folder and recursively get copied. So you need to retrieve children first and then create the target directory So that target directory would not be count in copy process.
Change your code as follows.
public static void main(String[] args) throws IOException {
File source = new File("Test");
File target = new File("Test/subfolder");
copyDirectory(source, target);
}
public static void copyDirectory(File sourceLocation, File targetLocation)
throws IOException {
String[] children = sourceLocation.list();
if (sourceLocation.isDirectory()) {
if (!targetLocation.exists()) {
targetLocation.mkdir();
}
for (int i = 0; i < children.length; i++) {
copyDirectory(new File(sourceLocation, children[i]), new File(
targetLocation, children[i]));
}
} else {
InputStream in = new FileInputStream(sourceLocation);
OutputStream out = new FileOutputStream(targetLocation);
byte[] buf = new byte[1];
int length;
while ((length = in.read(buf)) > 0) {
out.write(buf, 0, length);
}
in.close();
out.close();
}
}
You are calling your method recursively without a condition to break the recursion. You will have to exclude directories in your for-loop.
I have around 100 files in a folder. Each file will have data like this and each line resembles an user id.
960904056
6624084
1096552020
750160020
1776024
211592064
1044872088
166720020
1098616092
551384052
113184096
136704072
And I am trying to keep on merging the files from that folder into a new big file until the total number of user id's become 10 Million in that new big file.
I am able to read all the files from a particular folder and then I keep on adding the user id's from those files in a linkedhashset. And then I was thinking to see whether the size of hashset is 10 Million and if it is 10 million then write all those user id's to a new text file. Is that feasoible solution?
That 10 million number should be configurable. In future, If I need to change that 10 million 1o 50Million
then I should be able to do that.
Below is the code I have so far
public static void main(String args[]) {
File folder = new File("C:\\userids-20130501");
File[] listOfFiles = folder.listFiles();
Set<String> userIdSet = new LinkedHashSet<String>();
for (int i = 0; i < listOfFiles.length; i++) {
File file = listOfFiles[i];
if (file.isFile() && file.getName().endsWith(".txt")) {
try {
List<String> content = FileUtils.readLines(file, Charset.forName("UTF-8"));
userIdSet.addAll(content);
if(userIdSet.size() >= 10Million) {
break;
}
System.out.println(userIdSet);
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
Any help will be appreciated on this? And any better way to do the same process?
Continuing from where we left. ;)
You can use the FileUtils to write the file along with the writeLines() method.
Try this -
public static void main(String args[]) {
File folder = new File("C:\\userids-20130501");
Set<String> userIdSet = new LinkedHashSet<String>();
int count = 1;
for (File file : folder.listFiles()) {
if (file.isFile() && file.getName().endsWith(".txt")) {
try {
List<String> content = FileUtils.readLines(file, Charset.forName("UTF-8"));
userIdSet.addAll(content);
if(userIdSet.size() >= 10Million) {
File bigFile = new File("<path>" + count + ".txt");
FileUtils.writeLines(bigFile, userIdSet);
count++;
userIdSet = new LinkedHashSet<String>();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
If the purpose of saving the data in the LinkedHashSet is just for writing it again to another file then I have another solution.
EDIT to avoid OutOfMemory exception
public static void main(String args[]) {
File folder = new File("C:\\userids-20130501");
int fileNameCount = 1;
int contentCounter = 1;
File bigFile = new File("<path>" + fileNameCount + ".txt");
boolean isFileRequired = true;
for (File file : folder.listFiles()) {
if (file.isFile() && file.getName().endsWith(".txt")) {
try {
List<String> content = FileUtils.readLines(file, Charset.forName("UTF-8"));
contentCounter += content.size();
if(contentCounter < 10Million) {
FileUtils.writeLines(bigFile, content, true);
} else {
fileNameCount++;
bigFile = new File("<path>" + fileNameCount + ".txt");
FileUtils.writeLines(bigFile, content);
contentCounter = 1;
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
You can avoid the use of the Set as intermediate storage if you write at the same time that you read from file. You could do something like this,
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintWriter;
public class AppMain {
private static final int NUMBER_REGISTERS = 10000000;
private static String[] filePaths = {"filePath1", "filePaht2", "filePathN"};
private static String mergedFile = "mergedFile";
public static void main(String[] args) throws IOException {
mergeFiles(filePaths, mergedFile);
}
private static void mergeFiles(String[] filePaths, String mergedFile) throws IOException{
BufferedReader[] readerArray = createReaderArray(filePaths);
boolean[] closedReaderFlag = new boolean[readerArray.length];
PrintWriter writer = createWriter(mergedFile);
int currentReaderIndex = 0;
int numberLinesInMergedFile = 0;
BufferedReader currentReader = null;
String currentLine = null;
while(numberLinesInMergedFile < NUMBER_REGISTERS && getNumberReaderClosed(closedReaderFlag) < readerArray.length){
currentReaderIndex = (currentReaderIndex + 1) % readerArray.length;
if(closedReaderFlag[currentReaderIndex]){
continue;
}
currentReader = readerArray[currentReaderIndex];
currentLine = currentReader.readLine();
if(currentLine == null){
currentReader.close();
closedReaderFlag[currentReaderIndex] = true;
continue;
}
writer.println(currentLine);
numberLinesInMergedFile++;
}
writer.close();
for(int index = 0; index < readerArray.length; index++){
if(!closedReaderFlag[index]){
readerArray[index].close();
}
}
}
private static BufferedReader[] createReaderArray(String[] filePaths) throws FileNotFoundException{
BufferedReader[] readerArray = new BufferedReader[filePaths.length];
for (int index = 0; index < readerArray.length; index++) {
readerArray[index] = createReader(filePaths[index]);
}
return readerArray;
}
private static BufferedReader createReader(String path) throws FileNotFoundException{
BufferedReader reader = new BufferedReader(new FileReader(path));
return reader;
}
private static PrintWriter createWriter(String path) throws FileNotFoundException{
PrintWriter writer = new PrintWriter(path);
return writer;
}
private static int getNumberReaderClosed(boolean[] closedReaderFlag){
int count = 0;
for (boolean currentFlag : closedReaderFlag) {
if(currentFlag){
count++;
}
}
return count;
}
}
The way you're going, you likely may run out of memory, your are keeping an unnecessary record in userIdSet.
A slight modification that can improve your code is as follows:
public static void main(String args[]) {
File folder = new File("C:\\userids-20130501");
File[] listOfFiles = folder.listFiles();
// there's no need for the userIdSet!
//Set<String> userIdSet = new LinkedHashSet<String>();
// Instead I'd go for a counter ;)
long userIdCount = 0;
for (int i = 0; i < listOfFiles.length; i++) {
File file = listOfFiles[i];
if (file.isFile() && file.getName().endsWith(".txt")) {
try {
List<String> content = FileUtils.readLines(file, Charset.forName("UTF-8"));
// I just want to know how many lines there are...
userIdCount += content.size();
// my guess is you'd probably want to print what you've got
// before a possible break?? - You know better!
System.out.println(content);
if(userIdCount >= 10Million) {
break;
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
Like I noted, just a slight modification. It was not my intention to run a very detailed analysis on your code. I just pointed out a glaring mis-design.
Finally, where you stated System.out.println(content);, you might consider writing to file at that point.
If you will write to file one line at a time, you try-catch block may look like this:
try {
List<String> content = FileUtils.readLines(file, Charset.forName("UTF-8"));
for(int lineNumber = 0; lineNumber < content.size(); lineNumber++){
if(++userIdCount >= 10Million){
break;
}
// here, write to file... But I will use simple System.out.print for example
System.out.println(content.get(lineNumber));
}
} catch (IOException e) {
e.printStackTrace();
}
Your code can be improved in many ways, but I don't have time to do that. But I hope my suggestion can push you further to the front in the right track. Cheers!
I want to count the no files in which a string is occurring and I have a list of documents in a directory but they are redundant. How do I remove the duplicate files from that particular directory?
Any help appreciated!
public static boolean CompareFiles(File x, File y) throws FileNotFoundException
{ //boolean result=true;
try {
Scanner xs = new Scanner(x);
Scanner ys = new Scanner(y);
boolean result = true;
while (result)
{
if (xs.nextByte() != ys.nextByte()) result = false;
}
return result;
}
catch (FileNotFoundException e)
{
System.out.println(e.getMessage());
return false;
}
}
public static void main(String[] args) throws FileNotFoundException, IOException//
{
File dir = new File("C:/Users/Aravind/Documents/ranked");
File[] fileList = dir.listFiles();
for (int x = 0; x <fileList.length; x++)
{
for (int y = x+1; y < fileList.length; y++)
{
if (CompareFiles(fileList[x],fileList[y]))
{
System.out.println("in calling fn");
fileList[x].delete();
}
//System.out.println(fileList[x]);
}
}
Create a map using the name of the file as key and the checksum of the file as value (follow this example to get a file's checksum using java).
Before adding an new entry to that map, check if the calculated checksum already exists vith containsValue (if two files have the same checksum, their contents are the same).
Delete the "redundant" file.
for (File f : dir.listFiles()) if (isDuplicate(f)) f.delete();
... or maybe give us more details on what you need.