I want to count the no files in which a string is occurring and I have a list of documents in a directory but they are redundant. How do I remove the duplicate files from that particular directory?
Any help appreciated!
public static boolean CompareFiles(File x, File y) throws FileNotFoundException
{ //boolean result=true;
try {
Scanner xs = new Scanner(x);
Scanner ys = new Scanner(y);
boolean result = true;
while (result)
{
if (xs.nextByte() != ys.nextByte()) result = false;
}
return result;
}
catch (FileNotFoundException e)
{
System.out.println(e.getMessage());
return false;
}
}
public static void main(String[] args) throws FileNotFoundException, IOException//
{
File dir = new File("C:/Users/Aravind/Documents/ranked");
File[] fileList = dir.listFiles();
for (int x = 0; x <fileList.length; x++)
{
for (int y = x+1; y < fileList.length; y++)
{
if (CompareFiles(fileList[x],fileList[y]))
{
System.out.println("in calling fn");
fileList[x].delete();
}
//System.out.println(fileList[x]);
}
}
Create a map using the name of the file as key and the checksum of the file as value (follow this example to get a file's checksum using java).
Before adding an new entry to that map, check if the calculated checksum already exists vith containsValue (if two files have the same checksum, their contents are the same).
Delete the "redundant" file.
for (File f : dir.listFiles()) if (isDuplicate(f)) f.delete();
... or maybe give us more details on what you need.
Related
I'm currently trying write a program that will delete duplicate files within a given folder. I've been told to use Path object in favor of the File object, and that the API for Path has everything that File would have, but I can't seem to figure out how to make an array of items within the given path. Is it poor practice to be converting a Path to a File and using the listFiles() method? Is it bad practice to be converting from a Path to a File and back as I do in the code below?
public class FileIO {
final static String FILE_PATH = "C:\\Users\\" + System.getProperty("user.name") + "\\Documents\\Duplicate Test";
public static void main(String args[]) throws IOException {
Path path = Paths.get(FILE_PATH);
folderDive(path);
}
public static void folderDive(Path path) throws IOException {
File [] pathList = path.toFile().listFiles();
ArrayList<String> deletedList = new ArrayList<String>();
Arrays.sort(pathList);
BufferedWriter writer = new BufferedWriter(new FileWriter(FILE_PATH + "\\Deleted.txt"));
deletedList.add("Listed Below are files that have been succesfully deleted: ");
for(int pivot = 0; pivot < pathList.length - 1; pivot++) {
for(int index = pivot + 1; index < pathList.length; index++) {
if(pathList[pivot].exists() && pathList[index].exists() &&
fileCompare(pathList[pivot].toPath(), pathList[index].toPath())) {
deletedList.add(pathList[index].getName());
pathList[index].delete();
}
}
}
for(String list: deletedList) {
writer.write(list);
writer.newLine();
}
writer.close();
}
public static boolean fileCompare(Path firstFile, Path comparedFile) throws IOException {
byte [] first = Files.readAllBytes(firstFile);
byte [] second = Files.readAllBytes(comparedFile);
if(Arrays.equals(first, second)) {
return true;
}
return false;
}
}
I am trying to write a program which will delete all duplicate files in a directory. It is currently able to detect duplicates, but my deleting code does not seem to be working (Files.delete() returns false). Can anybody tell me why this is?
Current code:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.lang.SecurityManager;
public class Duplicate {
#SuppressWarnings("resource")
public static boolean isDuplicate(File a, File b) throws IOException {
FileInputStream as = new FileInputStream(a);
FileInputStream bs = new FileInputStream(b);
while(true) {
int aBytes = as.read();
int bBytes = bs.read();
if(aBytes != bBytes) {
return false;
} else if(aBytes == -1) {
System.out.println("Duplicate found: "+a.getName()+", "+b.getName());
return true;
}
}
}
public static void main(String[] args) throws IOException {
File dir = new File(System.getProperty("user.dir"));
File[] files = dir.listFiles();
for(int i = 0; i < files.length; i++) {
for(int j = i+1; j < files.length; j++) {
if(isDuplicate(files[i], files[j])) {
String filePath = System.getProperty("user.dir").replace("\\", "/")+"/"+files[i].getName();
System.out.println("Deleting "+filePath);
File f = new File(filePath);
if(f.delete())
System.out.println(filePath+" deleted successfully");
else
System.out.println("Could not delete "+filePath);
}
}
}
}
}
Did you close your file streams? It would make sense that it would return false if the file is currently open.
Apart from the resources problem (which certainly explains why you can't delete), the problem is that you won't know why the deletion fails -- in fact, with File you have no means to know at all.
Here is the equivalent program written with java.nio.file, with resource management:
public final class Duplicates
{
private Duplicates()
{
throw new Error("nice try!");
}
private static boolean duplicate(final Path path1, final Path path2)
throws IOException
{
if (Files.isSameFile(path1, path2))
return true;
final BasicFileAttributeView view1
= Files.getFileAttributeView(path1, BasicFileAttributeView.class);
final BasicFileAttributeView view2
= Files.getFileAttributeView(path2, BasicFileAttributeView.class);
final long size1 = view1.readAttributes().size();
final long size2 = view2.readAttributes().size();
if (size1 != size2)
return false;
try (
final FileChannel channel1 = FileChannel.open(path1,
StandardOpenOption.READ);
final FileChannel channel2 = FileChannel.open(path2,
StandardOpenOption.READ);
) {
final ByteBuffer buf1
= channel1.map(FileChannel.MapMode.READ_ONLY, 0L, size1);
final ByteBuffer buf2
= channel2.map(FileChannel.MapMode.READ_ONLY, 0L, size1);
// Yes, this works; see javadoc for ByteBuffer.equals()
return buf1.equals(buf2);
}
}
public static void main(final String... args)
throws IOException
{
final Path dir = Paths.get(System.getProperty("user.dir"));
final List<Path> list = new ArrayList<>();
for (final Path entry: Files.newDirectoryStream(dir))
if (Files.isRegularFile(entry))
list.add(entry);
final int size = list.size();
for (int i = 0; i < size; i++)
for (int j = i + 1; j < size; j++)
try {
if (duplicate(list.get(i), list.get(j)))
Files.deleteIfExists(list.get(j));
} catch (IOException e) {
System.out.printf("Aiie... Failed to delete %s\nCause:\n%s\n",
list.get(j), e);
}
}
}
Note: a better strategy would probably be to create a directory in which you will move all duplicates you detect; when done, just delete all files in this directory then the directory itself. See Files.move().
I have a directory that contains sequentially numbered log files and some Excel spreadsheets used for analysis. The log file are ALWAYS sequentially numbered beginning at zero, but the number of them can vary. I am trying to concatenate the log files, in the order they were created into a single text file which will be a concatenation of all the log files.
For instance, with log files foo0.log, foo1.log, foo2.log would be output to concatenatedfoo.log by appending foo1 after foo0, and foo2 after foo1.
I need to count all the files in the given directory with the extension of *.log, using the count to drive a for-loop that also generates the file name for concatenation. I'm having a hard time finding a way to count the files using a filter...none of the Java Turtorials on file operations seem to fit the situation, but I'm sure I'm missing something. Does this approach make sense? or is there an easier way?
int numDocs = [number of *.log docs in directory];
//
for (int i = 0; i <= numberOfFiles; i++) {
fileNumber = Integer.toString(i);
try
{
FileInputStream inputStream = new FileInputStream("\\\\Path\\to\\file\\foo" + fileNumber + ".log");
BufferedReader br = new BufferedReader(new InputStreamReader(inputStream));
try
{
BufferedWriter metadataOutputData = new BufferedWriter(new FileWriter("\\\\Path\\to\\file\\fooconcat.log").append());
metadataOutputData.close();
}
//
catch (IOException e) // catch IO exception writing final output
{
System.err.println("Exception: ");
System.out.println("Exception: "+ e.getMessage().getClass().getName());
e.printStackTrace();
}
catch (Exception e) // catch IO exception reading input file
{
System.err.println("Exception: ");
System.out.println("Exception: "+ e.getMessage().getClass().getName());
e.printStackTrace();
}
}
how about
public static void main(String[] args){
final int BUFFERSIZE = 1024 << 8;
File baseDir = new File("C:\\path\\logs\\");
// Get the simple names of the files ("foo.log" not "/path/logs/foo.log")
String[] fileNames = baseDir.list(new FilenameFilter() {
#Override
public boolean accept(File dir, String name) {
return name.endsWith(".log");
}
});
// Sort the names
Arrays.sort(fileNames);
// Create the output file
File output = new File(baseDir.getAbsolutePath() + File.separatorChar + "MERGED.log");
try{
BufferedOutputStream out = new BufferedOutputStream(new FileOutputStream(output), BUFFERSIZE);
byte[] bytes = new byte[BUFFERSIZE];
int bytesRead;
final byte[] newLine = "\n".getBytes(); // use to separate contents
for(String s : fileNames){
// get the full path to read from
String fullName = baseDir.getAbsolutePath() + File.separatorChar + s;
BufferedInputStream in = new BufferedInputStream(new FileInputStream(fullName),BUFFERSIZE);
while((bytesRead = in.read(bytes,0,bytes.length)) != -1){
out.write(bytes, 0, bytesRead);
}
// close input file and ignore any issue with closing it
try{in.close();}catch(IOException e){}
out.write(newLine); // seperation
}
out.close();
}catch(Exception e){
throw new RuntimeException(e);
}
}
This code DOES assume that the "sequential naming" would be zero padded such that they will lexigraphically (?? sp) sort correctly. i.e. The files would be
0001.log (or blah0001.log, or 0001blah.log etc)
0002.log
....
0010.log
and not
1.log
2.log
...
10.log
The latter pattern will not sort correctly with the code I have given.
Here's some code for you.
File dir = new File("C:/My Documents/logs");
File outputFile = new File("C:/My Documents/concatenated.log");
Find the ".log" files:
File[] files = dir.listFiles(new FilenameFilter() {
#Override
public boolean accept(File file, String name) {
return name.endsWith(".log") && file.isFile();
}
});
Sort them into the appropriate order:
Arrays.sort(files, new Comparator<File>() {
#Override
public int compare(File file1, File file2) {
return numberOf(file1).compareTo(numberOf(file2));
}
private Integer numberOf(File file) {
return Integer.parseInt(file.getName().replaceAll("[^0-9]", ""));
}
});
Concatenate them:
byte[] buffer = new byte[8192];
OutputStream out = new BufferedOutputStream(new FileOutputStream(outputFile));
try {
for (File file : files) {
InputStream in = new FileInputStream(file);
try {
int charCount;
while ((charCount = in.read(buffer)) >= 0) {
out.write(buffer, 0, charCount);
}
} finally {
in.close();
}
}
} finally {
out.flush();
out.close();
}
By having the log folder as a File object, you can code like this
for (File logFile : logFolder.listFiles()){
if (logFile.getAbsolutePath().endsWith(".log")){
numDocs++;
}
}
to find the number of log files.
I would;
open the output file once. Just use a PrintWriter.
in a loop ...
create a File for each possible file
if it doesn't exist break the loop.
Using a BufferedReader
to read the lines of the file with readLine()
write each line to the output file.
You should be able to do this with about 12 lines of code. I would pass the IOExceptions to the caller.
You can use SequenceInputStream for concatenation of FileInputStreams.
To see all log files File.listFiles(FileFilter) can be used.
It will give you unsorted array with files. To sort files in right order, use Arrays.sort.
Code example:
static File[] logs(String dir) {
File root = new File(dir);
return root.listFiles(new FileFilter() {
#Override
public boolean accept(File pathname) {
return pathname.isFile() && pathname.getName().endsWith(".log");
}
});
}
static String cat(final File[] files) throws IOException {
Enumeration<InputStream> e = new Enumeration<InputStream>() {
int index;
#Override
public boolean hasMoreElements() {
return index < files.length;
}
#Override
public InputStream nextElement() {
index++;
try {
return new FileInputStream(files[index - 1]);
} catch (FileNotFoundException ex) {
throw new RuntimeException("File not available!", ex);
}
}
};
SequenceInputStream input = new SequenceInputStream(e);
StringBuilder sb = new StringBuilder();
int c;
while ((c = input.read()) != -1) {
sb.append((char) c);
}
return sb.toString();
}
public static void main(String[] args) throws IOException {
String dir = "<path-to-dir-with-logs>";
File[] logs = logs(dir);
for (File f : logs) {
System.out.println(f.getAbsolutePath());
}
System.out.println();
System.out.println(cat(logs));
}
I have around 100 files in a folder. Each file will have data like this and each line resembles an user id.
960904056
6624084
1096552020
750160020
1776024
211592064
1044872088
166720020
1098616092
551384052
113184096
136704072
And I am trying to keep on merging the files from that folder into a new big file until the total number of user id's become 10 Million in that new big file.
I am able to read all the files from a particular folder and then I keep on adding the user id's from those files in a linkedhashset. And then I was thinking to see whether the size of hashset is 10 Million and if it is 10 million then write all those user id's to a new text file. Is that feasoible solution?
That 10 million number should be configurable. In future, If I need to change that 10 million 1o 50Million
then I should be able to do that.
Below is the code I have so far
public static void main(String args[]) {
File folder = new File("C:\\userids-20130501");
File[] listOfFiles = folder.listFiles();
Set<String> userIdSet = new LinkedHashSet<String>();
for (int i = 0; i < listOfFiles.length; i++) {
File file = listOfFiles[i];
if (file.isFile() && file.getName().endsWith(".txt")) {
try {
List<String> content = FileUtils.readLines(file, Charset.forName("UTF-8"));
userIdSet.addAll(content);
if(userIdSet.size() >= 10Million) {
break;
}
System.out.println(userIdSet);
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
Any help will be appreciated on this? And any better way to do the same process?
Continuing from where we left. ;)
You can use the FileUtils to write the file along with the writeLines() method.
Try this -
public static void main(String args[]) {
File folder = new File("C:\\userids-20130501");
Set<String> userIdSet = new LinkedHashSet<String>();
int count = 1;
for (File file : folder.listFiles()) {
if (file.isFile() && file.getName().endsWith(".txt")) {
try {
List<String> content = FileUtils.readLines(file, Charset.forName("UTF-8"));
userIdSet.addAll(content);
if(userIdSet.size() >= 10Million) {
File bigFile = new File("<path>" + count + ".txt");
FileUtils.writeLines(bigFile, userIdSet);
count++;
userIdSet = new LinkedHashSet<String>();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
If the purpose of saving the data in the LinkedHashSet is just for writing it again to another file then I have another solution.
EDIT to avoid OutOfMemory exception
public static void main(String args[]) {
File folder = new File("C:\\userids-20130501");
int fileNameCount = 1;
int contentCounter = 1;
File bigFile = new File("<path>" + fileNameCount + ".txt");
boolean isFileRequired = true;
for (File file : folder.listFiles()) {
if (file.isFile() && file.getName().endsWith(".txt")) {
try {
List<String> content = FileUtils.readLines(file, Charset.forName("UTF-8"));
contentCounter += content.size();
if(contentCounter < 10Million) {
FileUtils.writeLines(bigFile, content, true);
} else {
fileNameCount++;
bigFile = new File("<path>" + fileNameCount + ".txt");
FileUtils.writeLines(bigFile, content);
contentCounter = 1;
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
You can avoid the use of the Set as intermediate storage if you write at the same time that you read from file. You could do something like this,
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintWriter;
public class AppMain {
private static final int NUMBER_REGISTERS = 10000000;
private static String[] filePaths = {"filePath1", "filePaht2", "filePathN"};
private static String mergedFile = "mergedFile";
public static void main(String[] args) throws IOException {
mergeFiles(filePaths, mergedFile);
}
private static void mergeFiles(String[] filePaths, String mergedFile) throws IOException{
BufferedReader[] readerArray = createReaderArray(filePaths);
boolean[] closedReaderFlag = new boolean[readerArray.length];
PrintWriter writer = createWriter(mergedFile);
int currentReaderIndex = 0;
int numberLinesInMergedFile = 0;
BufferedReader currentReader = null;
String currentLine = null;
while(numberLinesInMergedFile < NUMBER_REGISTERS && getNumberReaderClosed(closedReaderFlag) < readerArray.length){
currentReaderIndex = (currentReaderIndex + 1) % readerArray.length;
if(closedReaderFlag[currentReaderIndex]){
continue;
}
currentReader = readerArray[currentReaderIndex];
currentLine = currentReader.readLine();
if(currentLine == null){
currentReader.close();
closedReaderFlag[currentReaderIndex] = true;
continue;
}
writer.println(currentLine);
numberLinesInMergedFile++;
}
writer.close();
for(int index = 0; index < readerArray.length; index++){
if(!closedReaderFlag[index]){
readerArray[index].close();
}
}
}
private static BufferedReader[] createReaderArray(String[] filePaths) throws FileNotFoundException{
BufferedReader[] readerArray = new BufferedReader[filePaths.length];
for (int index = 0; index < readerArray.length; index++) {
readerArray[index] = createReader(filePaths[index]);
}
return readerArray;
}
private static BufferedReader createReader(String path) throws FileNotFoundException{
BufferedReader reader = new BufferedReader(new FileReader(path));
return reader;
}
private static PrintWriter createWriter(String path) throws FileNotFoundException{
PrintWriter writer = new PrintWriter(path);
return writer;
}
private static int getNumberReaderClosed(boolean[] closedReaderFlag){
int count = 0;
for (boolean currentFlag : closedReaderFlag) {
if(currentFlag){
count++;
}
}
return count;
}
}
The way you're going, you likely may run out of memory, your are keeping an unnecessary record in userIdSet.
A slight modification that can improve your code is as follows:
public static void main(String args[]) {
File folder = new File("C:\\userids-20130501");
File[] listOfFiles = folder.listFiles();
// there's no need for the userIdSet!
//Set<String> userIdSet = new LinkedHashSet<String>();
// Instead I'd go for a counter ;)
long userIdCount = 0;
for (int i = 0; i < listOfFiles.length; i++) {
File file = listOfFiles[i];
if (file.isFile() && file.getName().endsWith(".txt")) {
try {
List<String> content = FileUtils.readLines(file, Charset.forName("UTF-8"));
// I just want to know how many lines there are...
userIdCount += content.size();
// my guess is you'd probably want to print what you've got
// before a possible break?? - You know better!
System.out.println(content);
if(userIdCount >= 10Million) {
break;
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
Like I noted, just a slight modification. It was not my intention to run a very detailed analysis on your code. I just pointed out a glaring mis-design.
Finally, where you stated System.out.println(content);, you might consider writing to file at that point.
If you will write to file one line at a time, you try-catch block may look like this:
try {
List<String> content = FileUtils.readLines(file, Charset.forName("UTF-8"));
for(int lineNumber = 0; lineNumber < content.size(); lineNumber++){
if(++userIdCount >= 10Million){
break;
}
// here, write to file... But I will use simple System.out.print for example
System.out.println(content.get(lineNumber));
}
} catch (IOException e) {
e.printStackTrace();
}
Your code can be improved in many ways, but I don't have time to do that. But I hope my suggestion can push you further to the front in the right track. Cheers!
I have developed an application that read how many files are there in a java package inside the java project and count the line of code in those individual files to for example in a java project if there are 2 packages having 4 individual files then total files read will be 4 and if those 4 files having 10 piece of lines of code in each file then 4*10 is total 40 lines of code in overall project ...below is my piece of code
private static int totalLineCount = 0;
private static int totalFileScannedCount = 0;
public static void main(final String[] args) throws FileNotFoundException {
JFileChooser chooser = new JFileChooser();
chooser.setCurrentDirectory(new java.io.File("C:" + File.separator));
chooser.setDialogTitle("FILES ALONG WITH LINE NUMBERS");
chooser.setFileSelectionMode(JFileChooser.DIRECTORIES_ONLY);
chooser.setAcceptAllFileFilterUsed(false);
if (chooser.showOpenDialog(null) == JFileChooser.APPROVE_OPTION) {
Map<File, Integer> result = new HashMap<File, Integer>();
File directory = new File(chooser.getSelectedFile().getAbsolutePath());
List<File> files = getFileListing(directory);
// print out all file names, in the the order of File.compareTo()
for (File file : files) {
// System.out.println("Directory: " + file);
getFileLineCount(result, file);
//totalFileScannedCount += result.size(); //saral
}
System.out.println("*****************************************");
System.out.println("FILE NAME FOLLOWED BY LOC");
System.out.println("*****************************************");
for (Map.Entry<File, Integer> entry : result.entrySet()) {
System.out.println(entry.getKey().getAbsolutePath() + " ==> " + entry.getValue());
}
System.out.println("*****************************************");
System.out.println("SUM OF FILES SCANNED ==>" + "\t" + totalFileScannedCount);
System.out.println("SUM OF ALL THE LINES ==>" + "\t" + totalLineCount);
}
}
public static void getFileLineCount(final Map<File, Integer> result, final File directory)
throws FileNotFoundException {
File[] files = directory.listFiles(new FilenameFilter() {
public boolean accept(final File directory, final String name) {
if (name.endsWith(".java")) {
return true;
} else {
return false;
}
}
});
for (File file : files) {
if (file.isFile()) {
Scanner scanner = new Scanner(new FileReader(file));
int lineCount = 0;
totalFileScannedCount ++; //saral
try {
for (lineCount = 0; scanner.nextLine() != null; ) {
while (scanner.hasNextLine()) {
String line = scanner.nextLine().trim();
if (!line.isEmpty()) {
lineCount++;
}
}
} catch (NoSuchElementException e) {
result.put(file, lineCount);
totalLineCount += lineCount;
}
}
}
}
/**
* Recursively walk a directory tree and return a List of all Files found;
* the List is sorted using File.compareTo().
*
* #param aStartingDir
* is a valid directory, which can be read.
*/
static public List<File> getFileListing(final File aStartingDir) throws FileNotFoundException {
validateDirectory(aStartingDir);
List<File> result = getFileListingNoSort(aStartingDir);
Collections.sort(result);
return result;
}
// PRIVATE //
static private List<File> getFileListingNoSort(final File aStartingDir) throws FileNotFoundException {
List<File> result = new ArrayList<File>();
File[] filesAndDirs = aStartingDir.listFiles();
List<File> filesDirs = Arrays.asList(filesAndDirs);
for (File file : filesDirs) {
if (file.isDirectory()) {
result.add(file);
}
if (!file.isFile()) {
// must be a directory
// recursive call!
List<File> deeperList = getFileListingNoSort(file);
result.addAll(deeperList);
}
}
return result;
}
/**
* Directory is valid if it exists, does not represent a file, and can be
* read.
*/
static private void validateDirectory(final File aDirectory) throws FileNotFoundException {
if (aDirectory == null) {
throw new IllegalArgumentException("Directory should not be null.");
}
if (!aDirectory.exists()) {
throw new FileNotFoundException("Directory does not exist: " + aDirectory);
}
if (!aDirectory.isDirectory()) {
throw new IllegalArgumentException("Is not a directory: " + aDirectory);
}
if (!aDirectory.canRead()) {
throw new IllegalArgumentException("Directory cannot be read: " + aDirectory);
}
}
but the issue is that it also count the white space lines while calculating the line of code for the individual files , which it should not , please advise what modifications I need to do in my program so that it should not count the white spaces while calculating the line of code for the individual files .
The idea that was coming to my mind was just compares the read string with "", and count if not equals to "" (empty) like if(!readString.trim().equals("")) lineCount++
Please advise for this
Suggestions:
Scanner has a hasNextLine() method which you should use. I would use it as the condition of a while loop.
Then get the line inside the while loop by calling nextLine() just once inside of the loop.
Again call trim() on your Strings that are read in. I still don't see your attempt at this in the latest code update!
A key concept when calling methods on Strings is that they are immutable, and the methods called on them do not change the underlying String, and trim() is no different: The String that it is called on is unchanged, but the String returned by the method is changed, and in fact is trimmed.
String has an isEmpty() method that you should call after trimming the String.
So don't do:
try {
for (lineCount = 0; scanner.nextLine() != null; ) {
if(!readString.trim().equals("")) lineCount++; // updated one
}
} catch (NoSuchElementException e) {
result.put(file, lineCount);
totalLineCount += lineCount;
}
Instead do:
int lineCount = 0;
while (scanner.hasNextLine()) {
String line = scanner.nextLine().trim();
if (!line.isEmpty()) {
lineCount++;
}
}