Multiple file reading loop and distinguishing between .pdf and .doc files - java

Am writing a Java program in Eclipse to scan keywords from resumes and filter the most suitable resume among them, apart from showing the keywords for each resume. The resumes can be of doc/pdf format.
I've successfully implemented a program to read pdf files and doc files seperately (by using Apache's PDFBox and POI jar packages and importing libraries for the required methods), display the keywords and show resume strength in terms of the number of keywords found.
Now there are two issues am stuck in:
(1) I need to distinguish between a pdf file and a doc file within the program, which is easily achievable by an if statement but am confused how to write the code to detect if a file has a .pdf or .doc extension. (I intend to build an application to select the resumes, but then the program has to decide whether it will implement the doc type file reading block or the pdf type file reading block)
(2) I intend to run the program for a list of resumes, for which I'll need a loop within which I'll run the keyword scanning operations for each resume, but I can't think of a way as because even if the files were named like 'resume1', 'resume2' etc we can't assign the loop's iterable variable in the file location like : 'C:/Resumes_Folder/Resume[i]' as thats the path.
Any help would be appreciated!

You can use a FileFilter to read only one type or another, then respond accordingly. It'll give you a List containing only files of the desired type.
The second requirement is confusing to me. I think you would be well served by creating a class that encapsulates the data and behavior that you want for a parsed Resume. Write a factory class that takes in an InputStream and produces a Resume with the data you need inside.
You are making a classic mistake: You are embedding all the logic in a main method. This will make it harder to test your code.
All problem solving consists of breaking big problems into smaller ones, solving the small problems, and assembling them to finally solve the big problem.
I would recommend that you decompose this problem into smaller classes. For example, don't worry about looping over a directory's worth of files until you can read and parse an individual PDF and DOC file.
Create an interface:
public interface ResumeParser {
Resume parse(InputStream is) throws IOException;
}
Implement different implementations for PDF and Word Doc.
Create a factory to give you the appropriate ResumeParser based on file type:
public class ResumeParserFactory {
public ResumeParser create(String fileType) {
if (fileType.contains(".pdf") {
return new PdfResumeParser();
} else if (fileType.contains(".doc") {
return new WordResumeParser();
} else {
throw new IllegalArgumentException("Unknown document type: " + fileType);
}
}
}
Be sure to write unit tests as you go. You should know how to use JUnit.

Another alternative to using a FileFilter is to use a DirectoryStream, because Files::newDirectoryStream easily allows to specify relevant file endings:
try (DirectoryStream<Path> stream = Files.newDirectoryStream(dir, "*.{doc,pdf}")) {
for (Path entry: stream) {
// process files here
}
} catch (DirectoryIteratorException ex) {
// I/O error encounted during the iteration, the cause is an IOException
throw ex.getCause();
}
}

You can do something basic like:
// Put the path to the folder containing all the resumes here
File f = new File("C:\\");
ArrayList<String> names = new ArrayList<>
(Arrays.asList(Objects.requireNonNull(f.list())));
for (String fileName : names) {
if (fileName.length() > 3) {
String type = fileName.substring(fileName.length() - 3);
if (type.equalsIgnoreCase("doc")) {
// doc file logic here
} else if (type.equalsIgnoreCase("pdf")) {
// pdf file logic here
}
}
}
But as DuffyMo's answer says, you can also use a FileFilter (it's definitely a better option than my quick code).
Hope it helps.

Related

Check does file exist java.nio

I'm trying to check does file exist but it doesn't work.
FileSystem fs = FileSystems.getDefault();
Path p = fs.getPath(fileName);
if(!Files.exists(p)) {
create(fileName);
} else {
throw new ConflictException(String.format("File already exist."));
}
The problem is that even the file exist with same fileName it goes inside if statement and goes to create method and when it came to part to create file then it returns exception that file already exists.
What could be the problem and possible solution to check does file/directory exists if I'm using FileSystem?
You're doing it wrong.
The general principle when working in environments subject to external change, such as file systems, you just cannot do check-and-act. That entire principle is broken in such an environment, and you're doing it here:
You check if the file exists, and then depending on the result of that, you choose an action. That's check-and-act and doesn't work.
After all, what if the 'answer' to your check changes in between the check and the act? It doesn't even have to be another thread within your own VM, it can be another process. You can't synchronize on anything to get this job done 'safely' either.
No, the right principle is act-then-check. Do the operation 'make this file but only if it is not already there', atomically, and deal with the fallout, that is, deal with the error afterwards if the file already existed.
Java's nio has support for this, fortunately (the old File API does not, don't use that). Lastly, there is no need to go via the FileSystem stuff, not as long as you are using the default filesystem. However, if that was just there for the purposes of simplifying the question, this works just as well with a custom filesystem:
Path p = Paths.get(fileName);
try {
try (var out = Files.newOutputStream(p, StandardOpenOption.CREATE_NEW)) {
// write your file here
}
} catch (FileAlreadyExistsException e) {
throw new ConflictException(String.format("File already exists", e);
}
// CREATE_NEW is the magic voodoo here: That tells java:
// do this ONLY if you make a new file, otherwise don't do it, atomically.
though note that FAEException is fine as is, so I'm not sure you should neccessarily wrap that into a conflictexception - that only makes sense if this API has abstracted away the notion that you are doing this thing to the filesystem (you did not include a method name or javadoc in your paste, so I can't tell).
If you don't need to write anything to the file, you don't need newOutputStream, you can just go with:
Path p = Paths.get(fileName);
try {
Files.createFile(p);
} catch (FileAlreadyExistsException e) {
throw new ConflictException(String.format("File already exists", e);
}
// Files.createFile implies CREATE_NEW already; it either makes
// the file and returns, or doesn't and throws FAEEx.

Use Java to build an index of files in Windows 10

I am writing a desktop application in Java to quickly find files. I have used the exec command in Java to run powershell to do this, as Java's os.walk method seems to be much slower. Right now it takes about 5 minutes to generate a text file that lists the contents of all files on my computer (a total of around 440,000 files).
This is fine, but the problem I have is that I have no way of updating this list of files. So if I change a few files in my file system and want to update my file list, I can't do so quickly (i.e. incrementally). Instead, I have to generate the file list all over from scratch.
I know you can use git-bash to create a locate database (using updatedb). Now this is an awesome solution, but the application I'm trying to create may be used by people who don't have that installed. So I'd like to do it using default apps provided with Windows (i.e. powershell, or natively in Java). I am trying to make this app easy to use, so I don't want the user to have to install a bunch of other dependencies.
The following code shows how to use Java and avoid Powershell altogether. It builds an array in memory and writes it to a text file (467,000 files listed) all in under 30 seconds!
Run the following code in Main or wherever you want. It calls the createFileList method.
List<Path> pathsArrayList = new ArrayList<>();
Path rootPath_obj;
rootPath_obj = Paths.get(this.configMap.get("root_path"));
createFileList(rootPath_obj);
Here's the tree stream traversal code:
public void createFileList(Path path_in) throws IOException, AccessDeniedException {
try (DirectoryStream<Path> mystream = Files.newDirectoryStream(path_in)) {
for (Path entry : mystream) {
if (Files.isDirectory(entry)) {
createFileList(entry);
}
pathsArrayList.add(entry);
}
}
catch (AccessDeniedException ex) {
// Do nothing, just move on to the next file
}
}
Now write the file to save for later. This is the listing of all files within the root path tree.
System.out.println("Writing database...");
try (FileWriter writer = new FileWriter(this.configMap.get("db_path"))) {
for(Path pth: pathsArrayList){
writer.write(pth.toString() + System.lineSeparator());
}
}
System.out.println("...database written.");

Create a text file if it doesn't exist and append to it if it does using Java BufferedWriter

This is probably ridiculously simple for gun Java programmers, yet the fact that I (a relative newbie to Java) couldn't find a simple, straightforward example of how to do it means that I'm going to use the self-answer option to hopefully prevent others going through similar frustration.
I needed to output error information to a simple text file. These actions would be infrequent and small (and sometimes not needed at all) so there is no point keeping a stream open for the file; the file is opened, written to and closed in the one action.
Unlike other "append" questions that I've come across, this one requires that the file be created on the first call to the method in that run of the Java application. The file will not exist before that.
The original code was:
Path pathOfLog = Paths.get(gsOutputPathUsed + gsOutputFileName);
Charset charSetOfLog = Charset.forName("US-ASCII");
bwOfLog = Files.newBufferedWriter(pathOfLog, charSetOfLog);
bwOfLog.append(stringToWrite, 0, stringToWrite.length());
iReturn = stringToWrite.length();
bwOfLog.newLine();
bwOfLog.close();
The variables starting with gs are pre-populated string variables showing the output location, and stringToWrite is an argument which is passed in.
So the .append method should be enough to show that I wanted to append content, right?
But it isn't; each time the procedure was called the file was left containing only the string of the most recent call.
The answer is that you also need to specify open options when calling the newBufferedWriter method. What gets you is the default arguments as specified in the documentation:
If no options are present then this method works as if the CREATE,
TRUNCATE_EXISTING, and WRITE options are present.
Specifically, it's TRUNCATE_EXISTING that causes the problem:
If the file already exists and it is opened for WRITE access, then its
length is truncated to 0.
The solution, then, is to change
bwOfLog = Files.newBufferedWriter(pathOfLog, charSetOfLog);
to
bwOfLog = Files.newBufferedWriter(pathOfLog, charSetOfLog,StandardOpenOption.CREATE, StandardOpenOption.APPEND);
Probably obvious to long time Java coders, less so to new ones. Hopefully this will help someone avoid a bit of head banging.
You can also try this :
Path path = Paths.get("C:\\Users", "textfile.txt");
String text = "\nHello how are you ?";
try (BufferedWriter writer = Files.newBufferedWriter(path, StandardCharsets.UTF_8, StandardOpenOption.APPEND,StandardOpenOption.CREATE)) {
writer.write(text);
} catch (IOException e) {
e.printStackTrace();
}

Read metadata with ExifTool

I'm trying to read illustrator file metadata value by using Exiftool. I tried as per below.
File[] images = new File("filepath").listFiles();
ExifTool tool = new ExifTool(Feature.STAY_OPEN);
for(File f : images) {
if (f.toString().contains(".ai"))
{
System.out.println("test "+tool.getImageMeta(f, Tag.DATE_TIME_ORIGINAL));
}
}
tool.close();
Above code not printing any value. I even tried this.
public static final File[] IMAGES = new File("filepath").listFiles();
ExifTool tool = new ExifTool(Feature.STAY_OPEN);
for (File f : IMAGES) {
System.out.println("\n[" + f.getName() + "]");
System.out.println(tool.getImageMeta(f, Format.NUMERIC,
Tag.values()));
}
Which only prints {IMAGE_HEIGHT=2245, IMAGE_WIDTH=5393}. How do I call metadata values using Exiftool. Any advices and references links are highly appreciated.
For the given API, it either;
1-does not contain the tag you are looking for
2-the file itself might not have that tag filled
3-you might want to recreate your own using a more general tag command when calling exiftool.exe
Look in the source code and find the enum containing all the tags available to the API, that'll show you what you're restricted to. But yeah, you might want to consider making your own class similar to the one you're using. I'm in the midst of doing the same. That way you can store the tags in perhaps a set or HashMap instead of an enum and therefore be much less limited in tag choice. Then, all you have to do is write the commands for the tags you want to the process's OutputStream and then read the results from the InputStream.

Java - find matching pairs from list

background:
I need to load test a process on a server that I am working with. What I am doing is I am creating a bunch of files on client side and will upload them to server. The server is monitoring for new files (in input dir, file names are unique) and once there is a new file it processes it, once done, it creates a response file with same name but different extension to output dir. If the processing fails, it puts the incoming file to error dir. I am using the inotifywait to monitor the changes on server, which outputs:
10:48:47 /path/to/in/ CREATE ABCD.infile1
10:48:55 /path/to/out/ CREATE ABCD.outfile1
or
10:49:11 /path/to/in/ CREATE ASDF.infile1
10:49:19 /path/to/err/ CREATE ASDF.infile1
problem:
I need to parse the list of all results (planning to implement in java) like so, that I take the infile and match it with the same file name (either found in ERR or OUT), calculate the time taken and indicate weather it was success or not. The idea I am having is to create 3 lists (in, out, err) and try to parse, something like (in pseudo-code)
inList
outList
errList
for item : inList
if outlist.contains(item) parse;
else if errList.contains(item) parse;
else error;
question:
Is this efficient? Or is there a better way to approach this situation? Anyway, you might think that it is a code you are executing just once, why the struggle, but I really would like to know how do handle this properly.
The solution with lists is problematic, as you will have to keep them synchronized properly with the state of drive and always load them. What is more you will reach at some point capacity limit for file stored in single location.
Alternatives what you have are that you use i/o API to check path existence, or introduce a between database where you will store your values.
Another approach is database where you will store the information about keys and physical paths that file really has.
If I was you i would start with the I/O API and design a simple interface that could be replaced in future if the solution would appear to be inefficient.
You can use the "UserDefinedfileAttributeView" concept.
Create your own File attribute, say, "Result" and set its value accordingly for the files in IN dir. If the file is moved to OUT dir, "Result"="Success" and if the file is moved to ERR dir, "Result"="Error"
I tried the below code, hope it helps.
public static void main(String[] args) {
try{
Path file = Paths.get("C:\\Users\\rohit\\Desktop\\imp docs\\Steps.txt");
UserDefinedFileAttributeView userView = Files.getFileAttributeView(file, UserDefinedFileAttributeView.class);
String attribName = "RESULT";
String attribValue = "SUCCESS";
userView.write(attribName, Charset.defaultCharset().encode(attribValue));
List<String> attribList = userView.list();
for (String s : attribList) {
ByteBuffer buf = ByteBuffer.allocate(userView.size(s));
userView.read(s, buf);
buf.flip();
String value = Charset.defaultCharset().decode(buf).toString();
if("SUCCESS".equals(value)){
System.out.print(String.format("User defined attribute: %s", s));
System.out.println(String.format("; value: %s", value));
}
}
}
catch(Exception e){
}
You can do this for every file placed in IN dir.

Categories