Efficient way of crawling file system using threads Java - java

I am currently working on a java project that does OCR in PDFs from the file system for searching its content.
In this project I am searching in a folder that the user specifies. I am taking PDFs content by OCR and checking them whether the keywords provided by the user are included in them.
I am trying to make sure when an OCR is done on a PDF, the crawling or the traversal to continue (necessarily on another thread or few threads), so that the performance of the system is not reduced dramatically.
Is there a way to accomplish this? I've included the traversing code I am using below..
public void traverseDirectory(File[] files) {
if (files != null) {
for (File file : files) {
if (file.isDirectory()) {
traverseDirectory(file.listFiles());
} else {
String[] type = file.getName().toString().split("\\.(?=[^\\.]+$)");
if (type.length > 1) {
if (type[1].equals("pdf")) {
//checking content goes here
}
}
}
}
}
}

You can just use Files.walkFileTree:
ExecutorService executor = Executors.newFixedThreadPool(threadCount);
PdfOcrService service = ...
Path rootPath = Paths.get("/path/to/your/directory");
Files.walkFileTree(rootPath, new SimpleFileVisitor<Path>() {
public void visitFile(Path path, BasicFileAttributes attrs) {
executor.submit(() -> {
service.performOcrOnFile(path);
});
}
});

Related

How can you create a new zip file in Java and add a large directory of files to it?

I'm trying to add a directory of files to a zip. The directory is around 150 files large. A few, 5-75 files in, I keep getting a crash with the error message "The process cannot access the file because it is being used by another process."
I tried a delay which may be helping but is certainly not solving the bug.
Using code from:
Is it possible to create a NEW zip file using the java FileSystem?
final File folder = new File("C:/myDir/img");
for (final File fileEntry : folder.listFiles()) {
if (fileEntry.isDirectory()) {
continue;
}
else {
String filename = fileEntry.getName();
String toBeAddedName = "C:/myDir/img/" + filename;
Path toBeAdded = FileSystems.getDefault().getPath(toBeAddedName).toAbsolutePath();
createZip(zipLocation, toBeAdded, "./" + filename);
System.out.println("Added file " + ++count);
//Delay because 'file in use' bug
try { Thread.sleep(1000); } //1secs
catch (InterruptedException e) {}
}
}
public static void createZip(Path zipLocation, Path toBeAdded, String internalPath) throws Throwable {
Map<String, String> env = new HashMap<String, String>();
//Check if file exists.
env.put("create", String.valueOf(Files.notExists(zipLocation)));
//Use a zip filesystem URI
URI fileUri = zipLocation.toUri(); //Here
URI zipUri = new URI("jar:" + fileUri.getScheme(), fileUri.getPath(), null);
System.out.println(zipUri);
//URI uri = URI.create("jar:file:"+zipLocation); //Here creates the zip
//Try with resource
try (FileSystem zipfs = FileSystems.newFileSystem(zipUri, env)) {
//Create internal path in the zipfs
Path internalTargetPath = zipfs.getPath(internalPath);
//Create parent dir
Files.createDirectories(internalTargetPath.getParent());
//Copy a file into the zip file
Files.copy(toBeAdded, internalTargetPath, StandardCopyOption.REPLACE_EXISTING);
}
}
I can't promise this is the cause of your problem, but your code compresses files into a ZIP file in a strange, or at least inefficient, manner. Specifically, you're opening up a new FileSystem for each individual file you want to compress. I'm assuming you're doing it this way because that's what the Q&A you linked to does. However, that answer is only compressing one file whereas you want to compress multiple files at the same time. You should keep the FileSystem open for the entire duration of compressing your directory.
public static void compress(Path directory, int depth, Path zipArchiveFile) throws IOException {
var uri = URI.create("jar:" + zipArchiveFile.toUri());
var env = Map.of("create", Boolean.toString(Files.notExists(zipArchiveFile, NOFOLLOW_LINKS)));
try (var fs = FileSystems.newFileSystem(uri, env)) {
Files.walkFileTree(directory, Set.of(), depth, new SimpleFileVisitor<>() {
private final Path archiveRoot = fs.getRootDirectories().iterator().next();
#Override
public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) throws IOException {
// Don't include the directory itself
if (!directory.equals(dir)) {
Files.createDirectory(resolveDestination(dir));
}
return FileVisitResult.CONTINUE;
}
#Override
public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
Files.copy(file, resolveDestination(file), REPLACE_EXISTING);
return FileVisitResult.CONTINUE;
}
private Path resolveDestination(Path path) {
/*
* Use Path#resolve(String) instead of Path#resolve(Path). I couldn't find where the
* documentation mentions this, but at least three implementations will throw a
* ProviderMismatchException if #resolve(Path) is invoked with a Path argument that
* belongs to a different provider (i.e. if the implementation types don't match).
*
* Note: Those three implementations, at least in OpenJDK 12.0.1, are the JRT, ZIP/JAR,
* and Windows file system providers (I don't have access to Linux's or Mac's provider
* source currently).
*/
return archiveRoot.resolve(directory.relativize(path).toString());
}
});
}
}
Note: The depth parameter is used in exactly the same way as maxDepth is in Files#walkFileTree.
Note: If you only ever care about the files in the directory itself (i.e. don't want to recursively traverse the file tree), then you can use Files#list(Path). Don't forget to close the Stream when finished with it.
It's possible that you opening and closing the FileSystem over and over is causing your problems, in which case the above should solve the issue.

How to list all text files in a directory [duplicate]

I need to read all ".txt" files from folder (user needs to select this folder).
Please advise how to do it?
you can use filenamefilter class it is pretty simple usage
public static void main(String[] args) throws IOException {
File f = new File("c:\\mydirectory");
FilenameFilter textFilter = new FilenameFilter() {
public boolean accept(File dir, String name) {
return name.toLowerCase().endsWith(".txt");
}
};
File[] files = f.listFiles(textFilter);
for (File file : files) {
if (file.isDirectory()) {
System.out.print("directory:");
} else {
System.out.print(" file:");
}
System.out.println(file.getCanonicalPath());
}
}
just create an filenamefilter instance an override accept method how you want
Assuming you already have the directory, you can do something like this:
File directory= new File("user submits directory");
for (File file : directory.listFiles())
{
if (FileNameUtils.getExtension(file.getName()).equals("txt"))
{
//dom something here.
}
}
The FileNameUtils.getExtension() can be found here.
Edit: What you seem to want to do is to access the file structure from the web browser. According to this previous SO post, what you want to do is not possible due to security reasons.
You need to read the directory and iterate inside it.
it is more a question on Java access to file systems than about MVC
I wrote the following function that will search for all the text files inside a directory.
public static void parseDir(File dirPath)
{
File files[] = null;
if(dirPath.isDirectory())
{
files = dirPath.listFiles();
for(File dirFiles:files)
{
if(dirFiles.isDirectory())
{
parseDir(dirFiles);
}
else
{
if(dirFiles.getName().endsWith(".txt"))
{
//do your processing here....
}
}
}
}
else
{
if(dirPath.getName().endsWith(".txt"))
{
//do your processing here....
}
}
}
see if this helps.
provide a text box to user to enter the path of directory.
File userDir=new File("userEnteredDir");
File[] allfiles=useDir.listFiles();
Iterate allFiles to filter .txt files using getExtension() method

Generating URI's from files in a File class in Java FX?

I'm writing code for a music player in Java FX, I use the MediaPlayer class, which is initialized by a Media class. So far I think that the sources for the Media constructors must be URI in Strings, so I've writen this code for adding a list of song files to a playlist and so playing such list:
public void setPlaylist (List<File> lista) {
songsList.clear();
for (File s : lista) {
songsList.add(s.toURI());
}
}
This works fine. However, when I want to get a File containing the path of a folder, and inputing each file's name in URI format I get some trouble, this is what I've tried so far:
public void setPlaylist (File folder) {
songsList.clear();
for (String s : folder.list()) {
try {
songsList.add(new URI("file:///" + (folder + "\\" + s).replace("\\", "/").replaceAll(" ", "%20")));
} catch (URISyntaxException ex) {
Logger.getLogger(PlayList.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
I'm getting error logs like this:
SEVERE: null java.net.URISyntaxException: Illegal character in path at
index 78:
file:///C:/Users/Diego%20Aguilar/Music/3%20Grandes%20de%20la%20Banda/AlbumArt_{9AEABE24-F5A2-441C-A71A-D061E000A9BA}_Large.jpg
Use File#toURI() as you were using before to avoid running into encoding issues and make use of a FilenameFilter to restrict the list to media files only. Here's how the code would look then.
public void setPlaylist (File folder) {
songsList.clear();
File[] musicFiles = folder.listFiles(new FilenameFilter() {
#Override
public boolean accept(File dir, String name) {
return (name.endsWith(".mp3") || name.endsWith(".m4a"));
}
});
for (File file : musicFiles) {
songsList.add(file.toURI());
}
}
See JavaDocs: FilenameFilter, File#toURI()
Instead of using String s : folder.list() use File s : folder.listFiles() ... then use the URL from the files.
Your file URI contains an angular bracket {, which is causing SEVERE: null java.net.URISyntaxException
You need to have a valid file path to create a proper URI.
Here is the link to URI RFC for referring what is allowed and what is not allowed in a URL.

How to get all text files from one folder using Java?

I need to read all ".txt" files from folder (user needs to select this folder).
Please advise how to do it?
you can use filenamefilter class it is pretty simple usage
public static void main(String[] args) throws IOException {
File f = new File("c:\\mydirectory");
FilenameFilter textFilter = new FilenameFilter() {
public boolean accept(File dir, String name) {
return name.toLowerCase().endsWith(".txt");
}
};
File[] files = f.listFiles(textFilter);
for (File file : files) {
if (file.isDirectory()) {
System.out.print("directory:");
} else {
System.out.print(" file:");
}
System.out.println(file.getCanonicalPath());
}
}
just create an filenamefilter instance an override accept method how you want
Assuming you already have the directory, you can do something like this:
File directory= new File("user submits directory");
for (File file : directory.listFiles())
{
if (FileNameUtils.getExtension(file.getName()).equals("txt"))
{
//dom something here.
}
}
The FileNameUtils.getExtension() can be found here.
Edit: What you seem to want to do is to access the file structure from the web browser. According to this previous SO post, what you want to do is not possible due to security reasons.
You need to read the directory and iterate inside it.
it is more a question on Java access to file systems than about MVC
I wrote the following function that will search for all the text files inside a directory.
public static void parseDir(File dirPath)
{
File files[] = null;
if(dirPath.isDirectory())
{
files = dirPath.listFiles();
for(File dirFiles:files)
{
if(dirFiles.isDirectory())
{
parseDir(dirFiles);
}
else
{
if(dirFiles.getName().endsWith(".txt"))
{
//do your processing here....
}
}
}
}
else
{
if(dirPath.getName().endsWith(".txt"))
{
//do your processing here....
}
}
}
see if this helps.
provide a text box to user to enter the path of directory.
File userDir=new File("userEnteredDir");
File[] allfiles=useDir.listFiles();
Iterate allFiles to filter .txt files using getExtension() method

How do I automatically convert all javadoc package.html files into package-info.java files?

We use a lot of legacy package.html files in our project and we want to convert them to package-info.java files. Doing that manually isn't an option (way too many files). Is there a good way to automate that?
We want to convert them for a couple of reasons:
From the javadoc specs: This file is new in JDK 5.0, and is preferred over package.html.
To not mix both types of files in the same codebase
To avoid that Intellij/Eclipse builds put those *.html files in our classes dirs (and possibly in a release binary jars) so they behave like our other normal html resources.
You may need to change the directory separator if you're not running windows. Also, the conversion is a bit of a hack, but it should work. Out of curiosity, how many packages do you have that manual isn't an option?
public class Converter {
public static void main(String[] args) {
File rootDir = new File(".");
renamePackageToPackageInfo(rootDir);
}
private static void renamePackageToPackageInfo(File dir) {
File[] files = dir.listFiles(new FilenameFilter() {
#Override
public boolean accept(File dir, String name) {
return "package.html".equals(name);
}
});
for (File file : files) {
convertFile(file);
}
// now recursively rename all the child directories.
File[] dirs = dir.listFiles(new FileFilter() {
#Override
public boolean accept(File pathname) {
return pathname.isDirectory();
}
});
for (File subdir : dirs) {
renamePackageToPackageInfo(subdir);
}
}
private static void convertFile(File html) {
// determine the FQN package name
String fqpn = getPackageName(html);
// check if package-info.java already exists
File packageInfo = new File(html.getParent(), "package-info.java");
if (packageInfo.exists()) {
System.out.println("package-info.java already exists for package: "+fqpn);
return;
}
// create the i/o streams, and start pumping the data
try {
PrintWriter out = new PrintWriter(packageInfo);
BufferedReader in = new BufferedReader(new FileReader(html));
out.println("/**");
// skip over the headers
while (true) {
String line = in.readLine();
if (line.equalsIgnoreCase("<BODY>"))
break;
}
// now pump the file into the package-info.java file
while (true) {
String line = in.readLine();
if (line.equalsIgnoreCase("</BODY>"))
break;
out.println(" * " + line);
}
out.println("*/");
out.println("package "+fqpn+";");
out.close();
in.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
// queue the package.html file for deletion
//html.deleteOnExit();
}
private static String getPackageName(File file) {
StringBuilder path = new StringBuilder(file.getParent());
// trim the first two characters (./ or .\)
path.delete(0, 2);
// then convert all separators into . (HACK: should use directory separator property)
return path.toString().replaceAll("\\\\", ".");
}
}
The IntelliJ guys have made an intention to do this for all files. It's been resolved and will probably be released in the next IntelliJ release.
To do this in batch mode in IDEA:
In settings, activate the inspection gadget "'package.html' may be converted to 'package-info.java' inspection"
Open a package.html file
You see a banner fix the inspection on top the file
Click on the settings icon at the right on the banner
Select "Run inspection on" >> "Whole project"
Click on "Convert to package-info.java" >> OK
Optionally remove the inappropriate lines (sed -i "/Put #see and #since/d" `find . -name "package-info.java"`)

Categories