Best way of extracting data from project

Best way of extracting data from project - java

I've made this so far
import java.io.File;
import java.io.FileInputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.io.IOUtils;
public class Test {
public static void main(String... args) {
Pattern p = Pattern.compile("(?s).*(MyFunc[(](?s).*[)];)+(?s).*");
File[] files = new File("C:\\TestDir").listFiles();
showFiles(files, p);
}
public static void showFiles(File[] files, Pattern p) {
for (File file : files) {
if (file.isDirectory()) {
System.out.println("Directory: " + file.getName());
showFiles(file.listFiles(), p); // Calls same method again.
} else {
System.out.println("File: " + file.getAbsolutePath());
String f;
try {
f= IOUtils.toString(new FileInputStream(file.getAbsolutePath()), "UTF-8");
System.out.println(file.getName());
Matcher m = p.matcher(f);
if (m.find()) {
System.out.println(m.group());
}
} catch (Exception e) {
e.printStackTrace();
return;
}
}
}
}
}
What I want to do is find every call of MyFunc written in files inside a certain directory (that may have subdirectories with files that should be checked too). The number of files is pretty big, but the above is very very slow for even single file of 1Mb. Do you have any idea of how to achieve what I want? I didn't expect this to be so slow.
EDIT// If this can't be done efficiently by a simple program, please feel free to advice me on useful FREE frameworks. Thank you for your help everyone.

The problem with your approach is the regular expression you're using. You're including .* at the beginning and at the end of your pattern, that will increase processing dramatically. Try the same code with the following regex:
(MyFunc\\(.*?\\);)
You can also apply the enhancements proposed by the other answers but I am pretty sure your bottleneck is in the regex itself.
Good luck!

You are likely taking a hit on creating a String out of each file's contents. This will stress the heap and garbage collector.
You can use the Scanner object to help with this:
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/Scanner.html
Additionally this has been answered here already:
Performing regex on a stream
Best of luck!

This may help you along a little further:
http://www.java-tips.org/java-se-tips/java.util.regex/how-to-apply-regular-expressions-on-the-contents-of-a.html
Again, creating a String for each file is costly. This example uses memory mapped files to avoid the hit on the garbage collector. This will instead use the C based heap instead of memory inside the JVM.

Related

Creating files in a separate thread

I have a method that starts creating JSON files in each of the folders in my tree.
public static void fill(List<String> subFoldersPaths) {
for (int i = 0; i < subFoldersPaths.size(); i++) {
String fullFileName = subFoldersPaths.get(i) + FILE_NAME;
String formatFullFileName = String.format(fullFileName, i)+"%d";
Runnable runnable = new JsonCreator(formatFullFileName);
new Thread(runnable).start();
}
}
List<String> subFoldersPaths is a list that contains paths to each folder in order.
Here is my folder structure:
I want each folder to be filled with files in a separate thread every 0.08 seconds. But my class will not fill every folder.
Here is a class that implements Runnable, which should perform the filling:
import com.epam.lab.model.Author;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import net.andreinc.mockneat.MockNeat;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import java.io.FileWriter;
import java.io.IOException;
public class JsonCreator implements Runnable {
private static Logger logger = LogManager.getLogger();
private static String fileName;
private static final int FILES_COUNT = 100;
public JsonCreator(String s){
this.fileName = s;
}
#Override
public void run() {
for (int i = 0; i < FILES_COUNT; i++) {
try {
String formatFullFileName = String.format(fileName, i)+".json";
FileWriter fileWriter = new FileWriter(formatFullFileName);
fileWriter.write(createJsonString());
fileWriter.close();
Thread.sleep(80);
} catch (IOException | InterruptedException e) {
logger.error("File was not created", e);
}
}
}
private static String createJsonString() {
MockNeat mockNeat = MockNeat.threadLocal();
Gson gson = new GsonBuilder()
.setPrettyPrinting()
.create();
String json = mockNeat
.reflect(Author.class)
.field("authorName", mockNeat.names().first())
.field("authorSurname", mockNeat.names().last())
.map(gson::toJson)
.val();
return json;
}
}
But this class fills not every folder with files. (maybe there is a problem with the file names) I can not figure it out.
And I want each folder below "foo" to be filled in a separate thread of JSON files in the amount of FILES_COUNT = 10
some examples of algorithm execution:
The folder structure is created with the participation of the random, so it is almost always different. but this does not affect the fact that files are not created in all folders

Your code is buggy; you cannot ever use that FileWriter constructor. Use new FileWriter(formatFullFileName, StandardCharsets.UTF_8), which is only in jdk11. If you're not on JDK11, you can't use FileWriter at all (it uses platform default encoding, and that is not acceptable; JSON must be in UTF-8 as per the JSON spec, and you have no guarantee that UTF-8 is your platform default).
you aren't guarding your FileWriter with an ARM block - you should add that.
In the initial block, formatFullFileName is a variable that is a format string. In the run() method, it's the opposite (it's the result of running a String.format op on one). Makes your code very hard to read.
Most likely your filenames are incorrect. You should be using List<Path> which would have removed any doubt. If your List<String> subFoldersPaths contains, for example, /home/misnomer/project/foo/1stLayerSubFolder0 in it, and the constant FILE_NAME (which you did not put in your pastes) is, say, example, then the path for the very first file to be created becomes: /home/misnomer/project/foo/1stLayerSubFolder0example0.json which is not what you wanted - you're missing a slash.
NB: If using the newer path API, writing a string to a file becomes vastly simpler: Files.write(path, string) is all you need (and note that the Files API defaults to UTF-8, unlike most other parts of the java libraries that involve turning strings to bytes or vice versa).
The paste needs more info, or you should debug this on your own: Print when you write a file, preferably including the thread ID (you can get it with Thread.currentThread().getName()). That's how programming works: You don't just stare at it, go --heck, I dunno, better ask stack overflow!-- and then give up. You debug it. Use a debugger, or if you can't/don't want to, use the poor man's debugger: Add a whole bunch of System.out.println statements. Go through your code and imagine (write it down if you have to) which each step is doing. Then, add a println statement that confirms this. The very place where what the program says it is doing does not match with what you thought it would do? That's where a bug is. Fix it, and keep going until all bugs are eliminated.

Searching files in a directory and pairing them based on a common sub-string

I have been attempting to program a solution for ImageJ to process my images.
I understand how to get a directory, run commands on it, etc etc. However I've run into a situation where I now need to start using some type of search function in order to pair two images together in a directory full of image pairs.
I'm hoping that you guys can confirm I am on the right direction and that my idea is right. So far it is proving difficult for me to understand as I have less than even a month's worth of experience with Java. Being that this project is directly for my research I really do have plenty of drive to get it done I just need some direction in what functions are useful to me.
I initially thought of using regex but I saw that when you start processing a lot of images (especially with imagej which it seems does not dump data usage well, if that's the correct way to say it) that regex is very slow.
The general format of these images is:
someString_DAPI_0001.tif
someString_GFP_0001.tif
someString_DAPI_0002.tif
someString_GFP_0002.tif
someString_DAPI_0003.tif
someString_GFP_0003.tif
They are in alphabetical order so it should be able to go to the next image in the list. I'm just a bit lost on what functions I should use to accomplish this but I think my overall while structure is correct. Thanks to some help from Java forums. However I'm still stuck on where to go to next.
So far here is my code: Thanks to this SO answer for partial code
int count = 0;
getFile("C:\");
string DAPI;
string GFP;
private void getFile(String dirPath) {
File f = new File(dirPath);
File[] files = f.listFiles();
while (files.length > 0) {
if (/* File name contains "DAPI"*/){
DAPI = File f;
string substitute to get 'GFP' filename
store GFP file name into variable
do something(DAPI, GFP);
}
advance to next filename in list
}
}
As of right now I don't really know how to search for a string within a string. I've seen regex capture groups, and other solutions but I do not know the "best" one for processing hundreds of images.
I also have no clue what function would be used to substitute substrings.
I'd much appreciate it if you guys could point me towards the functions best for this case. I like to figure out how to make it on my own I just need help getting to the right information. Also want to make sure I am not making major logic mistakes here.

It doesn't seem like you need regex if your file names follow the simple pattern that you mentioned. You can simply iterate over the files and filter based on whether the filename contains DAPI e.g. see below. This code may be oversimplification of your requirements but I couldn't tell that based on the details you've provided.
import java.io.*;
public class Temp {
int count = 0;
private void getFile(String dirPath) {
File f = new File(dirPath);
File[] files = f.listFiles();
if (files != null) {
for (File file : files) {
if (file.getName().contains("DAPI")) {
String dapiFile = file.getName();
String gfpFile = dapiFile.replace("DAPI", "GFP");
doSomething(dapiFile, gfpFile);
}
}
}
}
//Do Something does nothing right now, expand on it.
private void doSomething(String dapiFile, String gfpFile) {
System.out.println(new File(dapiFile).getAbsolutePath());
System.out.println(new File(gfpFile).getAbsolutePath());
}
public static void main(String[] args) {
Temp app = new Temp();
app.getFile("C:\\tmp\\");
}
}
NOTE: As per Vogel612's answer, if you have Java 8 and like a functional solution you can have:
private void getFile(String dirPath) {
try {
Files.find(Paths.get(dirPath), 1, (path, basicFileAttributes) -> (path.toFile().getName().contains("DAPI"))).forEach(
dapiPath -> {
Path gfpPath = dapiPath.resolveSibling(dapiPath.getFileName().toString().replace("DAPI", "GFP"));
doSomething(dapiPath, gfpPath);
});
} catch (IOException e) {
e.printStackTrace();
}
}
//Dummy method does nothing yet.
private void doSomething(Path dapiPath, Path gfpPath) {
System.out.println(dapiPath.toAbsolutePath().toString());
System.out.println(gfpPath.toAbsolutePath().toString());
}

Using java.io.File is the wrong way to approach this problem. What you're looking for is a Stream-based solution using Files.find that would look something like this:
Files.find(dirPath, 1, (path, attributes) -> {
return path.getFileName().toString().contains("DAPI");
}).forEach(path -> {
Path gfpFile = path.resolveSibling(/*build GFP name*/);
doSomething(path, gfpFile);
});
What this does is:
Iterate over all Paths below dirPath 1 level deep (may be adjusted)
Check that the File's name contains "DAPI"
Use these files to find the relevant "GFP"-File
give them to doSomething
This is preferrable to the files solution because of multiple things:
It's significantly more informative when failing
It's cleaner and more terse than your File-Based solution and doesn't have to check for null
It's forward compatible, and thus preferrable over a File-Based solution
Files.find is available from Java 8 onwards

how to copy a file only if it does not exist in the to directory java

I have searched pretty thoroughly and I'm fairly certain that no one has asked this question. However, this may be because this is completely the wrong way to go about this. I once wrote an effective java program that copied files from one directory to another. If the file already existed in the corresponding directory it would be caught with an exception and renamed. I want to use this program for another application, and for this I want it to do nothing when the exception is caught, simply continue on with the program without copying that file. I will be using this to fix approximately 18gb of files when it works, if it even printed one character when the exception was caught it would be extremely inefficient. This is the code I have so far:
import java.nio.file.*;
import java.io.IOException;
public class Sync
{
public static void main(String[] args)
{
String from=args[0];
Path to=FileSystems.getDefault().getPath(args[1]);
copyFiles(from, to);
}
public static void copyFiles(String from, Path to)
{
try(DirectoryStream<Path> files= Files.newDirectoryStream(FileSystems.getDefault().getPath(from)))
{
for(Path f:files)
{
Files.copy(f, to.resolve(f.getFileName()));
System.out.println(" "+f.getFileName()+" copied ");
}
}
catch(FileAlreadyExistsException e)
{
//not sure
}
catch(IOException e)
{
e.printStackTrace();
}
}
}
Is there a way to use FileAlreadyExistsExcpetion to do this?

I wouldn't use the try/catch to perform a user logic, this is wrong from a programming point of view and also not efficient.
What I would do is to check if the file exists, and n that case skip the copy operation or do whatever you want
for(Path f:files)
{
//here use Files.exists(Path path, LinkOption... options)
Files.copy(f, to.resolve(f.getFileName()));
System.out.println(" "+f.getFileName()+" copied ");
}
Here the Files docs:
http://docs.oracle.com/javase/7/docs/api/java/nio/file/Files.html
Good luck

File I/O bottleneck found via VisualVM

I've found a bottleneck in my app that keeps growing as data in my files grow (see attached screenshot of VisualVM below).
Below is the getFileContentsAsList code. How can this be made better performance-wise? I've read several posts on efficient File I/O and some have suggested Scanner as a way to efficiently read from a file. I've also tried Apache Commons readFileToString but that's not running fast as well.
The data file that's causing the app to run slower is 8 KB...that doesn't seem too big to me.
I could convert to an embedded database like Apache Derby if that seems like a better route. Ultimately looking for what will help the application run faster (It's a Java 1.7 Swing app BTW).
Here's the code for getFileContentsAsList:
public static List<String> getFileContentsAsList(String filePath) throws IOException {
if (ReceiptPrinterStringUtils.isNullOrEmpty(filePath)) throw new IllegalArgumentException("File path must not be null or empty");
Scanner s = null;
List<String> records = new ArrayList<String>();
try {
s = new Scanner(new BufferedReader(new FileReader(filePath)));
s.useDelimiter(FileDelimiters.RECORD);
while (s.hasNext()) {
records.add(s.next());
}
} finally {
if (s != null) {
s.close();
}
}
return records;
}

The size of an ArrayList is multiplied by 1.5 when necessary. This is O(log(N)). (Doubling was used in Vector.) I would certainly use an O(1) LinkedList here, and BufferedReader.readLine() rather than a Scanner if I was trying to speed it up. It's hard to believe that the time to read one 8k file is seriously a concern. You can read millions of lines in a second.

So, file.io gets to be REAL expensive if you do it a lot...as seen in my screen shot and original code, getFileContentsAsList, which contains file.io calls, gets invoked quite a bit (18.425 times). VisualVM is a real gem of a tool to point out bottlenecks like these!
After contemplating over various ways to improve performance, it dawned on me that possibly the best way is to do file.io calls as little as possible. So, I decided to use private static variables to hold the file contents and to only do file.io in the static initializer and when a file is written to. As my application is (fortunately) not doing excessive writing (but excessive reading), this makes for a much better performing application.
Here's the source for the entire class that contains the getFileContentsAsList method. I took a snapshot of that method and it now runs in 57.2 ms (down from 3116 ms). Also, it was my longest running method and is now my 4th longest running method. The top 5 longest running methods run for a total of 498.8 ms now as opposed to the ones in the original screenshot that ran for a total of 3812.9 ms. That's a percentage decrease of about 85%
[100 * (498.8 - 3812.9) / 3812.9].
package com.mbc.receiptprinter.util;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.logging.Level;
import org.apache.commons.io.FileUtils;
import com.mbc.receiptprinter.constant.FileDelimiters;
import com.mbc.receiptprinter.constant.FilePaths;
/*
* Various File utility functions. This class uses the Apache Commons FileUtils class.
*/
public class ReceiptPrinterFileUtils {
private static Map<String, String> fileContents = new HashMap<String, String>();
private static Map<String, Boolean> fileHasBeenUpdated = new HashMap<String, Boolean>();
static {
for (FilePaths fp : FilePaths.values()) {
File f = new File(fp.getPath());
try {
FileUtils.touch(f);
fileHasBeenUpdated.put(fp.getPath(), false);
fileContents.put(fp.getPath(), FileUtils.readFileToString(f));
} catch (IOException e) {
ReceiptPrinterLogger.logMessage(ReceiptPrinterFileUtils.class,
Level.SEVERE,
"IOException while performing FileUtils.touch in static block of ReceiptPrinterFileUtils", e);
}
}
}
public static String getFileContents(String filePath) throws IOException {
if (ReceiptPrinterStringUtils.isNullOrEmpty(filePath)) throw new IllegalArgumentException("File path must not be null or empty");
File f = new File(filePath);
if (fileHasBeenUpdated.get(filePath)) {
fileContents.put(filePath, FileUtils.readFileToString(f));
fileHasBeenUpdated.put(filePath, false);
}
return fileContents.get(filePath);
}
public static List<String> convertFileContentsToList(String fileContents) {
List<String> records = new ArrayList<String>();
if (fileContents.contains(FileDelimiters.RECORD)) {
records = Arrays.asList(fileContents.split(FileDelimiters.RECORD));
}
return records;
}
public static void writeStringToFile(String filePath, String data) throws IOException {
fileHasBeenUpdated.put(filePath, true);
FileUtils.writeStringToFile(new File(filePath), data);
}
public static void writeStringToFile(String filePath, String data, boolean append) throws IOException {
fileHasBeenUpdated.put(filePath, true);
FileUtils.writeStringToFile(new File(filePath), data, append);
}
}

ArrayLists have a good performance at reading and also on writing IF the lenth does not change very often. In your application the length changes very often (size is doubled, when it is full and an element is added) and your application needs to copy your array into an new, longer array.
You could use a LinkedList, where new elements are appended and no copy actions are needed.
List<String> records = new LinkedList<String>();
Or you could initialize the ArrayList with the approximated finished Number of Words. This will reduce the number of copy actions.
List<String> records = new ArrayList<String>(2000);

Search a codebase for large methods

By default the HotSpot JIT refuses to compile methods bigger than about 8k of bytecode (1). Is there anything that can scan a jar for such methods (2)?
unless you pass -XX:-DontCompileHugeMethods
Jon Masamitsu describes how interpreted methods can slow down garbage collection and notes that refactoring is generally wiser than -XX:-DontCompileHugeMethods

Thanks to Peter Lawrey for the pointer to ASM. This program prints out the size of each method in a jar:
import org.objectweb.asm.ClassReader;
import org.objectweb.asm.tree.ClassNode;
import org.objectweb.asm.tree.MethodNode;
public static void main(String[] args) throws IOException {
for (String filename : args) {
System.out.println("Methods in " + filename);
ZipFile zip = new ZipFile(filename);
Enumeration<? extends ZipEntry> it = zip.entries();
while (it.hasMoreElements()) {
InputStream clazz = zip.getInputStream(it.nextElement());
try {
ClassReader cr = new ClassReader(clazz);
ClassNode cn = new ClassNode();
cr.accept(cn, ClassReader.SKIP_DEBUG);
List<MethodNode> methods = cn.methods;
for (MethodNode method : methods) {
int count = method.instructions.size();
System.out.println(count + " " + cn.name + "." + method.name);
}
} catch (IllegalArgumentException ignored) {
}
}
}
}

Checkstyle would probably be good for this - it doesn't work on the 8k limit but the number of executable statements in a method in general. To be honest, this is a limit that you want in practice though.
As you already state, -XX:-DontCompileHugeMethods is generally a bad idea - it forces the JVM to dig through all that ugly code and try and do something with it, which can have a negative effect on performance rather than a positive one! Refactoring, or better still not writing methods that huge to start with would be the way forward.
Oh, and if the methods that huge ended up there through some human design, and not auto-generated code, then there's probably some people on your team who need talking to...

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.