Search a codebase for large methods - java

By default the HotSpot JIT refuses to compile methods bigger than about 8k of bytecode (1). Is there anything that can scan a jar for such methods (2)?
unless you pass -XX:-DontCompileHugeMethods
Jon Masamitsu describes how interpreted methods can slow down garbage collection and notes that refactoring is generally wiser than -XX:-DontCompileHugeMethods

Thanks to Peter Lawrey for the pointer to ASM. This program prints out the size of each method in a jar:
import org.objectweb.asm.ClassReader;
import org.objectweb.asm.tree.ClassNode;
import org.objectweb.asm.tree.MethodNode;
public static void main(String[] args) throws IOException {
for (String filename : args) {
System.out.println("Methods in " + filename);
ZipFile zip = new ZipFile(filename);
Enumeration<? extends ZipEntry> it = zip.entries();
while (it.hasMoreElements()) {
InputStream clazz = zip.getInputStream(it.nextElement());
try {
ClassReader cr = new ClassReader(clazz);
ClassNode cn = new ClassNode();
cr.accept(cn, ClassReader.SKIP_DEBUG);
List<MethodNode> methods = cn.methods;
for (MethodNode method : methods) {
int count = method.instructions.size();
System.out.println(count + " " + cn.name + "." + method.name);
}
} catch (IllegalArgumentException ignored) {
}
}
}
}

Checkstyle would probably be good for this - it doesn't work on the 8k limit but the number of executable statements in a method in general. To be honest, this is a limit that you want in practice though.
As you already state, -XX:-DontCompileHugeMethods is generally a bad idea - it forces the JVM to dig through all that ugly code and try and do something with it, which can have a negative effect on performance rather than a positive one! Refactoring, or better still not writing methods that huge to start with would be the way forward.
Oh, and if the methods that huge ended up there through some human design, and not auto-generated code, then there's probably some people on your team who need talking to...

Related

Understanding what happens when we override the clone method with and without invoking super.clone?

I'm reading Effective Java by Joshua Bloch. I must say its a dense and complex book. The chapter on Methods Common to all objects (chapter 3) is proving hard for me to grasp as I've been programming for less than 3 years (1 year in java). I don't quite understand the concept of overriding the clone method appropriately. Can I get a simple to follow example of implementing clone, the right way as well as the wrong way? And why failing to invoke super.clone would cause a problem? what will happen?
Thank you in advance.
I'm reading that book myself. Not sure if I did everything "right" in this example, but maybe it'll help you understand.
Computer.java
package testclone;
public class Computer implements Cloneable {
String OperatingSystem;
protected Computer Clone() throws CloneNotSupportedException {
Computer newClone = (Computer) super.clone();
newClone.OperatingSystem = this.OperatingSystem;
return newClone;
}
}
MultiCore.java
package testclone;
public class MultiCore extends Computer implements Cloneable {
int NumberOfCores;
#Override
protected MultiCore Clone() throws CloneNotSupportedException {
//********* use 1 of the next 2 lines ***********
//MultiCore newClone = (MultiCore) super.clone();
MultiCore newClone = new MultiCore();
newClone.NumberOfCores = this.NumberOfCores;
return newClone;
}
}
TestClone.java
package testclone;
public class TestClone implements Cloneable {
public static void main(String[] args) throws CloneNotSupportedException {
//Computer myComputer = new Computer();
//myComputer.OperatingSystem = "Windows";
MultiCore myMultiCore = new MultiCore();
myMultiCore.OperatingSystem = "Windows"; //field is in parent class
myMultiCore.NumberOfCores = 4;
MultiCore newMultiCore = myMultiCore.Clone();
System.out.println("orig Operating System = " + myMultiCore.OperatingSystem);
System.out.println("orig Number of Cores = " + myMultiCore.NumberOfCores);
System.out.println("clone Operating System = " + newMultiCore.OperatingSystem);
System.out.println("clone Number of Cores = " + newMultiCore.NumberOfCores);
}
}
Output:
orig Operating System = Windows
orig Number of Cores = 4
clone Operating System = null * This line is not what you want.
clone Number of Cores = 4
If you use the super.clone() line instead, then the Output is
orig Operating System = Windows
orig Number of Cores = 4
clone Operating System = Windows * Now it's what you want
clone Number of Cores = 4
So if you don't use super.clone(), it doesn't clone the fields in the parent (or grandparent, or great-grandparent, etc)
Good luck!
(Sorry - the above looked formatted correctly when I typed it in, but for some reason looks awful when it actually shows)
You should always use super.clone(). If you don't, and say just return new MyObject(this.x);, then that works fine for instances of MyObject. But if someone extends MyObject, it's no longer possible for them to get an instance of the right class when overriding your clone method. The one thing Object.clone does that you can't do a good job of yourself is creating an instance of the right class; the rest is just copying instance fields, which is drudgework you could have done yourself if you wanted.

Best way of extracting data from project

I've made this so far
import java.io.File;
import java.io.FileInputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.io.IOUtils;
public class Test {
public static void main(String... args) {
Pattern p = Pattern.compile("(?s).*(MyFunc[(](?s).*[)];)+(?s).*");
File[] files = new File("C:\\TestDir").listFiles();
showFiles(files, p);
}
public static void showFiles(File[] files, Pattern p) {
for (File file : files) {
if (file.isDirectory()) {
System.out.println("Directory: " + file.getName());
showFiles(file.listFiles(), p); // Calls same method again.
} else {
System.out.println("File: " + file.getAbsolutePath());
String f;
try {
f= IOUtils.toString(new FileInputStream(file.getAbsolutePath()), "UTF-8");
System.out.println(file.getName());
Matcher m = p.matcher(f);
if (m.find()) {
System.out.println(m.group());
}
} catch (Exception e) {
e.printStackTrace();
return;
}
}
}
}
}
What I want to do is find every call of MyFunc written in files inside a certain directory (that may have subdirectories with files that should be checked too). The number of files is pretty big, but the above is very very slow for even single file of 1Mb. Do you have any idea of how to achieve what I want? I didn't expect this to be so slow.
EDIT// If this can't be done efficiently by a simple program, please feel free to advice me on useful FREE frameworks. Thank you for your help everyone.
The problem with your approach is the regular expression you're using. You're including .* at the beginning and at the end of your pattern, that will increase processing dramatically. Try the same code with the following regex:
(MyFunc\\(.*?\\);)
You can also apply the enhancements proposed by the other answers but I am pretty sure your bottleneck is in the regex itself.
Good luck!
You are likely taking a hit on creating a String out of each file's contents. This will stress the heap and garbage collector.
You can use the Scanner object to help with this:
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/Scanner.html
Additionally this has been answered here already:
Performing regex on a stream
Best of luck!
This may help you along a little further:
http://www.java-tips.org/java-se-tips/java.util.regex/how-to-apply-regular-expressions-on-the-contents-of-a.html
Again, creating a String for each file is costly. This example uses memory mapped files to avoid the hit on the garbage collector. This will instead use the C based heap instead of memory inside the JVM.

File I/O bottleneck found via VisualVM

I've found a bottleneck in my app that keeps growing as data in my files grow (see attached screenshot of VisualVM below).
Below is the getFileContentsAsList code. How can this be made better performance-wise? I've read several posts on efficient File I/O and some have suggested Scanner as a way to efficiently read from a file. I've also tried Apache Commons readFileToString but that's not running fast as well.
The data file that's causing the app to run slower is 8 KB...that doesn't seem too big to me.
I could convert to an embedded database like Apache Derby if that seems like a better route. Ultimately looking for what will help the application run faster (It's a Java 1.7 Swing app BTW).
Here's the code for getFileContentsAsList:
public static List<String> getFileContentsAsList(String filePath) throws IOException {
if (ReceiptPrinterStringUtils.isNullOrEmpty(filePath)) throw new IllegalArgumentException("File path must not be null or empty");
Scanner s = null;
List<String> records = new ArrayList<String>();
try {
s = new Scanner(new BufferedReader(new FileReader(filePath)));
s.useDelimiter(FileDelimiters.RECORD);
while (s.hasNext()) {
records.add(s.next());
}
} finally {
if (s != null) {
s.close();
}
}
return records;
}
The size of an ArrayList is multiplied by 1.5 when necessary. This is O(log(N)). (Doubling was used in Vector.) I would certainly use an O(1) LinkedList here, and BufferedReader.readLine() rather than a Scanner if I was trying to speed it up. It's hard to believe that the time to read one 8k file is seriously a concern. You can read millions of lines in a second.
So, file.io gets to be REAL expensive if you do it a lot...as seen in my screen shot and original code, getFileContentsAsList, which contains file.io calls, gets invoked quite a bit (18.425 times). VisualVM is a real gem of a tool to point out bottlenecks like these!
After contemplating over various ways to improve performance, it dawned on me that possibly the best way is to do file.io calls as little as possible. So, I decided to use private static variables to hold the file contents and to only do file.io in the static initializer and when a file is written to. As my application is (fortunately) not doing excessive writing (but excessive reading), this makes for a much better performing application.
Here's the source for the entire class that contains the getFileContentsAsList method. I took a snapshot of that method and it now runs in 57.2 ms (down from 3116 ms). Also, it was my longest running method and is now my 4th longest running method. The top 5 longest running methods run for a total of 498.8 ms now as opposed to the ones in the original screenshot that ran for a total of 3812.9 ms. That's a percentage decrease of about 85%
[100 * (498.8 - 3812.9) / 3812.9].
package com.mbc.receiptprinter.util;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.logging.Level;
import org.apache.commons.io.FileUtils;
import com.mbc.receiptprinter.constant.FileDelimiters;
import com.mbc.receiptprinter.constant.FilePaths;
/*
* Various File utility functions. This class uses the Apache Commons FileUtils class.
*/
public class ReceiptPrinterFileUtils {
private static Map<String, String> fileContents = new HashMap<String, String>();
private static Map<String, Boolean> fileHasBeenUpdated = new HashMap<String, Boolean>();
static {
for (FilePaths fp : FilePaths.values()) {
File f = new File(fp.getPath());
try {
FileUtils.touch(f);
fileHasBeenUpdated.put(fp.getPath(), false);
fileContents.put(fp.getPath(), FileUtils.readFileToString(f));
} catch (IOException e) {
ReceiptPrinterLogger.logMessage(ReceiptPrinterFileUtils.class,
Level.SEVERE,
"IOException while performing FileUtils.touch in static block of ReceiptPrinterFileUtils", e);
}
}
}
public static String getFileContents(String filePath) throws IOException {
if (ReceiptPrinterStringUtils.isNullOrEmpty(filePath)) throw new IllegalArgumentException("File path must not be null or empty");
File f = new File(filePath);
if (fileHasBeenUpdated.get(filePath)) {
fileContents.put(filePath, FileUtils.readFileToString(f));
fileHasBeenUpdated.put(filePath, false);
}
return fileContents.get(filePath);
}
public static List<String> convertFileContentsToList(String fileContents) {
List<String> records = new ArrayList<String>();
if (fileContents.contains(FileDelimiters.RECORD)) {
records = Arrays.asList(fileContents.split(FileDelimiters.RECORD));
}
return records;
}
public static void writeStringToFile(String filePath, String data) throws IOException {
fileHasBeenUpdated.put(filePath, true);
FileUtils.writeStringToFile(new File(filePath), data);
}
public static void writeStringToFile(String filePath, String data, boolean append) throws IOException {
fileHasBeenUpdated.put(filePath, true);
FileUtils.writeStringToFile(new File(filePath), data, append);
}
}
ArrayLists have a good performance at reading and also on writing IF the lenth does not change very often. In your application the length changes very often (size is doubled, when it is full and an element is added) and your application needs to copy your array into an new, longer array.
You could use a LinkedList, where new elements are appended and no copy actions are needed.
List<String> records = new LinkedList<String>();
Or you could initialize the ArrayList with the approximated finished Number of Words. This will reduce the number of copy actions.
List<String> records = new ArrayList<String>(2000);

which of the two is a better way of creating and destroying objects?

i have a question on lines 26 & 27:
String dumb = input.nextLine();
output.println(dumb.replaceAll(REMOVE, ADD));
i was hoping that i'd be able to shrink this down to a single line and be able to save space, so i did:
output.println(new String(input.nextLine()).replaceAll(REMOVE, ADD));
but now i'm wondering about performance. i understand that this program is quiet basic and doesn't need optimization, but i'd like to learn this.
the way i look at it, in the first scenario i'm creating a string object dumb, but once i leave the loop the object is abandoned and the JVM should clean it up, right? but does the JVM clean up the abandoned object faster than the program goes through the loop? or will there be several string objects waiting for garbage collection once the program is done?
and is my logic correct that in the second scenario the String object is created on the fly and destroyed once the program has passed through that line? and is this in fact a performance gain?
i'd appreciate it if you could clear this up for me.
thank you,
p.s. in case you are wondering about the program (i assumed it was straight forward) it takes in an input file, and output file, and two words, the program takes the input file, replaces the first word with the second and writes it into the second file. if you've actually read this far and would like to suggest ways i could make my code better, PLEASE DO SO. i'd be very grateful.
import java.io.File;
import java.util.Scanner;
import java.io.PrintWriter;
public class RW {
public static void main(String[] args) throws Exception{
String INPUT_FILE = args[0];
String OUTPUT_FILE = args[1];
String REMOVE = args[2];
String ADD = args[3];
File ifile = new File(INPUT_FILE);
File ofile = new File(OUTPUT_FILE);
if (ifile.exists() == false) {
System.out.println("the input file does not exists in the current folder");
System.out.println("please provide the input file");
System.exit(0);
}
Scanner input = new Scanner(ifile);
PrintWriter output = new PrintWriter(ofile);
while(input.hasNextLine()) {
String dumb = input.nextLine();
output.println(dumb.replaceAll(REMOVE, ADD));
}
input.close();
output.close();
}
}
The very, very first thing I'm going to say is this:
Don't worry about optimizing performance prematurely. The Java compiler is smart, it'll optimize a lot of this stuff for you, and even if it didn't you're optimizing out incredibly tiny amounts of time. The stream IO you've got going there is already running for orders of magnitude longer than the amount of time you're talking about.
What is most important is how easy the code is to understand. You've got a nice code style, going from your example, so keep that up. Which of the two code snippets is easier for someone other than you to read? That is the best option. :)
That said, here are some more specific answers to your questions:
Garbage collection will absolutely pick up objects which are instantiated inside the scope of a loop. The fact that it's instantiated inside the loop means that Java will already have marked it for clean up as soon as it fell out of scope. The next time GC runs, it will clean up all of those things which have been marked for clean up.
Creating an object inline will still create an object. The constructor is still called, memory is still allocated... Under the hood, they are really, really similar. It's just that in one case that object has a name, and in the other it doesn't. You're not going to save any real resources by combining two lines of code into one.
"input.nextLine()" already returns a String, so you don't need to wrap it in a new String(). (So yes, removing that actually will result in one less object being instantiated!)
Local Objects are eligible for GC once they go out of scope. That does not mean that GC cleans them that very moment. The eligible objects undergone a lifecycle. GC may or may not collect them immediately.
As far your program is concerned, there is nothing much to optimize except a line or two. Below is a restructured program.
import java.io.File;
import java.util.Scanner;
import java.io.PrintWriter;
public class Test {
public static void main(String[] args) throws Exception {
String INPUT_FILE = args[0];
String OUTPUT_FILE = args[1];
String REMOVE = args[2];
String ADD = args[3];
File ifile = new File(INPUT_FILE);
File ofile = new File(OUTPUT_FILE);
if (ifile.exists() == false) {
System.out.println("the input file does not exists in the current folder\nplease provide the input file");
System.exit(0);
}
Scanner input = null;
PrintWriter output = null;
try {
input = new Scanner(ifile);
output = new PrintWriter(ofile);
while (input.hasNextLine()) {
output.println(input.nextLine().replaceAll(REMOVE, ADD));
}
} finally {
if (input != null)
input.close();
if(output != null)
output.close();
}
}
}
If you arew concerned about obejct creation and performance, use a profiler to mesure your code. And keep in mind that doing new String(input.nextLine()) is totally pointless since input.nextLine() returns an immutable instance of String. So just do output.println(input.nextLine().replaceAll(REMOVE, ADD));.

FST (Finite-state transducers) Libraries, C++ or java

I have a problem to solve using FSTs.
Basically, I'll make a morphological parser, and in this moment i have to work with large transducers. The performance is The Big issue here.
Recently, i worked in c++ in other projects where the performance matters, but now, i'am considering java, because the java's benefits and because java is getting better.
I studied some comparisons between java and c++, but I cannot decide what language i should use for this specific problem because it depends on lib in use.
I canĀ“t find much information about java's libs, so, my question is: Are there any open source java libs in which the performance is good, like The RWTH FSA Toolkit that i read in an article that is the fastest c++ lib?
Thanks all.
What are the "benefits" of Java, for your purposes? What specific problem does that platform solve that you need? What is the performance constraint you must consider? Were the "comparisons" fair, because Java is actually extremely difficult to benchmark. So is C++, but you can at least get some algorithmic boundary guarantees from STL.
I suggest you look at OpenFst and the AT&T finite-state transducer tools. There are others out there, but I think your worry about Java puts the cart before the horse-- focus on what solves your problem well.
Good luck!
http://jautomata.sourceforge.net/ and http://www.cs.duke.edu/csed/jflap/ are based Java finite state machine libraries, although I don't have experience using them so I cannot comment on the efficiency.
I'm one of the developers of the morfologik-stemming library. It's pure Java and its performance is very good, both when you build the automaton and when you use it. We use it for morphological analysis in LanguageTool.
The problem here is the minimum size of your objects in Java. In C++, without virtual methods and run time type identification, your objects weight exactly their content. And the time your automata take to manipulate memory has a big impact on performance.
I think that should be the main reason for choosing C++ over Java.
OpenFST is a C++ finite state transducer framework that is really comprehensive. Some people from CMU ported it to Java for use in their natural language processing.
A blog post series describing it.
The code is located on svn.
Update:
I ported it to java here
Lucene has a excellent implementation of FST, which is easy to use and high performance, making query engines like Elasticsearch, Solr deliver very fast sub-second term based query.Let me take an example:
import com.google.common.base.Preconditions;
import org.apache.lucene.store.ByteArrayDataInput;
import org.apache.lucene.store.DataInput;
import org.apache.lucene.store.GrowableByteArrayDataOutput;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.IntsRefBuilder;
import org.apache.lucene.util.fst.Builder;
import org.apache.lucene.util.fst.FST;
import org.apache.lucene.util.fst.PositiveIntOutputs;
import org.apache.lucene.util.fst.Util;
import java.io.IOException;
public class T {
private final String inputValues[] = {"cat", "dog", "dogs"};
private final long outputValues[] = {5, 7, 12};
// https://lucene.apache.org/core/8_4_0/core/org/apache/lucene/util/fst/package-summary.html
public static void main(String[] args) throws IOException {
T t = new T();
FST<Long> fst = t.buildFSTInMemory();
System.out.println(String.format("memory used for fst is %d bytes", fst.ramBytesUsed()));
t.searchFST(fst);
byte[] bytes = t.serialize(fst);
System.out.println(String.format("length of serialized fst is %d bytes", bytes.length));
fst = t.deserialize(bytes);
t.searchFST(fst);
}
private FST<Long> buildFSTInMemory() throws IOException {
// Input values (keys). These must be provided to Builder in Unicode sorted order! Use Collections.sort() to sort inputValues first.
PositiveIntOutputs outputs = PositiveIntOutputs.getSingleton();
Builder<Long> builder = new Builder<Long>(FST.INPUT_TYPE.BYTE1, outputs);
BytesRef scratchBytes = new BytesRef();
IntsRefBuilder scratchInts = new IntsRefBuilder();
for (int i = 0; i < inputValues.length; i++) {
// scratchBytes.copyChars(inputValues[i]);
scratchBytes.bytes = inputValues[i].getBytes();
scratchBytes.offset = 0;
scratchBytes.length = inputValues[i].length();
builder.add(Util.toIntsRef(scratchBytes, scratchInts), outputValues[i]);
}
FST<Long> fst = builder.finish();
return fst;
}
private FST<Long> deserialize(byte[] bytes) throws IOException {
DataInput in = new ByteArrayDataInput(bytes);
PositiveIntOutputs outputs = PositiveIntOutputs.getSingleton();
FST<Long> fst = new FST<Long>(in, outputs);
return fst;
}
private byte[] serialize(FST<Long> fst) throws IOException {
final int capicity = 32;
GrowableByteArrayDataOutput out = new GrowableByteArrayDataOutput(capicity);
fst.save(out);
return out.getBytes();
}
private void searchFST(FST<Long> fst) throws IOException {
for (int i = 0; i < inputValues.length; i++) {
Long value = Util.get(fst, new BytesRef(inputValues[i]));
Preconditions.checkState(value == outputValues[i], "fatal error");
}
}
}

Categories