I've found a bottleneck in my app that keeps growing as data in my files grow (see attached screenshot of VisualVM below).
Below is the getFileContentsAsList code. How can this be made better performance-wise? I've read several posts on efficient File I/O and some have suggested Scanner as a way to efficiently read from a file. I've also tried Apache Commons readFileToString but that's not running fast as well.
The data file that's causing the app to run slower is 8 KB...that doesn't seem too big to me.
I could convert to an embedded database like Apache Derby if that seems like a better route. Ultimately looking for what will help the application run faster (It's a Java 1.7 Swing app BTW).
Here's the code for getFileContentsAsList:
public static List<String> getFileContentsAsList(String filePath) throws IOException {
if (ReceiptPrinterStringUtils.isNullOrEmpty(filePath)) throw new IllegalArgumentException("File path must not be null or empty");
Scanner s = null;
List<String> records = new ArrayList<String>();
try {
s = new Scanner(new BufferedReader(new FileReader(filePath)));
s.useDelimiter(FileDelimiters.RECORD);
while (s.hasNext()) {
records.add(s.next());
}
} finally {
if (s != null) {
s.close();
}
}
return records;
}
The size of an ArrayList is multiplied by 1.5 when necessary. This is O(log(N)). (Doubling was used in Vector.) I would certainly use an O(1) LinkedList here, and BufferedReader.readLine() rather than a Scanner if I was trying to speed it up. It's hard to believe that the time to read one 8k file is seriously a concern. You can read millions of lines in a second.
So, file.io gets to be REAL expensive if you do it a lot...as seen in my screen shot and original code, getFileContentsAsList, which contains file.io calls, gets invoked quite a bit (18.425 times). VisualVM is a real gem of a tool to point out bottlenecks like these!
After contemplating over various ways to improve performance, it dawned on me that possibly the best way is to do file.io calls as little as possible. So, I decided to use private static variables to hold the file contents and to only do file.io in the static initializer and when a file is written to. As my application is (fortunately) not doing excessive writing (but excessive reading), this makes for a much better performing application.
Here's the source for the entire class that contains the getFileContentsAsList method. I took a snapshot of that method and it now runs in 57.2 ms (down from 3116 ms). Also, it was my longest running method and is now my 4th longest running method. The top 5 longest running methods run for a total of 498.8 ms now as opposed to the ones in the original screenshot that ran for a total of 3812.9 ms. That's a percentage decrease of about 85%
[100 * (498.8 - 3812.9) / 3812.9].
package com.mbc.receiptprinter.util;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.logging.Level;
import org.apache.commons.io.FileUtils;
import com.mbc.receiptprinter.constant.FileDelimiters;
import com.mbc.receiptprinter.constant.FilePaths;
/*
* Various File utility functions. This class uses the Apache Commons FileUtils class.
*/
public class ReceiptPrinterFileUtils {
private static Map<String, String> fileContents = new HashMap<String, String>();
private static Map<String, Boolean> fileHasBeenUpdated = new HashMap<String, Boolean>();
static {
for (FilePaths fp : FilePaths.values()) {
File f = new File(fp.getPath());
try {
FileUtils.touch(f);
fileHasBeenUpdated.put(fp.getPath(), false);
fileContents.put(fp.getPath(), FileUtils.readFileToString(f));
} catch (IOException e) {
ReceiptPrinterLogger.logMessage(ReceiptPrinterFileUtils.class,
Level.SEVERE,
"IOException while performing FileUtils.touch in static block of ReceiptPrinterFileUtils", e);
}
}
}
public static String getFileContents(String filePath) throws IOException {
if (ReceiptPrinterStringUtils.isNullOrEmpty(filePath)) throw new IllegalArgumentException("File path must not be null or empty");
File f = new File(filePath);
if (fileHasBeenUpdated.get(filePath)) {
fileContents.put(filePath, FileUtils.readFileToString(f));
fileHasBeenUpdated.put(filePath, false);
}
return fileContents.get(filePath);
}
public static List<String> convertFileContentsToList(String fileContents) {
List<String> records = new ArrayList<String>();
if (fileContents.contains(FileDelimiters.RECORD)) {
records = Arrays.asList(fileContents.split(FileDelimiters.RECORD));
}
return records;
}
public static void writeStringToFile(String filePath, String data) throws IOException {
fileHasBeenUpdated.put(filePath, true);
FileUtils.writeStringToFile(new File(filePath), data);
}
public static void writeStringToFile(String filePath, String data, boolean append) throws IOException {
fileHasBeenUpdated.put(filePath, true);
FileUtils.writeStringToFile(new File(filePath), data, append);
}
}
ArrayLists have a good performance at reading and also on writing IF the lenth does not change very often. In your application the length changes very often (size is doubled, when it is full and an element is added) and your application needs to copy your array into an new, longer array.
You could use a LinkedList, where new elements are appended and no copy actions are needed.
List<String> records = new LinkedList<String>();
Or you could initialize the ArrayList with the approximated finished Number of Words. This will reduce the number of copy actions.
List<String> records = new ArrayList<String>(2000);
Related
I'm currently working on a project and I'm running into a couple of issues. This project involves working with 2 classes, Subject and TestSubject. Basically, I need my program (in TestSubject class) to read details (subject code and subject name) from a text file and create subject objects using this information, then add those to an array list. The text file looks like this:
ITC105: Communication and Information Management
ITC106: Programming Principles
ITC114: Introduction to Database Systems
ITC161: Computer Systems
ITC204: Human Computer Interaction
ITC205: Professional Programming Practice
the first part is the subject code i.e. ITC105 and the second part is the name (Communication and Information Management)
I have created the subject object with the code and name as strings with getters and setters to allow access (in the subject class):
private static String subjectCode;
private static String subjectName;
public Subject(String newSubjectCode, String newSubjectName) {
newSubjectCode = subjectCode;
newSubjectName = subjectName;
}
public String getSubjectCode() {
return subjectCode;
}
public String getSubjectName() {
return subjectName;
}
public void setSubjectCode(String newSubjectCode) {
subjectCode= newSubjectCode;
}
public void setSubjectName(String newSubjectName) {
subjectName = newSubjectName;
}
The code I have so far for reading the file and creating the array list is:
public class TestSubject {
#SuppressWarnings({ "null", "resource" })
public static void main(String[] args) throws IOException {
File subjectFile = new File ("A:\\Assessment 3 Task 1\\src\\subjects.txt");
Scanner scanFile = new Scanner(subjectFile);
System.out.println("The current subjects are as follows: ");
System.out.println(" ");
while (scanFile.hasNextLine()) {
System.out.println(scanFile.nextLine());
}
//This array will store the list of subject objects.
ArrayList <Object> subjectList = new ArrayList <>();
//Subjects split into code and name and added to a new subject object.
String [] token = new String[3];
while (scanFile.hasNextLine()) {
token = scanFile.nextLine().split(": ");
String code = token [0] + ": ";
String name = token [1];
Subject addSubjects = new Subject (code, name);
//Each subject is then added to the subject list array list.
subjectList.add(addSubjects);
}
//Check if the array list is being filled by printing it to the console.
System.out.println(subjectList.toString());
This code isn't working, the array list is just printing as blank. I have tried doing this several ways including a buffered reader but I can't get it to work so far. The next section of code allows a user to enter a subject code and name, which is then added to the array list as well. That section of code works perfectly, I'm just stuck on the above part. Any advice on how to fix it to make it work would be amazing.
Another small thing:
File subjectFile = new File ("A:\\Assessment 3 Task 1\\src\\subjects.txt"); //this file path
Scanner scanFile = new Scanner(subjectFile);
I'd like to know how I can change the file path so that it will still work if the folder is moved or the files are opened on another computer. The .txt file is in the source folder with the java files. I have tried:
File subjectFile = new File ("subjects.txt");
But that doesn't work and just throws errors.
That is because you have already read through the file
while (scanFile.hasNextLine()) {
System.out.println(scanFile.nextLine());
}
The contents are exhausted. So when you do
while (scanFile.hasNextLine()) {
token = scanFile.nextLine().split(": ");
there is no data left.
Remove the first loop or re-open the file.
Or as #UsagiMiyamoto mentions
Or read the line to a String variable, print it, then split it... All in one loop.
I assume you are just beginning with learning Java and hence the below code is probably way too advanced, but it may help others who are trying to do something similar to you and also give you a glimpse of what you will probably learn in future.
The below code uses the following (in no particular order):
Streams
Accessing resources
Records
try-with-resources
Multi-catch
Method references
NIO.2
More notes after the code.
import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;
import java.net.URL;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public record Subject(String subjectCode, String subjectName) {
private static final String DELIMITER = ": ";
private static Path getPath(String filename) throws URISyntaxException {
URL url = Subject.class.getResource(filename);
URI uri = url.toURI(); // throws java.net.URISyntaxException
return Paths.get(uri);
}
private static Subject makeSubject(String line) {
String[] parts = line.split(DELIMITER);
return new Subject(parts[0].trim(), parts[1].trim());
}
/**
* Reads contents of a text file and converts its contents to a list of
* instances of this record and displays that list.
*
* #param args - not used.
*/
public static void main(String[] args) {
try {
Path path = getPath("subjects.txt");
try (Stream<String> lines = Files.lines(path)) { // throws java.io.IOException
lines.map(Subject::makeSubject)
.collect(Collectors.toList())
.forEach(System.out::println);
}
}
catch (IOException | URISyntaxException x) {
x.printStackTrace();
}
}
}
A Java record is applicable for an immutable object and it simply saves you from writing code for methods including getters as well as equals, hashCode and toString. (There are no setters since a record is immutable.) It's a bit like Project Lombok. I would say that a Subject is immutable since I don't think the code or name would need to be changed and that's why I thought making Subject a record was applicable.
Running the above code produces the following output:
Subject[subjectCode=ITC105, subjectName=Communication and Information Management]
Subject[subjectCode=ITC106, subjectName=Programming Principles]
Subject[subjectCode=ITC114, subjectName=Introduction to Database Systems]
Subject[subjectCode=ITC161, subjectName=Computer Systems]
Subject[subjectCode=ITC204, subjectName=Human Computer Interaction]
Subject[subjectCode=ITC205, subjectName=Professional Programming Practice]
Regarding
I'd like to know how I can change the file path so that it will still work if the folder is moved
I placed file subjects.txt in the same folder as file Subject.class, which allowed me to use method getResource. Refer to the Accessing resources link, above. Note that this can't be used if
the files are opened on another computer
Alternatively, there are several directories whose paths are stored in System properties including
java.home
java.io.tmpdir
user.home
user.dir
what did your debug console said about the exception?
your code works very well in my editor.
code result
and you should code like below if you want to read file through relative path
before ->
new File ("A:\Assessment 3 Task 1\src\subjects.txt");
after ->
new File (".\\subjects.txt");
I have a method that starts creating JSON files in each of the folders in my tree.
public static void fill(List<String> subFoldersPaths) {
for (int i = 0; i < subFoldersPaths.size(); i++) {
String fullFileName = subFoldersPaths.get(i) + FILE_NAME;
String formatFullFileName = String.format(fullFileName, i)+"%d";
Runnable runnable = new JsonCreator(formatFullFileName);
new Thread(runnable).start();
}
}
List<String> subFoldersPaths is a list that contains paths to each folder in order.
Here is my folder structure:
I want each folder to be filled with files in a separate thread every 0.08 seconds. But my class will not fill every folder.
Here is a class that implements Runnable, which should perform the filling:
import com.epam.lab.model.Author;
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import net.andreinc.mockneat.MockNeat;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import java.io.FileWriter;
import java.io.IOException;
public class JsonCreator implements Runnable {
private static Logger logger = LogManager.getLogger();
private static String fileName;
private static final int FILES_COUNT = 100;
public JsonCreator(String s){
this.fileName = s;
}
#Override
public void run() {
for (int i = 0; i < FILES_COUNT; i++) {
try {
String formatFullFileName = String.format(fileName, i)+".json";
FileWriter fileWriter = new FileWriter(formatFullFileName);
fileWriter.write(createJsonString());
fileWriter.close();
Thread.sleep(80);
} catch (IOException | InterruptedException e) {
logger.error("File was not created", e);
}
}
}
private static String createJsonString() {
MockNeat mockNeat = MockNeat.threadLocal();
Gson gson = new GsonBuilder()
.setPrettyPrinting()
.create();
String json = mockNeat
.reflect(Author.class)
.field("authorName", mockNeat.names().first())
.field("authorSurname", mockNeat.names().last())
.map(gson::toJson)
.val();
return json;
}
}
But this class fills not every folder with files. (maybe there is a problem with the file names) I can not figure it out.
And I want each folder below "foo" to be filled in a separate thread of JSON files in the amount of FILES_COUNT = 10
some examples of algorithm execution:
The folder structure is created with the participation of the random, so it is almost always different. but this does not affect the fact that files are not created in all folders
Your code is buggy; you cannot ever use that FileWriter constructor. Use new FileWriter(formatFullFileName, StandardCharsets.UTF_8), which is only in jdk11. If you're not on JDK11, you can't use FileWriter at all (it uses platform default encoding, and that is not acceptable; JSON must be in UTF-8 as per the JSON spec, and you have no guarantee that UTF-8 is your platform default).
you aren't guarding your FileWriter with an ARM block - you should add that.
In the initial block, formatFullFileName is a variable that is a format string. In the run() method, it's the opposite (it's the result of running a String.format op on one). Makes your code very hard to read.
Most likely your filenames are incorrect. You should be using List<Path> which would have removed any doubt. If your List<String> subFoldersPaths contains, for example, /home/misnomer/project/foo/1stLayerSubFolder0 in it, and the constant FILE_NAME (which you did not put in your pastes) is, say, example, then the path for the very first file to be created becomes: /home/misnomer/project/foo/1stLayerSubFolder0example0.json which is not what you wanted - you're missing a slash.
NB: If using the newer path API, writing a string to a file becomes vastly simpler: Files.write(path, string) is all you need (and note that the Files API defaults to UTF-8, unlike most other parts of the java libraries that involve turning strings to bytes or vice versa).
The paste needs more info, or you should debug this on your own: Print when you write a file, preferably including the thread ID (you can get it with Thread.currentThread().getName()). That's how programming works: You don't just stare at it, go --heck, I dunno, better ask stack overflow!-- and then give up. You debug it. Use a debugger, or if you can't/don't want to, use the poor man's debugger: Add a whole bunch of System.out.println statements. Go through your code and imagine (write it down if you have to) which each step is doing. Then, add a println statement that confirms this. The very place where what the program says it is doing does not match with what you thought it would do? That's where a bug is. Fix it, and keep going until all bugs are eliminated.
I am having some trouble in using the Groovy TemplateEngines in Java without running in OOM. When creating a lot of different templates it seems to me that there a lot of scripts created on the heap - which are then never garbage
collected.
I use java 8. When running this code with -Xmx32M there are about 3000 iterations possible. After that is a OOM-Error thrown.
Here is my code:
import groovy.text.SimpleTemplateEngine;
import groovy.text.Template;
import groovy.text.TemplateEngine;
import java.util.HashMap;
import java.util.Map;
public class Test {
public static void main(String[] args) throws Exception {
String groovy = "XX-${i}";
for (int i = 0; i < (1000000000); i++) {
TemplateEngine e = new SimpleTemplateEngine();
Template t = e.createTemplate(groovy);
Map<String, Object> binding = new HashMap<>();
binding.put("i", i);
String res = t.make(binding).toString();
if (i % 100 == 0) {
System.out.println("->" + res);
}
}
}
}
I also tried different variations and ClassLoaded - but in essence the results are always the same. As I can't find any current issues with that I guess I am missing something.
Could anyone help to enlighten me?
Tino
Here is your problem https://bugs.openjdk.java.net/browse/JDK-8037342.
Each time the parser runs it creates a new unique class based off the number of parse being done. For instance, after a while the class names look like
groovy.runtime.metaclass.SimpleTemplateScript4237MetaClass
groovy.runtime.metaclass.SimpleTemplateScript4238MetaClass
After a while the ClassLoader's parallelLockMap will fill the heap and nothing is eligible to be GC'd. It's sort of like a OOM PermGen error.
Use Apache Commons Text. Fast and Efficient alternative to SimpleTemplateEngine.
String templateString, Map binding;
StrSubstitutor sb = new StrSubstitutor(binding);
String value = sb.replace(templateString);
I have struggling with that problem has been a while and now I come up with that workaround.
Just call clear after run your script.
https://gist.github.com/jpozorio/38f26120e6346dfd74cecd7a147028aa
I've made this so far
import java.io.File;
import java.io.FileInputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.io.IOUtils;
public class Test {
public static void main(String... args) {
Pattern p = Pattern.compile("(?s).*(MyFunc[(](?s).*[)];)+(?s).*");
File[] files = new File("C:\\TestDir").listFiles();
showFiles(files, p);
}
public static void showFiles(File[] files, Pattern p) {
for (File file : files) {
if (file.isDirectory()) {
System.out.println("Directory: " + file.getName());
showFiles(file.listFiles(), p); // Calls same method again.
} else {
System.out.println("File: " + file.getAbsolutePath());
String f;
try {
f= IOUtils.toString(new FileInputStream(file.getAbsolutePath()), "UTF-8");
System.out.println(file.getName());
Matcher m = p.matcher(f);
if (m.find()) {
System.out.println(m.group());
}
} catch (Exception e) {
e.printStackTrace();
return;
}
}
}
}
}
What I want to do is find every call of MyFunc written in files inside a certain directory (that may have subdirectories with files that should be checked too). The number of files is pretty big, but the above is very very slow for even single file of 1Mb. Do you have any idea of how to achieve what I want? I didn't expect this to be so slow.
EDIT// If this can't be done efficiently by a simple program, please feel free to advice me on useful FREE frameworks. Thank you for your help everyone.
The problem with your approach is the regular expression you're using. You're including .* at the beginning and at the end of your pattern, that will increase processing dramatically. Try the same code with the following regex:
(MyFunc\\(.*?\\);)
You can also apply the enhancements proposed by the other answers but I am pretty sure your bottleneck is in the regex itself.
Good luck!
You are likely taking a hit on creating a String out of each file's contents. This will stress the heap and garbage collector.
You can use the Scanner object to help with this:
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/Scanner.html
Additionally this has been answered here already:
Performing regex on a stream
Best of luck!
This may help you along a little further:
http://www.java-tips.org/java-se-tips/java.util.regex/how-to-apply-regular-expressions-on-the-contents-of-a.html
Again, creating a String for each file is costly. This example uses memory mapped files to avoid the hit on the garbage collector. This will instead use the C based heap instead of memory inside the JVM.
I'm thinking about using HBase as a source for one of my MapReduce jobs. I know that TableInputFormat specifies one input split (and thus one mapper) per Region. However, this seems inefficient. I'd really like to have multiple mappers working on a given Region at once. Can I achieve this by extending TableInputFormatBase? Can you please point me to an example? Furthermore, is this even a good idea?
Thanks for the help.
You need a custom input format that extends InputFormat. you can get idea how do this from answer to question I want to scan lots of data (Range based queries), what all optimizations I can do while writing the data so that scan becomes faster. This is a good idea if the time of data processing is more greater then data retrieving time.
Not sure if you can specify multiple mappers for a given region, but consider the following:
If you think one mapper is inefficient per region (maybe your data nodes don't have enough resources like #cpus), you can perhaps specify smaller regions sizes in the file hbase-site.xml.
here's a site for the default configs options if you want to look into changing that:
http://hbase.apache.org/configuration.html#hbase_default_configurations
please note that by making the region size small, you will be increasing the number of files in your DFS, and this can limit the capacity of your hadoop DFS depending on the memory of your namenode. Remember, the namenode's memory usage is directly related to the number of files in your DFS. This may or may not be relavant to your situation as I do not know how your cluster is being used. There is never a silver bullet answer to these questions!
1 . Its absolutely fine just make sure the key set is mutually exclusive between the mappers .
you arent creating too many clients as this may lead to lot of gc , as during hbase read hbase block cache churning happens
Using this MultipleScanTableInputFormat, you can use MultipleScanTableInputFormat.PARTITIONS_PER_REGION_SERVER configuration to control how many mappers should execute against a single regionserver. The class will group all the input splits by their location (regionserver), and the RecordReader will properly iterate through all aggregated splits for the mapper.
Here is the example
https://gist.github.com/bbeaudreault/9788499#file-multiplescantableinputformat-java-L90
That work you have created the multiple aggregated splits for a single mapper
private List<InputSplit> getAggregatedSplits(JobContext context) throws IOException {
final List<InputSplit> aggregatedSplits = new ArrayList<InputSplit>();
final Scan scan = getScan();
for (int i = 0; i < startRows.size(); i++) {
scan.setStartRow(startRows.get(i));
scan.setStopRow(stopRows.get(i));
setScan(scan);
aggregatedSplits.addAll(super.getSplits(context));
}
// set the state back to where it was..
scan.setStopRow(null);
scan.setStartRow(null);
setScan(scan);
return aggregatedSplits;
}
Create partition by Region server
#Override
public List<InputSplit> getSplits(JobContext context) throws IOException {
List<InputSplit> source = getAggregatedSplits(context);
if (!partitionByRegionServer) {
return source;
}
// Partition by regionserver
Multimap<String, TableSplit> partitioned = ArrayListMultimap.<String, TableSplit>create();
for (InputSplit split : source) {
TableSplit cast = (TableSplit) split;
String rs = cast.getRegionLocation();
partitioned.put(rs, cast);
}
This would be useful if you wanna scan large regions (hundred of millions rows) with conditioned scan that finds only a few records. This will prevent of ScannerTimeoutException
package org.apache.hadoop.hbase.mapreduce;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
public class RegionSplitTableInputFormat extends TableInputFormat {
public static final String REGION_SPLIT = "region.split";
#Override
public List<InputSplit> getSplits(JobContext context) throws IOException {
Configuration conf = context.getConfiguration();
int regionSplitCount = conf.getInt(REGION_SPLIT, 0);
List<InputSplit> superSplits = super.getSplits(context);
if (regionSplitCount <= 0) {
return superSplits;
}
List<InputSplit> splits = new ArrayList<InputSplit>(superSplits.size() * regionSplitCount);
for (InputSplit inputSplit : superSplits) {
TableSplit tableSplit = (TableSplit) inputSplit;
System.out.println("splitting by " + regionSplitCount + " " + tableSplit);
byte[] startRow0 = tableSplit.getStartRow();
byte[] endRow0 = tableSplit.getEndRow();
boolean discardLastSplit = false;
if (endRow0.length == 0) {
endRow0 = new byte[startRow0.length];
Arrays.fill(endRow0, (byte) 255);
discardLastSplit = true;
}
byte[][] split = Bytes.split(startRow0, endRow0, regionSplitCount);
if (discardLastSplit) {
split[split.length - 1] = new byte[0];
}
for (int regionSplit = 0; regionSplit < split.length - 1; regionSplit++) {
byte[] startRow = split[regionSplit];
byte[] endRow = split[regionSplit + 1];
TableSplit newSplit = new TableSplit(tableSplit.getTableName(), startRow, endRow,
tableSplit.getLocations()[0]);
splits.add(newSplit);
}
}
return splits;
}
}