Reading input in java - java

I've two types of input data
a b c d e...
Here a, b, and so on are values to be read. All are of same data types which may be short, int, long, double. All values are separated by one or more spaces. We've given these on a single line and we don't know how many are there. Input ends with newline. In second case we're given count as a first variable "n" and then n variables follow. e.g. for n=5, it looks like this.
n a b c d e
This could be done with Scanner but I've heard reading input with scanner is slower than bufferedReader method. I'm looking for any possible way for doing this other than using Scanner class. I'm new to Java. Please help.

I would get something which works first. Once you have an understanding of the bottleneck, only then is it worth trying to optimise it.
To answer your question, IMHO, the fastest way to read the data is to use a memory mapped file and parse the ByteBuffer assuming you have ASCII 8-bit byte data (a reasonable assumption for numbers) avoiding using the built in parsers altogether. This will be much faster but also a lot for more complicated and complete overkill. ;)
If you want examples of how to parse numbers straight from a ByteBuffer Java low level: Converting between integers and text (part 1) To go faster you can use the Unsafe class, but that is not standard Java.

Especially when you're new to a language or environment I would suggest to start out with something easily understood yet functional like
String inputline = "n a b c d e";
// Obtain real inputline eg by reading it from a file via a reader
inputline = someBufferedReaderDefinedElsewhere.readLine();
String[] parts = inputline.split(" ");
// easiest for case 1
for (String part : parts) {
...
}
// easiest for case 2
int numberToRead = Integer.parseInt(parts[0]);
// not < but <= because you start reading at element 1
for (int ii=1;ii<=numberToRead;ii++) {
...
}
Of course to be completed with a healthy dose of error checking!
Afterwards, if you determine (with proof, eg the output of the profiling of your app) that that part of the code is in fact responsible for an unreasonable amount of CPU consumption you can start thinking about faster, more custom ways of reading the data. Not the other way around.

Related

Most efficient way to create a string out of a list of characters then clear it

I'm trying to create a JSON-like format to load components from files and while writing the parser I've run into an interesting performance question.
The parser reads the file character by character, so I have a LinkedList as a buffer. After reaching the end of a key (:) or a value (,) the buffer has to be emptied and a string constructed of it.
My question is what is the most efficient way to do this.
My two best bets would be:
for (int i = 0; i < buff.size(); i++)
value += buff.removeFirst().toString();
and
value = new String((char[]) buff.toArray(new char[buff.size()]));
Instead of guessing this you should write a benchmark. Take a look at How do I write a correct micro-benchmark in Java to understand how to write a benchmark with JMH.
Your for loop would be inefficient as you are concatenating 1-letter Strings using + operator. This leads to creation and immediate throwing away intermediate String objects. You should use StringBuilder if you plan to concatenate in a loop.
The second option should use a zero-length array as per Arrays of Wisdom of the Ancients article which dives into internal details of the JVM:
value = new String((char[]) buff.toArray(new char[0]));

JAVA : Performance and Memory improvement code comparison from codechef

So today I solved a very simple problem from Codechef and I solved the problem using JAVA my answer was accepted. My code was.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
class INTEST {
public static void main(String args[]) throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
String input = reader.readLine();
int n = Integer.parseInt(input.split(" ")[0]);
long k = Long.parseLong(input.split(" ")[1]);
int count = 0;
String element;
for (int i = 0; i < n; i++) {
element = reader.readLine();
if (Long.parseLong(element) % k == 0) {
count++;
}
}
System.out.println(count);
}
}
The onine judge reported
Running Time : 0.58 Second
Memory : 1340.5M
So, I looked into some other solutions for the same problem (sorted the solution by time) and I got another solution by the user indontop.
public class Main{
public static void main(String ...args)throws Exception{
byte b;
byte barr[]=new byte[1028];
int r=0,n=0,k=0;
while((r=System.in.read())!= ' '){
n=n*10+r-'0';
}
//System.out.println(n);
while((r=System.in.read())!='\n'){ //change
k=k*10+r-'0';
}
//System.out.println(k);
//System.in.read(); // remove
n=0;
int count=0;
while((r=System.in.read(barr,0,1028))!=-1){
for(int i=0;i<barr.length;i++){
b=barr[i];
if(b!='\n'){ //change
n=n*10+b-'0';
}
else{
// i++; //remove
if(n%k==0)count++;
n=0;
}
}
}
System.out.println(count);
}
}
the execution time and memory for the above code.
Running Time : 0.13 Second
Memory : OM
I wonder how was the user able to achieve this much performance and Memory gain with this very simple problem.
I dont understand the logic behind this code, can anyone help me by explaining this code, and also please explain what is wrong with my code.
Thank You.
How indontop achieved a better memory footprint
Basically, indontop's program reads bytes directly from the input stream, without going through readers or reading lines. The only structure it allocates is a single array of 1028 bytes, and no other objects are created directly.
Your program, on the other hand, reads lines from a BufferedReader. Each such line is allocated in memory as a string. But your program is rather short, so it's highly likely that the garbage collector doesn't kick in, hence all those lines that were read are not cleared away from memory.
What indontop's program does
It reads the input byte by byte and parses the numbers directly from it, without using Integer.parseInt or similar methods. The characters '0' through '9' can be converted to their respective values (0-9) by subtracting '0' from each of them. The numbers themselves are parsed by noting that a number like '123' can be parsed as 1*10*10 + 2*10 + 3.
The bottom line is that the user is implementing the very basic algorithm for interpreting numbers without ever having the full string in memory.
Is indontop's program better than yours?
My answer to this is no. First, his program is not entirely correct: he is reading an array of bytes and is not checking how many bytes were actually read. The last array read can contain bytes from the previous read, which may give wrong output, and it is by sheer luck that this didn't happen when he ran it.
Now, the rest of this is opinion-based:
Your program is much more readable than his. You have meaningful variable names, he doesn't. You are using well-known methods, he doesn't. Your code is concise, his is verbose and repeats the same code many times.
He is reinventing the wheel - there are good number parsing methods in Java, no need to rewrite them.
Reading data byte-by-byte is inefficient as far as system calls are concerned, and improves efficiency only in artificial environments like CodeChef and like sites.
Runtime efficiency
You really can't tell by looking at one run. Those programs run under a shared server that does lots of other things and there are too many factors that affect performance. Benchmarking is a complicated issue. The numbers you see? Just ignore them.
Premature Optimization
In real world programs, memory is garbage collected when it's needed. Memory efficiency should be improved only if it's something very obvious (don't allocate an array of 1000000 bytes if you only intend to use 1000 of them), or when the program, when running under real conditions, has memory issues.
This is true for the time efficiency as well, but as I said, it's not even clear if his program is more runtime efficient than yours.
Is your program good?
Well, not perfect, you are running the split twice, and it would be better to just do it once and store the result in a two-element array. But other than that, it's a good answer to this question.

remove duplicate lines from a file

I have the following data:
number1
I am writing line1 .
number2
First line .
number3
I am writing line2.
number4
Second line .
number5
I am writing line3 .
number6
Third line.
number7
I am writing line2 .
number8
Fourth line .
number9
I am writing line5 .
number10
Fifth line .
Now I want to remove the duplicate lines from this text file -- along with this I want to remove 1 preceding and 2 succeeding lines of the duplicate line. Such that after removal my data looks like:
number1
I am writing line1 .
number2
First line .
number3
I am writing line2.
number4
Second line .
number5
I am writing line3 .
number6
Third line.
number9
I am writing line5 .
number10
Fifth line .
The size of my file is 60 GB and I am using a server with 64 GB RAM. I am using the following code for removing the duplicates:
fOutput = open('myfile','w')
table_size = 2**16
seen = [False]*table_size
infile = open('test.ttl', 'r')
while True:
inFileLine1=infile.readline()
if not inFileLine1:
break #EOF
inFileLine2=infile.readline()
inFileLine3=infile.readline()
inFileLine4=infile.readline()
h = hash(inFileLine2) % table_size
if seen[h]:
dup = False
with open('test.ttl','r') as f:
for line1 in f:
if inFileLine2 == line1:
dup = True
break
if not dup:
fOutput.write(inFileLine1)
fOutput.write(inFileLine2)
fOutput.write(inFileLine3)
fOutput.write(inFileLine4)
else:
seen[h] = True
fOutput.write(inFileLine1)
fOutput.write(inFileLine2)
fOutput.write(inFileLine3)
fOutput.write(inFileLine4)
fOutput.close()
However, it turns out this code is very slow. Is there some way by which I may improve the efficiency of the code using parallelization i.e. using all 24 cores available to me on my system or using any other technique.
Although the above code is written in python -- but I am fine with efficient solutions in c++ or python or Java or using linux commands
Here test.ttl is my input file with size 60GB
It seems that your code is reading every line exactly once, and writing every line (that need to be written) also exactly once. Thus there is no way to optimize the algorithm on the file reading - writing part.
I strongly suspect that your code is slow because of the very bad use of Hash table. Your hash table only has size 2^16, while your file may contain about 2^28 lines, assuming an average of 240 bytes per line.
Since you have such a big RAM (enough to contain all the file), I suggest you change the hash table to a size of 2^30. This should help considerably.
Edit:
In this case, you could try to use some very simple Hash function. For example:
long long weight[] = {generate some random numbers};
long long Hash(char * s, int length)
{
long long result = 0;
int i = 0, j = 0;
while (i < length)
{
result += s[i] * weight[j ++];
i += j;
}
return result & ((1 << 30) - 1); // assume that your hash table has size 2^30
}
If duplicate lines are quite common, then I think the right way to solve the problem is similar to the one you have, but you must use a hash table that can grow on demand and will automatically handle collisions. Try using the Python set data type to store lines that were already reached. With set you will not need to confirm that duplicate lines really are duplicates; if they're in the set already, they are definitely duplicates. This will work, and be quite efficient. However, Python's memory management may not be very efficient, and the set data type might grow beyond the available memory, in which case a rethink will be required. Try it.
Edit: ok, so set grew too large.
For a good solution, you want to avoid repeatedly re-reading the input file. In your original solution, the input file is read again for each possible duplicate, so if there are N lines, the total number of lines read may be up to N^2. Optimization (profiling) and parallelism won't make this better. And, due to the massive file size, you also have a memory constraint which rules out simple tricks like storing all of the lines seen so far in a hash table (like set).
Here is my second suggestion. In this suggestion, memory requirements will scale to fit whatever you have available. You will need enough disk space for at least one copy of your input file. The steps form a pipeline - the output from one step is the input of the next.
Step 1. I think you are interested in working on groups of 4 lines. You want to keep the whole group of 4, or none of them. Your first step should be to combine each group of 4 lines into a single line. For example:
number1
I am writing line1 .
number2
First line .
number3
I am writing line2.
number4
Second line .
becomes
number1#I am writing line1 .#number2#First line .
number3#I am writing line2 .#number4#Second line .
Note that I used '#' to mark where the line breaks were. This is important. You can use any character here, provided it is not used in any other place in your input file.
Step 2. Prepend the line number to each line.
1#number1#I am writing line1 .#number2#First line .
2#number3#I am writing line2 .#number4#Second line .
Step 3. Use the Unix sort utility (or a Windows port of it). It's already highly optimized. There are even options to do the sort in parallel for extra speed. Sort with the following options:
sort '-t#' -k3
These sort options cause the program to consider only the 3rd field - which is the 2nd line in each group.
Step 4. Now step through the output of the previous stage, looking for duplicates, making use of the fact that they will be next to each other. Look at the 3rd field. If you find a duplicate line, discard it.
Step 5. Reconstruct the order of the original file using another sort:
sort '-t#' -k1 -n
This time, the sort uses the numerical value of the line number (the first field).
Step 6. Remove the line number from the start of each line.
Step 7. Turn each '#' character back into a newline character. Job done.
Though this seems like a lot of steps, all but steps 3 and 5 only involve a single pass through the input file, so they'll be very fast. N steps for N lines. The sorting steps (3 and 5) are also fast because the sort program has been heavily optimized and uses a good sorting algorithm (at most N log N steps for N lines).
fOutput = open('myfile','w')
infile = open('test.ttl', 'r')
all_line2 = {}
while True:
inFileLine1 = infile.readline()
if not inFileLine1:
break #EOF
inFileLine2 = infile.readline()
_ = infile.readline()
_ = infile.readline()
all_line2[inFileLine2] = False
infile.seek(0)
while True:
inFileLine1=infile.readline()
if not inFileLine1:
break #EOF
inFileLine2=infile.readline()
inFileLine3=infile.readline()
inFileLine4=infile.readline()
if not all_line2.get(inFileLine2):
fOutput.write(inFileLine1)
fOutput.write(inFileLine2)
fOutput.write(inFileLine3)
fOutput.write(inFileLine4)
all_line2[inFileLine2] = True
Look at java.util.concurrent.ConcurrentHashMap in Java. It is designed to perform well when used by multiple threads that access the map concurrently.
Also, read the file using Java NIO through an Executor fixed thread pool.
To start with you can use this code
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
public class Main {
private static final ConcurrentHashMap map = new ConcurrentHashMap();
public static class Task implements Runnable {
private final String line;
public Task(String line) {
this.line = line;
}
#Override
public void run() {
// if (!map.containsKey(line)) // not needed
map.put(line, true);
}
}
public static void main(String[] args) throws IOException {
ExecutorService service = Executors.newFixedThreadPool(10);
String dir_path, file_name;
Files.lines(Paths.get(dir_path, file_name)).forEach(l -> service.execute(new Task(l)));
service.shutdown();
map.keySet().forEach(System.out::println);
}
}
I would prefer to use Java for this. And given that the size of the file is 60 GB, Java provides a well suited API for this named MappedByteBuffer.
You load the file using a file channel and map the channel using the above API as follows:
FileChannel fileChannel = new RandomAccessFile(new File(inputFile), "r").getChannel();
mappedBuffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, fileChannel.size());
This loads the entire file into memory. For the best efficient performance, map it into chunks (loads 50k bytes)
mappedBuffer = fileChannel.map(FileChannel.MapMode.READ_ONLY, 0, 50000);
Now you can iterate over the mappedBuffer and do your processing. Let know for any clarity.
I would like to read the file in a sequence manner. Let's consider some factors affecting performance and possible solutions:
Language: vote for C/C++.
IO: we can use Memory Mapping that is available on Windows and Linux, on Linux it is the mmap() function; basically, this will map the file content to a pointer e.g. char* data. Tell me if you are using Windows and need the code.
Searching for a key: I suggest to use Binary Search Tree, each time we take a new couple of lines => value, key; we need to traverse the tree to find the key. If found, then skip this and next couple. If not found, then insert this couple into the tree as a new node, at the ending position of the searching; and also write this couple to the output file. Of course, the searching takes O(logN).
Data structure of a node:
struct Node {
char* key;
unsigned short keyLen;
char* value;
unsigned short valueLen;
Node* leftNode;
Node* rightNode;
}
You can change unsigned short to unsigned char if relevant. The pointers key and value actually point to certain positions of the memory block hold by data, thus no new memory is allocated to store key and value.
The searching can be further improved by using Bloom Filter. If the filter answers NO (very quickly) then definitively the key is not existed in our tree, no need to traverse the tree anymore. If the answer is YES, then traverse the tree normally. Bloom Filter is implemented in Redis and HBase, please take a look at these open source database systems, if needed.

Reading large files for a simulation (Java crashes with out of heap space)

For a school assignment, I need to create a Simulation for memory accesses. First I need to read 1 or more trace files. Each contains memory addresses for each access. Example:
0 F001CBAD
2 EEECA89F
0 EBC17910
...
Where the first integer indicates a read/write etc. then the hex memory address follows. With this data, I am supposed to run a simulation. So the idea I had was parse these data into an ArrayList<Trace> (for now I am using Java) with trace being a simple class containing the memory address and the access type (just a String and an integer). After which I plan to loop through these array lists to process them.
The problem is even at parsing, it running out of heap space. Each trace file is ~200MB. I have up to 8. Meaning minimum of ~1.6 GB of data I am trying to "cache"? What baffles me is I am only parsing 1 file and java is using 2GB according to my task manager ...
What is a better way of doing this?
A code snippet can be found at Code Review
The answer I gave on codereview is the same one you should use here .....
But, because duplication appears to be OK, I'll duplicate the answer here.
The issue is almost certainly in the structure of your Trace class, and it's memory efficiency. You should ensure that the instrType and hexAddress are stored as memory efficient structures. The instrType appears to be an int, which is good, but just make sure that it is declared as an int in the Trace class.
The more likely problem is the size of the hexAddress String. You may not realise it but Strings are notorious for 'leaking' memory. In this case, you have a line and you think you are just getting the hexString from it... but in reality, the hexString contains the entire line.... yeah, really. For example, look at the following code:
public class SToken {
public static void main(String[] args) {
StringTokenizer tokenizer = new StringTokenizer("99 bottles of beer");
int instrType = Integer.parseInt(tokenizer.nextToken());
String hexAddr = tokenizer.nextToken();
System.out.println(instrType + hexAddr);
}
}
Now, set a break-point in (I use eclipse) your IDE, and then run it, and you will see that hexAddr contains a char[] array for the entire line, and it has an offset of 3 and a count of 7.
Because of the way that String substring and other constructs work, they can consume huge amounts of memory for short strings... (in theory that memory is shared with other strings though). As a consequence, you are essentially storing the entire file in memory!!!!
At a minimum, you should change your code to:
hexAddr = new String(tokenizer.nextToken().toCharArray());
But even better would be:
long hexAddr = parseHexAddress(tokenizer.nextToken());
Like rolfl I answered your question in the code review. The biggest issue, to me, is the reading everything into memory first and then processing. You need to read a fixed amount, process that, and repeat until finished.
Try use class java.nio.ByteBuffer instead of java.util.ArrayList<Trace>. It should also reduce the memory usage.
class TraceList {
private ByteBuffer buffer;
public TraceList(){
//allocate byte buffer
}
public void put(byte operationType, int addres) {
//put data to byte buffer
}
public Trace get(int index) {
//get data from byte buffer by index
byte type = ...//read type
int addres = ...//read addres
return new Trace(type, addres)
}
}

Why is my hashset so memory-consuming?

I found out the memory my program is increasing is because of the code below, currently I am reading a file that is about 7GB big, and I believe the one that would be stored in the hashset is lesson than 10M, but the memory my program keeps increasing to 300MB and then crashes because of OutofMemoryError. If it is the Hashset problem, which data structure shall I choose?
if(tagsStr!=null) {
if(tagsStr.contains("a")||tagsStr.contains("b")||tagsStr.contains("c")) {
maTable.add(postId);
}
} else {
if(maTable.contains(parentId)) {
//do sth else, no memories added here
}
}
You haven't really told us what you're doing, but:
If your file is currently in something like ASCII, each character you read will be one byte in the file or two bytes in memory.
Each string will have an object overhead - this can be significant if you're storing lots of small strings
If you're reading lines with BufferedReader (or taking substrings from large strings), each one may have a large backing buffer - you may want to use maTable.add(new String(postId)) to avoid this
Each entry in the hash set needs a separate object to keep the key/hashcode/value/next-entry values. Again, with a lot of entries this can add up
In short, it's quite possible that you're doing nothing wrong, but a combination of memory-increasing factors are working against you. Most of these are unavoidable, but the third one may be relevant.
You've either got a memory leak or your understanding of the amount of string data that you are storing is incorrect. We can't tell which without seeing more of your code.
The scientific solution is to run your application using a memory profiler, and analyze the output to see which of your data structures is using an unexpectedly large amount of memory.
If I was to guess, it would be that your application (at some level) is doing something like this:
String line;
while ((line = br.readLine()) != null) {
// search for tag in line
String tagStr = line.substring(pos1, pos2);
// code as per your example
}
This uses a lot more memory than you'd expect. The substring(...) call creates a tagStr object that refers to the backing array of the original line string. Your tag strings that you expect to be short actually refer to a char[] object that holds all characters in the original line.
The fix is to do this:
String tagStr = new String(line.substring(pos1, pos2));
This creates a String object that does not share the backing array of the argument String.
UPDATE - this or something similar is an increasingly likely explanation ... given your latest data.
To expand on another of Jon Skeet's point, the overheads of a small String are surprisingly high. For instance, on a typical 32 bit JVM, the memory usage of a one character String is:
String object header for String object: 2 words
String object fields: 3 words
Padding: 1 word (I think)
Backing array object header: 3 words
Backing array data: 1 word
Total: 10 words - 40 bytes - to hold one char of data ... or one byte of data if your input is in an 8-bit character set.
(This is not sufficient to explain your problem, but you should be aware of it anyway.)
Couldn't be it possible that the data read into memory (from the 7G file) is somehow not freed? Something ike Jon puts... ie. since strings are immutable every string read requires a new String object creation which might lead to out of memory if GC is not quick enough...
If the above is the case than you might insert some 'breakpoints' into your code/iteration, ie. at some defined points, issue gc and wait till it terminates.
Run your program with -XX:+HeapDumpOnOutOfMemoryError. You'll then be able to use a memory analyser like MAT to see what is using up all of the memory - it may be something completely unexpected.

Categories