Read a very long string from console using Java Scanner takes time? - java

Currently I'm creating a console program that read a one line with very long String with java Scanner
sample data is more like this
50000 integer in one line separated by white-space,
"11 23 34 103 999 381 ....." until 50000 integer
This data is entered by user via console not from a File
here's my code
System.out.print("Input of integers : ");
Scanner sc = new Scanner(System.in);
long start = System.currentTimeMillis();
String Z = sc.nextLine();
long end = System.currentTimeMillis();
System.out.println("String Z created in "+(end-start)+"ms, Z character length is "+Z.length()+" characters");
Then I execute, as the result I've got this
String Z created within 49747ms, Z character length is 194539 characters
My question is why it takes a long time?
Is there any faster way to read a very long string?
I have tried buffered reader, but not much different..
String Z created within 41881ms, Z character length is 194539 characters

It looks like scanner uses a regular expression to match the end of line - this is likely causing the inefficiency, especially since you're matching regex against a 200k length String.
The pattern used is, effectively, .*(\r\n|[\n\r\u2028\u2029\u0085])|.+$

My guess would be memory allocation, as it reads the line, it fills char buffer. And it gets larger and larger and needs to copy all so far readed text again and again. Each time it makes internal buffer Ntimes larger, so it is not atrociously slow, but for your huge line, it still is slow.
And processing of regexp itself does not help too. But my guess is that realocation and copying is the source of slowdown.
And maybe it needs to do GC to free memory to aquire, so another slowdown.
You can test my hypothesis by copying of Scanner and changing BUFFER_SIZE to equal your line length (or larger, to be sure).

Related

Having problems formatting Strings to a specific precision

I am trying to format a string and write it into a file. See my code below
StringBuilder sb=new StringBuilder();
sb.insert(0, String.format("%-30s", recDataWithWarId.getRiaFirmName() != null ? recDataWithWarId.getRiaFirmName() : " "));
The conditions I would like to achieve are (I already have a handle on condition 1 mentioned below):
If recDataWithWarId.getRiaFirmName() is not thirty characters in length or null I want to fill up rest of spaces with blank spaces.
I would also like to limit recDataWithWarId.getRiaFirmName() to 30 if the length is greater than 30 and which I had hoped this code would do automatically.
What is the most efficient way of achieving these two conditions without having to write lengthy code one for substring and one to format? I have to do it on multiple place as this is a fairly large text file.
You need to specify both a precision and a width in your format string. You want "%-30.30s"
The first -30 specifies that the value will be padded on the right to 30 characters. The 30 after the decimal point specifies the maximum number of characters.

Performance in Reading data to memory java

I am trying to read a 512MB file into java memory. Here is my code:
String url_part = "/homes/t1.csv";
File f = new File(url_part);
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(f)));
ArrayList<String> mem = new ArrayList<String>();
System.out.println("Start loading.....");
System.gc();
double start = System.currentTimeMillis();
String line = br.readLine();
int count = 0;
while(line!=null){
line=br.readLine();
mem.add(line);
//System.out.println(count);
count++;
if(count%500000==0){
System.out.println(count);
}
}
The file contains 40000000 lines, the performance is totally fine before reading 18500000 lines, but it stucks somewhere after reading about 20000000 lines. (It freezes here, but continue after a long waiting, about 10seconds)
I kept track of the memory use, I found even the totaly file size is just 512 MB, the memory grows about 2GB when running the program. Also, the 8 core CPU keeps working at 100% utils.
I just want to read the file into memory so that later I can access the data I want faster from memory. Am I doing in the right way? THank!
First, Java stores strings in UTF-16, so if your input file contains mostly latin-1 symbols, then you will need twice more memory to store these symbols, thus 1Gb is used to store the chars. Second, there's an overhead per each line. We may roughly estimate it:
Reference from ArrayList to String - 4 bytes (assuming compressed oops)
Reference from String to char[] array - 4 bytes
String object header - at least 8 bytes
hash String field (to store hashCode) - 4 bytes
char[] object header - at least 8 bytes
char[] array length - 4 bytes
So in total at least 32 bytes will be wasted per each line. Usually it's more as objects must be padded. So for 20_000_000 lines you have at least 640_000_000 bytes overhead.

Reading N lines with optimization - Java

I have a large text file with N number of lines. Now I have to read these lines in i iterations. Which means that I have to read n = Math.floor(N/i) lines in a single iteration. Now in each iteration I have to fill a string array of n length. So the basic question is that how should I read n lines in optimum time? The simplest way to do this is to use a BufferedReader and read one line at time with BufferedReader.readLine() but it will significantly decrease performance if n is too large. Is there a way to read exactly n lines at a time?
To read n lines from a text file, from a system point of view there is no other way than reading as many characters as necessary until you have seen n end-of-line delimiters (unless the file has been preprocessed to detect these, but I doubt this is allowed here).
As far as I know, no file I/O system in the world does support a function to read "until the nth occurrence of some character", nor "the n following lines" (but I am probably wrong).
If you really want to minimize the number of I/O function calls, your last resort is block I/O with which you could read a "page" at a time (say of length n times the expected or maximum line length), and detect the end-of-lines yourself.
I agree with Yves Daoust's answer, except for the paragraph recommending
If you really want to minimize the number of I/O function calls, your last resort is block I/O with which you could read a "page" at a time (say of length n times the expected or maximum line length), and detect the end-of-lines yourself.
There's no need to "detect the end-of-lines yourself". Something like
new BufferedReader(new InputStreamReader(is, charset), 8192);
creates a reader with a buffer of 8192 chars. The question is how useful this is for reading data in blocks. For this a byte[] is needed and there's a sun.nio.cs.StreamDecoder in between which I haven't looked into.
To be sure use
new BufferedReader(new InputStreamReader(new BufferedInputStream(is, 8192), charset));
so you get a byte[] buffer.
Note that 8192 is the default size for both BufferedReader and InputStreamReader, so leaving it out would change nothing in my above examples. Note that using much larger buffers makes no sense and can even be detrimental for performance.
Update
So far you get all the buffering needed and this should suffice. In case it doesn't, you can try:
to avoid the decoder overhead. When your lines are terminated by \n, you can look for (byte) \n in the file content without decoding it (unless you're using some exotic Charset).
to prefetch the data yourself. Normally, the OS should take care of it, so when your buffer becomes empty, Java calls into the OS, and it has the data already in memory.
to use a memory mapped file, so that no OS calls are needed for fetching more data (as all data "are" there when the mapping gets created).

What does scanf,memset, and a couple while's in C++ mean in Java?

So I have been working on understanding an algorithm and I found a C++ code for it on the internet but the problem is that I know Java but am really unfamiliar with C++ so I'm having some trouble understanding it. This part in particular:
int m, i, j, len;
char temp[50];
char stuff[100][100];
while(scanf("%d",&m)!=EOF)
{
while(m--)
{
scanf("%s",temp);
len = strlen(temp);
for(i=0;i<=len;i++)
for(j=0;j<=len;j++)
memset(stuff[i][j],0,sizeof(char)*4);
}
}
Ok, so here is what I'm thinking about the parts I don't understand. If someone could tell me if i'm on the right track I'd be super grateful.
while(scanf("%d",&m)!=EOF)
For this part, since there is no input file or anything like that and m hasn't been initialized. I'm thinking it means that it will take in user input of ints or doubles until the EOF char is given in the console. That input will be saved as m. I'm just confused about the while loop. Is something ongoing? m is just one int so I don't really understand that.
while(m--)
Like equivalent to for(i = m; i>=0; i--) ?
scanf("%s",temp);
len = strlen(temp);
for(i=0;i<=len;i++)
for(j=0;j<=len;j++)
memset(stuff[i][j],0,sizeof(char)*4);
So if m is decrementing it wants to take user input from the console in the form of a string, but really a char, and save each char in the next index in the temp[]? len is the number of chars in that array that have actually been initialized? Does this mean len == m? And then memset is just setting 4 indices of each 3rd dimension array to null?
What I am expecting here is the user to enter a sequence of n chars to be saved in temp and then depending on how many chars are in the sequence prepare a 3d array of stuff[n][n][4] all filled with zeroes. It's just the while statements seem sort of excessive for just taking in what is basically a string.
Any help would be great. I have never done anything with C++ before and I've figure it all out except this one last part. I'm sorry this is so long but I was trying to show what I've been thinking.
java and c syntax are not so far apart that you should have trouble reading this if you know java. While loops, for loops, code blocks that go with each, and variable pre/post incrementation/decrementation are pretty much identical.
scanf() is a complicated beast but you've got the gist of how you're using it… it reads from the standard input path and in your case, it reads an integer (%d is for an int; google "man scant" for full details), placing it into m. You have a loop to read this input, then perform a code block based on the new value of m.
The while (m--) block is just like the for loop you suggested, and it would also work in this same manner in java.
scanf("%s") reads a character string, not a just a character… the string will be NULL terminated but there's no guarantee (without doing more) that the input won't exceed the size of the buffer (temp).
strlen() returns the length of the string (not including it's NULL termination).
memset(buffer, value, length) writes value into memory beginning at buffer, for a total of length bytes. Google "man memset" for docs.
while(scanf("%d",&m)!=EOF)
Read an integer as input into m, until no input remains.
while(m--)
Yes, exactly equivalent to for(; m>=0; m--).
So if m is decrementing it wants to take user input from the console in the form of a string, but really a char, and save each char in the
next index in the temp[]? len is the number of chars in that array
that have actually been initialized? Does this mean len == m? And then
memset is just setting 4 indices of each 3rd dimension array to null?
The code reads some text into temp, which is an array of char. strlen(temp) determines how long the text in temp is (it counts forward until it sees a zero char, which is how a text string is terminated in C). So no, len != m. And you're correct about the role of memset.
What I am expecting here is the user to enter a sequence of n chars to
be saved in temp and then depending on how many chars are in the
sequence prepare a 3d array of stuff[n][n][4] all filled with zeroes.
Yes, this is exactly what happens.
First of all, scanf and printf are commands used in C mostly, although they can be used in C++ as well.
As for the while(m--)
Is the same as for (int i = m; i > 0; i--)

Determine Line Number from Byte Offset in a text file

I currently have a scenario where I know the byte offset of a text file. I want to know is there anyway in which I can determine the line number from the byte offset. The records in the text file are not of fixed length, in which case I would have divided the offset by the width.
You cannot determine the line number from byte offset unless all the lines are a uniform length. However you can scan for newlines and keep track of them to calculate the offset in the file.
You could do something like;
String fullTextFile = loadTextFile();
String section = fullTextFile.substring(0, byteOffset);
String reduced = section.replaceAll("[^\n]*","");
int lineNumber = reduced.length();
I'm not entirely sure how legal that regex is, but it shouldn't require much tweaking.

Categories