I currently have a scenario where I know the byte offset of a text file. I want to know is there anyway in which I can determine the line number from the byte offset. The records in the text file are not of fixed length, in which case I would have divided the offset by the width.
You cannot determine the line number from byte offset unless all the lines are a uniform length. However you can scan for newlines and keep track of them to calculate the offset in the file.
You could do something like;
String fullTextFile = loadTextFile();
String section = fullTextFile.substring(0, byteOffset);
String reduced = section.replaceAll("[^\n]*","");
int lineNumber = reduced.length();
I'm not entirely sure how legal that regex is, but it shouldn't require much tweaking.
Related
Can read file from start index to end index.
Files.lines(Paths.get("file.csv")).skip(1000000).limit(1000).forEach(s-> {});
But it isn't performance. It is possible to read csv file performance from middle of file?
Would java RandomAccessFile class methods help? Something like seek, skipBytes, etc. You can find tutorials.
It depends on how predictable it is, and on what exactly you mean by "middle", and what you consider to be "read from middle".
If the "middle" has to be an exact line, then you will have to read all the bytes before that middle, because otherwise you can miss on bytes that end lines (and the only way, with a CSV file, of knowing where line N is, is to have read exactly N-1 end-of-line characters until arriving at that position). Having to read all bytes up to a point is linear in time, and is certainly not as fast as actually jumping there in 1 go - but it may count as "reading from middle" for you.
If the file is highly predictable (all lines have approximately the same length), and you do not care much about getting exactly at the middle, then you can always take the length of the file, L, and jump to the last position before position L/2 which contains a newline character. The next position is, with high probability (since your file is predictable), the "middle line".
I am trying to format a string and write it into a file. See my code below
StringBuilder sb=new StringBuilder();
sb.insert(0, String.format("%-30s", recDataWithWarId.getRiaFirmName() != null ? recDataWithWarId.getRiaFirmName() : " "));
The conditions I would like to achieve are (I already have a handle on condition 1 mentioned below):
If recDataWithWarId.getRiaFirmName() is not thirty characters in length or null I want to fill up rest of spaces with blank spaces.
I would also like to limit recDataWithWarId.getRiaFirmName() to 30 if the length is greater than 30 and which I had hoped this code would do automatically.
What is the most efficient way of achieving these two conditions without having to write lengthy code one for substring and one to format? I have to do it on multiple place as this is a fairly large text file.
You need to specify both a precision and a width in your format string. You want "%-30.30s"
The first -30 specifies that the value will be padded on the right to 30 characters. The 30 after the decimal point specifies the maximum number of characters.
Introduction
We store tuples (string,int) in a binary file. The string represents a word (no spaces nor numbers). In order to find a word, we apply binary search algorithm, since we know that all the tuples are sorted with respect to the word.
In order to store this, we use writeUTF for the string and writeInt for the integer. Other than that, let's assume for now there are no ways to distinguish between the start and the end of the tuple unless we know them in advance.
Problem
When we apply binary search, we get a position (i.e. (a+b)/2) in the file, which we can read using methods in Random Access File, i.e. we can read the byte at that place. However, since we can be in the middle of the word, we cannot know where this words starts or finishes.
Solution
Here're two possible solutions we came up with, however, we're trying to decide which one will be more space efficient/faster.
Method 1: Instead of storing the integer as a number, we thought to store it as a string (using eg. writeChars or writeUTF), because in that case, we can insert a null character in the end of the tuple. That is, we can be sure that none of the methods used to serialize the data will use the null character, since the information we store (numbers and digits) have higher ASCII value representations.
Method 2: We keep the same structure, but instead we separate each tuple with 6-8 (or less) bytes of random noise (same across the file). In this case, we assume that words have a low entropy, so it's very unlikely they will have any signs of randomness. Even if the integer may get 4 bytes that are exactly the same as those in the random noise, the additional two bytes that follow will not (with high probability).
Which of these methods would you recommend? Is there a better way to store this kind of information. Note, we cannot serialize the entire file and later de-serialize it into memory, since it's very big (and we are not allowed to).
I assume you're trying to optimize for speed & space (in that order).
I'd use a different layout, built from 2 files:
Interger + Index file
Each "record" is exactly 8 bytes long, the lower 4 are the integer value for the record, and the upper 4 bytes are an integer representing the offset for the record in the other file (the characters file).
Characters file
Contiguous file of characters (UTF-8 encoding or anything you choose). "Records" are not separated, not terminated in any way, simple 1 by 1 characters. For example, the records Good, Hello, Morning will look like GoodHelloMorning.
To iterate the dataset, you iterate the integer/index file with direct access (recordNum * 8 is the byte offset of the record), read the integer and the characters offset, plus the character offset of the next record (which is the 4 byte integer at recordNum * 8 + 12), then read the string from the characters file between the offsets you read from the index file. Done!
it's less than 200MB. Max 20 chars for a word.
So why bother? Unless you work on some severely restricted system, load everything into a Map<String, Integer> and get a few orders of magnitude speed up.
But let's say, I'm overlooking something and let's continue.
Method 1: Instead of storing the integer as a number, we thought to store it as a string (using eg. writeChars or writeUTF), because in that case, we can insert a null character
You don't have to as you said that your word contains no numbers. So you can always parse things like 0124some456word789 uniquely.
The efficiency depends on the distribution. You may win a factor of 4 (single digit numbers) or lose a factor of 2.5 (10-digit numbers). You could save something by using a higher base. But there's the storage for the string and it may dominate.
Method 2: We keep the same structure, but instead we separate each tuple with 6-8 (or less) bytes of random noise (same across the file).
This is too wasteful. Using four zeros between the data byte would do:
Find a sequence of at least four zeros.
Find the last zero.
That's the last separator byte.
Method 3: Using some hacks, you could ensure that the number contains no zero byte (either assuming that it doesn't use the whole range or representing it with five bytes). Then a single zero byte would do.
Method 4: As disk is organized in blocks, you should probably split your data into 4 KiB blocks. Then you can add some time header allowing you quick access to the data (start indexes for the 8th, 16th, etc. piece of data). The range between e.g., the 8th and 16th block should be scanned sequentially as it's both simpler and faster than binary search.
Currently I'm creating a console program that read a one line with very long String with java Scanner
sample data is more like this
50000 integer in one line separated by white-space,
"11 23 34 103 999 381 ....." until 50000 integer
This data is entered by user via console not from a File
here's my code
System.out.print("Input of integers : ");
Scanner sc = new Scanner(System.in);
long start = System.currentTimeMillis();
String Z = sc.nextLine();
long end = System.currentTimeMillis();
System.out.println("String Z created in "+(end-start)+"ms, Z character length is "+Z.length()+" characters");
Then I execute, as the result I've got this
String Z created within 49747ms, Z character length is 194539 characters
My question is why it takes a long time?
Is there any faster way to read a very long string?
I have tried buffered reader, but not much different..
String Z created within 41881ms, Z character length is 194539 characters
It looks like scanner uses a regular expression to match the end of line - this is likely causing the inefficiency, especially since you're matching regex against a 200k length String.
The pattern used is, effectively, .*(\r\n|[\n\r\u2028\u2029\u0085])|.+$
My guess would be memory allocation, as it reads the line, it fills char buffer. And it gets larger and larger and needs to copy all so far readed text again and again. Each time it makes internal buffer Ntimes larger, so it is not atrociously slow, but for your huge line, it still is slow.
And processing of regexp itself does not help too. But my guess is that realocation and copying is the source of slowdown.
And maybe it needs to do GC to free memory to aquire, so another slowdown.
You can test my hypothesis by copying of Scanner and changing BUFFER_SIZE to equal your line length (or larger, to be sure).
I have text files with some numbers like
100
38963
27856
0
534
From this numbers i want to find maximum numbers and want to assign the value for max number as 1. From that i want to assign values to other numbers which is least.For example the first one want to give (38963/100)*100. I want to do all this using java program. Please anybody help me.
To read lines of text from a file, you can wrap a FileReader in a BufferedReader. You can use String.split() to split a line of text into tokens around spaces, and you can use Integer.parseInt() to turn a String representing a valid integer into an int.
You can find the maximum and minimum of a list of ints in linear time (examining each int once) using two ints worth of storage.
That should be enough to get you started.
Edit: just realized those were supposed to be on separate lines (you should use the formatting tools when posting). String.split() will be unnecessary, then.