How to speed up reading of a large OBJ (text) file?

How to speed up reading of a large OBJ (text) file? - java

I am using an OBJ Loader library that I found on the 'net and I want to look into speeding it up.
It works by reading an .OBJ file line by line (its basically a text file with lines of numbers)
I have a 12mb OBJ file that equates to approx 400,000 lines. suffice to say, it takes forever to try and read line by line.
Is there a way to speed it up? It uses a BufferedReader to read the file (which is stored in my assets folder)
Here is the link to the library: click me

Just an idea: you could first get the size of the file using the File class, after getting the file:
File file = new File(sdcard,"sample.txt");
long size = file.length();
The size returned is in bytes, thus, divide the file size into a sizable number of chunks e.g. 5, 10, 20 e.t.c. with a byte range specified and saved for each chunk. Create byte arrays of the same size as each chunk created above, then "assign" each chunk to a separate worker thread, which should read its chunk into its corresponding character array using the read(buffer, offset, length) method i.e. read "length" characters of each chunk into the array "buffer" beginning at array index "offset". You have to convert the bytes into characters. Then concatenate all arrays together to get the final complete file contents. Insert checks for the chunk sizes so each thread does not overlap the others boundaries. Again, this is just an idea, hopefully it will work when actually implemented.

Related

How to read S3 file chunk by chunk in java?

I have a use case where I have one S3 file. The size is not large enough but it can contains 10-50 million single row records. I want to read a specific byte range. I have read that we can use Range header in S3 GetObject.
Like this:
final GetObjectRequest request = new GetObjectRequest(s3Bucket, key);
request.withRange(byteStartRange, byteEndRange);
return s3Client.getObject(request);
But want to know, does the byte range always guarantees a complete line?
For e.g:
My S3 file content is :
dhjdjdjdjdk
djdjjdfddkkd
dhdjjdjdjdd
cjjjdjdddd
......
If I specify the byte range to be some range X to Y, Will it guarantee full line read of it can read some incomplete line which falls in the byte range?

No, the Range will not guarantee a complete line.
It will provide back only the specific range of bytes requested. Amazon S3 has no insight into the contents of a file. It cannot parse/recognize newline characters.
You will need to request a large enough range that it (hopefully) contains a complete line. Then your code would need to determine where the line ends and the next line begins.

Zip files size anomaly

I am seeing something unusual in my zip files.
I have two .txt files and both are then zipped through java.util.zip(ZipOutputStream, ZipEntry ...) in my application and then returned in response as downloadable zip files through the browser.
One file has data which is a database blob and other is a StringBuffer. My blob txt file is of size 10 mb and my StringBuffer txt file is 15 mb but when these are zipped the blob txt zip file has size larger that the StringBuffer txt file although it contains a smaller txt file.
Any reason why this might be happening?

the StringBuffer and (as of Java 5) StringBuilder classes store just
the buffer for the character data plus current length (without the
additional offset and hash code fields of a String), but that buffer
could be larger than the actual number of characters placed in it;
a Java char takes up two bytes, even if you're using them to store
boring old ASCII values that would fit into a single byte;

Your BLOB -- binary large object -- probably contains data that isn't text, and as compressible as text. For example, it could contain an image.
If you don't already know what the blob contains, you can use a hexdump program to look at it.

handling comp3 and ebcidic conversion in java to ASCII for large files

I am trying to convert comp3 and EBCIDIC characters in my java code but im running into out of memory exception as the amount of data handled is huge about 5 gb. my code is currently as follows:
byte[] data = Files.readAllBytes(path);
this is resulting in an out of memory exception which i can understand, but i cant use a file scanner as well since the data in the file wont be split into lines.
Can anyone point me in the correct direction on how to handle this
Note: the file may contain records of different length hence splitting it based on record length seams not possible.

As Bill said you could (should) ask for the data to be converted to display characters on the mainframe and if English speaking you can do a ascii transfer.
Also how are you deciding where comp-3 fields start ???
You do not have to read the whole file into memory, you can still read the file in blocks, This method will fill an array of bytes:
protected final int readBuffer(InputStream in, final byte[] buf)
throws IOException {
int total = 0;
int num = in.read(buf, total, buf.length);
while (num >= 0 && total + num < buf.length) {
total += num;
num = in.read(buf, total, buf.length - total);
}
return num;
}
if all the records are the same length, create an array of the record length and the above method will read one record at a time.
Finally the JRecord project has classes to read fixed length files etc. It can do comp-3 conversion. Note: I am the author of JRecord.

I'm running into out of memory exception as the amount of data handled is huge about 5 gb.
You only need to read one record at a time.
My code is currently as follows:
byte[] data = Files.readAllBytes(path);
This is resulting in an out of memory exception which i can understand
Me too.
but i cant use a file scanner as well since the data in the file wont be split into lines.
You mean you can't use the Scanner class? That's not the only way to read a record at a time.
In any case not all files have record delimiters. Some have fixed-length records, some have length words at the start of each record, and some have record type attributes at the start of each record, or in both cases at least in the fixed part of the record.
I'll have to split it based on an attribute record_id at a particular position(say at the begining of each record) that will tell me the record length
So read that attribute, decode it if necessary, and read the rest of the record according to the record length you derive from the attribute. One at a time.
I direct your attention to the methods of DataInputStream, especially readFully(). You will also need a Java COMP-3 library. There are several available. Most of the rest can be done by built-in EBCDIC character set decoders.

dividing input file into multiple files based on one of the column

I have a semicolon delimited input file where first column is a 3 char fixed width code, while the remaining columns are some string data.
001;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
003;first_data_str;second_data_str;third_data_str;fourth_data_str
002;first_data_str;second_data_str;third_data_str;fourth_data_str
001;first_data_str;second_data_str;third_data_str;fourth_data_str
I want to divide above file into number of files based on different values of first column.
For e.g. in above example, there are three different values in the first column, so I will divide the file into three files viz. 001.txt, 002.txt, 003.txt
The output file should contain item count as line one and data as remaining lines.
So there are 5 001 rows, so 001.txt will be:
5
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
first_data_str;second_data_str;third_data_str;fourth_data_str
Similarly, 002 file will have first line as 4 and then 4 lines of data and 003 file will have first line as 5 and then five lines of data.
What would be the most efficient way to achieve this considering very large input file with greater then 100,000 rows?
I have written below code to read lines from the file:
try{
FileInputStream fstream = new FileInputStream(this.inputFilePath);
DataInputStream in = new DataInputStream(fstream);
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String strLine;
while ((strLine = br.readLine()) != null) {
String[] tokens = strLine.split(";");
}
in.close();
}catch(IOException e){
e.printStackTrace();
}

for each line
extract chunk name, e.g 001
look for file named "001-tmp.txt"
if one exist, read first line - it will give you number of lines, then increment the value and write into same file using seek function with argument 0 and then use writeUTF to override the string. Perhaps some string length calculation has to be applied here, leave placeholder for 10 spaces for example.
if one does not exist, then create one and write 1 as first line, padded with 10 spaces
append current line to the file
close current file
proceed with next line of source file

One of the solutions that comes to mind is to keep a 'Map' and only open every file once. But you wont be able to this because you have around 1 lac rows, so no OS will allow you that many open file descriptors.
So one of the way is to open the file in append mode and keep writing to it and closing it. But because the of huge many file open close calls , the process may slow up. You can test it for your self though.
If the above is not providing satisfying results, you may try a mix of approach 1 and 2, where by you only open 100 open files at any time and only closing a file if a new file that is not already opened needs to be written to....

First, create HashMap<String, ArrayList<String>> map to collect all the data from the file.
Second, use strLine.split(";",2) instead of strLine.split(";"). The result will be array of length 2, first element be the code and the second the data.
Then, add decoded string to the map:
ArrayList<String> list=map.get(tokens[0]);
if (list==null) {
map.put(tokens[0], list=new ArrayList<String>();
}
list.add(tokens[1]);
At the end, scan the map.keySet() and for each key, create a file named as that key and write list's size and list's content to it.

For each three character code, you're going to have a list of input lines. To me the obvious solution would be to use a Map, with String keys (your three character codes) pointing to the corresponding List that contains all of the lines.
For each of those keys, you'd create a file with the relevant name, the first line would be the size of the list, and then you'd iterate over it to write the remaining lines.

I guess you are not fixed to three files so I suggest you create a map of writers with your three characters code as key and the writer as value.
For each line you read, you select or create the required reader and write the lines into. Also you need a second map to maintain the line count values for all files.
Once you are done with reading the source file, you flush and close all writers and read the files one by one again. This time you just add the line count in front of the file. There is no other way but to rewrite the entire file to my knowledge because its not directly possible to add anything to the beginning of a file without buffering and rewriting the entire file. I suggest you use a temporary file for this one.
This answer applies only in case you file is too large to be stored fully in memory. In case storing is possible, there are faster solutions to this. Like storing the contents of the file fully in StringBuffer objects before writing it to files.

Reading records from file of Pagesize only

I am a beginner and I have a file having variable sized records; there are two fields per row
i.e. one is 7-15 digits key and then followed by space there is a string which is also variable sized for each record.
I am trying to read bytes only of page size into my buffer and then process them.
The problem is that if i use Java.RanomAccessFile and use seek method to reach a particular line , then i use ReadFully method to read those 1024 bytes into my buffer. I have written the functions to convert byte into int and byte into string -but the problem is that I dont know how many bytes form that 7-15 digit and how many bytes form my string.

When you say a row, do you mean each row has a line separator in between? If that is the case, you can use something like BufferedReader's readline() method. That gives you a string which is 1 line without the line separator.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to speed up reading of a large OBJ (text) file? - java

Related

How to read S3 file chunk by chunk in java?

Zip files size anomaly

handling comp3 and ebcidic conversion in java to ASCII for large files

dividing input file into multiple files based on one of the column

Reading records from file of Pagesize only

Categories

Resources