I have a use case where I have one S3 file. The size is not large enough but it can contains 10-50 million single row records. I want to read a specific byte range. I have read that we can use Range header in S3 GetObject.
Like this:
final GetObjectRequest request = new GetObjectRequest(s3Bucket, key);
request.withRange(byteStartRange, byteEndRange);
return s3Client.getObject(request);
But want to know, does the byte range always guarantees a complete line?
For e.g:
My S3 file content is :
dhjdjdjdjdk
djdjjdfddkkd
dhdjjdjdjdd
cjjjdjdddd
......
If I specify the byte range to be some range X to Y, Will it guarantee full line read of it can read some incomplete line which falls in the byte range?
No, the Range will not guarantee a complete line.
It will provide back only the specific range of bytes requested. Amazon S3 has no insight into the contents of a file. It cannot parse/recognize newline characters.
You will need to request a large enough range that it (hopefully) contains a complete line. Then your code would need to determine where the line ends and the next line begins.
Related
I am trying to convert comp3 and EBCIDIC characters in my java code but im running into out of memory exception as the amount of data handled is huge about 5 gb. my code is currently as follows:
byte[] data = Files.readAllBytes(path);
this is resulting in an out of memory exception which i can understand, but i cant use a file scanner as well since the data in the file wont be split into lines.
Can anyone point me in the correct direction on how to handle this
Note: the file may contain records of different length hence splitting it based on record length seams not possible.
As Bill said you could (should) ask for the data to be converted to display characters on the mainframe and if English speaking you can do a ascii transfer.
Also how are you deciding where comp-3 fields start ???
You do not have to read the whole file into memory, you can still read the file in blocks, This method will fill an array of bytes:
protected final int readBuffer(InputStream in, final byte[] buf)
throws IOException {
int total = 0;
int num = in.read(buf, total, buf.length);
while (num >= 0 && total + num < buf.length) {
total += num;
num = in.read(buf, total, buf.length - total);
}
return num;
}
if all the records are the same length, create an array of the record length and the above method will read one record at a time.
Finally the JRecord project has classes to read fixed length files etc. It can do comp-3 conversion. Note: I am the author of JRecord.
I'm running into out of memory exception as the amount of data handled is huge about 5 gb.
You only need to read one record at a time.
My code is currently as follows:
byte[] data = Files.readAllBytes(path);
This is resulting in an out of memory exception which i can understand
Me too.
but i cant use a file scanner as well since the data in the file wont be split into lines.
You mean you can't use the Scanner class? That's not the only way to read a record at a time.
In any case not all files have record delimiters. Some have fixed-length records, some have length words at the start of each record, and some have record type attributes at the start of each record, or in both cases at least in the fixed part of the record.
I'll have to split it based on an attribute record_id at a particular position(say at the begining of each record) that will tell me the record length
So read that attribute, decode it if necessary, and read the rest of the record according to the record length you derive from the attribute. One at a time.
I direct your attention to the methods of DataInputStream, especially readFully(). You will also need a Java COMP-3 library. There are several available. Most of the rest can be done by built-in EBCDIC character set decoders.
I am using an OBJ Loader library that I found on the 'net and I want to look into speeding it up.
It works by reading an .OBJ file line by line (its basically a text file with lines of numbers)
I have a 12mb OBJ file that equates to approx 400,000 lines. suffice to say, it takes forever to try and read line by line.
Is there a way to speed it up? It uses a BufferedReader to read the file (which is stored in my assets folder)
Here is the link to the library: click me
Just an idea: you could first get the size of the file using the File class, after getting the file:
File file = new File(sdcard,"sample.txt");
long size = file.length();
The size returned is in bytes, thus, divide the file size into a sizable number of chunks e.g. 5, 10, 20 e.t.c. with a byte range specified and saved for each chunk. Create byte arrays of the same size as each chunk created above, then "assign" each chunk to a separate worker thread, which should read its chunk into its corresponding character array using the read(buffer, offset, length) method i.e. read "length" characters of each chunk into the array "buffer" beginning at array index "offset". You have to convert the bytes into characters. Then concatenate all arrays together to get the final complete file contents. Insert checks for the chunk sizes so each thread does not overlap the others boundaries. Again, this is just an idea, hopefully it will work when actually implemented.
I am building history parser, there's an application that already done the logging task (text based).
Now that my supervisor want me to create an application to read that log.
The log is is created at the end of the month, and is separated by [date]:
[19-11-2014]
- what goes here
- what goes here
[20-11-2014]
- what goes here
- what goes here
etc...
If the log file has small size, there's no problem processing the content by DataInputStream to get the byte[], and convert it to String and then do the filtering process (by doing substring and such).
But when the file has a large size (about 100mb), it throws JavaHeapSpace exception. I know that this is because the length of the content exceeds String maxlength, when I try not to convert the byte[] into string, no exception was thrown.
Now the question is, how do I split the byte[] into several byte[]?
Which is each new byte[] only contains single:
[date]
- what goes here
So if within a month we have 9 dates in log, it would be split into 9 byte[].
The splitting marker would be based on [\\d{2}-\\d{2}-\\d{4}] , if it is string I could just use Regex to find all the marker, get the indexOf and then substring it.
But how do I do this without converting to string first? As it would throws the JavaHeapSpace.
I think there are several concepts here that you're missing.
First, an InputStream is a Stream, which means it is a flow of bytes. What you do with that flow is up to you, but saving all of the stream to memory defies the point of the stream construct altogether.
Second, a DataInputStream is used to read objects from a binary file that were serialized there by a DataOutputStream. Reading just a string is overkill for this type of Stream, since a simple InputStream can do that.
As for your specific problem, I would use a BufferedFileReader, and read one line at a time, until reaching the next date. At that point you can do whatever processing you need on the last chunk of lines you read, and free the memory. Thus not running into the same problem.
Question may be quite vague, let me expound it here.
I'm developing an application in which I'll be reading data from a file. I've a FileReader class which opens the file in following fashion
currentFileStream = new FileInputStream(currentFile);
fileChannel = currentFileStream.getChannel();
data is read as following
bytesRead = fileChannel.read(buffer); // Data is buffered using a ByteBuffer
I'm processing the data in any one of the 2 forms, one is binary and other is character.
If its processed as character I do an additional step of decoding this ByteBuffer into CharBuffer
CoderResult result = decoder.decode(byteBuffer, charBuffer, false);
Now my problem is I need to read by repositioning the file from some offset during recovery mode in case of some failure or crash in application.
For this, I maintain a byteOffset which keeps track of no of bytes processed during binary mode and I persist this variable.
If something happens I reposition the file like this
fileChannel.position(byteOffset);
which is pretty straightforward.
But if processing mode is character, I maintain recordOffset which keeps track of character position/offset in the file. During recovery I make calls to read() internally till I get some character offset which is persisted recordOffset+1.
Is there anyway to get corresponding bytes which were needed to decode characters? For instance I want something similar like this if recordOffset is 400 and its corresponding byteOffset is 410 or 480 something( considering different charsets). So that while repositioning I can do this
fileChannel.position(recordOffset); //recordOffset equivalent value in number of bytes
instead of making repeated calls internally in my application.
Other approach I could think for this was using an InputStreamReader's skip method.
If there are any better approach for this or if possible to get byte - character mapping, please let me know.
I am a beginner and I have a file having variable sized records; there are two fields per row
i.e. one is 7-15 digits key and then followed by space there is a string which is also variable sized for each record.
I am trying to read bytes only of page size into my buffer and then process them.
The problem is that if i use Java.RanomAccessFile and use seek method to reach a particular line , then i use ReadFully method to read those 1024 bytes into my buffer. I have written the functions to convert byte into int and byte into string -but the problem is that I dont know how many bytes form that 7-15 digit and how many bytes form my string.
When you say a row, do you mean each row has a line separator in between? If that is the case, you can use something like BufferedReader's readline() method. That gives you a string which is 1 line without the line separator.