Zip files size anomaly - java

I am seeing something unusual in my zip files.
I have two .txt files and both are then zipped through java.util.zip(ZipOutputStream, ZipEntry ...) in my application and then returned in response as downloadable zip files through the browser.
One file has data which is a database blob and other is a StringBuffer. My blob txt file is of size 10 mb and my StringBuffer txt file is 15 mb but when these are zipped the blob txt zip file has size larger that the StringBuffer txt file although it contains a smaller txt file.
Any reason why this might be happening?

the StringBuffer and (as of Java 5) StringBuilder classes store just
the buffer for the character data plus current length (without the
additional offset and hash code fields of a String), but that buffer
could be larger than the actual number of characters placed in it;
a Java char takes up two bytes, even if you're using them to store
boring old ASCII values that would fit into a single byte;

Your BLOB -- binary large object -- probably contains data that isn't text, and as compressible as text. For example, it could contain an image.
If you don't already know what the blob contains, you can use a hexdump program to look at it.

Related

How to read S3 file chunk by chunk in java?

I have a use case where I have one S3 file. The size is not large enough but it can contains 10-50 million single row records. I want to read a specific byte range. I have read that we can use Range header in S3 GetObject.
Like this:
final GetObjectRequest request = new GetObjectRequest(s3Bucket, key);
request.withRange(byteStartRange, byteEndRange);
return s3Client.getObject(request);
But want to know, does the byte range always guarantees a complete line?
For e.g:
My S3 file content is :
dhjdjdjdjdk
djdjjdfddkkd
dhdjjdjdjdd
cjjjdjdddd
......
If I specify the byte range to be some range X to Y, Will it guarantee full line read of it can read some incomplete line which falls in the byte range?
No, the Range will not guarantee a complete line.
It will provide back only the specific range of bytes requested. Amazon S3 has no insight into the contents of a file. It cannot parse/recognize newline characters.
You will need to request a large enough range that it (hopefully) contains a complete line. Then your code would need to determine where the line ends and the next line begins.

File split in java [duplicate]

This question already has answers here:
Java - Read file and split into multiple files
(11 answers)
Closed 4 years ago.
How can I split a file in two ( file1 and file2 ) such that the file1 contains first 10kb of the file and file2 contains rest of the remaining data of the file.
I am using AIDE on android.
There is no "system call" to split a file. You need to open a file, read it and copy the contents to the corresponding output files (which you need to create).
Synopsis:
Open the input file as a FileInputStream
Make a byte[] buffer somewhere around 4k
Open the two output files as two FileOutputStreams
Read from input into buffer and write buffer to first OutputStream
Do this until exactly 10kb bytes have been read and written
Read from input into buffer and write buffer to second OutputStream
Do this until there are no more bytes from the input stream
Close all three streams
Of course, you will need to be careful to make sure that you copy exactly the correct number of bytes. See InputStream.read(buf, offset, length) for details. Test also for special case when input file is less than 10k long.

How to speed up reading of a large OBJ (text) file?

I am using an OBJ Loader library that I found on the 'net and I want to look into speeding it up.
It works by reading an .OBJ file line by line (its basically a text file with lines of numbers)
I have a 12mb OBJ file that equates to approx 400,000 lines. suffice to say, it takes forever to try and read line by line.
Is there a way to speed it up? It uses a BufferedReader to read the file (which is stored in my assets folder)
Here is the link to the library: click me
Just an idea: you could first get the size of the file using the File class, after getting the file:
File file = new File(sdcard,"sample.txt");
long size = file.length();
The size returned is in bytes, thus, divide the file size into a sizable number of chunks e.g. 5, 10, 20 e.t.c. with a byte range specified and saved for each chunk. Create byte arrays of the same size as each chunk created above, then "assign" each chunk to a separate worker thread, which should read its chunk into its corresponding character array using the read(buffer, offset, length) method i.e. read "length" characters of each chunk into the array "buffer" beginning at array index "offset". You have to convert the bytes into characters. Then concatenate all arrays together to get the final complete file contents. Insert checks for the chunk sizes so each thread does not overlap the others boundaries. Again, this is just an idea, hopefully it will work when actually implemented.

Connect Direct : File sending from Mainframe to Unix

When I am sending a Variable length file from Mainframe Connect direct to UNIX box, the file on UNIX have some extra bytes on the beginning of the Mainframe file, I tried using different SYSOPTS option but I am still getting those intial bytes. Any Idea ?
You should look at getting the file copied to a Fixed-Length record (recfm=FB) file on the mainframe before doing the transfer. There are a number of mainframe utilities that can do this (i.e. sort).
If you transfer it as a VB file you should also leave it as an EBCDIC file (the BDW/RDW fields are binary fields and should not be translated to ASCII).
As others have said, it would be useful to have an example of the file.
Following on from NealB. A vb file on the mainframe is stored in this format
<BDW><RDW>Record Data 1
<RDW>Record Data 2
....
<RDW>Record Data n-1
<BDW><RDW>Record Data n
<RDW>Record Data n+1
....
<RDW>Record Data o-1
<BDW><RDW>Record Data o
<RDW>Record Data o+1
....
Where
BDW : Block descriptor word is 4 bytes; the first 2 bytes are the block length (big endian format); the last 2 bytes will be hex 0's for disk files (tapes files can use these 2 bytes).
RDW : Record Descriptor word is 4 bytes; the first 2 bytes are the record length (big endian format); the last 2 bytes will be hex 0's.
So if Block length was 240 (and contained 3 80-byte records) then the file would be
---BDW--- ---RDW---
00F0 0000 0050 0000 80-bytes of data (record 1)
0050 0000 80-bytes of data (record 2)
0050 0000 80-bytes of data (record 3)
There may be a unix utility for handling mainframe VB files
There are are some vb options for Connect-Direct (NDM) (see http://pic.dhe.ibm.com/infocenter/sb2bi/v5r2/index.jsp?topic=%2Fcom.ibm.help.cd_interop_sysopts.doc%2FCDP_UNIXSysopts.html).
Looking at the documentation, you can not combine vb options with ascii translation; converting the file to Fixed-Length records (recfm=FB) on the mainframe may make a lot of sense.
Note: You could try looking at the file with the Record Editor and using the File-Wizard (button to the left of the layout name). The wizard should pickup that it is a Mainframe-VB file.
Note: While converting the file to a fixed-length record on the mainframe would be the best option, the java project JRecord can read Mainframe VB files if need be
Some extra bytes... how many is "some"?
If there are always 4 bytes, these may be the RDW (Record Descriptor Word) which carries the record length.
I don't know much about Connect Direct, but from a command line FTP session on the mainframe you
can verify the RDW status using the LOCSTAT command as follows:
Command:
LOCSTAT RDW
RDW's from VB/VBS files are retained as part of data.
Command:
If you see the above message you can drop the RDW's using the following command:
LOCSITE NORDW
If you are pulling from the mainframe then you can find out whether RDW's are being stripped or not using FTP command:
QUOTE STAT
You will then see several messages, one of which reports the RDW status:
211-RDWs from variable format datasets are retained as part of the data.
Again, you can fix this with
QUOTE SITE NORDW
after which QUOTE STAT should give you:
211=RDWs from variable format datasets are discarded
Are the extra bytes 0xEF 0xBB 0xBF, 0xFF 0xFE or 0xFE 0xFF? That's the UTF Byte Order Marker.
If it's UTF-8, ignore it. Strip it, if you like. It's pointless.
If it's UTF-16, then you can use the bytes to determine endianness. If you know the endianness, it's safe to ignore or strip them.
If you control the application generating the files, change it from saving UTF. Just save the files as ASCII and the BOMs will go away.

Java RandomAccessFile setLength but for the start of binary file

I've been reading on RandomAccessFile and understand that its possible to truncate the end of a file by setLength to a length shorter than the file. Im trying to copy just the "end" of the file to a new file and truncate the beginning.
So for example: I want to delete the first 1300 bytes of a file and copy the rest of the file into a new file.
Is there any way of doing this?
Cheers
Have you considered using the RandomAccessFile seek method to seek to 1300 bytes, and then read the remainder of the file starting at the offset and use another RandomAccessFile (or different stream output) to create a new file with the values you read in from the original file beginning at the 1300 byte offset you specified?

Categories