String max size for 722MB xml file - java

I have a ByteArrayOutputStream which holds a byte representation of an XML with 750MB size.
I need to convert it to String.
I wrote:
ByteArrayOutputStream xmlArchive = ...
String xmlAsString = xmlArchive.toString(UTF8);
However although I am using 4GB of heap size I get java.lang.OutOfMemoryError: Java heap space
What is wrong? How can I know which heap size to use? I am using JDK64 bit
UPDATE
I need it as String in order to remove all the characters before "<?xml"
Currently my code is:
String xmlAsString = xmlArchive.toString(UTF8);
int xmlBegin = xmlAsString.indexOf("<?xml");
if (xmlBegin >0){
return xmlAsString.substring(xmlBegin);
}
return xmlAsString;
I then convert it again to byte array.
UPDATED 2
The ByteArrayOutputStream is written like this:
HttpMethod method ..
InputStream response = method.getResponseBodyAsStream();
byte[] buf = new byte[5000];
while ( (len=response.read(buf)) != -1) {
output.write(buf, 0, len);
}
len is from the header of the response Content-Length

You could use the Scanner class:
Scanner scanner = new Scanner(response, StandardCharsets.UTF_8.name());
// skip to "<?xml"
scanner.skip(".*?(?=<\\?xml)");
// process rest of stream
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
// Do something with line
}
scanner.close();

Expanding on Jamie Cockburn's answer:
To fill in his while loop to match your expected behaviour:
byte[] buf = line.getBytes(StandardCharsets.UTF_8.name());
output.write(buf, 0, buf.length);

Related

Sending string as byte array from C# to Java via socket

I am trying the following:
C# Client:
string stringToSend = "Hello man";
BinaryWriter writer = new BinaryWriter(mClientSocket.GetStream(),Encoding.UTF8);
//write number of bytes:
byte[] headerBytes = BitConverter.GetBytes(stringToSend.Length);
mClientSocket.GetStream().Write(headerBytes, 0, headerBytes.Length);
//write text:
byte[] textBytes = System.Text.Encoding.UTF8.GetBytes(stringToSend);
writer.Write(textBytes, 0, textBytes.Length);
Java Server:
Charset utf8 = Charset.forName("UTF-8");
BufferedReader in = new BufferedReader(new InputStreamReader(clientSocket.getInputStream(), utf8));
while (true) {
//we read header first
int headerSize = in.read();
int bytesRead = 0;
char[] input = new char[headerSize];
while (bytesRead < headerSize)
{
bytesRead += in.read(input, bytesRead, headerSize - bytesRead);
}
String resString = new String(input);
System.out.println(resString);
if (resString.equals("!$$$")) {
break;
}
}
The string size equals 9.That's correct on both sides.But, when I am reading the string iteself on the Java side, the data looks wrong.The char buffer ('input' variable)content looks like this:
",",",'H','e','l','l','o',''
I tried to change endianness with reversing the byte array.Also tried changing string encoding format between ASCII and UTF-8.I still feel like it relates to the endianness problem,but can not figure out how to solve it.I know I can use other types of writers in order to write text data to the steam,but I am trying using raw byte arrays for the sake of learning.
These
byte[] headerBytes = BitConverter.GetBytes(stringToSend.Length);
are 4 bytes. And they aren't character data so it makes no sense to read them with a BufferedReader. Just read the bytes directly.
byte[] headerBytes = new byte[4];
// shortcut, make sure 4 bytes were actually read
in.read(headerBytes);
Now extract your text's length and allocate enough space for it
int length = ByteBuffer.wrap(headerBytes).getInt();
byte[] textBytes = new byte[length];
Then read the text
int remaining = length;
int offset = 0;
while (remaining > 0) {
int count = in.read(textBytes, offset, remaining);
if (-1 == count) {
// deal with it
break;
}
remaining -= count;
offset += count;
}
Now decode it as UTF-8
String text = new String(textBytes, StandardCharsets.UTF_8);
and you are done.
Endianness will have to match for those first 4 bytes. One way of ensuring that is to use "network order" (big-endian). So:
C# Client
byte[] headerBytes = BitConverter.GetBytes(IPAddress.HostToNetworkOrder(stringToSend.Length));
Java Server
int length = ByteBuffer.wrap(headerBytes).order(ByteOrder.BIG_ENDIAN).getInt();
At first glance it appears you have a problem with your indexes.
You C# code is sending an integer converted to 4 bytes.
But you Java Code is only reading a single byte as the length of the string.
The next 3 bytes sent from C# are going to the three zero bytes from your string length.
You Java code is reading those 3 zero bytes and converting them to empty characters which represent the first 3 empty characters of your input[] array.
C# Client:
string stringToSend = "Hello man";
BinaryWriter writer = new BinaryWriter(mClientSocket.GetStream(),Encoding.UTF8);
//write number of bytes: Original line was sending the entire string here. Optionally if you string is longer than 255 characters, you'll need to send another data type, perhaps an integer converted to 4 bytes.
byte[] textBytes = System.Text.Encoding.UTF8.GetBytes(stringToSend);
mClientSocket.GetStream().Write((byte)textBytes.Length);
//write text the entire buffer
writer.Write(textBytes, 0, textBytes.Length);
Java Server:
Charset utf8 = Charset.forName("UTF-8");
BufferedReader in = new BufferedReader(new InputStreamReader(clientSocket.getInputStream(), utf8));
while (true) {
//we read header first
// original code was sending an integer as 4 bytes but was only reading a single char here.
int headerSize = in.read();// read a single byte from the input
int bytesRead = 0;
char[] input = new char[headerSize];
// no need foe a while statement here:
bytesRead = in.read(input, 0, headerSize);
// if you are going to use a while statement, then in each loop
// you should be processing the input but because it will get overwritten on the next read.
String resString = new String(input, utf8);
System.out.println(resString);
if (resString.equals("!$$$")) {
break;
}
}

IllegalArgumentException using Java8 Base64 decoder

I wanted to use Base64.java to encode and decode files. Encode.wrap(InputStream) and decode.wrap(InputStream) worked but runned slowly. So I used following code.
public static void decodeFile(String inputFileName,
String outputFileName)
throws FileNotFoundException, IOException {
Base64.Decoder decoder = Base64.getDecoder();
InputStream in = new FileInputStream(inputFileName);
OutputStream out = new FileOutputStream(outputFileName);
byte[] inBuff = new byte[BUFF_SIZE]; //final int BUFF_SIZE = 1024;
byte[] outBuff = null;
while (in.read(inBuff) > 0) {
outBuff = decoder.decode(inBuff);
out.write(outBuff);
}
out.flush();
out.close();
in.close();
}
However, it always throws
Exception in thread "AWT-EventQueue-0" java.lang.IllegalArgumentException: Input byte array has wrong 4-byte ending unit
at java.util.Base64$Decoder.decode0(Base64.java:704)
at java.util.Base64$Decoder.decode(Base64.java:526)
at Base64Coder.JavaBase64FileCoder.decodeFile(JavaBase64FileCoder.java:69)
...
After I changed final int BUFF_SIZE = 1024; into final int BUFF_SIZE = 3*1024;, the code worked. Since "BUFF_SIZE" is also used to encode file, I believe there were something wrong with the file encoded (1024 % 3 = 1, which means paddings are added in the middle of the file).
Also, as #Jon Skeet and #Tagir Valeev mentioned, I should not ignore the return value from InputStream.read(). So, I modified the code as below.
(However, I have to mention that the code does run much faster than using wrap(). I noticed the speed difference because I had coded and intensively used Base64.encodeFile()/decodeFile() long before jdk8 was released. Now, my buffed jdk8 code runs as fast as my original code. So, I do not know what is going on with wrap()... )
public static void decodeFile(String inputFileName,
String outputFileName)
throws FileNotFoundException, IOException
{
Base64.Decoder decoder = Base64.getDecoder();
InputStream in = new FileInputStream(inputFileName);
OutputStream out = new FileOutputStream(outputFileName);
byte[] inBuff = new byte[BUFF_SIZE];
byte[] outBuff = null;
int bytesRead = 0;
while (true)
{
bytesRead = in.read(inBuff);
if (bytesRead == BUFF_SIZE)
{
outBuff = decoder.decode(inBuff);
}
else if (bytesRead > 0)
{
byte[] tempBuff = new byte[bytesRead];
System.arraycopy(inBuff, 0, tempBuff, 0, bytesRead);
outBuff = decoder.decode(tempBuff);
}
else
{
out.flush();
out.close();
in.close();
return;
}
out.write(outBuff);
}
}
Special thanks to #Jon Skeet and #Tagir Valeev.
I strongly suspect that the problem is that you're ignoring the return value from InputStream.read, other than to check for the end of the stream. So this:
while (in.read(inBuff) > 0) {
// This always decodes the *complete* buffer
outBuff = decoder.decode(inBuff);
out.write(outBuff);
}
should be
int bytesRead;
while ((bytesRead = in.read(inBuff)) > 0) {
outBuff = decoder.decode(inBuff, 0, bytesRead);
out.write(outBuff);
}
I wouldn't expect this to be any faster than using wrap though.
Try to use decode.wrap(new BufferedInputStream(new FileInputStream(inputFileName))). With buffering it should be at least as fast as your manually crafted version.
As for why your code doesn't work: that's because the last chunk is likely to be shorter than 1024 bytes, but you try to decode the whole byte[] array. See the #JonSkeet answer for details.
Well, I changed
"final int BUFF_SIZE = 1024;"
into
"final int BUFF_SIZE = 1024 * 3;"
It worked!
So, I guess probabaly there is something wrong with padding... I mean, when encoding the file, (since 1024 % 3 = 1) there must be paddings. And those might raise problems when decoding...
You should records the number of bytes you have read, beside this,
You should be sure that your buffer size is divisible for 3, cause in Base64, every 3 bytes have four output(64 is 2^6, and 3*8 equals 4*6), by doing this, you can avoid padding problems.( In this way your output will not have the wrong ending of "=")

Trim Padding From ByteArrayOutputStream

I'm working with Amazon S3 and would like to upload an InputStream (which requires counting the number of bytes I'm sending).
public static boolean uploadDataTo(String bucketName, String key, String fileName, InputStream stream) {
ByteArrayOutputStream out = new ByteArrayOutputStream();
byte[] buffer = new byte[1];
try {
while (stream.read(buffer) != -1) { // copy from stream to buffer
out.write(buffer); // copy from buffer to byte array
}
} catch (Exception e) {
UtilityFunctionsObject.writeLogException(null, e);
}
byte[] result = out.toByteArray(); // we needed all that just for length
int bytes = result.length;
IO.close(out);
InputStream uploadStream = new ByteArrayInputStream(result);
....
}
I was told copying a byte at a time is highly inefficient (obvious for large files). I can't make it more because it will add padding to the ByteArrayOutputStream, which I can't strip out. I can strip it out from result, but how can I do it safely? If I use an 8KB buffer, can I just strip out the right most buffer[i] == 0? Or is there a better way to do this? Thanks!
Using Java 7 on Windows 7 x64.
You can do something like this:
int read = 0;
while ((read = stream.read(buffer)) != -1) {
out.write(buffer, 0, read);
}
stream.read() returns the number of bytes that have been written into buffer. You can pass this information to the len parameter of out.write(). So you make sure that you write only the bytes you have read from the stream.
Use Jakarta Commons IOUtils to copy from the input stream to the byte array stream in a single step. It will use an efficient buffer, and not write any excess bytes.
If you want efficiency you could process the file as you read it. I would replace uploadStream with stream and remove the rest of the code.
If you need some buffering you can do this
InputStream uploadStream = new BufferedInputStream(stream);
the default buffer size is 8 KB.
If you want the length use File.length();
long length = new File(fileName).length();

Java: Reading file in two parts - partly as String and partly as byte[]

I have a file which is split in two parts by "\n\n" - first part is not too long String and second is byte array, which can be quite long.
I am trying to read the file as follows:
byte[] result;
try (final FileInputStream fis = new FileInputStream(file)) {
final InputStreamReader isr = new InputStreamReader(fis);
final BufferedReader reader = new BufferedReader(isr);
String line;
// reading until \n\n
while (!(line = reader.readLine()).trim().isEmpty()){
// processing the line
}
// copying the rest of the byte array
result = IOUtils.toByteArray(reader);
reader.close();
}
Even though the resulting array is the size it should be, its contents are broken. If I try to use toByteArray directly on fis or isr, the contents of result are empty.
How can I read the rest of the file correctly and efficiently?
Thanks!
The reason your contents are broken is because the IOUtils.toByteArray(...) function reads your data as a string in the default character encoding, i.e. it converts the 8-bit binary values into text characters using whatever logic your default encoding prescribes. This usually leads to many of the binary values getting corrupted.
Depending on how exactly the charset is implemented, there is a slight chance that this might work:
result = IOUtils.toByteArray(reader, "ISO-8859-1");
ISO-8859-1 uses only a single byte per character. Not all character values are defined, but many implementations will pass them anyways. Maybe you're lucky with it.
But a much cleaner solution would be to instead read the String in the beginning as binary data first and then converting it to text via new String(bytes) rather than reading the binary data at the end as a String and then converting it back.
This might mean, though, that you need to implement your own version of a BufferedReader for performance purposes.
You can find the source code of the standard BufferedReader via the obvious Google search, which will (for example) lead you here:
http://www.docjar.com/html/api/java/io/BufferedReader.java.html
It's a bit long, but conceptually not too difficult to understand, so hopefully it will be useful as a reference.
Alternatively, you could read the file into byte array, find \n\n position and split the array into the line and bytes
byte[] a = Files.readAllBytes(Paths.get("file"));
String line = "";
byte[] result = a;
for (int i = 0; i < a.length - 1; i++) {
if (a[i] == '\n' && a[i + 1] == '\n') {
line = new String(a, 0, i);
int len = a.length - i - 1;
result = new byte[len];
System.arraycopy(a, i + 1, result, 0, len);
break;
}
}
Thanks for all the comments - the final implementation was done in this way:
try (final FileInputStream fis = new FileInputStream(file)) {
ByteBuffer buffer = ByteBuffer.allocate(64);
boolean wasLast = false;
String headerValue = null, headerKey = null;
byte[] result = null;
while (true) {
byte current = (byte) fis.read();
if (current == '\n') {
if (wasLast) {
// this is \n\n
break;
} else {
// just a new line in header
wasLast = true;
headerValue = new String(buffer.array(), 0, buffer.position()));
buffer.clear();
}
} else if (current == '\t') {
// headerKey\theaderValue\n
headerKey = new String(buffer.array(), 0, buffer.position());
buffer.clear();
} else {
buffer.put(current);
wasLast = false;
}
}
// reading the rest
result = IOUtils.toByteArray(fis);
}

Problem: Java-Read socket (Bluetooth) input stream until a string (<!MSG>) is read

I use the following code (from Bluetooth Chat sample app) to read the incoming data and construct a string out of the bytes read. I want to read until this string has arrived <!MSG>. How to insert this condition with read() function?
The whole string looks like this <MSG><N>xxx<!N><V>yyy<!V><!MSG>. But the read() function does not read entire string at once. When I display the characters, I cannot see all the characters in the same line. It looks like:
Sender: <MS
Sender: G><N>xx
Sender: x<V
.
.
.
I display the characters on my phone (HTC Desire) and I send the data using windows hyperterminal.
How to make sure all the characters are displayed in a single line? I have tried using StringBuilder and StringBuffer instead of new String() but the problem is read() function does not read all the characters sent. The length of the input stream (bytes) is not equal to actual length of the string sent. The construction of string from the read bytes is happening alright.
Thank you for any suggestions and time spent on this. Also please feel free to suggest other mistakes or better way of doing below things, if any.
Cheers,
Madhu
public void run() {
Log.i(TAG, "BEGIN mConnectedThread");
//Writer writer = new StringWriter();
byte[] buffer = new byte[1024];
int bytes;
//String end = "<!MSG>";
//byte compare = new Byte(Byte.parseByte(end));
// Keep listening to the InputStream while connected
while (true) {
try {
//boolean result = buffer.equals(compare);
//while(true) {
// Read from the InputStream
bytes = mmInStream.read(buffer);
//Reader reader = new BufferedReader(new InputStreamReader(mmInStream, "UTF-8"));
//int n;
//while ((bytes = reader.read(buffer)) != -1) {
//writer.write(buffer, 0, bytes);
//StringBuffer sb = new StringBuffer();
//sb = sb.append(buffer);
//String readMsg = writer.toString();
String readMsg = new String(buffer, 0, bytes);
//if (readMsg.endsWith(end))
// Send the obtained bytes to the UI Activity
mHandler.obtainMessage(BluetoothChat.MESSAGE_READ, bytes, -1, readMsg)
.sendToTarget();
//}
} catch (IOException e) {
Log.e(TAG, "disconnected", e);
connectionLost();
break;
}
}
}
The read function does not make any guarantee about the number of bytes it returns (it generally tries to return as many bytes from the stream as it can, without blocking). Therefore, you have to buffer the results, and keep them aside until you have your full message. Notice that you could receive something after the "<!MSG>" message, so you have to take care not to throw it away.
You can try something along these lines:
byte[] buffer = new byte[1024];
int bytes;
String end = "<!MSG>";
StringBuilder curMsg = new StringBuilder();
while (-1 != (bytes = mmInStream.read(buffer))) {
curMsg.append(new String(buffer, 0, bytes, Charset.forName("UTF-8")));
int endIdx = curMsg.indexOf(end);
if (endIdx != -1) {
String fullMessage = curMsg.substring(0, endIdx + end.length());
curMsg.delete(0, endIdx + end.length());
// Now send fullMessage
}
}

Categories