Remove junk trailing xml from an inputstream

Remove junk trailing xml from an inputstream - java

My free webhost appends analytics javascript to all PHP and HTML files. Which is fine, except that I want to send XML to my Android app, and it's invalidating my files.
Since XML is parsed in its entirety (and blows up) before passed along to my SAX ContentHandler, I can't just catch the exception and continue merrily along with a fleshed out object. (Which I tried, and then felt sheepish about.)
Any suggestions on a reasonably efficient strategy?
I'm about to create a class that will take my InputStream, read through it until I find the junk, break, then take what I just wrote to, convert it back into an InputStream and pass it along like nothing happened. But I'm worried that it'll be grossly inefficient, have bugs I shouldn't have to deal with (e.g. breaking on binary values such as embedded images) and hopefully unnecessary.
FWIW, this is part of an Android project, so I'm using the android.util.Xml class (see source code). When I traced the exception, it took me to a native appendChars function that is itself being called from a network of private methods anyway, so subclassing anything seems to be unreasonably useless.
Here's the salient bit from my stacktrace:
E/AndroidRuntime( 678): Caused by: org.apache.harmony.xml.ExpatParser$ParseException: At line 3, column 0: junk after document element
E/AndroidRuntime( 678): at org.apache.harmony.xml.ExpatParser.parseFragment(ExpatParser.java:523)
E/AndroidRuntime( 678): at org.apache.harmony.xml.ExpatParser.parseDocument(ExpatParser.java:482)
E/AndroidRuntime( 678): at org.apache.harmony.xml.ExpatReader.parse(ExpatReader.java:320)
E/AndroidRuntime( 678): at org.apache.harmony.xml.ExpatReader.parse(ExpatReader.java:277)
I guess in the end I'm asking for opinions on whether the InputStream -> manually parse to OutputStream -> recreate InputStream -> pass along solution is as horrible as I think it is.

I'm about to create a class that will take my InputStream, read
through it until I find the junk, break, then take what I just wrote
to, convert it back into an InputStream and pass it along like nothing
happened. But I'm worried that it'll be grossly inefficient, have bugs
I shouldn't have to deal with (e.g. breaking on binary values such as
embedded images) and hopefully unnecessary.
you could use a FilterStream for that no need for a buffer
best thing to do is add a delimiter to the end of the XML like --theXML ends HERE -- or a char not found in XML like a group of 16 \u04 chars (you then only need to check every 16th byte) to the end of the XML and read until you find it
implementation assuming \u04 delim
class WebStream extends FilterInputStream {
byte[] buff = new byte[1024];
int offset = 0, length = 0;
public WebStream(InputStream i) {
super(i);
}
#Override
public boolean markSupported() {
return false;
}
#Override
public int read() throws IOException {
if (offset == length)
readNextChunk();
if (length == -1)
return -1;// eof
return buff[offset++];
}
#Override
public int read(byte[] b, int off, int len) throws IOException {
if (offset == length)
readNextChunk();
if (length == -1)
return -1;// eof
int cop = length - offset;
if (len < cop)
cop = len;
System.arraycopy(buff, offset, b, off, cop);
offset += cop;
return cop;
}
private void readNextChunk() throws IOException {
if (offset <= length) {
System.arraycopy(buff, offset, buff, 0, length - offset);
length -= offset;
offset = 0;
}
int read = in.read(buff, length, buff.length - length);
if (read < 0 && length <= 0) {
length = -1;
offset = 0;
return;
}
// note that this is assuming ascii compatible
// anything like utf16 or utf32 will break here
for (int i = length; i < read + length; i += 16) {
if (buff[i] == 0x04) {
while (buff[--i] == 0x04)
;// find beginning of delim block
length = i;
read = 0;
}
}
}
}
note this misses throws, some error checking and needs proper debugging

"I'm about to create a class that will take my InputStream, read through it until I find the junk, break, then take what I just wrote to, convert it back into an InputStream and pass it along like nothing happened. But I'm worried that it'll be grossly inefficient, have bugs I shouldn't have to deal with (e.g. breaking on binary values such as embedded images) and hopefully unnecessary."
That'll work. You can read into a StringBuffer and then use a ByteArrayInputStream or something similar (like StreamReader if that's applicable).
http://developer.android.com/reference/java/io/ByteArrayInputStream.html
The downside is that you're reading in the entire XML file into memory, for large files, it can be inefficient memory-wise.
Alternatively, you can subclass InputStream and do the filtering out via the stream. You'd probably just need to override the 3 read() methods by calling super.read() and flagging when you've gotten to the garbage at the end and return an EOF as needed.

Free webhost have this issue. I'm still yet to find an alternative still in free mode.

Related

Why StringBuffer#append throws StringIndexOutOfBoundsException

I have the code which tries to append 2 SttringBuffers:
logBuf.append(errStrBuf);
In logs I see following trace:
java.lang.StringIndexOutOfBoundsException: String index out of range: 90
at java.lang.AbstractStringBuilder.getChars(AbstractStringBuilder.java:325)
at java.lang.StringBuffer.getChars(StringBuffer.java:201)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:404)
at java.lang.StringBuffer.append(StringBuffer.java:253)
I cannot understand the cause of the issue.
Can you provide example with constants?
Can it be related with concurrency?
Can you propose solution?

Yes, it can have to do with concurrency. As per the doc:
This method synchronizes on this (the destination) object but does not
synchronize on the source (sb).
So, if errStrBuf is changed in the process, it may yield this error. Synchronize on it yourself, as such:
synchronize (errStrBuf) {
logBuf.append(errStrBuf);
}
using the same synchronized-block wherever the errStrBuf is changed.

Some poking around the Java sources shows StringBuffer.append(StringBuffer sb) delegates to AbstractStringBuilder.append(StringBuffer sb) which does this:
// Length of additional sb.
int len = sb.length();
// Make sure there's room.
ensureCapacityInternal(count + len);
// Copy them through.
sb.getChars(0, len, value, count);
StringBuffer.getChars delegates to AbstractStringBuilder again so getChars looks a bit like:
public void getChars(int srcBegin, int srcEnd, char[] dst, int dstBegin)
{
if (srcBegin < 0)
throw new StringIndexOutOfBoundsException(srcBegin);
if ((srcEnd < 0) || (srcEnd > count))
throw new StringIndexOutOfBoundsException(srcEnd);
if (srcBegin > srcEnd)
throw new StringIndexOutOfBoundsException("srcBegin > srcEnd");
System.arraycopy(value, srcBegin, dst, dstBegin, srcEnd - srcBegin);
}
Note that you are getting String index out of range: 90 so it must be the srcEnd > count returning true. So the string that is being appended is now shorter than the len that was passed. Clearly it must have been fiddled with by another thtread.

How to monitor file modifications and know what changes were made

I'm working on a Java project where I need to monitor files in a certain directory and be notified whenever changes are made on one of the files, this can be achieved using WatchService. Furthermore, I want to know what changes were made, for example: "characters 10 to 15 where removed", "at index 13 characters 'abcd' were added"... I'm willing to take any solution even based on c language monitiring the fileSystem.
I also want to avoid the diff solution to avoid storing the same file 2 times, and for the complexity of the algorithm, it takes to much time for big files.
Thank you for help. :)

If you're using Linux, then the following code will detect changes in file length, you can easily extend this to update modifications.
Because you don't want to keep two files, there is no way to tell which characters were altered if either the file length is reduced (lost characters can't be found) or The file was altered somewhere in the middle
#include <stdio.h>
#include <stdint.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int main(int argc, char** argv)
{
int fd = open("test", O_RDONLY);
int length = lseek(fd, 0, SEEK_END);
while (1)
{
int new_length;
close(fd);
open("test", O_RDONLY);
sleep(1);
new_length = lseek(fd, 0, SEEK_END);
printf("new_length = %d\n", new_length);
if (new_length != length)
printf ("Length changed! %d->%d\n", length, new_length);
length=new_length;
}
}
[EDIT]
Since the author accepts changes to the kernel for this task, the following change to vfs_write should do the trick:
#define MAX_DIFF_LENGTH 128
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
char old_content[MAX_DIFF_LENGTH+1];
char new_content[MAX_DIFF_LENGTH+1];
ssize_t ret;
if (!(file->f_mode & FMODE_WRITE))
return -EBADF;
if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write))
return -EINVAL;
if (unlikely(!access_ok(VERIFY_READ, buf, count)))
return -EFAULT;
ret = rw_verify_area(WRITE, file, pos, count);
if (___ishay < 20)
{
int i;
int length = count > MAX_DIFF_LENGTH ? MAX_DIFF_LENGTH : count;
___ishay++;
vfs_read(file, old_content, length, pos);
old_content[length] = 0;
new_content[length] = 0;
memcpy(new_content, buf, length);
printk(KERN_ERR"[___ISHAY___]Write request for file named: %s count: %d pos: %lld:\n",
file->f_path.dentry->d_name.name,
count,
*pos);
printk(KERN_ERR"[___ISHAY___]New content (replacement) <%d>:\n", length);
for (i=0;i<length;i++)
{
printk("[0x%02x] (%c)", new_content[i], (new_content[i] > 32 && new_content[i] < 127) ?
new_content[i] : 46);
if (length+1 % 10 == 0)
printk("\n");
}
printk(KERN_ERR"[___ISHAY___]Old content (on file now):\n");
for (i=0;i<length;i++)
{
printk("[0x%02x] (%c)", old_content[i], (old_content[i] > 32 && old_content[i] < 127) ?
old_content[i] : 46);
if (length+1 % 10 == 0)
printk("\n");
}
}
if (ret >= 0) {
count = ret;
if (file->f_op->write)
ret = file->f_op->write(file, buf, count, pos);
else
ret = do_sync_write(file, buf, count, pos);
if (ret > 0) {
fsnotify_modify(file);
add_wchar(current, ret);
}
inc_syscw(current);
}
return ret;
}
Explanation:
vfs_write is the function that handles write requests for files, so that's our best central hook to catch modification requests for files before they occur.
vfs_write accepts the file, file position, buffer and length for the write operation, so we know what part of the file will be replaced by this write, and what data will replace it.
Since we know what part of the file will be altered, I added the vfs_read call just before the actual write to keep in memory the part of file we are about to overrun.
This should be a good starter point to get what you need, I made the following simplifications as this is only an example:
Buffers are allocated statically at max 128 bytes (should be allocated dynamically and protect the memory allocation from wasting too much memory on huge write requests)
File length should be checked and read buffer should refer to this check, the current code prints a read buffer even if the write overflows to length beyond the file end
The output currently goes to dmesg. A better implementation would be to keep a cyclic buffer accessible in debugfs, possibly with poll option
Current code captures write to ALL files, I'm sure that's not what you want...
[EDIT2]
Forgot to mention where this function is located, its under fs/read_write.c in the kernel tree
[EDIT3]
There's another possible solution, providing you know which program you want to monitor, and that it doesn't have libc linked statically is use LD_PRELOAD to override the write function and use that as your hook and record the changes. I haven't tried this, but there's no reason why it shouldn't work

Exception handling multiple messages inside Buffer [JAVA-Mina]

---EDIT below
I'm actually implementing the Mina ProtocolCodecFilter in order to receive messages from a serial device.
The codec specifies multiple different messages (with their pojos), and even thou the implementation works correctly 99% of the time, I'm getting issues with one type of message: the only message that doesn't have a fixed length. I can know the minimum length, but never the maximum.
This is the exception message I'm receiving (just the important parts):
org.apache.mina.filter.codec.ProtocolDecoderException: org.apache.mina.core.buffer.BufferDataException: dataLength: -2143812863 (Hexdump: 02 01 A2 02 01 A0 02)
at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:25
...
Caused by: org.apache.mina.core.buffer.BufferDataException: dataLength: -2143812863
at org.apache.mina.core.buffer.AbstractIoBuffer.prefixedDataAvailable(AbstractIoBuffer.java:2058)
at my.codec.in.folder.codec.MAFrameDecoder.doDecode(MAFrameDecoder.java:29)
at org.apache.mina.filter.codec.CumulativeProtocolDecoder.decode(CumulativeProtocolDecoder.java:178)
at org.apache.mina.filter.codec.ProtocolCodecFilter.messageReceived(ProtocolCodecFilter.java:241)
Sometimes the dataLength is negative, sometimes positive (didn't find any clue about the cause of this).
The MAFrameDecoder:29 is the second sentence of my implementation of the CumulativeProtocolDecoder's doDecode() method (MAX_SIZE=4096):
public boolean doDecode(IoSession session, IoBuffer in, ProtocolDecoderOutput out)
throws Exception
{
boolean result=false;
if(in.prefixedDataAvailable(4, MAX_SIZE)) //-->This is line 29
{
int length = in.getInt();
byte[] idAndData = new byte[length];
in.get(idAndData);
//do things, read from buffer, create message, out.write, etc
//if all has been correct, result=true
}
return result;
}
While debugging the error through a TCP sniffer, we figured out that the exception was thrown when multiple messages where inserted in the same IoBuffer (in).
Seems like my Decoder simply cannot handle multiple messages inside the same buffer. But as I told before, there's also the non-fixed length message issue ( which I really can't know if it has some relevance ). In other doDecode implementations I've seen another methods to manage the buffer, such as:
while (in.hasRemaining())
or
InputStream is=in.asInputStream();
Anyway, I'm trying to avoid blind steps, so this is why I'm asking this here. Instead of just fixing the error, I would like to know the reason of it.
Hope you can help me, any advice would be really appreciated. : )
p.s: The encoder that sends me the messages through the buffer has its autoExpand parameter in false.
EDIT 10/11/2014
I've been exploring the AbstractIoBuffer method and found out this:
#Override
public boolean prefixedDataAvailable(int prefixLength, int maxDataLength) {
if (remaining() < prefixLength) {
return false;
}
int dataLength;
switch (prefixLength) {
case 1:
dataLength = getUnsigned(position());
break;
case 2:
dataLength = getUnsignedShort(position());
break;
case 4:
dataLength = getInt(position());
break;
default:
throw new IllegalArgumentException("prefixLength: " + prefixLength);
}
if (dataLength < 0 || dataLength > maxDataLength) {
throw new BufferDataException("dataLength: " + dataLength);
}
return remaining() - prefixLength >= dataLength;
}
The prefixLength I'm sending is 4, so the switch enters on the last valid case:
dataLength = getInt(position());
After that, it throws the BufferDataException with the negative dataLength, which means the AbstractIoBuffer's position() method is returning a negative value.
I always thought that a nioBuffer could never hold a negative value on its position parameter. Any clues of why is this happening?

I think you should try first reading the size of the packet you have to decode, and ensuring you have enough bytes remaining in the buffer for the decoding to complete successfully.
If there aren't enough bytes you should return false, so the cumulative protocol decoder can get more data for you.
Be careful to return the buffer to the appropriate position before returning the buffer, otherwise you will lose the length data for the next iteration. (If you are using 4 bytes for the length you should be rewinding 4 bytes).
Edit: You could actually use the mark() and reset() methods of the IoBuffer to achieve this behaviour

How to determine length of OGG file

I'm making a rhythm game and I need a quick way to get the length of an ogg file. The only way I could think would be to stream the file really fast without playing it but if I have hundreds of songs this would obviously not be practical. Another way would be to store the length of the file in some sort of properties file but I would like to avoid this. I know there must be some way to do this as most music players can tell you the length of a song.

The quickest way to do it is to seek to the end of the file, then back up to the last Ogg page header you find and read its granulePosition (which is the total number of samples per channel in the file). That's not foolproof (you might be looking at a chained file, in which case you're only getting the last stream's length), but should work for the vast majority of Ogg files out there.
If you need help with reading the Ogg page header, you can read the Jorbis source code... The short version is to look for "OggS", read a byte (should be 0), read a byte (only bit 3 should be set), then read a 64-bit little endian value.

I implemented the solution described by ioctlLR and it seems to work:
double calculateDuration(final File oggFile) throws IOException {
int rate = -1;
int length = -1;
int size = (int) oggFile.length();
byte[] t = new byte[size];
FileInputStream stream = new FileInputStream(oggFile);
stream.read(t);
for (int i = size-1-8-2-4; i>=0 && length<0; i--) { //4 bytes for "OggS", 2 unused bytes, 8 bytes for length
// Looking for length (value after last "OggS")
if (
t[i]==(byte)'O'
&& t[i+1]==(byte)'g'
&& t[i+2]==(byte)'g'
&& t[i+3]==(byte)'S'
) {
byte[] byteArray = new byte[]{t[i+6],t[i+7],t[i+8],t[i+9],t[i+10],t[i+11],t[i+12],t[i+13]};
ByteBuffer bb = ByteBuffer.wrap(byteArray);
bb.order(ByteOrder.LITTLE_ENDIAN);
length = bb.getInt(0);
}
}
for (int i = 0; i<size-8-2-4 && rate<0; i++) {
// Looking for rate (first value after "vorbis")
if (
t[i]==(byte)'v'
&& t[i+1]==(byte)'o'
&& t[i+2]==(byte)'r'
&& t[i+3]==(byte)'b'
&& t[i+4]==(byte)'i'
&& t[i+5]==(byte)'s'
) {
byte[] byteArray = new byte[]{t[i+11],t[i+12],t[i+13],t[i+14]};
ByteBuffer bb = ByteBuffer.wrap(byteArray);
bb.order(ByteOrder.LITTLE_ENDIAN);
rate = bb.getInt(0);
}
}
stream.close();
double duration = (double) (length*1000) / (double) rate;
return duration;
}
Beware, finding the rate this way will work only for vorbis OGG!
Feel free to edit my answer, it may not be perfect.

Efficient way of processing byte array which contains a mixture of encodings

I have some data in a byte array, retrieved earlier from a network session using non-blocking IO (to facilitate multiple channels).
The format of the data is essentially
varint: length of text
UTF-8: the text
I am trying to figure out a way of efficiently extracting the text, given that its starting position is undetermined (as a varint is variable in length). I have something that's really close but for one small niggle, here goes:
import com.clearspring.analytics.util.Varint;
// Some fields for your info
private final byte replyBuffer[] = new byte[32768];
private static final Charset UTF8 = Charset.forName ("UTF-8");
// ...
// Code which extracts the text
ByteArrayInputStream byteInputStream = new ByteArrayInputStream(replyBuffer);
DataInputStream inputStream = new DataInputStream(byteInputStream);
int textLengthBytes;
try {
textLengthBytes = Varint.readSignedVarInt (inputStream);
}
catch (IOException e) {
// I don't think we should ever get an IOException when using the
// ByteArrayInputStream class
throw new RuntimeException ("Unexpected IOException", e);
}
int offset = byteInputStream.pos(); // ** Here lies the problem **
String textReceived = new String (replyBuffer, offset, textLengthBytes, UTF8);
The idea being that the text offset in the buffer is indicated by byteInputStream.pos(). However that method is protected.
It seems to me that the only way to get the "rest" of the text after decoding the varint is to use something that copies it all into another buffer but that seems rather wasteful for me.
Constructing the string directly from the underlying buffer should be fine, because after this I don't care anymore for the state of byteInputStream or inputStream. So I am trying to figure out a way to calculate offset, or, put another way, how many bytes Varint.readSignedVarInt consumed. Perhaps there is an efficient method of converting from the integer value returned by Varint.readSignedVarInt to the number of bytes that would have taken up in the encoding?

There are a few ways you can find the offset of the string in the byte array:
You can create a subclass of ByteArrayInputStream that gives you access to the pos field. It has protected access so that subclasses can use it.
If you want something more generally applicable, create a subclass of FilterInputStream that counts the number of bytes that have been read. This is more work and probably not worth the effort though.
Count the number of bytes that encode the varint. There are at most 5.
int offset = 0; while (replyBuffer[offset++] < 0);
Calculate the number of bytes needed to encode a varint. Each byte encodes 7 bits so you can take the position of the highest 1 bit and divide by 7.
// "zigzag" encoding required since you store the length as signed
int textLengthUnsigned = (textLengthBytes<<2) ^ (textLengthBytes >> 31);
int offset = (31 - Integer.numberOfLeadingZeros(textLengthUnsigned))/7 + 1

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Remove junk trailing xml from an inputstream - java

Free webhost have this issue. I'm still yet to find an alternative still in free mode.

Related

Why StringBuffer#append throws StringIndexOutOfBoundsException

How to monitor file modifications and know what changes were made

Exception handling multiple messages inside Buffer [JAVA-Mina]

How to determine length of OGG file

Efficient way of processing byte array which contains a mixture of encodings

Categories

Resources