I'm working on a Java project where I need to monitor files in a certain directory and be notified whenever changes are made on one of the files, this can be achieved using WatchService. Furthermore, I want to know what changes were made, for example: "characters 10 to 15 where removed", "at index 13 characters 'abcd' were added"... I'm willing to take any solution even based on c language monitiring the fileSystem.
I also want to avoid the diff solution to avoid storing the same file 2 times, and for the complexity of the algorithm, it takes to much time for big files.
Thank you for help. :)
If you're using Linux, then the following code will detect changes in file length, you can easily extend this to update modifications.
Because you don't want to keep two files, there is no way to tell which characters were altered if either the file length is reduced (lost characters can't be found) or The file was altered somewhere in the middle
#include <stdio.h>
#include <stdint.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
int main(int argc, char** argv)
{
int fd = open("test", O_RDONLY);
int length = lseek(fd, 0, SEEK_END);
while (1)
{
int new_length;
close(fd);
open("test", O_RDONLY);
sleep(1);
new_length = lseek(fd, 0, SEEK_END);
printf("new_length = %d\n", new_length);
if (new_length != length)
printf ("Length changed! %d->%d\n", length, new_length);
length=new_length;
}
}
[EDIT]
Since the author accepts changes to the kernel for this task, the following change to vfs_write should do the trick:
#define MAX_DIFF_LENGTH 128
ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos)
{
char old_content[MAX_DIFF_LENGTH+1];
char new_content[MAX_DIFF_LENGTH+1];
ssize_t ret;
if (!(file->f_mode & FMODE_WRITE))
return -EBADF;
if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write))
return -EINVAL;
if (unlikely(!access_ok(VERIFY_READ, buf, count)))
return -EFAULT;
ret = rw_verify_area(WRITE, file, pos, count);
if (___ishay < 20)
{
int i;
int length = count > MAX_DIFF_LENGTH ? MAX_DIFF_LENGTH : count;
___ishay++;
vfs_read(file, old_content, length, pos);
old_content[length] = 0;
new_content[length] = 0;
memcpy(new_content, buf, length);
printk(KERN_ERR"[___ISHAY___]Write request for file named: %s count: %d pos: %lld:\n",
file->f_path.dentry->d_name.name,
count,
*pos);
printk(KERN_ERR"[___ISHAY___]New content (replacement) <%d>:\n", length);
for (i=0;i<length;i++)
{
printk("[0x%02x] (%c)", new_content[i], (new_content[i] > 32 && new_content[i] < 127) ?
new_content[i] : 46);
if (length+1 % 10 == 0)
printk("\n");
}
printk(KERN_ERR"[___ISHAY___]Old content (on file now):\n");
for (i=0;i<length;i++)
{
printk("[0x%02x] (%c)", old_content[i], (old_content[i] > 32 && old_content[i] < 127) ?
old_content[i] : 46);
if (length+1 % 10 == 0)
printk("\n");
}
}
if (ret >= 0) {
count = ret;
if (file->f_op->write)
ret = file->f_op->write(file, buf, count, pos);
else
ret = do_sync_write(file, buf, count, pos);
if (ret > 0) {
fsnotify_modify(file);
add_wchar(current, ret);
}
inc_syscw(current);
}
return ret;
}
Explanation:
vfs_write is the function that handles write requests for files, so that's our best central hook to catch modification requests for files before they occur.
vfs_write accepts the file, file position, buffer and length for the write operation, so we know what part of the file will be replaced by this write, and what data will replace it.
Since we know what part of the file will be altered, I added the vfs_read call just before the actual write to keep in memory the part of file we are about to overrun.
This should be a good starter point to get what you need, I made the following simplifications as this is only an example:
Buffers are allocated statically at max 128 bytes (should be allocated dynamically and protect the memory allocation from wasting too much memory on huge write requests)
File length should be checked and read buffer should refer to this check, the current code prints a read buffer even if the write overflows to length beyond the file end
The output currently goes to dmesg. A better implementation would be to keep a cyclic buffer accessible in debugfs, possibly with poll option
Current code captures write to ALL files, I'm sure that's not what you want...
[EDIT2]
Forgot to mention where this function is located, its under fs/read_write.c in the kernel tree
[EDIT3]
There's another possible solution, providing you know which program you want to monitor, and that it doesn't have libc linked statically is use LD_PRELOAD to override the write function and use that as your hook and record the changes. I haven't tried this, but there's no reason why it shouldn't work
Related
I have a data file which contains 100,000+ lines, each line just contains two fields, key and value split by comma, and all the keys are unique. I want to query value by key from this file. Loading it to a map is out of question as that consumes too much memory(code will run on embedded device) and I don't want DB involved. What I do so far is to preprocess the file in my PC, i.e., sort the lines, then use binary search like below in the preprocessed file:
public long findKeyOffset(RandomAccessFile raf, String key)
throws IOException {
int blockSize = 8192;
long fileSize = raf.length();
long min = 0;
long max = (long) fileSize / blockSize;
long mid;
String line;
while (max - min > 1) {
mid = min + (long) ((max - min) / 2);
raf.seek(mid * blockSize);
if (mid > 0)
line = raf.readLine(); // probably a partial line
line = raf.readLine();
String[] parts = line.split(",");
if (key.compareTo(parts[0]) > 0) {
min = mid;
} else {
max = mid;
}
}
// find the right line
min = min * blockSize;
raf.seek(min);
if (min > 0)
line = raf.readLine();
while (true) {
min = raf.getFilePointer();
line = raf.readLine();
if (line == null)
break;
String[] parts = line.split(",");
if (line.compareTo(parts[0]) >= 0)
break;
}
raf.seek(min);
return min;
}
I think there are better solutions than this. Can anyone give me some enlightenment?
Data is immutable and keys are unique (as mentioned in the comments on the question).
A simple solution: Write your own hashing code to map key with the line number.
This means, leave the sorting and instead write your data to the file in the order that your hashing algorithm tells.
When key is queried, you hash the key, get the specific line number and then read the value.
In theory, you have an O(1) solution to your problem.
Ensure that the hashing algorithm has less collision, but I think that depending upon your exact case, a few collisions should be fine. Example: 3 keys map to the same line number so you write all three of them on the same line and when any of the collided keys is searched, you read all 3 entries from that line. Then do a linear (aka O(3) aka constant time in this case) search on the entire line.
An easy algorithm to optimise performance for your specific constraints:
let n be the number of lines in the original, immutable, sorted file.
let k < n be a number (we'll discuss ideal number later).
Divide the file into k files, with approximately equal number of lines in each (so each file has n/k lines). the files will be referred to as F1...Fk. If you prefer to keep the original file intact, just consider F1...Fk as line numbers within the file, cutting it into segments.
create a new file called P with k lines, each line i is the first key of Fi.
when looking for a key, first go with binary search over P using O(logk) to find which file /segment (F1...Fk) you need to go to. Then go to that file/segment and search within.
If k is big enough, then size of Fi (n/k) will be small enough to load to a HashMap and retrieve key with O(1). If it is still not practical, do a binary search of O(log(n/k)).
The total search will be O(logk)+O(log(n/k)), which is an improvement on O(logn) which is your original solution.
I would suggest to find a k that would be big enough to allow you to load a specific Fi file/segment into a HashMap, and not too big to fill up space on your device. The most balanced k it sqrt(n), which makes the solution run in O(log(sqrt(n))), but that may be quite a large P file. If you get a k which allows you to load P and Fi into a HashMap for O(1) retrieve, that would be the best solution.
What about this?
#include <iostream>
#include <fstream>
#include <boost/algorithm/string.hpp>
#include <vector>
using namespace std;
int main(int argc, char *argv[])
{
ifstream f(argv[1],ios::ate);
if (!f.is_open())
return 0;
string key(argv[2]),value;
int max = f.tellg();
int min = 0,mid = 0;
string s;
while(max-min>1)
{
mid = min + (max - min )/2;
f.seekg(mid);
f >> s;
std::vector<std::string> strs;
if (!f)
{
break;
}
if (mid)
{
f >> s;
}
boost::split(strs, s, boost::is_any_of(","));
int comp = key.compare(strs[0]);
if ( comp < 0)
{
max = mid;
}
else if (comp > 0)
{
min = mid;
}
else
{
value = strs[1];
break;
}
}
cout<<"key "<<key;
if (!value.empty())
{
cout<<" found! value = "<<value<<endl;
}
else
{
cout<<" not found..."<<endl;
}
f.close();
return 0;
}
Is there a smart way to create a 'JSON-like' structure of String - Float pairs, 'key' not needed as data will be grabbed randomly - although an incremented key from 0-n might aid random retrieval of associated data. Due to the size of data set (10k pairs of values), I need this to be saved out to an external file type.
The reason is how my data will be compiled. To save someone entering data into an array manually the item will be excel based, saved out to CSV, parsed using a temporary java program to a file format (for example jJSON) which can be added to my project resources folder. I can then retrieve data from this set, without my application having to manually load a huge array into memory upon application creation. I can quite easily parse the CSV to 'fill-up' an array (or similar) at run-time - but I fear that on a mobile device, the memory overhead will be significant?
I have reviewed the answers to: Suitable Java data structure for parsing large data file and Data structure options for efficiently storing sets of integer pairs on disk? and have not been able to draw a definitive conclusion.
I have tried saving to a .JSON file, however not sure if I can request a random entry, plus this seems quite cumbersome for holding a simple structure. Is a treeMap or hashtable where I need to be focusing my search.
To provide some context to my query, my application will be running on android, and needs to reference a definition (approx 500 character String) and a conversion factor (an Float). I need to retrieve a random data entry. The user may only make 2 or 3 requests during a session - therefore see no point in loading a 10k element array into memory. QUERY: potentially modern day technology on android phones will easily munch through this type of query, and its perhaps only an issue if I am parsing millions of entries at run-time?
I am open to using SQLlite to hold my data if this will provide the functionality required. Please note that the data set must be derived from an easily exportable file format from excel (CSV, TXT etc).
Any advice you can give me would be much appreciated.
Here's one possible design that requires a minimal memory footprint while providing fast access:
Start with a data file of comma-separated or tab-separated values so you have line breaks between your data pairs.
Keep an array of long values corresponding to the indexes of the lines in the data file. When you know where the lines are, you can use InputStream.skip() to advance to the desired line. This leverages the fact that skip() is typically quite a bit faster than read for InputStreams.
You would have some setup code that would run at initialization time to index the lines.
An enhancement would be to only index every nth line so that the array is smaller. So if n is 100 and you're accessing line 1003, you take the 10th index to skip to line 1000, then read past two more lines to get to line 1003. This allows you to tune the size of the array to use less memory.
I thought this was an interesting problem, so I put together some code to test my idea. It uses a sample 4MB CSV file that I downloaded from some big data website that has about 36K lines of data. Most of the lines are longer than 100 chars.
Here's code snippet for the setup phase:
long start = SystemClock.elapsedRealtime();
int lineCount = 0;
try (InputStream in = getResources().openRawResource(R.raw.fl_insurance_sample)) {
int index = 0;
int charCount = 0;
int cIn;
while ((cIn = in.read()) != -1) {
charCount++;
char ch = (char) cIn; // this was for debugging
if (ch == '\n' || ch == '\r') {
lineCount++;
if (lineCount % MULTIPLE == 0) {
index = lineCount / MULTIPLE;
if (index == mLines.length) {
mLines = Arrays.copyOf(mLines, mLines.length + 100);
}
mLines[index] = charCount;
}
}
}
mLines = Arrays.copyOf(mLines, index+1);
} catch (IOException e) {
Log.e(TAG, "error reading raw resource", e);
}
long elapsed = SystemClock.elapsedRealtime() - start;
I discovered my data file was actually separated by carriage returns rather than line feeds. It must have been created on an Apple computer. Hence the test for '\r' as well as '\n'.
Here's a snippet from the code to access the line:
long start = SystemClock.elapsedRealtime();
int ch;
int line = Integer.parseInt(editText.getText().toString().trim());
if (line < 1 || line >= mLines.length ) {
mTextView.setText("invalid line: " + line + 1);
}
line--;
int index = (line / MULTIPLE);
in.skip(mLines[index]);
int rem = line % MULTIPLE;
while (rem > 0) {
ch = in.read();
if (ch == -1) {
return; // readLine will fail
} else if (ch == '\n' || ch == '\r') {
rem--;
}
}
BufferedReader reader = new BufferedReader(new InputStreamReader(in));
String text = reader.readLine();
long elapsed = SystemClock.elapsedRealtime() - start;
My test program used an EditText so that I could input the line number.
So to give you some idea of performance, the first phase averaged around 1600ms to read through the entire file. I used a MULTIPLE value of 10. Accessing the last record in the file averaged about 30ms.
To get down to 30ms access with only a 29312-byte memory footprint is pretty good, I think.
You can see the sample project on GitHub.
I have a file of size around 4-5 Gigs(nearly billion lines). From every line of the file, I have to parse the array of integers and the additional integer info and update my custom data structure. My class to hold such information looks like
class Holder {
private int[][] arr = new int[1000000000][5]; // assuming that max array size is 5
private int[] meta = new int[1000000000];
}
A sample line from the file looks like
(1_23_4_55) 99
Every index in the arr & meta corresponds to the line number in the file. From the above line, I extract the array of integers first and then the meta information. In that case,
--pseudo_code--
arr[line_num] = new int[]{1, 23, 4, 55}
meta[line_num]=99
Right now, I am using BufferedReader object and it's readLine method to read each line & use character level operations to parse the integer array and meta information from each line and populate the Holder instance. But, it takes almost half an hour to complete this entire operation.
I used both java Serialization & Externalizable(write the meta and arr) to serialize and deserialize this HUGE Holder instance. And with both of them, the time to serialize is almost half an hour and to deserialize is also almost half an hour.
I would appreciate your suggestions on dealing with this kind of problem & would definitely love to hear your part of story if any.
P.S. Main Memory is not a problem. I have almost 50 GB of RAM in my machine. I have also increased the BufferedReader size to 40 MB (Of course, I can increase this upto 100 MB considering that disk access takes approx. 100 MB/sec). Even cores and CPU is not a problem.
EDIT I
The code that I am using to do this task is provided below(after anonymizing very few information);
public class BigFileParser {
private int parsePositiveInt(final String s) {
int num = 0;
int sign = -1;
final int len = s.length();
final char ch = s.charAt(0);
if (ch == '-')
sign = 1;
else
num = '0' - ch;
int i = 1;
while (i < len)
num = num * 10 + '0' - s.charAt(i++);
return sign * num;
}
private void loadBigFile() {
long startTime = System.nanoTime();
Holder holder = new Holder();
String line;
try {
Reader fReader = new FileReader("/path/to/BIG/file");
// 40 MB buffer size
BufferedReader bufferedReader = new BufferedReader(fReader, 40960);
String tempTerm;
int i, meta, ascii, len;
boolean consumeNextInteger;
// GNU Trove primitive int array list
TIntArrayList arr;
char c;
while ((line = bufferedReader.readLine()) != null) {
consumeNextInteger = true;
tempTerm = "";
arr = new TIntArrayList(5);
for (i = 0, len = line.length(); i < len; i++) {
c = line.charAt(i);
ascii = c - 0;
// 95 is the ascii value of _ char
if (consumeNextInteger && ascii == 95) {
arr.add(parsePositiveInt(tempTerm));
tempTerm = "";
} else if (ascii >= 48 && ascii <= 57) { // '0' - '9'
tempTerm += c;
} else if (ascii == 9) { // '\t'
arr.add(parsePositiveInt(tempTerm));
consumeNextInteger = false;
tempTerm = "";
}
}
meta = parsePositiveInt(tempTerm);
holder.update(arr, meta);
}
bufferedReader.close();
long endTime = System.nanoTime();
System.out.println("#time -> " + (endTime - startTime) * 1.0
/ 1000000000 + " seconds");
} catch (IOException exp) {
exp.printStackTrace();
}
}
}
public class Holder {
private static final int SIZE = 500000000;
private TIntArrayList[] arrs;
private TIntArrayList metas;
private int idx;
public Holder() {
arrs = new TIntArrayList[SIZE];
metas = new TIntArrayList(SIZE);
idx = 0;
}
public void update(TIntArrayList arr, int meta) {
arrs[idx] = arr;
metas.add(meta);
idx++;
}
}
It sounds like the time taken for file I/O is the main limiting factor, given that serialization (binary format) and your own custom format take about the same time.
Therefore, the best thing you can do is to reduce the size of the file. If your numbers are generally small, then you could get a huge boost from using Google protocol buffers, which will encode small integers generally in one or two bytes.
Or, if you know that all your numbers are in the 0-255 range, you could use a byte[] rather than int[] and cut the size (and hence load time) to a quarter of what it is now. (assuming you go back to serialization or just write to a ByteChannel)
It simply can't take that long. You're working with some 6e9 ints, which means 24 GB. Writing 24 GB to the disk takes some time, but nothing like half an hour.
I'd put all the data in a single one-dimensional array and access it via methods like int getArr(int row, int col) which transform row and col onto a single index. According to how the array gets accessed (usually row-wise or usually column-wise), this index would be computed as N * row + col or N * col + row to maximize locality. I'd also store meta in the same array.
Writing a single huge int[] into memory should be pretty fast, surely no half an hour.
Because of the data amount, the above doesn't work as you can't have a 6e9 entries array. But you can use a couple of big arrays instead and all of the above applies (compute a long index from row and col and split it into two ints for accessing the 2D-array).
Make sure you aren't swapping. Swapping is the most probable reason for the slow speed I can think of.
There are several alternative Java file i/o libraries. This article is a little old, but it gives an overview that's still generally valid. He's reading about 300Mb per second with a 6-year old Mac. So for 4Gb you have under 15 seconds of read time. Of course my experience is that Mac IO channels are very good. YMMV if you have a cheap PC.
Note there is no advantage above a buffer size of 4K or so. In fact you're more likely to cause thrashing with a big buffer, so don't do that.
The implication is that parsing characters into the data you need is the bottleneck.
I have found in other apps that reading into a block of bytes and writing C-like code to extract what I need goes faster than the built-in Java mechanisms like split and regular expressions.
If that still isn't fast enough, you'd have to fall back to a native C extension.
If you randomly pause it you will probably see that the bulk of the time goes into parsing the integers, and/or all the new-ing, as in new int[]{1, 23, 4, 55}. You should be able to just allocate the memory once and stick numbers into it at better than I/O speed if you code it carefully.
But there's another way - why is the file in ASCII?
If it were in binary, you could just slurp it up.
I'm making a rhythm game and I need a quick way to get the length of an ogg file. The only way I could think would be to stream the file really fast without playing it but if I have hundreds of songs this would obviously not be practical. Another way would be to store the length of the file in some sort of properties file but I would like to avoid this. I know there must be some way to do this as most music players can tell you the length of a song.
The quickest way to do it is to seek to the end of the file, then back up to the last Ogg page header you find and read its granulePosition (which is the total number of samples per channel in the file). That's not foolproof (you might be looking at a chained file, in which case you're only getting the last stream's length), but should work for the vast majority of Ogg files out there.
If you need help with reading the Ogg page header, you can read the Jorbis source code... The short version is to look for "OggS", read a byte (should be 0), read a byte (only bit 3 should be set), then read a 64-bit little endian value.
I implemented the solution described by ioctlLR and it seems to work:
double calculateDuration(final File oggFile) throws IOException {
int rate = -1;
int length = -1;
int size = (int) oggFile.length();
byte[] t = new byte[size];
FileInputStream stream = new FileInputStream(oggFile);
stream.read(t);
for (int i = size-1-8-2-4; i>=0 && length<0; i--) { //4 bytes for "OggS", 2 unused bytes, 8 bytes for length
// Looking for length (value after last "OggS")
if (
t[i]==(byte)'O'
&& t[i+1]==(byte)'g'
&& t[i+2]==(byte)'g'
&& t[i+3]==(byte)'S'
) {
byte[] byteArray = new byte[]{t[i+6],t[i+7],t[i+8],t[i+9],t[i+10],t[i+11],t[i+12],t[i+13]};
ByteBuffer bb = ByteBuffer.wrap(byteArray);
bb.order(ByteOrder.LITTLE_ENDIAN);
length = bb.getInt(0);
}
}
for (int i = 0; i<size-8-2-4 && rate<0; i++) {
// Looking for rate (first value after "vorbis")
if (
t[i]==(byte)'v'
&& t[i+1]==(byte)'o'
&& t[i+2]==(byte)'r'
&& t[i+3]==(byte)'b'
&& t[i+4]==(byte)'i'
&& t[i+5]==(byte)'s'
) {
byte[] byteArray = new byte[]{t[i+11],t[i+12],t[i+13],t[i+14]};
ByteBuffer bb = ByteBuffer.wrap(byteArray);
bb.order(ByteOrder.LITTLE_ENDIAN);
rate = bb.getInt(0);
}
}
stream.close();
double duration = (double) (length*1000) / (double) rate;
return duration;
}
Beware, finding the rate this way will work only for vorbis OGG!
Feel free to edit my answer, it may not be perfect.
My free webhost appends analytics javascript to all PHP and HTML files. Which is fine, except that I want to send XML to my Android app, and it's invalidating my files.
Since XML is parsed in its entirety (and blows up) before passed along to my SAX ContentHandler, I can't just catch the exception and continue merrily along with a fleshed out object. (Which I tried, and then felt sheepish about.)
Any suggestions on a reasonably efficient strategy?
I'm about to create a class that will take my InputStream, read through it until I find the junk, break, then take what I just wrote to, convert it back into an InputStream and pass it along like nothing happened. But I'm worried that it'll be grossly inefficient, have bugs I shouldn't have to deal with (e.g. breaking on binary values such as embedded images) and hopefully unnecessary.
FWIW, this is part of an Android project, so I'm using the android.util.Xml class (see source code). When I traced the exception, it took me to a native appendChars function that is itself being called from a network of private methods anyway, so subclassing anything seems to be unreasonably useless.
Here's the salient bit from my stacktrace:
E/AndroidRuntime( 678): Caused by: org.apache.harmony.xml.ExpatParser$ParseException: At line 3, column 0: junk after document element
E/AndroidRuntime( 678): at org.apache.harmony.xml.ExpatParser.parseFragment(ExpatParser.java:523)
E/AndroidRuntime( 678): at org.apache.harmony.xml.ExpatParser.parseDocument(ExpatParser.java:482)
E/AndroidRuntime( 678): at org.apache.harmony.xml.ExpatReader.parse(ExpatReader.java:320)
E/AndroidRuntime( 678): at org.apache.harmony.xml.ExpatReader.parse(ExpatReader.java:277)
I guess in the end I'm asking for opinions on whether the InputStream -> manually parse to OutputStream -> recreate InputStream -> pass along solution is as horrible as I think it is.
I'm about to create a class that will take my InputStream, read
through it until I find the junk, break, then take what I just wrote
to, convert it back into an InputStream and pass it along like nothing
happened. But I'm worried that it'll be grossly inefficient, have bugs
I shouldn't have to deal with (e.g. breaking on binary values such as
embedded images) and hopefully unnecessary.
you could use a FilterStream for that no need for a buffer
best thing to do is add a delimiter to the end of the XML like --theXML ends HERE -- or a char not found in XML like a group of 16 \u04 chars (you then only need to check every 16th byte) to the end of the XML and read until you find it
implementation assuming \u04 delim
class WebStream extends FilterInputStream {
byte[] buff = new byte[1024];
int offset = 0, length = 0;
public WebStream(InputStream i) {
super(i);
}
#Override
public boolean markSupported() {
return false;
}
#Override
public int read() throws IOException {
if (offset == length)
readNextChunk();
if (length == -1)
return -1;// eof
return buff[offset++];
}
#Override
public int read(byte[] b, int off, int len) throws IOException {
if (offset == length)
readNextChunk();
if (length == -1)
return -1;// eof
int cop = length - offset;
if (len < cop)
cop = len;
System.arraycopy(buff, offset, b, off, cop);
offset += cop;
return cop;
}
private void readNextChunk() throws IOException {
if (offset <= length) {
System.arraycopy(buff, offset, buff, 0, length - offset);
length -= offset;
offset = 0;
}
int read = in.read(buff, length, buff.length - length);
if (read < 0 && length <= 0) {
length = -1;
offset = 0;
return;
}
// note that this is assuming ascii compatible
// anything like utf16 or utf32 will break here
for (int i = length; i < read + length; i += 16) {
if (buff[i] == 0x04) {
while (buff[--i] == 0x04)
;// find beginning of delim block
length = i;
read = 0;
}
}
}
}
note this misses throws, some error checking and needs proper debugging
"I'm about to create a class that will take my InputStream, read through it until I find the junk, break, then take what I just wrote to, convert it back into an InputStream and pass it along like nothing happened. But I'm worried that it'll be grossly inefficient, have bugs I shouldn't have to deal with (e.g. breaking on binary values such as embedded images) and hopefully unnecessary."
That'll work. You can read into a StringBuffer and then use a ByteArrayInputStream or something similar (like StreamReader if that's applicable).
http://developer.android.com/reference/java/io/ByteArrayInputStream.html
The downside is that you're reading in the entire XML file into memory, for large files, it can be inefficient memory-wise.
Alternatively, you can subclass InputStream and do the filtering out via the stream. You'd probably just need to override the 3 read() methods by calling super.read() and flagging when you've gotten to the garbage at the end and return an EOF as needed.
Free webhost have this issue. I'm still yet to find an alternative still in free mode.