Why is ReversedLinesFileReader so slow? - java

I have a file that is 21.6GB and I want to read it from the end to the start rather than from the beginning to the end as you would usually do.
If I read each line of the file from the start to the end using the following code, then it takes 1 minute, 12 seconds.
val startTime = System.currentTimeMillis()
File("very-large-file.xml").forEachLine {
val i = 0
}
val diff = System.currentTimeMillis() - startTime
println(diff.timeFormat())
Now, I have read that to read in a file in reverse then I should use ReversedLinesFileReader from Apache Commons. I have created the following extension function to do just this:
fun File.forEachLineFromTheEndOfFile(action: (line: String) -> Unit) {
val reader = ReversedLinesFileReader(this, Charset.defaultCharset())
var line = reader.readLine()
while (line != null) {
action.invoke(line)
line = reader.readLine()
}
reader.close()
}
and then call it in the following way, which is the same as the previous way only with a call to forEachLineFromTheEndOfFile function:
val startTime = System.currentTimeMillis()
File("very-large-file.xml").forEachLineFromTheEndOfFile {
val i = 0
}
val diff = System.currentTimeMillis() - startTime
println(diff.timeFormat())
This took 17 minutes and 50 seconds to run!
Am I using ReversedLinesFileReader in the correct way?
I am running Linux Mint with an Ext4 file system on an SSD. Could this have anything to do with it?
Is it just the case that files should not be read from the end to the start?

You are asking for a very expensive operation. Not only are you using random access in blocks to read the file and going backwards (so if the file system is reading ahead, it is reading the wrong direction), you are also reading an XML file which is UTF-8 and the encoding is slower than a fixed byte encoding.
Then on top of that you are using a less than efficient algorithm. It reads a block at a time of inconvenient size (is it disk block size aware? are you setting the block size to match your file system?) backwards while processing encoding and makes (unnecessary?) copy of the partial byte array and then turns it into a string (do you need a string to parse?). It could create the string without the copy and really creating the string probably could be deferred and you work directly from the buffer only decoding if you need to (XML parsers for example also work from ByteArrays or buffers). And there are other array copies that just are not needed but it is more convenient for the code.
It also might have a bug in that it checks for newlines without considering that the character might mean something different if actually is part of a multi-byte sequence. It would have to look back a few extra characters to check this for variable length encodings, I don't see it doing that.
So instead of a nice forward only heavily buffered sequential read of a file which is the fastest thing you can do on your filesystem, you are doing random reads of 1 block at a time. It should at least read multiple disk blocks so that it can use the forward momentum (set blocksize to some multiple of your disk block size will help) and also avoid the number of "left over" copies made at buffer boundaries.
There are probably faster approaches. But it'll not be as fast as reading a file in forward order.
UPDATE:
Ok, so I tried an experiment with a rather silly version that processes around 27G of data by reading the first 10 million lines from wikidata JSON dump and reversing those lines.
Timings on my 2015 Mac Book Pro (with all my dev stuff and many chrome windows open eating memory and some CPU all the time, about 5G of total memory is free, VM size is default with no parameters set at all, not run under debugger):
reading in reverse order: 244,648 ms = 244 secs = 4 min 4 secs
reading in forward order: 77,564 ms = 77 secs = 1 min 17 secs
temp file count: 201
approx char count: 29,483,478,770 (line content not including line endings)
total line count: 10,050,000
The algorithm is to read the original file by lines buffering 50000 lines at a time, writing the lines in reverse order to a numbered temp file. Then after all files are written, they are read in reverse numerical order forward by lines. Basically dividing them into reverse sort order fragments of the original. It could be optimized because this is the most naive version of that algorithm with no tuning. But it does do what file systems do best, sequential reads and sequential writes with good sized buffers.
So this is a lot faster than the one you were using and it could be tuned from here to be more efficient. You could trade CPU for disk I/O size and try using gzipped files as well, maybe a two threaded model to have the next buffer gzipping while processing the previous. Less string allocations, checking each file function to make sure nothing extra is going on, make sure no double buffering, and more.
The ugly but functional code is:
package com.stackoverflow.reversefile
import java.io.File
import java.util.*
fun main(args: Array<String>) {
val maxBufferSize = 50000
val lineBuffer = ArrayList<String>(maxBufferSize)
val tempFiles = ArrayList<File>()
val originalFile = File("/data/wikidata/20150629.json")
val tempFilePrefix = "/data/wikidata/temp/temp"
val maxLines = 10000000
var approxCharCount: Long = 0
var tempFileCount = 0
var lineCount = 0
val startTime = System.currentTimeMillis()
println("Writing reversed partial files...")
try {
fun flush() {
val bufferSize = lineBuffer.size
if (bufferSize > 0) {
lineCount += bufferSize
tempFileCount++
File("$tempFilePrefix-$tempFileCount").apply {
bufferedWriter().use { writer ->
((bufferSize - 1) downTo 0).forEach { idx ->
writer.write(lineBuffer[idx])
writer.newLine()
}
}
tempFiles.add(this)
}
lineBuffer.clear()
}
println(" flushed at $lineCount lines")
}
// read and break into backword sorted chunks
originalFile.bufferedReader(bufferSize = 4096 * 32)
.lineSequence()
.takeWhile { lineCount <= maxLines }.forEach { line ->
lineBuffer.add(line)
if (lineBuffer.size >= maxBufferSize) flush()
}
flush()
// read backword sorted chunks backwards
println("Reading reversed lines ...")
tempFiles.reversed().forEach { tempFile ->
tempFile.bufferedReader(bufferSize = 4096 * 32).lineSequence()
.forEach { line ->
approxCharCount += line.length
// a line has been read here
}
println(" file $tempFile current char total $approxCharCount")
}
} finally {
tempFiles.forEach { it.delete() }
}
val elapsed = System.currentTimeMillis() - startTime
println("temp file count: $tempFileCount")
println("approx char count: $approxCharCount")
println("total line count: $lineCount")
println()
println("Elapsed: ${elapsed}ms ${elapsed / 1000}secs ${elapsed / 1000 / 60}min ")
println("reading original file again:")
val againStartTime = System.currentTimeMillis()
var againLineCount = 0
originalFile.bufferedReader(bufferSize = 4096 * 32)
.lineSequence()
.takeWhile { againLineCount <= maxLines }
.forEach { againLineCount++ }
val againElapsed = System.currentTimeMillis() - againStartTime
println("Elapsed: ${againElapsed}ms ${againElapsed / 1000}secs ${againElapsed / 1000 / 60}min ")
}

The correct way to investigate this problem would be:
Write a version of this test in pure Java.
Benchmark it to make sure that the performance problem is still there.
Profile it to figure out where the performance bottleneck is.
Q: Am I using ReversedLinesFileReader in the correct way?
Yes. (Assuming that it is an appropriate thing to use a line reader at all. That depends on what it is you are really trying to do. For instance, if you just wanted to count lines backwards, then you should be reading 1 character at a time and counting the newline sequences.)
Q: I am running Linux Mint with an Ext4 file system on an SSD. Could this have anything to do with it?
Possibly. Reading a file in reverse means that the read-ahead strategies used by the OS to give fast I/O may not work. It could be interacting with the characteristics of an SSD.
Q: Is it just the case that files should not be read from the end to the start?
Possibly. See above.
The other thing that you have not considered is that your file could actually contain some extremely long lines. The bottleneck could be the assembly of the characters into (long) lines.
Looking at the source code, it would seem that there is potential for O(N^2) behavior when lines are very long. The critical part is (I think) in the way that "rollover" is handled by FilePart. Note the way that the "left over" data gets copied.

Related

Iterating massive CSVs for comparisons

I have two very large CSV files that will only continue to get larger with time. The documents I'm using to test are 170 columns wide and roughly 57,000 rows. This is using data from 2018 to now, ideally the end result will be sufficient to run on CSVs with data going as far back as 2008 which will result in the CSVs being massive.
Currently I'm using Univocity, but the creator has been inactive on answering questions for quite some time and their website has been down for weeks, so I'm open to changing parsers if need be.
Right now I have the following code:
public void test() {
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setLineSeparatorDetectionEnabled(true);
parserSettings.setHeaderExtractionEnabled(false);
CsvParser sourceParser = new CsvParser(parserSettings);
sourceParser.beginParsing(sourceFile));
Writer writer = new OutputStreamWriter(new FileOutputStream(outputPath), StandardCharsets.UTF_8);
CsvWriterSettings writerSettings = new CsvWriterSettings();
CsvWriter csvWriter = new CsvWriter(writer, writerSettings);
csvWriter.writeRow(headers);
String[] sourceRow;
String[] compareRow;
while ((sourceRow = sourceParser.parseNext()) != null) {
CsvParser compareParser = new CsvParser(parserSettings);
compareParser.beginParsing(Path.of("src/test/resources/" + compareCsv + ".csv").toFile());
while ((compareRow = compareParser.parseNext()) != null) {
if (Arrays.equals(sourceRow, compareRow)) {
break;
} else {
if (compareRow[KEY_A].trim().equals(sourceRow[KEY_A].trim()) &&
compareRow[KEY_B].trim().equals(sourceRow[KEY_B].trim()) &&
compareRow[KEY_C].trim().equals(sourceRow[KEY_C].trim())) {
for (String[] result : getOnlyDifferentValues(sourceRow, compareRow)) {
csvWriter.writeRow(result);
}
break;
}
}
}
compareParser.stopParsing();
}
}
This all works exactly as I need it to, but of course as you can obviously tell it takes forever. I'm stopping and restarting the parsing of the compare file because order is not guaranteed in these files, so what is in row 1 in the source CSV could be in row 52,000 in the compare CSV.
The Question:
How do I get this faster? Here are my requirements:
Print row under following conditions:
KEY_A, KEY_B, KEY_C are equal but any other column is not equal
Source row is not found in compare CSV
Compare row is not found in source CSV
Presently I only have the first requirement working, but I need to tackle the speed issue first and foremost. Also, if I try to parse the file into memory I immediately run out of heap space and the application laughs at me.
Thanks in advance.
Also, if I try to parse the file into memory I immediately run out of heap space
Have you tried increasing the heap size? You don't say how large your data file is, but 57000 rows * 170 columns * 100 bytes per cell = 1 GB, which should pose no difficulty on a modern hardware. Then, you can keep the comparison file in a HashMap for efficient lookup by key.
Alternatively, you could import the CSVs into a database and make use of its join algorithms.
Or if you'd rather reinvent the wheel while scrupolously avoiding memory use, you could first sort the CSVs (by partitioning them into sets small enough to sort in memory, and then doing a k-way merge to merge the sublists), and then to a merge join. But the other solutions are likely to be a lot easier to implement :-)

Is there any hash function which have following properties

I want a hash function which is fast, collision resistant and can give unique output. The primary requirement is - it should be persist-able i.e It's progress(hashing progress) could be saved on a file and then later resumed. You can also provide your own implementation with Python.
Implementations in "other languages" is/are also accepted if it is possible to use that with Python without getting hands dirty going internal.
Thanks in advance :)
Because of the pigeonhole principle no hash function can generate hashes which are unique / collision-proof. A good hashing function is collision-resistant, and makes it difficult to generate a file that produces a specified hash. Designing a good hash function is an advanced topic, and I'm certainly no expert in that field. However, since my code is based on sha256 it should be fairly collision-resistant, and hopefully it's also difficult to generate a file that produces a specified hash, but I can make no guarantees in that regard.
Here's a resumable hash function based on sha256 which is fairly fast. It takes about 44 seconds to hash a 1.4GB file on my 2GHz machine with 2GB of RAM.
persistent_hash.py
#! /usr/bin/env python
''' Use SHA-256 to make a resumable hash function
The file is divided into fixed-sized chunks, which are hashed separately.
The hash of each chunk is combined into a hash for the whole file.
The hashing process may be interrupted by Control-C (SIGINT) or SIGTERM.
When a signal is received, hashing continues until the end of the
current chunk, then the file position and current hex digest is saved
to a file. The name of this file is formed by appending '.hash' to the
name of the file being hashed.
Just re-run the program to resume hashing. The '.hash' file will be deleted
once hashing is completed.
Written by PM 2Ring 2014.11.11
'''
import sys
import os
import hashlib
import signal
quit = False
blocksize = 1<<16 # 64kB
blocksperchunk = 1<<10
chunksize = blocksize * blocksperchunk
def handler(signum, frame):
global quit
print "\nGot signal %d, cleaning up." % signum
quit = True
def do_hash(fname):
hashname = fname + '.hash'
if os.path.exists(hashname):
with open(hashname, 'rt') as f:
data = f.read().split()
pos = int(data[0])
current = data[1].decode('hex')
else:
pos = 0
current = ''
finished = False
with open(fname, 'rb') as f:
f.seek(pos)
while not (quit or finished):
full = hashlib.sha256(current)
part = hashlib.sha256()
for _ in xrange(blocksperchunk):
block = f.read(blocksize)
if block == '':
finished = True
break
part.update(block)
full.update(part.digest())
current = full.digest()
pos += chunksize
print pos
if finished or quit:
break
hexdigest = full.hexdigest()
if quit:
with open(hashname, 'wt') as f:
f.write("%d %s\n" % (pos, hexdigest))
elif os.path.exists(hashname):
os.remove(hashname)
return (not quit), pos, hexdigest
def main():
if len(sys.argv) != 2:
print "Calculate resumable hash of a file."
print "Usage:\npython %s filename\n" % sys.argv[0]
exit(1)
fname = sys.argv[1]
signal.signal(signal.SIGINT, handler)
signal.signal(signal.SIGTERM, handler)
print do_hash(fname)
if __name__ == '__main__':
main()

fastest packing of data in Python (and Java)

(Sometimes our host is wrong; nanoseconds matter ;)
I have a Python Twisted server that talks to some Java servers and profiling shows spending ~30% of its runtime in the JSON encoder/decoder; its job is handling thousands of messages per second.
This talk by youtube raises interesting applicable points:
Serialization formats - no matter which one you use, they are all
expensive. Measure. Don’t use pickle. Not a good choice. Found
protocol buffers slow. They wrote their own BSON implementation which
is 10-15 time faster than the one you can download.
You have to measure. Vitess swapped out one its protocols for an HTTP
implementation. Even though it was in C it was slow. So they ripped
out HTTP and did a direct socket call using python and that was 8%
cheaper on global CPU. The enveloping for HTTP is really expensive.
Measurement. In Python measurement is like reading tea leaves.
There’s a lot of things in Python that are counter intuitive, like
the cost of grabage colleciton. Most of chunks of their apps spend
their time serializing. Profiling serialization is very depending on
what you are putting in. Serializing ints is very different than
serializing big blobs.
Anyway, I control both the Python and Java ends of my message-passing API and can pick a different serialisation than JSON.
My messages look like:
a variable number of longs; anywhere between 1 and 10K of them
and two already-UTF8 text strings; both between 1 and 3KB
Because I am reading them from a socket, I want libraries that can cope gracefully with streams - its irritating if it doesn't tell me how much of a buffer it consumed, for example.
The other end of this stream is a Java server, of course; I don't want to pick something that is great for the Python end but moves problems to the Java end e.g. performance or torturous or flaky API.
I will obviously be doing my own profiling. I ask here in the hope you describe approaches I wouldn't think of e.g. using struct and what the fastest kind of strings/buffers are.
Some simple test code gives surprising results:
import time, random, struct, json, sys, pickle, cPickle, marshal, array
def encode_json_1(*args):
return json.dumps(args)
def encode_json_2(longs,str1,str2):
return json.dumps({"longs":longs,"str1":str1,"str2":str2})
def encode_pickle(*args):
return pickle.dumps(args)
def encode_cPickle(*args):
return cPickle.dumps(args)
def encode_marshal(*args):
return marshal.dumps(args)
def encode_struct_1(longs,str1,str2):
return struct.pack(">iii%dq"%len(longs),len(longs),len(str1),len(str2),*longs)+str1+str2
def decode_struct_1(s):
i, j, k = struct.unpack(">iii",s[:12])
assert len(s) == 3*4 + 8*i + j + k, (len(s),3*4 + 8*i + j + k)
longs = struct.unpack(">%dq"%i,s[12:12+i*8])
str1 = s[12+i*8:12+i*8+j]
str2 = s[12+i*8+j:]
return (longs,str1,str2)
struct_header_2 = struct.Struct(">iii")
def encode_struct_2(longs,str1,str2):
return "".join((
struct_header_2.pack(len(longs),len(str1),len(str2)),
array.array("L",longs).tostring(),
str1,
str2))
def decode_struct_2(s):
i, j, k = struct_header_2.unpack(s[:12])
assert len(s) == 3*4 + 8*i + j + k, (len(s),3*4 + 8*i + j + k)
longs = array.array("L")
longs.fromstring(s[12:12+i*8])
str1 = s[12+i*8:12+i*8+j]
str2 = s[12+i*8+j:]
return (longs,str1,str2)
def encode_ujson(*args):
return ujson.dumps(args)
def encode_msgpack(*args):
return msgpacker.pack(args)
def decode_msgpack(s):
msgunpacker.feed(s)
return msgunpacker.unpack()
def encode_bson(longs,str1,str2):
return bson.dumps({"longs":longs,"str1":str1,"str2":str2})
def from_dict(d):
return [d["longs"],d["str1"],d["str2"]]
tests = [ #(encode,decode,massage_for_check)
(encode_struct_1,decode_struct_1,None),
(encode_struct_2,decode_struct_2,None),
(encode_json_1,json.loads,None),
(encode_json_2,json.loads,from_dict),
(encode_pickle,pickle.loads,None),
(encode_cPickle,cPickle.loads,None),
(encode_marshal,marshal.loads,None)]
try:
import ujson
tests.append((encode_ujson,ujson.loads,None))
except ImportError:
print "no ujson support installed"
try:
import msgpack
msgpacker = msgpack.Packer()
msgunpacker = msgpack.Unpacker()
tests.append((encode_msgpack,decode_msgpack,None))
except ImportError:
print "no msgpack support installed"
try:
import bson
tests.append((encode_bson,bson.loads,from_dict))
except ImportError:
print "no BSON support installed"
longs = [i for i in xrange(10000)]
str1 = "1"*5000
str2 = "2"*5000
random.seed(1)
encode_data = [[
longs[:random.randint(2,len(longs))],
str1[:random.randint(2,len(str1))],
str2[:random.randint(2,len(str2))]] for i in xrange(1000)]
for encoder,decoder,massage_before_check in tests:
# do the encoding
start = time.time()
encoded = [encoder(i,j,k) for i,j,k in encode_data]
encoding = time.time()
print encoder.__name__, "encoding took %0.4f,"%(encoding-start),
sys.stdout.flush()
# do the decoding
decoded = [decoder(e) for e in encoded]
decoding = time.time()
print "decoding %0.4f"%(decoding-encoding)
sys.stdout.flush()
# check it
if massage_before_check:
decoded = [massage_before_check(d) for d in decoded]
for i,((longs_a,str1_a,str2_a),(longs_b,str1_b,str2_b)) in enumerate(zip(encode_data,decoded)):
assert longs_a == list(longs_b), (i,longs_a,longs_b)
assert str1_a == str1_b, (i,str1_a,str1_b)
assert str2_a == str2_b, (i,str2_a,str2_b)
gives:
encode_struct_1 encoding took 0.4486, decoding 0.3313
encode_struct_2 encoding took 0.3202, decoding 0.1082
encode_json_1 encoding took 0.6333, decoding 0.6718
encode_json_2 encoding took 0.5740, decoding 0.8362
encode_pickle encoding took 8.1587, decoding 9.5980
encode_cPickle encoding took 1.1246, decoding 1.4436
encode_marshal encoding took 0.1144, decoding 0.3541
encode_ujson encoding took 0.2768, decoding 0.4773
encode_msgpack encoding took 0.1386, decoding 0.2374
encode_bson encoding took 55.5861, decoding 29.3953
bson, msgpack and ujson all installed via easy_install
I would love to be shown I'm doing it wrong; that I should be using cStringIO interfaces or however else you speed it all up!
There must be a way to serialise this data that is an order of magnitude faster surely?
While JSon is flexible, it is one of the slowest serialization formats in Java (possible python as well) in nano-seconds matter I would use a binary format in native byte order (likely to be little endian)
Here is a library were I do exactly that AbstractExcerpt and UnsafeExcerpt A typical message takes 50 to 200 ns to serialize and send or read and deserialize.
In the end, we chose to use msgpack.
If you go with JSON, your choice of library on Python and Java is critical to performance:
On Java, http://blog.juicehub.com/2012/11/20/benchmarking-web-frameworks-for-games/ says:
Performance was absolutely atrocious until we swapped out the JSON Lib (json-simple) for Jackon’s ObjectMapper. This brought RPS for 35 to 300+ – a 10x increase
You may be able to speed up the struct case
def encode_struct(longs,str1,str2):
return struct.pack(">iii%dq"%len(longs),len(longs),len(str1),len(str2),*longs)+str1+str2
Try using the python array module and the method tostring to convert your longs into a binary string. Then you can append it like you did with the strings
Create a struct.Struct object and use that. I believe it's more efficient
You can also look into:
http://docs.python.org/library/xdrlib.html#module-xdrlib
Your fastest method encodes 1000 elements in .1222 seconds. That's 1 element in .1222 milliseconds. That's pretty fast. I doubt you'll do much better without switching languages.
I know this is an old question, but it is still interesting. My most recent choice was to use Cap’n Proto, written by the same guy who did protobuf for google. In my case, that lead to a decrease in both time and volume around 20% compared to Jackson's JSON encoder/decoder (server to server, Java on both sides).
Protocol Buffers are pretty fast and have bindings for both Java and Python. It's quite a popular library and used inside Google so it should be tested and optimized quite well.
Since the data your sending is already well defined, non-recursive, and non-nested, why not just use a simple delimited string. You just need a delimiter that isn't contained in your string variables maybe '\n'.
"10\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\nFoo Foo Foo\nBar Bar Bar"
Then just use a simple String Split method.
string[] temp = str.split("\n");
ArrayList<long> longs = new ArrayList<long>(long.parseLong(temp[0]));
string string1 = temp[temp.length-2];
string string2 = temp[temp.length-1];
for(int i = 1; i < temp.length-2 ; i++)
longs.add(long.parseLong(temp[i]));
note The above was written in the web browser and untested so syntax errors may exist.
For a text based; I'd assume the above is the fastest method.

Sorting a 100MB XML file with Java?

How long does sorting a 100MB XML file with Java take ?
The file has items with the following structure and I need to sort them by event
<doc>
<id>84141123</id>
<title>kk+ at Hippie Camp</title>
<description>photo by SFP</description>
<time>18945840</time>
<tags>elphinstone tribalharmonix vancouver intention intention7 newyears hippiecamp bc sunshinecoast woowoo kk kriskrug sunglasses smoking unibomber møtleykrüg </tags>
<geo></geo>
<event>47409</event>
</doc>
I'm on a Intel Dual Duo Core and 4GB RAM.
Minutes ? Hours ?
thanks
Here are the timings for a similar task executed using Saxon XQuery on a 100Mb input file.
Saxon-EE 9.3.0.4J from Saxonica
Java version 1.6.0_20
Analyzing query from {for $i in //item order by location return $i}
Analysis time: 195 milliseconds
Processing file:/e:/javalib/xmark/xmark100.xml
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/e:/javalib/xmark/xmark100.xml using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 6158 milliseconds
Tree size: 4787932 nodes, 79425460 characters, 381878 attributes
Execution time: 3.466s (3466ms)
Memory used: 471679816
So: about 6 seconds for parsing the input file and building a tree, 3.5 seconds for sorting it. That's invoked from the command line, but invoking it from Java will get very similar performance. Don't try to code the sort yourself - it's only a one-line query, and you are very unlikely to match the performance of an optimized XQuery engine.
i would say minutes - you shud be able to do that completely in-memory, so with a sax parser that would be reading-sorting-writing, should not be a problem for your hardware
I think a problem like this would be better sorted using serialisation.
Deserialise the XML file into an ArrayList of 'doc'.
Using straight Java code, apply sort on the event attribute and stored sorted arraylist in another variable.
Serialise out the sorted 'doc' ArrayList to file
If you do it in memory, you should be able to do this in under 10 seconds. You would be pusshing to do this under 2 seconds because it will spend that much times reading/writing to disk.
This program should use no more than 4-5x times the original file size. about 500 MB in your case.
String[] records = FileUtils.readFileToString(new File("my-file.xml")).split("</?doc>");
Map<Long, String> recordMap = new TreeMap<Long, String>();
for(int i=1;i<records.length;i+=2) {
String record = records[i];
int pos1 = record.indexOf("<id>");
int pos2 = record.indexOf("</id>", pos1+4);
long num = Long.parseLong(record.substring(pos1+3, pos2));
recordMap.put(num, record);
}
StringBuilder sb = new StringBuilder(records[0]);
for (String s : recordMap.values()) {
sb.append("<doc>").append(s).append("</doc>");
}
sb.append(records[records.length-1]);
FileUtils.writeStringToFile(new File("my-output-file.xml"), sb.toString());

Escaping large number of characters for display on XHTML web page via Java

I have an embedded device which runs Java applications which can among other things serve up XHTML web pages (I could write the pages as something other than XHTML, but I'm aiming for that for now).
When a request for a web page handled by my application is received a method is called in my code with all the information on the request including an output stream to display the page.
On one of my pages I would like to display a (log) file, which can be up to 1 MB in size.
I can display this file unescaped using the following code:
final PrintWriter writer; // Is initialized to a PrintWriter writing to the output stream.
final FileInputStream fis = new FileInputStream(file);
final InputStreamReader inputStreamReader = new InputStreamReader(fis);
try {
writer.println("<div id=\"log\" style=\"white-space: pre-wrap; word-wrap: break-word\">");
writer.println(" <pre>");
int length;
char[] buffer = new char[1024];
while ((length = inputStreamReader.read(buffer)) != -1) {
writer.write(buffer, 0, length);
}
writer.println(" </pre>");
writer.println("</div>");
} finally {
if (inputStreamReader != null) {
inputStreamReader.close();
}
}
This works reasonably well, and displays the entire file within a second or two (an acceptable timeframe).
This file can (and in practice, does) contain characters which are invalid XHTML, most commonly <>. So I need to find a way to escape these characters.
The first thing I tried was a CDATA section, but as documented here they do not display correctly in IE8.
The second thing I tried was a method like the following:
// Based on code: https://stackoverflow.com/questions/439298/best-way-to-encode-text-data-for-xml-in-java/440296#440296
// Modified to write directly to the stream to avoid creating extra objects.
private static void writeXmlEscaped(PrintWriter writer, char[] buffer, int offset, int length) {
for (int i = offset; i < length; i++) {
char ch = buffer[i];
boolean controlCharacter = ch < 32;
boolean unicodeButNotAscii = ch > 126;
boolean characterWithSpecialMeaningInXML = ch == '<' || ch == '&' || ch == '>';
if (characterWithSpecialMeaningInXML || unicodeButNotAscii || controlCharacter) {
writer.write("&#" + (int) ch + ";");
} else {
writer.write(ch);
}
}
}
This correctly escapes the characters (I was going to expand it to escape HTML invalid characters if needed), but the web page then takes 15+ seconds to display and other resources on the page (images, css stylesheet) intermittently fail to load (I believe due to the requests for them timing out because the processor is pegged).
I've tried using a BufferedWriter in front of the PrintWriter as well as changing the buffer size (both for reading the file and for the BufferedWriter) in various ways, with no improvement.
Is there a way to escape all XHTML invalid characters that does not require iterating over every single character in the stream? Failing that is there a way to speed up my code enough to display these files within a couple seconds?
I'll consider reducing the size of the log files if I have to, but I was hoping to make them at least 250-500 KB in size (with 1 MB being ideal).
I already have a method to simply download the log files, but I would like to display them in browser as well for simple troubleshooting/perusal.
If there's a way to set the headers so that IE8/Firefox will simply display the file in browser as a text file I would consider that as an alternative (and have an entire page dedicated to the file with no XHTML of any kind).
EDIT:
After making the change suggested by Cameron Skinner and performance testing it looks like the escaped writing takes about 1.5-2x as long as the block-written version. It's not nothing, but I'm probably not going to be able to get a huge speedup by messing with it.
I may just need to reduce the max size of the log file.
One small change that will (well, might) significantly increase the speed is to change
writer.write("&#" + (int) ch + ";");
to
writer.write("&#");
writer.write((int)ch);
writer.write(";");
String concatenation is extremely expensive as Java allocates a new temporary string buffer for each + operator, so you are generating two temporary buffers each time there is a character that needs replacing.
EDIT: One of the comments on another answer is highly relevant: find where the slow bit is first. I'd suggest testing logs that have no characters to be escaped and many characters to be escaped.
I think you should make the suggested change anyway because it costs you only a few seconds of your time.
You can try StringEscapeUtils from commons-lang:
StringEscapeUtils.escapeHtml(writer, string);
One option is for you to serve up the log contents inside of an iframe hosted inside of your web page. The iframe's source could point to a URL that serves up the content as text.

Categories