fastest packing of data in Python (and Java) - java

(Sometimes our host is wrong; nanoseconds matter ;)
I have a Python Twisted server that talks to some Java servers and profiling shows spending ~30% of its runtime in the JSON encoder/decoder; its job is handling thousands of messages per second.
This talk by youtube raises interesting applicable points:
Serialization formats - no matter which one you use, they are all
expensive. Measure. Don’t use pickle. Not a good choice. Found
protocol buffers slow. They wrote their own BSON implementation which
is 10-15 time faster than the one you can download.
You have to measure. Vitess swapped out one its protocols for an HTTP
implementation. Even though it was in C it was slow. So they ripped
out HTTP and did a direct socket call using python and that was 8%
cheaper on global CPU. The enveloping for HTTP is really expensive.
Measurement. In Python measurement is like reading tea leaves.
There’s a lot of things in Python that are counter intuitive, like
the cost of grabage colleciton. Most of chunks of their apps spend
their time serializing. Profiling serialization is very depending on
what you are putting in. Serializing ints is very different than
serializing big blobs.
Anyway, I control both the Python and Java ends of my message-passing API and can pick a different serialisation than JSON.
My messages look like:
a variable number of longs; anywhere between 1 and 10K of them
and two already-UTF8 text strings; both between 1 and 3KB
Because I am reading them from a socket, I want libraries that can cope gracefully with streams - its irritating if it doesn't tell me how much of a buffer it consumed, for example.
The other end of this stream is a Java server, of course; I don't want to pick something that is great for the Python end but moves problems to the Java end e.g. performance or torturous or flaky API.
I will obviously be doing my own profiling. I ask here in the hope you describe approaches I wouldn't think of e.g. using struct and what the fastest kind of strings/buffers are.
Some simple test code gives surprising results:
import time, random, struct, json, sys, pickle, cPickle, marshal, array
def encode_json_1(*args):
return json.dumps(args)
def encode_json_2(longs,str1,str2):
return json.dumps({"longs":longs,"str1":str1,"str2":str2})
def encode_pickle(*args):
return pickle.dumps(args)
def encode_cPickle(*args):
return cPickle.dumps(args)
def encode_marshal(*args):
return marshal.dumps(args)
def encode_struct_1(longs,str1,str2):
return struct.pack(">iii%dq"%len(longs),len(longs),len(str1),len(str2),*longs)+str1+str2
def decode_struct_1(s):
i, j, k = struct.unpack(">iii",s[:12])
assert len(s) == 3*4 + 8*i + j + k, (len(s),3*4 + 8*i + j + k)
longs = struct.unpack(">%dq"%i,s[12:12+i*8])
str1 = s[12+i*8:12+i*8+j]
str2 = s[12+i*8+j:]
return (longs,str1,str2)
struct_header_2 = struct.Struct(">iii")
def encode_struct_2(longs,str1,str2):
return "".join((
struct_header_2.pack(len(longs),len(str1),len(str2)),
array.array("L",longs).tostring(),
str1,
str2))
def decode_struct_2(s):
i, j, k = struct_header_2.unpack(s[:12])
assert len(s) == 3*4 + 8*i + j + k, (len(s),3*4 + 8*i + j + k)
longs = array.array("L")
longs.fromstring(s[12:12+i*8])
str1 = s[12+i*8:12+i*8+j]
str2 = s[12+i*8+j:]
return (longs,str1,str2)
def encode_ujson(*args):
return ujson.dumps(args)
def encode_msgpack(*args):
return msgpacker.pack(args)
def decode_msgpack(s):
msgunpacker.feed(s)
return msgunpacker.unpack()
def encode_bson(longs,str1,str2):
return bson.dumps({"longs":longs,"str1":str1,"str2":str2})
def from_dict(d):
return [d["longs"],d["str1"],d["str2"]]
tests = [ #(encode,decode,massage_for_check)
(encode_struct_1,decode_struct_1,None),
(encode_struct_2,decode_struct_2,None),
(encode_json_1,json.loads,None),
(encode_json_2,json.loads,from_dict),
(encode_pickle,pickle.loads,None),
(encode_cPickle,cPickle.loads,None),
(encode_marshal,marshal.loads,None)]
try:
import ujson
tests.append((encode_ujson,ujson.loads,None))
except ImportError:
print "no ujson support installed"
try:
import msgpack
msgpacker = msgpack.Packer()
msgunpacker = msgpack.Unpacker()
tests.append((encode_msgpack,decode_msgpack,None))
except ImportError:
print "no msgpack support installed"
try:
import bson
tests.append((encode_bson,bson.loads,from_dict))
except ImportError:
print "no BSON support installed"
longs = [i for i in xrange(10000)]
str1 = "1"*5000
str2 = "2"*5000
random.seed(1)
encode_data = [[
longs[:random.randint(2,len(longs))],
str1[:random.randint(2,len(str1))],
str2[:random.randint(2,len(str2))]] for i in xrange(1000)]
for encoder,decoder,massage_before_check in tests:
# do the encoding
start = time.time()
encoded = [encoder(i,j,k) for i,j,k in encode_data]
encoding = time.time()
print encoder.__name__, "encoding took %0.4f,"%(encoding-start),
sys.stdout.flush()
# do the decoding
decoded = [decoder(e) for e in encoded]
decoding = time.time()
print "decoding %0.4f"%(decoding-encoding)
sys.stdout.flush()
# check it
if massage_before_check:
decoded = [massage_before_check(d) for d in decoded]
for i,((longs_a,str1_a,str2_a),(longs_b,str1_b,str2_b)) in enumerate(zip(encode_data,decoded)):
assert longs_a == list(longs_b), (i,longs_a,longs_b)
assert str1_a == str1_b, (i,str1_a,str1_b)
assert str2_a == str2_b, (i,str2_a,str2_b)
gives:
encode_struct_1 encoding took 0.4486, decoding 0.3313
encode_struct_2 encoding took 0.3202, decoding 0.1082
encode_json_1 encoding took 0.6333, decoding 0.6718
encode_json_2 encoding took 0.5740, decoding 0.8362
encode_pickle encoding took 8.1587, decoding 9.5980
encode_cPickle encoding took 1.1246, decoding 1.4436
encode_marshal encoding took 0.1144, decoding 0.3541
encode_ujson encoding took 0.2768, decoding 0.4773
encode_msgpack encoding took 0.1386, decoding 0.2374
encode_bson encoding took 55.5861, decoding 29.3953
bson, msgpack and ujson all installed via easy_install
I would love to be shown I'm doing it wrong; that I should be using cStringIO interfaces or however else you speed it all up!
There must be a way to serialise this data that is an order of magnitude faster surely?

While JSon is flexible, it is one of the slowest serialization formats in Java (possible python as well) in nano-seconds matter I would use a binary format in native byte order (likely to be little endian)
Here is a library were I do exactly that AbstractExcerpt and UnsafeExcerpt A typical message takes 50 to 200 ns to serialize and send or read and deserialize.

In the end, we chose to use msgpack.
If you go with JSON, your choice of library on Python and Java is critical to performance:
On Java, http://blog.juicehub.com/2012/11/20/benchmarking-web-frameworks-for-games/ says:
Performance was absolutely atrocious until we swapped out the JSON Lib (json-simple) for Jackon’s ObjectMapper. This brought RPS for 35 to 300+ – a 10x increase

You may be able to speed up the struct case
def encode_struct(longs,str1,str2):
return struct.pack(">iii%dq"%len(longs),len(longs),len(str1),len(str2),*longs)+str1+str2
Try using the python array module and the method tostring to convert your longs into a binary string. Then you can append it like you did with the strings
Create a struct.Struct object and use that. I believe it's more efficient
You can also look into:
http://docs.python.org/library/xdrlib.html#module-xdrlib
Your fastest method encodes 1000 elements in .1222 seconds. That's 1 element in .1222 milliseconds. That's pretty fast. I doubt you'll do much better without switching languages.

I know this is an old question, but it is still interesting. My most recent choice was to use Cap’n Proto, written by the same guy who did protobuf for google. In my case, that lead to a decrease in both time and volume around 20% compared to Jackson's JSON encoder/decoder (server to server, Java on both sides).

Protocol Buffers are pretty fast and have bindings for both Java and Python. It's quite a popular library and used inside Google so it should be tested and optimized quite well.

Since the data your sending is already well defined, non-recursive, and non-nested, why not just use a simple delimited string. You just need a delimiter that isn't contained in your string variables maybe '\n'.
"10\n1\n2\n3\n4\n5\n6\n7\n8\n9\n10\nFoo Foo Foo\nBar Bar Bar"
Then just use a simple String Split method.
string[] temp = str.split("\n");
ArrayList<long> longs = new ArrayList<long>(long.parseLong(temp[0]));
string string1 = temp[temp.length-2];
string string2 = temp[temp.length-1];
for(int i = 1; i < temp.length-2 ; i++)
longs.add(long.parseLong(temp[i]));
note The above was written in the web browser and untested so syntax errors may exist.
For a text based; I'd assume the above is the fastest method.

Related

How to get rid of incorrect symbols during Java NIO decoding?

I need to read a text from file and, for instance, print it in console. The file is in UTF-8. It seems that I'm doing something wrong because some russian symbols are printed incorrectly. What's wrong with my code?
StringBuilder content = new StringBuilder();
try (FileChannel fChan = (FileChannel) Files.newByteChannel(Paths.get("D:/test.txt")) ) {
ByteBuffer byteBuf = ByteBuffer.allocate(16);
Charset charset = Charset.forName("UTF-8");
while(fChan.read(byteBuf) != -1) {
byteBuf.flip();
content.append(new String(byteBuf.array(), charset));
byteBuf.clear();
}
System.out.println(content);
}
The result:
Здравствуйте, как поживае��е?
Это п��имер текста на русском яз��ке.ом яз�
The actual text:
Здравствуйте, как поживаете?
Это пример текста на русском языке.
UTF-8 uses a variable number of bytes per character. This gives you a boundary error: You have mixed buffer-based code with byte-array based code and you can't do that here; it is possible for you to read enough bytes to be stuck halfway into a character, you then turn your input into a byte array, and convert it, which will fail, because you can't convert half a character.
What you really want is either to first read ALL the data and then convert the entire input, or, to keep any half-characters in the bytebuffer when you flip back, or better yet, ditch all this stuff and use code that is written to read actual characters. In general, using the channel API complicates matters a ton; it's flexible, but complicated - that's how it goes.
Unless you can explain why you need it, don't use it. Do this instead:
Path target = Paths.get("D:/test.txt");
try (var reader = Files.newBufferedReader(target)) {
// read a line at a time here. Yes, it will be UTF-8 decoded.
}
or better yet, as you apparently want to read the whole thing in one go:
Path target = Paths.get("D:/test.txt");
var content = Files.readString(target);
NB: Unlike most java methods that convert bytes to chars or vice versa, the Files API defaults to UTF-8 (instead of the useless and dangerous, untestable-bug-causing 'platform default encoding' that most java API does). That's why this last incredibly simple code is nevertheless correct.

Attempting to parse a file (mmp format) from legacy software using Python

I have a piece of Legacy software called Mixmeister that saved off playlist files in an MMP format.
This format appears to contain binary as well as file paths.
I am looking to extract the file paths along with any additional information I can from these files.
I see this has been done using JAVA (I do not know JAVA) here (see aorund ln 56):
https://github.com/liesen/CueMeister/blob/master/src/mixmeister/mmp/MixmeisterPlaylist.java
and Haskell here:
https://github.com/larjo/MixView/blob/master/ListFiles.hs
So far, I have tried reading the file as binary (got stuck); using Regex expressions (messy output with moderate success) and attempting to try some code to read chunks (beyond my skill level).
The code I am using with moderate success for Regex is:
file='C:\\Users\\xxx\\Desktop\\mixmeisterfile.mmp'
with open(file, 'r', encoding="Latin-1") as filehandle:
#with open(file, 'rb') as filehandle:
for text in filehandle:
b = re.search('TRKF(.*)TKLYTRKM', text)
if b:
print(b.group())
Again, this gets me close but is messy (the data is not all intact and surrounded by ascii and binary characters). Basically, my logic is just searching between two strings to attempt to extract the filenames. What I am really trying to do is get closer to something like what the JAVA has in GIT, which is (the code below is sampled from the GIT link):
List<Track> tracks = new ArrayList<Track>();
Marker trks = null;
for (Chunk chunk : trkl.getChunks()) {
TrackHeader header = new TrackHeader();
String file = "";
List<Marker> meta = new LinkedList<Marker>();
if (chunk.canContainSubchunks()) {
for (Chunk chunk2 : ((ChunkContainer) chunk).getChunks()) {
if ("TRKH".equals(chunk2.getIdentifier())) {
header = readTrackHeader(chunk2);
} else if ("TRKF".equals(chunk2.getIdentifier())) {
file = readTrackFile(chunk2);
} else {
if (chunk2.canContainSubchunks()) {
for (Chunk chunk3 : ((ChunkContainer) chunk2).getChunks()) {
if ("TRKM".equals(chunk3.getIdentifier())) {
meta.add(readTrackMarker(chunk3));
} else if ("TRKS".equals(chunk3.getIdentifier())) {
trks = readTrackMarker(chunk3);
}
}
}
}
}
}
Track tr = new Track(header, file, meta);
I am guessing this would either use RIFF or the chunk library in Python if not done using a Regex? Although I read the documentation at https://docs.python.org/2/library/chunk.html, I am not sure that I understand how to go about something like this - mainly I do not understand how to properly read the binary file which has the visible mixed in file paths.
I don't really know what's going on here but I'll try my best and if it doesn't work out then please excuse my stupidity. When I had a project for parsing weather data for a Metar, I realized that my main issue was that I was trying to turn everything into a String type, which wasn't suitable for all the data and so it would just come out as nothing. Your for loop should work just fine. However, when you traverse, have you tried making everything the same type, such as a Character/String type? Perhaps there are certain elements messed up simply because they don't match the type you are going for.

Why is ReversedLinesFileReader so slow?

I have a file that is 21.6GB and I want to read it from the end to the start rather than from the beginning to the end as you would usually do.
If I read each line of the file from the start to the end using the following code, then it takes 1 minute, 12 seconds.
val startTime = System.currentTimeMillis()
File("very-large-file.xml").forEachLine {
val i = 0
}
val diff = System.currentTimeMillis() - startTime
println(diff.timeFormat())
Now, I have read that to read in a file in reverse then I should use ReversedLinesFileReader from Apache Commons. I have created the following extension function to do just this:
fun File.forEachLineFromTheEndOfFile(action: (line: String) -> Unit) {
val reader = ReversedLinesFileReader(this, Charset.defaultCharset())
var line = reader.readLine()
while (line != null) {
action.invoke(line)
line = reader.readLine()
}
reader.close()
}
and then call it in the following way, which is the same as the previous way only with a call to forEachLineFromTheEndOfFile function:
val startTime = System.currentTimeMillis()
File("very-large-file.xml").forEachLineFromTheEndOfFile {
val i = 0
}
val diff = System.currentTimeMillis() - startTime
println(diff.timeFormat())
This took 17 minutes and 50 seconds to run!
Am I using ReversedLinesFileReader in the correct way?
I am running Linux Mint with an Ext4 file system on an SSD. Could this have anything to do with it?
Is it just the case that files should not be read from the end to the start?
You are asking for a very expensive operation. Not only are you using random access in blocks to read the file and going backwards (so if the file system is reading ahead, it is reading the wrong direction), you are also reading an XML file which is UTF-8 and the encoding is slower than a fixed byte encoding.
Then on top of that you are using a less than efficient algorithm. It reads a block at a time of inconvenient size (is it disk block size aware? are you setting the block size to match your file system?) backwards while processing encoding and makes (unnecessary?) copy of the partial byte array and then turns it into a string (do you need a string to parse?). It could create the string without the copy and really creating the string probably could be deferred and you work directly from the buffer only decoding if you need to (XML parsers for example also work from ByteArrays or buffers). And there are other array copies that just are not needed but it is more convenient for the code.
It also might have a bug in that it checks for newlines without considering that the character might mean something different if actually is part of a multi-byte sequence. It would have to look back a few extra characters to check this for variable length encodings, I don't see it doing that.
So instead of a nice forward only heavily buffered sequential read of a file which is the fastest thing you can do on your filesystem, you are doing random reads of 1 block at a time. It should at least read multiple disk blocks so that it can use the forward momentum (set blocksize to some multiple of your disk block size will help) and also avoid the number of "left over" copies made at buffer boundaries.
There are probably faster approaches. But it'll not be as fast as reading a file in forward order.
UPDATE:
Ok, so I tried an experiment with a rather silly version that processes around 27G of data by reading the first 10 million lines from wikidata JSON dump and reversing those lines.
Timings on my 2015 Mac Book Pro (with all my dev stuff and many chrome windows open eating memory and some CPU all the time, about 5G of total memory is free, VM size is default with no parameters set at all, not run under debugger):
reading in reverse order: 244,648 ms = 244 secs = 4 min 4 secs
reading in forward order: 77,564 ms = 77 secs = 1 min 17 secs
temp file count: 201
approx char count: 29,483,478,770 (line content not including line endings)
total line count: 10,050,000
The algorithm is to read the original file by lines buffering 50000 lines at a time, writing the lines in reverse order to a numbered temp file. Then after all files are written, they are read in reverse numerical order forward by lines. Basically dividing them into reverse sort order fragments of the original. It could be optimized because this is the most naive version of that algorithm with no tuning. But it does do what file systems do best, sequential reads and sequential writes with good sized buffers.
So this is a lot faster than the one you were using and it could be tuned from here to be more efficient. You could trade CPU for disk I/O size and try using gzipped files as well, maybe a two threaded model to have the next buffer gzipping while processing the previous. Less string allocations, checking each file function to make sure nothing extra is going on, make sure no double buffering, and more.
The ugly but functional code is:
package com.stackoverflow.reversefile
import java.io.File
import java.util.*
fun main(args: Array<String>) {
val maxBufferSize = 50000
val lineBuffer = ArrayList<String>(maxBufferSize)
val tempFiles = ArrayList<File>()
val originalFile = File("/data/wikidata/20150629.json")
val tempFilePrefix = "/data/wikidata/temp/temp"
val maxLines = 10000000
var approxCharCount: Long = 0
var tempFileCount = 0
var lineCount = 0
val startTime = System.currentTimeMillis()
println("Writing reversed partial files...")
try {
fun flush() {
val bufferSize = lineBuffer.size
if (bufferSize > 0) {
lineCount += bufferSize
tempFileCount++
File("$tempFilePrefix-$tempFileCount").apply {
bufferedWriter().use { writer ->
((bufferSize - 1) downTo 0).forEach { idx ->
writer.write(lineBuffer[idx])
writer.newLine()
}
}
tempFiles.add(this)
}
lineBuffer.clear()
}
println(" flushed at $lineCount lines")
}
// read and break into backword sorted chunks
originalFile.bufferedReader(bufferSize = 4096 * 32)
.lineSequence()
.takeWhile { lineCount <= maxLines }.forEach { line ->
lineBuffer.add(line)
if (lineBuffer.size >= maxBufferSize) flush()
}
flush()
// read backword sorted chunks backwards
println("Reading reversed lines ...")
tempFiles.reversed().forEach { tempFile ->
tempFile.bufferedReader(bufferSize = 4096 * 32).lineSequence()
.forEach { line ->
approxCharCount += line.length
// a line has been read here
}
println(" file $tempFile current char total $approxCharCount")
}
} finally {
tempFiles.forEach { it.delete() }
}
val elapsed = System.currentTimeMillis() - startTime
println("temp file count: $tempFileCount")
println("approx char count: $approxCharCount")
println("total line count: $lineCount")
println()
println("Elapsed: ${elapsed}ms ${elapsed / 1000}secs ${elapsed / 1000 / 60}min ")
println("reading original file again:")
val againStartTime = System.currentTimeMillis()
var againLineCount = 0
originalFile.bufferedReader(bufferSize = 4096 * 32)
.lineSequence()
.takeWhile { againLineCount <= maxLines }
.forEach { againLineCount++ }
val againElapsed = System.currentTimeMillis() - againStartTime
println("Elapsed: ${againElapsed}ms ${againElapsed / 1000}secs ${againElapsed / 1000 / 60}min ")
}
The correct way to investigate this problem would be:
Write a version of this test in pure Java.
Benchmark it to make sure that the performance problem is still there.
Profile it to figure out where the performance bottleneck is.
Q: Am I using ReversedLinesFileReader in the correct way?
Yes. (Assuming that it is an appropriate thing to use a line reader at all. That depends on what it is you are really trying to do. For instance, if you just wanted to count lines backwards, then you should be reading 1 character at a time and counting the newline sequences.)
Q: I am running Linux Mint with an Ext4 file system on an SSD. Could this have anything to do with it?
Possibly. Reading a file in reverse means that the read-ahead strategies used by the OS to give fast I/O may not work. It could be interacting with the characteristics of an SSD.
Q: Is it just the case that files should not be read from the end to the start?
Possibly. See above.
The other thing that you have not considered is that your file could actually contain some extremely long lines. The bottleneck could be the assembly of the characters into (long) lines.
Looking at the source code, it would seem that there is potential for O(N^2) behavior when lines are very long. The critical part is (I think) in the way that "rollover" is handled by FilePart. Note the way that the "left over" data gets copied.

Java equivalent of perl unpack function

I have a perl code (say client) which sends packed data as HTTP POST to another perl code running on apache mod_perl module (say server).
In client side, I have the pack function like this,
$postData = pack("N a*", length($metaData), $metaData);
From perl pack document, it seems,
N -> An unsigned long (32-bit) in "network" (big-endian) order.
a -> A string with arbitrary binary data, will be null padded.
Now the $postData will be sent to server using perl LWP User Agent.
In the server side perl, we used to unpack like this,
# first reading the metaData Length
my $buf;
$request->read($buf, 4); #$request is apache request handler
my $metaDataLength = unpack("N", $buf);
# now read the metaData itself
$request->read($buf, $metaDataLength);
Now I have to do this server side data parsing in java (moving away from perl for some reasons). I have searched google for this and it seems to be not a single line solution as in perl. Some suggested to write our own unpack function. I am using java 1.7 version.
Is there any simple solution available in java for the above server side data parsing ?
Edit: Thanks Elliot for 'ByteBuffer' idea. The following code works fine for me,
InputStream is = request.getInputStream(); //request is HTTPServletRequest
byte[] bArr = new byte[4]; //reading first 4 bytes to get metaDataLength
int bytesRead = is.read(bArr);
ByteBuffer buf = ByteBuffer.wrap(bArr);
int metaDataLength = buf.getInt(); //shows value matches with clientside perl code.
potentially JBBP can be such one
final int value = JBBPParser.prepare("int;").parse(theInputStream).findFieldForType(JBBPFieldInt.class).getAsInt();

Sorting a 100MB XML file with Java?

How long does sorting a 100MB XML file with Java take ?
The file has items with the following structure and I need to sort them by event
<doc>
<id>84141123</id>
<title>kk+ at Hippie Camp</title>
<description>photo by SFP</description>
<time>18945840</time>
<tags>elphinstone tribalharmonix vancouver intention intention7 newyears hippiecamp bc sunshinecoast woowoo kk kriskrug sunglasses smoking unibomber møtleykrüg </tags>
<geo></geo>
<event>47409</event>
</doc>
I'm on a Intel Dual Duo Core and 4GB RAM.
Minutes ? Hours ?
thanks
Here are the timings for a similar task executed using Saxon XQuery on a 100Mb input file.
Saxon-EE 9.3.0.4J from Saxonica
Java version 1.6.0_20
Analyzing query from {for $i in //item order by location return $i}
Analysis time: 195 milliseconds
Processing file:/e:/javalib/xmark/xmark100.xml
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/e:/javalib/xmark/xmark100.xml using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 6158 milliseconds
Tree size: 4787932 nodes, 79425460 characters, 381878 attributes
Execution time: 3.466s (3466ms)
Memory used: 471679816
So: about 6 seconds for parsing the input file and building a tree, 3.5 seconds for sorting it. That's invoked from the command line, but invoking it from Java will get very similar performance. Don't try to code the sort yourself - it's only a one-line query, and you are very unlikely to match the performance of an optimized XQuery engine.
i would say minutes - you shud be able to do that completely in-memory, so with a sax parser that would be reading-sorting-writing, should not be a problem for your hardware
I think a problem like this would be better sorted using serialisation.
Deserialise the XML file into an ArrayList of 'doc'.
Using straight Java code, apply sort on the event attribute and stored sorted arraylist in another variable.
Serialise out the sorted 'doc' ArrayList to file
If you do it in memory, you should be able to do this in under 10 seconds. You would be pusshing to do this under 2 seconds because it will spend that much times reading/writing to disk.
This program should use no more than 4-5x times the original file size. about 500 MB in your case.
String[] records = FileUtils.readFileToString(new File("my-file.xml")).split("</?doc>");
Map<Long, String> recordMap = new TreeMap<Long, String>();
for(int i=1;i<records.length;i+=2) {
String record = records[i];
int pos1 = record.indexOf("<id>");
int pos2 = record.indexOf("</id>", pos1+4);
long num = Long.parseLong(record.substring(pos1+3, pos2));
recordMap.put(num, record);
}
StringBuilder sb = new StringBuilder(records[0]);
for (String s : recordMap.values()) {
sb.append("<doc>").append(s).append("</doc>");
}
sb.append(records[records.length-1]);
FileUtils.writeStringToFile(new File("my-output-file.xml"), sb.toString());

Categories