Consider a huge JSON with structure like -
{"text": "very HUGE text here.."}
I am storing this JSON as an ObjectNode object called say json.
Now I try to extract this text from the ObjectNode.
String text = json.get("text").asText()
This JSON can be like 4-5 MB in size. When I run this code, I dont get a result (program keeps executing forever).
The above method works fine for small and normal sized strings. Is there any other best practice to extract huge data from JSON?
test with jackson(fastxml), 7MB json node can be parsed in 200 milliseconds
ObjectMapper objectMapper = new ObjectMapper();
InputStream is = getClass().getResourceAsStream("/test.json");
long begin = System.currentTimeMillis();
Map<String,String> obj = objectMapper.readValue(is, HashMap.class);
long end = System.currentTimeMillis();
System.out.println(obj.get("value").length() + "\t" + (end - begin));
the output is:
7888888 168
try to upgrade you jackson?
Perhaps your default heap size is too small: if input is 5 megs UTF-8 encoded, Java String of it will usually need 10 megs of memory (char is 16-bits, most UTF-8 for english chars is single byte).
There isn't much you can do about this, regardless of JSON library, if value has to be handled as Java String; you need enough memory for the value and rest of processing. Further, since Java heap is divided into different generations, 64 megs may or may not work: since 10 megs needs to be consecutive, it probably gets allocated in the old generation.
So: see try with bigger heap size and see how much you need.
Related
I have a large JSON file roughly around 300 MB, I am parsing it using Jackson object mapper:-
private void parseJson(Object obj) {
ObjectMapper map = new ObjectMapper();
Map<String,List<POJOClass>> result = new HashMap<>();
String str = map.writeValueAsString(obj);
map.registerModule(new JSR10module());
result = map.readValue(new ByteInputStream(str.getBytes(StandardCharSets.UTF_8)),
new TypeReference<Map<String,List<POJOClass>>>
() {});
}
Parameter to parseJson is an Object which contains JSON String.
This works fine for JSON files of around 150 MB, however, it starts to fail with Heap Space error when JSON files are around 250-300 MB. I am using jackson 2.4.0
You don't have enough memory in your java process to handle such huge file.
When launching your app use the -xmx option to increase the maximum memory used by your java process:
java -Xmx512m ...
or
java -Xmx1G ...
Have you tried streaming over the JSON using, for instance: https://www.baeldung.com/jackson-streaming-api so that you do not have to put everything into memory at once.
For huge big json file, you should use jackson stream api.
You can find the document here. It cost little memory than normal way. In general, your POJO must satisfy certain shapes, it should be able to be separated into many independent parts. Such as a list of object. Then with jackson stream api, you can achieve the object in the list one by one.
I am comparing JSON and BSON for serializing objects. These objects contain several arrays of a large number of integers. In my test the object I am serializing contains a total number of about 12,000 integers. I am only interested in how the sizes compare of the serialized results. I am using JSON.NET as the library which does the serialization. I am using JSON because I also want to be able to work with it in Javascript.
The size of the JSON string is about 43kb and the size of the BSON result is 161kb. So a difference factor of about 4. This is not what I expected because I looked at BSON because I thought BSON is more efficient in storing data.
So my question is why is BSON not efficient, can it be made more efficient? Or is there another way of serializing data with arrays containing large number of integers, which can be easily handled in Javascript?
Below you find the code to test the JSON/BSON serialization.
// Read file which contain json string
string _jsonString = ReadFile();
object _object = Newtonsoft.Json.JsonConvert.DeserializeObject(_jsonString);
FileStream _fs = File.OpenWrite("BsonFileName");
using (Newtonsoft.Json.Bson.BsonWriter _bsonWriter = new BsonWriter(_fs)
{ CloseOutput = false })
{
Newtonsoft.Json.JsonSerializer _jsonSerializer = new JsonSerializer();
_jsonSerializer.Serialize(_bsonWriter, _object);
_bsonWriter.Flush();
}
Edit:
Here are the resulting files
https://skydrive.live.com/redir?resid=9A6F31F60861DD2C!362&authkey=!AKU-ZZp8C_0gcR0
The efficiency of JSON vs BSON depends on the size of the integers you're storing. There's an interesting point where ASCII takes fewer bytes than actually storing integer types. 64-bit integers, which is how it appears your BSON document, take up 8 bytes. Your numbers are all less than 10,000, which means you could store each one in ASCII in 4 bytes (one byte for each character up through 9999). In fact, most of your data look like it's less than 1000, meaning it can be stored in 3 or fewer bytes. Of course, that deserialization takes time and isn't cheap, but it saves space. Furthermore, Javascript uses 64-bit values to represent all numbers, so if you wrote it to BSON after converting each integer to a more appropriate dataformat, your BSON file could be much larger.
According to the spec, BSON contains a lot of metadata that JSON doesn't. This metadata is mostly length prefixes so that you can skip through data you aren't interested in. For example, take the following data:
["hello there, this is an necessarily long string. It's especially long, but you don't care about it. You're just trying to get to the next element. But I keep going on and on.",
"oh man. here's another string you still don't care about. You really just want the third element in the array. How long are the first two elements? JSON won't tell you",
"data_you_care_about"]
Now, if you're using JSON, you have to parse the entirety of the first two strings to find out where the third one is. If you use BSON, you'll get markup more like (but not actually, because I'm making this markup up for the sake of example):
[175 "hello there, this is an necessarily long string. It's especially long, but you don't care about it. You're just trying to get to the next element. But I keep going on and on.",
169 "oh man. here's another string you still don't care about. You really just want the third element in the array. How long are the first two elements? JSON won't tell you",
19 "data_you_care_about"]
So now, you can read '175', know to skip forward 175 bytes, then read '169', skip forward 169 bytes, and then read '19' and copy the next 19 bytes to your string. That way you don't even have to parse the strings for delimiters.
Using one versus the other is very dependent on what your needs are. If you're going to be storing enormous documents that you've got all the time in the world to parse, but your disk space is limited, use JSON because it's more compact and space efficient.
If you're going to be storing documents, but reducing wait time (perhaps in a server context) is more important to you than saving some disk space, use BSON.
Another thing to consider in your choice is human readability. If you need to debug a crash report that contains BSON, you'll probably need a utility to decipher it. You probably don't just know BSON, but you can just read JSON.
FAQ
I have problem with creating a String from a JSON node.
Currently I'm doing it by node.toString() method. But in sometimes this takes 7-8 seconds to create the JSON string weighted 15MB-18MB.
I tried with mapper.writeValueAsString(node)) method too. But it shows some additional time for the test. This is very difficult to check the issue because it is also harder to reproduce.
I'm currently using only ObjectNode (not TextNode, BooleanNode etc) is it this will be effect to this? Or is there any better way to convert JSONNode to String ?
Sample Code :
JsonNodeFactory nodeFactory = JsonNodeFactory.instance;
ObjectNode node = nodeFactory.objectNode();
node.put("fnm", "Namal");
node.put("lnm", "Fernando");
node.put("age", 30);
for (int i = 0; i < 10000; i++) {
ObjectNode order = nodeFactory.objectNode();
order.put("id", (i+1000)+"");
order.put("nm", "ORD"+(i+1000));
order.put("ref", "RF-"+i);
node.put("order"+i, order);
}
long smili = System.currentTimeMillis();
System.out.println("main().Node : " + node.toString());
System.out.println("main().TIMING 1 : " + (System.currentTimeMillis() - smili) / 1000.0);;
long smili2 = System.currentTimeMillis();
ObjectMapper mapper = new ObjectMapper();
System.out.println("main().Node : " + mapper.writeValueAsString(node));
System.out.println("main().TIMING 2 : " + (System.currentTimeMillis() - smili2) / 1000.0);;
First things first: JsonNode.toString() should NOT be used for serialization, ever. It is useful for simple troubleshooting, but since it does not have access to contextual configuration, it will not necessarily produce valid JSON. This is by design.
Instead, you should use ObjectMapper or ObjectWriter to serialize it; this will produce valid JSON using exact configuration and settings that mapper has (or writer created by mapper).
Now: your timing comparison is flawed since you only do one call with ObjectMapper. There is overhead in first N calls; partly due to ObjectMapper initialization, partly due to JVM warmup (dynamic JIT compiler running to optimize and such). To get more usable results you would need to call method multiple times, and ideally let it run for couple of seconds.
But beyond this a common problem is to forget to have enough heap space.
In Java, Strings require memory at least equivalent to 2x length of String in characters (each char takes 2 bytes). But this is just the size of the result; during serialization a temporary buffer is also needed, using roughly comparable amount.
On disk, however, UTF-8 encoding typically uses just 1 byte per character (for ASCII characters).
So assuming you see a 15mB file, it could use 60 mB of memory during processing. If your heap size is set to small (say, 64 megs which is default in many cases) it would lead to very heavy garbage-collection processing. Solution there would be to either increase heap size.
Or, unless you actually need the String, to:
Write directly into file: use of intermediate String is an anti-pattern if the result goes into File or network connection
If you do require intermediate form, serialize as byte[]: this will only need same amount of memory as a file would (since it is UTF-8 encoded in memory, instead of being a wrapped char[]
I'm using Jackson streaming API to deserialise a quite large JSON (on the order of megabytes) into POJO. It's working fine, but I'd like to optimize it (both memory and processing wise, code runs on Android).
The main problem I'd like to optimize away is converting a large number of strings from UTF-8 to ISO-8859-1. Currently I use:
String result = new String(parser.getText().getBytes("ISO-8859-1"));
As I understand it, parser originally copies token content into String (getText()), then creates a byte array from it (getBytes()), which is then used to create a final String in desired encoding. Way too much allocations and copying.
Ideal solution would be if getText() would accept the encoding parameter and just give me the final string, but that's not the case.
Any other ideas, or flaws in my thinking?
You can use:
parser.getBinaryValue() (present on version 2.4 of Jackson)
or you can implement an ObjectCodec (with a method readValue(...) that knows converting bytes to String in ISO8859-1) and set it using parser.setCodec().
If you have control over the json generation, avoid using a charset different than UTF-8.
I am trying to pass a byte[] containing ASCII characters to log4j, to be logged into a file using the obvious representation. When I simply pass in the byt[] it is of course treated as an object and the logs are pretty useless. When I try to convert them to strings using new String(byte[] data), the performance of my application is halved.
How can I efficiently pass them in, without incurring the approximately 30us time penalty of converting them to strings.
Also, why does it take so long to convert them?
Thanks.
Edit
I should add that I am optmising for latency here - and yes, 30us does make a difference! Also, these arrays vary from ~100 all the way up to a few thousand bytes.
ASCII is one of the few encodings that can be converted to/from UTF16 with no arithmetic or table lookups so it's possible to convert manually:
String convert(byte[] data) {
StringBuilder sb = new StringBuilder(data.length);
for (int i = 0; i < data.length; ++ i) {
if (data[i] < 0) throw new IllegalArgumentException();
sb.append((char) data[i]);
}
return sb.toString();
}
But make sure it really is ASCII, or you'll end up with garbage.
What you want to do is delay processing of the byte[] array until log4j decides that it actually wants to log the message. This way you could log it at DEBUG level, for example, while testing and then disable it during production. For example, you could:
final byte[] myArray = ...;
Logger.getLogger(MyClass.class).debug(new Object() {
#Override public String toString() {
return new String(myArray);
}
});
Now you don't pay the speed penalty unless you actually log the data, because the toString method isn't called until log4j decides it'll actually log the message!
Now I'm not sure what you mean by "the obvious representation" so I've assumed that you mean convert to a String by reinterpreting the bytes as the default character encoding. Now if you are dealing with binary data, this is obviously worthless. In that case I'd suggest using Arrays.toString(byte[]) to create a formatted string along the lines of
[54, 23, 65, ...]
If your data is in fact ASCII (i.e. 7-bit data), then you should be using new String(data, "US-ASCII") instead of depending on the platform default encoding. This may be faster than trying to interpret it as your platform default encoding (which could be UTF-8, which requires more introspection).
You could also speed this up by avoiding the Charset-Lookup hit each time, by caching the Charset instance and calling new String(data, charset) instead.
Having said that: it's been a very, very long time since I've seen real ASCII data in production environment
Halved performance? How large is this byte array? If it's for example 1MB, then there are certainly more factors to take into account than just "converting" from bytes to chars (which is supposed to be fast enough though). Writing 1MB of data instead of "just" 100bytes (which the byte[].toString() may generate) to a log file is obviously going to take some time. The disk file system is not as fast as RAM memory.
You'll need to change the string representation of the byte array. Maybe with some more sensitive information, e.g. the name associated with it (filename?), its length and so on. After all, what does that byte array actually represent?
Edit: I can't remember to have seen the "approximately 30us" phrase in your question, maybe you edited it in within 5 minutes after asking, but this is actually microoptimization and it should certainly not cause "halved performance" in general. Unless you write them a million times per second (still then, why would you want to do that? aren't you overusing the phenomenon "logging"?).
Take a look here: Faster new String(bytes, cs/csn) and String.getBytes(cs/csn)