Slowness with creating a String from a ObjectNode with Jackson in Java - java

I have problem with creating a String from a JSON node.
Currently I'm doing it by node.toString() method. But in sometimes this takes 7-8 seconds to create the JSON string weighted 15MB-18MB.
I tried with mapper.writeValueAsString(node)) method too. But it shows some additional time for the test. This is very difficult to check the issue because it is also harder to reproduce.
I'm currently using only ObjectNode (not TextNode, BooleanNode etc) is it this will be effect to this? Or is there any better way to convert JSONNode to String ?
Sample Code :
JsonNodeFactory nodeFactory = JsonNodeFactory.instance;
ObjectNode node = nodeFactory.objectNode();
node.put("fnm", "Namal");
node.put("lnm", "Fernando");
node.put("age", 30);
for (int i = 0; i < 10000; i++) {
ObjectNode order = nodeFactory.objectNode();
order.put("id", (i+1000)+"");
order.put("nm", "ORD"+(i+1000));
order.put("ref", "RF-"+i);
node.put("order"+i, order);
}
long smili = System.currentTimeMillis();
System.out.println("main().Node : " + node.toString());
System.out.println("main().TIMING 1 : " + (System.currentTimeMillis() - smili) / 1000.0);;
long smili2 = System.currentTimeMillis();
ObjectMapper mapper = new ObjectMapper();
System.out.println("main().Node : " + mapper.writeValueAsString(node));
System.out.println("main().TIMING 2 : " + (System.currentTimeMillis() - smili2) / 1000.0);;

First things first: JsonNode.toString() should NOT be used for serialization, ever. It is useful for simple troubleshooting, but since it does not have access to contextual configuration, it will not necessarily produce valid JSON. This is by design.
Instead, you should use ObjectMapper or ObjectWriter to serialize it; this will produce valid JSON using exact configuration and settings that mapper has (or writer created by mapper).
Now: your timing comparison is flawed since you only do one call with ObjectMapper. There is overhead in first N calls; partly due to ObjectMapper initialization, partly due to JVM warmup (dynamic JIT compiler running to optimize and such). To get more usable results you would need to call method multiple times, and ideally let it run for couple of seconds.
But beyond this a common problem is to forget to have enough heap space.
In Java, Strings require memory at least equivalent to 2x length of String in characters (each char takes 2 bytes). But this is just the size of the result; during serialization a temporary buffer is also needed, using roughly comparable amount.
On disk, however, UTF-8 encoding typically uses just 1 byte per character (for ASCII characters).
So assuming you see a 15mB file, it could use 60 mB of memory during processing. If your heap size is set to small (say, 64 megs which is default in many cases) it would lead to very heavy garbage-collection processing. Solution there would be to either increase heap size.
Or, unless you actually need the String, to:
Write directly into file: use of intermediate String is an anti-pattern if the result goes into File or network connection
If you do require intermediate form, serialize as byte[]: this will only need same amount of memory as a file would (since it is UTF-8 encoded in memory, instead of being a wrapped char[]

Related

Parse Large Json File ( around 300 Mb ) to list of POJO

I have a large JSON file roughly around 300 MB, I am parsing it using Jackson object mapper:-
private void parseJson(Object obj) {
ObjectMapper map = new ObjectMapper();
Map<String,List<POJOClass>> result = new HashMap<>();
String str = map.writeValueAsString(obj);
map.registerModule(new JSR10module());
result = map.readValue(new ByteInputStream(str.getBytes(StandardCharSets.UTF_8)),
new TypeReference<Map<String,List<POJOClass>>>
() {});
}
Parameter to parseJson is an Object which contains JSON String.
This works fine for JSON files of around 150 MB, however, it starts to fail with Heap Space error when JSON files are around 250-300 MB. I am using jackson 2.4.0
You don't have enough memory in your java process to handle such huge file.
When launching your app use the -xmx option to increase the maximum memory used by your java process:
java -Xmx512m ...
or
java -Xmx1G ...
Have you tried streaming over the JSON using, for instance: https://www.baeldung.com/jackson-streaming-api so that you do not have to put everything into memory at once.
For huge big json file, you should use jackson stream api.
You can find the document here. It cost little memory than normal way. In general, your POJO must satisfy certain shapes, it should be able to be separated into many independent parts. Such as a list of object. Then with jackson stream api, you can achieve the object in the list one by one.

Scala: how to do String concatenation to avoid GC Overhead issue

I have a application which takes some really big delimited files (~10 to 15 M records) and ingest it into Kafka after doing some preprocessing. As a part of this preprocessing we convert the delimited records into json and add metadata to that json message (FileName, Row number). We are doing it using the Json4s Native serializer like below:
import org.json4s.native.Serialization._
//some more code and below is the final output.
write(Map(
"schema" -> schemaName,
"data" -> List(resultMap),
"flag" -> "I")
)
Once the message is converted to Json we add message metadata like:
def addMetadata(msg: String, metadata: MessageMetadata): String = {
val meta = write(asJsonObject(metadata))
val strippedMeta = meta.substring(1, meta.length -1)
val strippedMessage = msg.substring(1, msg.lastIndexOf("}"))
"{" + strippedMessage + "," + strippedMeta + "}"
msg
}
The final message looks like this at the end:
{"schema":"SchemaName"
"data": [
],
"flag": "I",
"metadata":{"srcType":"file","fileName":"file","line":1021}}
Now both of this methods are leaking some memory and throwing below error. The application have capacity of processing 300k messages per minute but after around 4-5 mins its slowing down and eventually dies. I know string concatenation generates lots of garbage objects and want to know what is the best way of doing it?
java.lang.OutOfMemoryError: GC overhead limit exceeded
When producing tons of such short messages, then there'll tons of tiny short-living objects created. Such tiny short-living objects are something the GC can handle very efficiently - it's very improbable that it could cause any serious problems.
The message
java.lang.OutOfMemoryError: GC overhead limit exceeded
means that GC was working very hard without any success. That's not what happens with tiny short-living objects. Most probably, you have a big memory leak which takes away all of your memory after a few minutes. Then the GC has to fail as there's nothing to reclaim.
Don't waste time on optimizing something which may be harmless. Use some tool to find the leak instead.
Try to use Stringbuilder, you can avoid creating unnecessary objects.
Is string concatenation in scala as costly as it is in Java?

Parsing Huge JSON with Jackson

Consider a huge JSON with structure like -
{"text": "very HUGE text here.."}
I am storing this JSON as an ObjectNode object called say json.
Now I try to extract this text from the ObjectNode.
String text = json.get("text").asText()
This JSON can be like 4-5 MB in size. When I run this code, I dont get a result (program keeps executing forever).
The above method works fine for small and normal sized strings. Is there any other best practice to extract huge data from JSON?
test with jackson(fastxml), 7MB json node can be parsed in 200 milliseconds
ObjectMapper objectMapper = new ObjectMapper();
InputStream is = getClass().getResourceAsStream("/test.json");
long begin = System.currentTimeMillis();
Map<String,String> obj = objectMapper.readValue(is, HashMap.class);
long end = System.currentTimeMillis();
System.out.println(obj.get("value").length() + "\t" + (end - begin));
the output is:
7888888 168
try to upgrade you jackson?
Perhaps your default heap size is too small: if input is 5 megs UTF-8 encoded, Java String of it will usually need 10 megs of memory (char is 16-bits, most UTF-8 for english chars is single byte).
There isn't much you can do about this, regardless of JSON library, if value has to be handled as Java String; you need enough memory for the value and rest of processing. Further, since Java heap is divided into different generations, 64 megs may or may not work: since 10 megs needs to be consecutive, it probably gets allocated in the old generation.
So: see try with bigger heap size and see how much you need.

How to avoid out of memory in StringBuilder or String in Java

I am getting a lot of data from a webservice containing xml entity references. While replacing those with the respective characters I am getting an out of memory error. Can anybody give an example of how to avoid that? I have been stuck for two days on this problem.
This is my code:
public String decodeXMLData(String s)
{
s = s.replaceAll(">",">");
System.out.println("string value is"+s);
s = s.replaceAll("<", "<");
System.out.println("string value1 is"+s);
s = s.replaceAll("&", "&");
s = s.replaceAll(""", "\"");
s = s.replaceAll("&apos;", "'");
s = s.replaceAll(" ", " ");
return s;
}
You should use a SAX parser, not parse it on your own.
Just look in to these resources, they have code samples too:
http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/
http://www.java-samples.com/showtutorial.php?tutorialid=152
http://www.totheriver.com/learn/xml/xmltutorial.html
Take a look at Apache Commons Lang | StringEscapeUtils.unescapeHtml.
Calling five times replaceAll, you are creating five new String objects. In total, you are working with six Strings. This is not an efficent way to XML-decode a string.
I reccommend you using a more robust implementation of XML-encoding/decoding methods, like those contained in Commons Lang libraries. In particular, StringEscapeUtils may help you to get your job done.
The method as shown would not be a source of out of memory errors (unless the string you are handling is as big as the remaining free heap).
What uou could be running into is the fact that String.substring() calls do not allocate a new string, but create a string object which re-uses the one that substring is called on. If your code exists of reading large buffers and creating strings from those buffers, you might need to use new String(str.substring(index)) to force reallocation of the string values into new small char arrays.
You can try increasing JVM memory, but that will only delay the inevitable if the problem is serious (i.e. if you're trying to claim gigabytes for example).
If you've got a single String that causes you to run out of memory trying to do this, it must be humongous :) Suggestion to use a SAX parser to handle it and print it in bits and pieces is a good one.
Or split it up into smaller bits yourself and send each of those to a routine that does what you want and discard the result afterwards.

Convert ASCII byte[] to String

I am trying to pass a byte[] containing ASCII characters to log4j, to be logged into a file using the obvious representation. When I simply pass in the byt[] it is of course treated as an object and the logs are pretty useless. When I try to convert them to strings using new String(byte[] data), the performance of my application is halved.
How can I efficiently pass them in, without incurring the approximately 30us time penalty of converting them to strings.
Also, why does it take so long to convert them?
Thanks.
Edit
I should add that I am optmising for latency here - and yes, 30us does make a difference! Also, these arrays vary from ~100 all the way up to a few thousand bytes.
ASCII is one of the few encodings that can be converted to/from UTF16 with no arithmetic or table lookups so it's possible to convert manually:
String convert(byte[] data) {
StringBuilder sb = new StringBuilder(data.length);
for (int i = 0; i < data.length; ++ i) {
if (data[i] < 0) throw new IllegalArgumentException();
sb.append((char) data[i]);
}
return sb.toString();
}
But make sure it really is ASCII, or you'll end up with garbage.
What you want to do is delay processing of the byte[] array until log4j decides that it actually wants to log the message. This way you could log it at DEBUG level, for example, while testing and then disable it during production. For example, you could:
final byte[] myArray = ...;
Logger.getLogger(MyClass.class).debug(new Object() {
#Override public String toString() {
return new String(myArray);
}
});
Now you don't pay the speed penalty unless you actually log the data, because the toString method isn't called until log4j decides it'll actually log the message!
Now I'm not sure what you mean by "the obvious representation" so I've assumed that you mean convert to a String by reinterpreting the bytes as the default character encoding. Now if you are dealing with binary data, this is obviously worthless. In that case I'd suggest using Arrays.toString(byte[]) to create a formatted string along the lines of
[54, 23, 65, ...]
If your data is in fact ASCII (i.e. 7-bit data), then you should be using new String(data, "US-ASCII") instead of depending on the platform default encoding. This may be faster than trying to interpret it as your platform default encoding (which could be UTF-8, which requires more introspection).
You could also speed this up by avoiding the Charset-Lookup hit each time, by caching the Charset instance and calling new String(data, charset) instead.
Having said that: it's been a very, very long time since I've seen real ASCII data in production environment
Halved performance? How large is this byte array? If it's for example 1MB, then there are certainly more factors to take into account than just "converting" from bytes to chars (which is supposed to be fast enough though). Writing 1MB of data instead of "just" 100bytes (which the byte[].toString() may generate) to a log file is obviously going to take some time. The disk file system is not as fast as RAM memory.
You'll need to change the string representation of the byte array. Maybe with some more sensitive information, e.g. the name associated with it (filename?), its length and so on. After all, what does that byte array actually represent?
Edit: I can't remember to have seen the "approximately 30us" phrase in your question, maybe you edited it in within 5 minutes after asking, but this is actually microoptimization and it should certainly not cause "halved performance" in general. Unless you write them a million times per second (still then, why would you want to do that? aren't you overusing the phenomenon "logging"?).
Take a look here: Faster new String(bytes, cs/csn) and String.getBytes(cs/csn)

Categories