InputBuffer in Jackson parser - java

I use Jackson for parsing JSON files. The file is passed as a stream to Jackson to create a parser. Here is the sample code :
JsonFactory f = new JsonFactory();
JsonParser parser = f.createParser(inputStream);
I know that createParser() prefetch data from stream into an input buffer. Subsequent call to nextToken() is served from this inputBuffer. In my application, along with parsing, I also want to keep track of the file offset of the inputStream until which I have consumed the data. Due to this buffering, offset tracking does not work.
Does anyone know if there is a way to disable buffering in Jackson? Or, is there an API call that I can use to determine if the buffer has data that has not yet been consumed?

Why not use JsonParser.getTokenLocation() or JsonParser.getCurrentLocation() to keep track of the file offset?
The returned object seems to have the byte position (in addition to the character position), which should be to position in the underlying input stream...
http://fasterxml.github.io/jackson-core/javadoc/2.2.0/com/fasterxml/jackson/core/JsonParser.html#getCurrentLocation()

Related

How to put a limit on memory usage while parsing a JSON string?

How can I rule out a java.lang.OutOfMemoryError when calling
new JSONObject(longAndMalformedJSONString)
I'm using the org.json implementation of a JSON parser.
I'm not looking for a way to decode the bad JSON String. I just want to put an upper limit on memory usage (and possibly CPU usage) and maybe get an exception that I can recover from.
Or, alternatively, is it safe to say, memory usage while parsing will never exceed a certain ratio compared to input string length? Then I could just limit that.
Or is there an alternate library that offers that?
There are two approaches when reading serialized data (JSON, XML, whatever): you either parse the entire input and keep the object in memory, or you navigate the stream via the provided API and you just keep the pieces you are interested in. It seems org.json doesn't have a streaming API, but more sophisticated libraries like Gson do:
JsonReader reader = new JsonReader(new InputStreamReader(in, "UTF-8"));
List<Message> messages = new ArrayList<Message>();
reader.beginArray();
while (reader.hasNext()) {
Message message = gson.fromJson(reader, Message.class);
messages.add(message);
}
reader.endArray();
reader.close();
You can also also put limits on the input, but it depends on the protocol you use for transferring the JSON payload

Split byte array by certain byte (character)

I am building history parser, there's an application that already done the logging task (text based).
Now that my supervisor want me to create an application to read that log.
The log is is created at the end of the month, and is separated by [date]:
[19-11-2014]
- what goes here
- what goes here
[20-11-2014]
- what goes here
- what goes here
etc...
If the log file has small size, there's no problem processing the content by DataInputStream to get the byte[], and convert it to String and then do the filtering process (by doing substring and such).
But when the file has a large size (about 100mb), it throws JavaHeapSpace exception. I know that this is because the length of the content exceeds String maxlength, when I try not to convert the byte[] into string, no exception was thrown.
Now the question is, how do I split the byte[] into several byte[]?
Which is each new byte[] only contains single:
[date]
- what goes here
So if within a month we have 9 dates in log, it would be split into 9 byte[].
The splitting marker would be based on [\\d{2}-\\d{2}-\\d{4}] , if it is string I could just use Regex to find all the marker, get the indexOf and then substring it.
But how do I do this without converting to string first? As it would throws the JavaHeapSpace.
I think there are several concepts here that you're missing.
First, an InputStream is a Stream, which means it is a flow of bytes. What you do with that flow is up to you, but saving all of the stream to memory defies the point of the stream construct altogether.
Second, a DataInputStream is used to read objects from a binary file that were serialized there by a DataOutputStream. Reading just a string is overkill for this type of Stream, since a simple InputStream can do that.
As for your specific problem, I would use a BufferedFileReader, and read one line at a time, until reaching the next date. At that point you can do whatever processing you need on the last chunk of lines you read, and free the memory. Thus not running into the same problem.

How to reposition a file by having a track on bytes which were decoded to corresponding characters?

Question may be quite vague, let me expound it here.
I'm developing an application in which I'll be reading data from a file. I've a FileReader class which opens the file in following fashion
currentFileStream = new FileInputStream(currentFile);
fileChannel = currentFileStream.getChannel();
data is read as following
bytesRead = fileChannel.read(buffer); // Data is buffered using a ByteBuffer
I'm processing the data in any one of the 2 forms, one is binary and other is character.
If its processed as character I do an additional step of decoding this ByteBuffer into CharBuffer
CoderResult result = decoder.decode(byteBuffer, charBuffer, false);
Now my problem is I need to read by repositioning the file from some offset during recovery mode in case of some failure or crash in application.
For this, I maintain a byteOffset which keeps track of no of bytes processed during binary mode and I persist this variable.
If something happens I reposition the file like this
fileChannel.position(byteOffset);
which is pretty straightforward.
But if processing mode is character, I maintain recordOffset which keeps track of character position/offset in the file. During recovery I make calls to read() internally till I get some character offset which is persisted recordOffset+1.
Is there anyway to get corresponding bytes which were needed to decode characters? For instance I want something similar like this if recordOffset is 400 and its corresponding byteOffset is 410 or 480 something( considering different charsets). So that while repositioning I can do this
fileChannel.position(recordOffset); //recordOffset equivalent value in number of bytes
instead of making repeated calls internally in my application.
Other approach I could think for this was using an InputStreamReader's skip method.
If there are any better approach for this or if possible to get byte - character mapping, please let me know.

Jackson JSON Streaming API: Read an entire object directly to String

I'm trying to stream in an array of JSON, object by object, but I need to import it as a raw JSON String.
Given an array of input like so:
[
{"object":1},
{"object":2},
...
{"object":n}
]
I am trying to iterate through the Strings:
{"object":1}
{"object":2}
...
{"object":n}
I can navigate the structure using the streaming API to validate that I have encountered an object, and all that, but I think the way I'm getting my String back is ideal.
Currently:
//[...]
//we have read a START_OBJECT token
JsonNode node = parser.readValueAsTree();
String jsonString = anObjectMapper.writeValueAsString(node);
//as opposed to String jsonString = node.toString() ;
//[...]
I imagine the building of the whole JsonNode structure involves a bunch of overhead, which is pointless if I'm just reserializing, so I'm looking for a better solution. Something along the lines of this would be ideal:
//[...]
//we have read a START_OBJECT token
String jsonString = parser.readValueAsString()
//or parser.skipChildrenAsString()
//[...]
The objects are obviously not as simple as
{"object":1}
which is why I'm looking to not waste time doing pointless node building. There may be some ideal way, involving mapping the content to objects and working with that, but I am not in a position where I am able to do that. I need the raw JSON string, one object at a time, to work with existing code.
Any suggestions or comments are appreciated. Thanks!
EDIT : parser.getText() returns the current token as text (e.g. START_OBJECT -> "{"), but not the rest of the object.
Edit2 : The motivation for using the Streaming API is to buffer objects in one by one. The actual json files can be quite large, and each object can be discarded after use, so I simply need to iterate through.
There is no way to avoid JSON tokenization (otherwise parser wouldn't know where objects start and end etc), so it will always involve some level of parsing and generation.
But you can reduce overhead slightly by reading values as TokenBuffer -- it is Jackson's internal type with lowest memory/performance overhead (and used internally whenever things need to be buffered):
TokenBuffer buf = parser.readValueAs(TokenBuffer.class);
// write straight from buffer if you have JsonGenerator
jgen.writeObject(buf);
// or, if you must, convert to byte[] or String
byte[] stuff = mapper.writeValueAsBytes();
We can do bit better however: if you can create JsonGenerator for output, just use JsonGenerator.copyCurrentStructure(JsonParser);:
jgen.copyCurrentStructure(jp); // points to END_OBJECT after copy
This will avoid all object allocation; and although it will need to decode JSON, encode back as JSON, it will be rather efficient.
And you can in fact use this even for transcoding -- read JSON, write XML/Smile/CSV/YAML/Avro -- between any formats Jackson supports.

Can ANTLR4 java parser handle very large files or can it stream files

Is the java parser generated by ANTLR capable of streaming arbitrarily large files?
I tried constructing a Lexer with a UnbufferedCharStream and passed that to the parser. I got an UnsupportedOperationException because of a call to size on the UnbufferedCharStream and the exception contained an explained that you can't call size on an UnbufferedCharStream.
new Lexer(new UnbufferedCharStream( new CharArrayReader("".toCharArray())));
CommonTokenStream stream = new CommonTokenStream(lexer);
Parser parser = new Parser(stream);
I basically have a file I exported from hadoop using pig. It has a large number of rows separated by '\n'. Each column is split by a '\t'. This is easy to parse in java as I use a buffered reader to read each line. Then I split by '\t' to get each column. But I also want to have some sort of schema validation. The first column should be a properly formatted date, followed some price columns, followed by some hex columns.
When I look at the generated parser code I could call it like so
parser.lines().line()
This would give me a List which conceptually I could iterate over. But it seems that the list would have a fixed size by the time I get it. Which means the parser probably already parsed the entire file.
Is there another part of the API that would allow you to stream really large files? Like some way of using the Visitor or Listener to get called as it is reading the file? But it can't keep the entire file in memory. It will not fit.
You could do it like this:
InputStream is = new FileInputStream(inputFile);//input file is the path to your input file
ANTLRInputStream input = new ANTLRInputStream(is);
GeneratedLexer lex = new GeneratedLexer(input);
lex.setTokenFactory(new CommonTokenFactory(true));
TokenStream tokens = new UnbufferedTokenStream<CommonToken>(lex);
GeneratedParser parser = new GeneratedParser(tokens);
parser.setBuildParseTree(false);//!!
parser.top_level_rule();
And if the file is quite big, forget about listener or visitor - I would be creating object directly in the grammar. Just put them all in some structure (i.e. HashMap, Vector...) and retrieve as needed. This way creating the parse tree (and this is what really takes a lot of memory) is avoided.

Categories