I need something like
Triple readNextTriple(stream)
so that I can process the triple one by one.
It is difficult to find an instruction for this task in Jena.
It seems that all options lead to a model.read(), which read the whole stream at once.
Does anyone have any suggestions?
See StreamRDF and RDFdataMgr.
https://jena.apache.org/documentation/io/streaming-io.html
Related
I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.
Any suggestions?
Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).
StAX Project Home: http://stax.codehaus.org/Home
Brief introduction: http://www.xml.com/pub/a/2003/09/17/stax.html
Javadoc: https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/package-summary.html
Use a SAX based parser that presents you with the contents of the document in a stream of events.
StAX API is easier to deal with compared to SAX. Here is a short tutorial
Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.
As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).
You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.
If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.
Stream the file into a SAX parser and read it into memory in chunks.
SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.
Use almost any SAX Parser to stream the file a bit at a time.
I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).
FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.
If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.
So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.
StringsFile file = new StringsFile();
StringInFile str = file.newString("abc"); // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file
+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.
Im a newbie to hadoop. I did the setup and executed the basic word count java program. The results look good.
My question is it possible to parse an extremely big log file to fetch only a few required lines using the map/reduce classes? Or is some other step required?
Any pointers in this direction will be very useful.
Thanks, Aarthi
Yes it is entirely possible, and if the file is sufficiently large, I believe hadoop could prove to good way to tackle it, despite what nhahtdh says.
Your mappers could simply act as the filters - check the values passed to them, and only if they fit the conditions of a required line do you context.write() it out.
You wont even need to write your own reducer, just use the default reduce() in the Reducer class.
We have a new requirement:
There are some BIG xml files keep coming into our system and we will need to process them immediately and quickly using Java. The file is huge but the required information for our processing is inside a element which is very small.
...
...
What is the best way to extract this small portion of the data from the huge file before we start processing. If we try to load the entire file, we will get out of memory error immediately due to size. What is the efficient way in Java that I can use to get the ..data..data..data.. data element without loading or reading the file line by line. Is there any SAX Parser that I can use to get this done?
Thank you
The SAX parsers are event based and are much faster because they do what you need: they don't read the xml document entirely. There is a SAXParser available in the Java distributions.
I had to parse huge files in a previous project (1G-2G) and didn't want to deal with using SAX. I find SAX too low-level in some instances and like keepings a traversal approach in most cases.
I have used the VTD library http://vtd-xml.sourceforge.net/. It's an EXTREMELY fast library that uses pointers to navigate through the document.
Well, if you want to read a part of a file, you will need to read each line of the file to be able to identify the part of the file of interest and then extract what you need.
If you only need a small portion of the incoming XML, you can either use SAX, or if you need to read only specific elements or attributes, you could use XPath, which would be a lot simpler to implement.
Java comes with a built-in SAXParser implementation as well as an XPath implementation. Find the javadocs for SAXParser here and for XPath here.
StAX is another option based on steaming the data, like SAX, but benefits from a more friendly approach (IMO) to processing the data by "pulling" what you want rather than having it "pushed" to you.
I have a huge XML files up to 1-2gb, and obviously I can't parse the whole file at once, I'd have to split it into parts then parse the parts and do whatever with them.
How can I count number of a certain node? So I can keep track on how many parts do I need to split the file. Is there a maybe better way to do this? I'm open to all suggestions thank you
Question update:
Well I did use STAX, maybe the logic I'm using it for is wrong, I'm parsing the file, then for each node I'm getting the node value and store it inside string builder. Then in another method I go trough stringbuilder and edit the output. Then I write that output to the file. I can do no more than 10000 objects like this.
Here is the exception I get :
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at com.sun.org.apache.xerces.internal.util.NamespaceSupport.<init>(Unkno
wn Source)
at com.sun.xml.internal.stream.events.XMLEventAllocatorImpl.setNamespace
Context(Unknown Source)
at com.sun.xml.internal.stream.events.XMLEventAllocatorImpl.getXMLEvent(
Unknown Source)
at com.sun.xml.internal.stream.events.XMLEventAllocatorImpl.allocate(Unk
nown Source)
at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(Unknown Sour
ce)
at com.sun.org.apache.xalan.internal.xsltc.trax.StAXEvent2SAX.bridge(Unk
nown Source)
at com.sun.org.apache.xalan.internal.xsltc.trax.StAXEvent2SAX.parse(Unkn
own Source)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transfor
mIdentity(Unknown Source)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transfor
m(Unknown Source)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transfor
m(Unknown Source)
Actually I think my whole approach is wrong, what I'm actually trying convert xml files into CSV samples. Here is how I do it so far :
Read/parse xml file
For each element node get text node value
Open stream write it to file(temp), for n nodes then flush and close stream
Then open another stream read from temp, use commons strip utils and some other stuff to create proper csv output then write it to csv file
The SAX or STAX API would be your best bet here. They don't parse the whole thing at once, they take one node at a time and let your app process it. They're good for arbitrarily large documents.
SAX is the older API, and works on a push model, STAX is newer and is a pull parser, and is therefore rather easier to use, but for your requirements, either one would be fine.
See this tutorial to get you started with STAX parsing.
You can use a streaming parser like StAX for this. This will not require you to read the entire file in memory at once.
I think you want to avoid creating a DOM, so SAX or StAX should be good choices.
With SAX just implement a simlpe content handler that just increments a counter if an interesting element is found.
With SAX you don't have to split the file: It's streaming, so it holds only the current bits in memory. It's very easy to write a ContentHandler that just does the counting. And it's very fast (in my experience, almost as fast as simply reading the file).
Well I did use STAX, maybe the logic I'm using it for is wrong, I'm parsing the file, then for each node I'm getting the node value and store it inside string builder. Then in another method I go trough stringbuilder and edit the output. Then I write that output to the file. I can do no more than 10000 objects like this.
By this description, I'd say yes, the logic you're using it for is wrong. You're holding on to too much in memory.
Rather than parsing the entire file, storing all the node values into something and then processing the result, you should handle each node as you hit it, and output while parsing.
With more details on what you're actually trying to accomplish and what the input XML and out whatever looks like, we could probably help streamline.
You'd be better off using an event based parser such as SAX
I think splitting the file is not the way to go. You'd better handle the xml file as a stream and use the SAX API (and not the DOM API).
Even better, you should use XQuery to handle you requests.
Saxon is a good Java / .Net implementation (using sax), that is amazingly fast, even on big files. Version HE is under a MPL open-source license.
Here is a little example:
java -cp saxon9he.jar net.sf.saxon.Query -qs:"count(doc('/path/to/your/doc/doc.xml')//YouTagToCount)"
With extended vtd-xml, you can load document in memory efficient as it supports memory mapping. Compared to DOM, the memory usage won't explode in an order of magnitude. And you will be able to use xpath to count the number of nodes very easily.
I'm currently trying to read in an XML file, make some minor changes (alter the value of some attributes), and write it back out again.
I have intended to use a StAX parser (javax.xml.stream.XMLStreamReader) to read in each event, see if it was one I wanted to change, and then pass it straight on to the StAX writer (javax.xml.stream.XMLStreamReader) if no changes were required.
Unfortunately, that doesn't look to be so simple - The writer has no way to take an event type and a parser object, only methods like writeAttribute and writeStartElement. Obviously I could write a big switch statement with a case for every possible type of element which can occur in an XML document, and just write it back out again, but it seems like a lot of trouble for something which seems like it should be simple.
Is there something I'm missing that makes it easy to write out a very similar XML document to the one you read in with StAX?
After a bit of mucking around, the answer seems to be to use the Event reader/writer versions rather than the Stream versions.
(i.e. javax.xml.stream.XMLEventReader and javax.xml.stream.XMLEventWriter)
See also http://www.devx.com/tips/Tip/37795, which is what finally got me moving.
StAX works pretty well and is very fast. I used it in a project to parse XML files which are up to 20MB. I don't have a thorough analysis, but it was definitely faster than SAX.
As for your question: The difference between streaming and event-handling, AFAIK is control. With the streaming API you can walk through your document step by step and get the contents you want. Whereas the event-based API you can only handle what you are interested in.
I know this is rather old question, but if anyone else is looking for something like this, there is another alternative: Woodstox Stax2 extension API has method:
XMLStreamWriter2.copyEventFromReader(XMLStreamReader2 r, boolean preserveEventData)
which copies the currently pointed-to event from stream reader using stream writer. This is not only simple but very efficient. I have used it for similar modifications with success.
(how to get XMLStreamWriter2 etc? All Woodstox-provided instances implement these extended versions -- plus there are wrappers in case someone wants to use "basic" Stax variants, as well)