I'm trying to convert data contained in a database table in a set of triples so I'm writing an owl file using Jena java library.
I have successfully done it with a small number of table records (100) which corresponds to nearly 20.000 rows in the .owl file and I'm happy with it.
To write the owl file I have used the following code (m is an OntModel object):
BufferedWriter out = null;
try {
out = new BufferedWriter (new FileWriter(FILENAME));
m.write(out);
out.close();
}catch(IOException e) {};
Unfortunately when I try to do the same with the entire result set of the table (800.000 records) eclipse console shows me the exception:
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
the exception is raised by
m.write(out);
I'm absolutely sure the model is correctly filled because I tried to execute the program without creating the owl file and all worked fine.
To fix it, I tried to increase heap memory setting -Xmx4096Minrun->configuration->vm arguments but the error still appears.
I'm executing the application on a macbook so I have no unlimited memory. Are there chances to complete the task? maybe is there a more efficient way to store the model?
The default format is RDF/XML is a pretty form, but to calculate the "pretty", quite a lot of work needs to be done before writing starts. This includes building up internal datstructures. Some shapes of data cause quite extensive work to be done searching for the "most pretty" variation.
RDF/XML in pretty form is the most expensive format. Even the pretty Turtle form is cheaper though it still involves some preparation calculations.
To write in RDF/XML in a simpler format, with no complex pretty features:
RDFDataMgr.write(System.out, m, RDFFormat.RDFXML_PLAIN);
Output streams are preferred, and the output will be UTF-8 - "new BufferedWriter (new FileWriter(FILENAME));" will use the platform default character set.
See the documentation for other formats and variations:
https://jena.apache.org/documentation/io/rdf-output.html
such as RDFFormat.TURTLE_BLOCKS.
Related
The Apache Commons CSV project works quite well for parsing comma-separates values, tab-delimited data, and similar data formats.
My impression is that this tool reads a file entirely with the resulting line objects kept in memory. But I am not sure, I cannot find any documentation with regard to this behavior.
For parsing very large, I should like to do an incremental read, one line at a time, or perhaps a relatively small number of lines at a time, to avoid overwhelming memory limitations.
With regard only to the aspect of memory usage, the idea here is like how a SAX parser for XML reads incrementally to minimize use of RAM versus a DOM style XML parser that reads a document entirely into memory to provide tree-traversal.
Questions:
What is the default behavior of Apache Commons CSV with regard to reading documents: Entirely into memory, or incremental?
Can this behavior be altered between incremental and entire-document?
My impression is that this tool reads a file entirely with the resulting line objects kept in memory
No. The use of memory is governed by how you choose to interact with your CSVParser object.
The Javadoc for CSVParser addresses this issue explicitly, in its sections Parsing record wise versus Parsing into memory, with a caution:
Parsing into memory may consume a lot of system resources depending on the input. For example if you're parsing a 150MB file of CSV data the contents will be read completely into memory.
I took a quick glance at the source code, and indeed parsing record wise seems to be reading from its input source a chunk at a time, not all at once. But see for yourself.
Parsing record wise
In section Parsing record wise, it shows how to incrementally read one CSVRecord at a time by looping the Iterable that is CSVParser.
CSVParser parser = CSVParser.parse(csvData, CSVFormat.RFC4180);
for (CSVRecord csvRecord : parser) {
...
}
Parsing into memory
In contrast, the Parsing into memory section shows the use of CSVParser::getRecords to load all the CSVRecord objects into a List all at once, in memory. So obviously a very large input file could blow out memory on a constrained machine.
Reader in = new StringReader("a;b\nc;d");
CSVParser parser = new CSVParser(in, CSVFormat.EXCEL);
List<CSVRecord> list = parser.getRecords();
I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.
Any suggestions?
Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).
StAX Project Home: http://stax.codehaus.org/Home
Brief introduction: http://www.xml.com/pub/a/2003/09/17/stax.html
Javadoc: https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/package-summary.html
Use a SAX based parser that presents you with the contents of the document in a stream of events.
StAX API is easier to deal with compared to SAX. Here is a short tutorial
Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.
As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).
You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.
If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.
Stream the file into a SAX parser and read it into memory in chunks.
SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.
Use almost any SAX Parser to stream the file a bit at a time.
I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).
FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.
If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.
So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.
StringsFile file = new StringsFile();
StringInFile str = file.newString("abc"); // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file
+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.
I have a working ANTLR4 compiler which works well for files up to ~300Mb, if I set the JAVA VM size to 8G with -Xmx8G. However larger files crash the parser/compiler with a HEAP out of memory message. I have been advised to check my code for memory consumption outside of the ANTLR4 process. (data below) I'm using token factory and unbufferedChar and token streams.
One strategy I'm working with is to test the size of the INPUT file/stream, (if knowable), in my case it is. If the file is small, parse using my top level rule which generates a parse tree that is large, but works for small files.
If the file is larger than an arbitrary threshold, I attempt to divide the parsing into chunks, by selecting a sub-rule. So for small files I parse the rule patFile (existing working code), for large files I'm exploring breaking things up by parsing sub rule "patFileHeader", followed by parsing the rule "bigPatternRec" which replaces the "patterns+" portion of the former rule.
In this way my expectation is that I can control how much of the token stream is read in.
At the moment this looks promising, but I see issues with controlling how much ANTLR4 parses when processing the header. I likely have a grammar rule that causes the patFileHeader to consume all available input tokens before exiting. Other cases seem to work, but I'm still testing. I'm just not sure that this approach to solving "large file" parsing is viable.
SMALL file Example Grammar:
patFile : patFileHeader patterns+
// {System.out.println("parser encountered patFile");}
;
patFileHeader : SpecialDirective? includes* gbl_directives* patdef
;
patterns : patdata+ patEnd
// {System.out.println("parser encountered patterns");}
;
bigPatternRec : patdata
| patEnd
;
...
In my case for a small file, I create the parse tree with:
parser = new myparser(tokens);
tree = parser.patFile(); // rule that parses to EOF
walker=walk(mylisteners,tree);
Which will parse the entire file to EOF.
For larger files I considered the following technique:
// Process the first few lines of the file
tree = parser.patFileHeader(); // sub rule that does not parse to EOF
walker=walk(mylisteners,tree);
//
// Process remaining lines one line/record at a time
//
while( inFile.available() ) {
parser = new myParser(tokens);
tree = parser.bigPatternRec();
walker=walk(mylisteners,tree);
}
In response to a suggestion that I profile the behavior, I have generated this screenshot of JVMonitor on the "whole file" during processing of my project.
One thing of intrest to me was the three Context sets of ~398Mb. In my grammar vec is a component of vecdata, so it appears that some context data is getting replicated. I may play with that. It's possible that the char[] entry is my code outside of ANTLR4. I'd have to disable my listeners and run to generate the parse tree witihout my code active to be sure. I do other things that consume memory (MappedByteBuffers) for high speed file I/O on output, which will contribute to exceeding the 8Gb image.
What is interesting however, is what happens to the memory image if I break the calls up and JUST process subrules. The memory consumption is ~%10 of the full size, and the ANTLR4 objects are not even on the radar in that case.
Friends,
In my application, i came across an scenario, where the user may request for an Report download as a flat file, which may have max of 17 Lakhs records (around 650 MB) of Data. During this request either my application server stops serving other threads or occurs out of memory exception.
As of now i am iterating through the result set and printing it to the file.
When i Google out for this, i came across an API named OpenCSV. I tried that too but i didn't see any improvement in the performance.
Please help me out on this.
Thanks for the quick response guys, Here i added my code snap
try {
response.setContentType("application/csv");
PrintWriter dout = response.getWriter();
while(rs.next()) {
dout.print(data row); // Here i am printing my ResultSet tubles into flat file.
dout.print("\r\n");
dout.flush();
}
OpenCSV will cleanly deal with the eccentricities of the CSV format, but a large report is still a large report. Take a look at the specific memory error, it sounds like you need to increase the Heap or Max Perm Gen space (it will depend of the error to be sure). Without any adjusting the JVM will only occupy s fixed amount of RAM (my experience is this number is 64 MB).
If you only stream the data from resultset to file without using big buffers this should be possible, but maybe you are first collecting the data in a growing list before sending to file? So you should investigate this issue.
Please specify your question more otherwise we have to speculate.
CSV format aren't limited by memory anymore --well, maybe only during prepopulating the data for CSV, but this can be done efficiently as well, for example querying subsets of rows from DB using for example LIMIT/OFFSET and immediately write it to file instead of hauling the entire DB table contents into Java's memory before writing any line. The Excel limitation of the amount rows in one "sheet" will increase to about one million.
Most decent DB's have an export-to-CSV function which can do this task undoubtely much more efficient. In case of for example MySQL, you can use the LOAD DATA INFILE command for this.
I'm using iText to create an RTF document. It'll have a few hundred pages when completed. However, I keep getting an outofmemory error, when it's finished adding all the various paragraphs and tables to the document, and it's trying to actually create the RTF file (with document.close();)
I've increased the Runtime memory with -Xmx350m, but it's not feasible to increase it anymore as the users computer won't have that much memory.
Is there a way to gradually write to the RTF document, rather than in a huge block at the end?
I found you can set it to explicitly cache on disk rather than memory using:
Document document = new Document();
RtfWriter2 writer2 = RtfWriter2.getInstance(document, new FileOutputStream("document.rtf"));
writer2.getDocumentSettings().setDataCacheStyle(RtfDataCache.CACHE_DISK);
document.open();
This makes it slower to generate but at least it creates the file without error. However, I'd still prefer a method which gradually creates the file if anyone knows one.