Parsing xlsx files as chunks via streaming/pagination strategy using apache poi - java

There is a case wherein xlsx,xlsm files having huge amount of data(in orders of 80-100MB) is causing memory heap out of space issues on servers using the load() method of Workbook object, which takes FileInputStream as parameter.
Its intended to load the data, validate the cell content and report error in case there is invalid record entry. If all data is correct then write it to the table.Hence, the following didn't suffice my purpose.
Error While Reading Large Excel Files (xlsx) Via Apache POI
The problem involves paginated parsing, data validating and then writing to database.

As xlsx files are in zip format containing content XML, you may remove pages by a simple parsing/discarding, creating a smaller content XML. Then create a smaller xlsx and use Apache POI. Use a test xlsx to develop the parsing. The XML in general has no line breaks or indentation; so an XML beautifier / tree editor might help. Excel uses shared strings so the actual content is hard to see.
Use a zip file system (URLs "jar:file://... .xlsx") to operate on the xlsx.

StAX parser is a good approach to this situation.
https://docs.oracle.com/javase/tutorial/jaxp/stax/index.html
We can iterate over the sheets to obtain index of value at each cell, and use SharedStringsTable object to get the value at particular cell location.

Related

read the contents from html and write in excel sheet in java

Please help me to read a html file and write it in an excel file by using java . I have searched the net , I could get only for copying tables , I need the data in it to be read and write in an excel file .
contents in html file
<title>**Deprecated Method Found**</title>
<h2>**Summary**</h2>
Deprecated API is error-prone and is a potential security threat and thus should not be used.
<h2>**Description**</h2>
Old API is sometimes marked deprecated because its implementation is designed in a way that can be error-prone. Deprecated API should be avoided where possible.
<h2>**Security Implications**</h2>
Blocks of code that use deprecated API are designed in a careless manner and thus are a potential security threat.
I need these separate headings in separate columns and the contents in rows .
Is it possible to parse this html to excel file
Try reading the HTML using jsoup, a Java HTML Parser. You can then save it as a CSV (Comma separated values) format which can be opened in excel.
jsoup to read text after element
Get all h2 elements using
Elements h2s = document.select("h2");
Then iterate over the elements reading what the heading says. If this heading is important, use the following code to get the text following that tag.
String text = h2.nextSibling().toString();
Sample CSV opened in Excel:

Is it possible to output to a csv file with multiple sheets?

I need to output data to a CSV file from Java, but in that csv file I hope to create multiple sheets so that data can be organized in a better way. After some googling, it seems this is not possible. A CSV file can only receive one-sheet data.
Is this true? If yes, what would be the options? Thank you.
CSV file is interpreted a sequence of characters which comply to some standardization, therefor it cannot contains more than one sheet. You can output your data in a Excel file that contains more than one sheet using the Apache POI api.
Comma Separated Value lists are generally created in plain text files, which do not have multiple pages. What you could do instead is create multiple CSV files, but if you really have a lot of data, a data base might be your best bet.

API for Processing large XLSX file

I have one question. Is there any API that can process xlsx and xls file. The requirement is i have one excel file. i have to encrypt values of some specific columns. without affecting the format of cell like cell color, cell formula, cell date format, cell currency format, charts etc. I have used APACHE POI library. i did not get success. it is very slow and not working on large file. i also searched on google but i dint get proper result.
An alternative to POI is JXL.
We used it successfully with rather large files.

Java: Question about data representation

I need to parse 70mb data with Java and I've currently a xml document (1-level, no children), where each document has multiple fields.
I was wondering if I should replace it with a simpler text file in which each row is a doc, and the fields are comma separated.
Is this going to significantly improve performances ? And what if the I had, for instance, 4GB data instead ?
thanks
It would probably be more efficient to use the text file than the XML file if you ever get to a point where you can't fit the whole data set into memory at once. at that point, being able to parse the text file line by line would be better than the XML approach (which I believe loads the whole file into memory).
According to a Robin Green XML only parses the whole file at once if you use DOM - SAX parsing streams.
There are other ways to persist data like this:
Database
Can this data be represented in a database? Java has easy support for most database systems, and you just have to install the right libraries to do so.
Java Properties
An alternative is the java properties system. This lets you put all your data on a file and then load them back and java parses the file when loading it.

Convert Semicolon Text to Excel

I need to convert a semicolon delimited file to an Excel.
However, there are some fields that must be removed from the Excel file, and some extra fields to be added. These extra fields are drop-down fields.
Is there any way to do this? Programming language that is preferably to be used is Java, but also welcome the possibility to use Excel macro.
I'm pretty sure you can do this with vanilla Excel. You can either do a global search and replace on semicolon to comma and just open as CSV or use the "Text to Columns" feature.
EDIT: I've not done this programmatically in Java, but in Perl it should be pretty straightforward with Text::xSV and Spreadsheet::WriteExcel
You could look at opencsv and HSSF.
I'm not sure how to do this with an Excel macro, but for Java:
Read the file with FileReader
Use a StringTokenizer with a ";" delimeter to separate the fields
Make an array for each row holding a custom object representing each row. The object can store arrays for the data needed to populate the drop down box
Use Apache POI to create an Excel spreadsheet (There are lots of POI examples on Stackoverflow)
You have two options:
Use Apache POI to write and customise the XLS
Create a sample spreadsheet in Excel, but save it as an HTML page. Take the saved HTML and use it as a template for your data. You can save the output (template+data) as a file with .xls suffix. Even though its content is really HTML it will open correctly.
If you use CSV you won't be able to get additional features such as drop downs or styling.

Categories