SharedStringsTable payload before parsing [duplicate]

SharedStringsTable payload before parsing [duplicate] - java

I am trying to open a large size(>30mb) .xlsx and copy all the rows and columns(consists of >200k rows) in that sheet into a new workbook sheet. I got an error on the following code:
FileInputStream fis = new FileInputStream(file);
XSSFWorkbook newWorkBook = new XSSFWorkbook(fis);
Increasing heap space does not help. After much research, i understand that a work around is either to use XSSF and SAX (Event API) or XLSX2CSV.java. I just need to copy the whole data from old sheet to new sheet. Somehow after trying SAX, i am stuck as i am not sure how to get the value from old sheet to copy to new sheet. Also, empty cells are not included in the SST. i need to copy over all cells, inclusive of empty cell.
#Override
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
// c => cell
if(name.equals("c")) {
//print cell reference
System.out.print(attributes.getValue("r") + " - ");
// Figure out if value is an index in the SST
String cellType = attributes.getValue("t");
System.out.println("CellType " + cellType);
if(cellType != null && cellType.equals("s")) {
isNextString = true;
}else {
isNextString = false;
}
}
//Clear last content
lastContents = "";
}
I am using Java, though i can try on C#.

Related

Read large size xlsx file and copy all

I am trying to open a large size(>30mb) .xlsx and copy all the rows and columns(consists of >200k rows) in that sheet into a new workbook sheet. I got an error on the following code:
FileInputStream fis = new FileInputStream(file);
XSSFWorkbook newWorkBook = new XSSFWorkbook(fis);
Increasing heap space does not help. After much research, i understand that a work around is either to use XSSF and SAX (Event API) or XLSX2CSV.java. I just need to copy the whole data from old sheet to new sheet. Somehow after trying SAX, i am stuck as i am not sure how to get the value from old sheet to copy to new sheet. Also, empty cells are not included in the SST. i need to copy over all cells, inclusive of empty cell.
#Override
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
// c => cell
if(name.equals("c")) {
//print cell reference
System.out.print(attributes.getValue("r") + " - ");
// Figure out if value is an index in the SST
String cellType = attributes.getValue("t");
System.out.println("CellType " + cellType);
if(cellType != null && cellType.equals("s")) {
isNextString = true;
}else {
isNextString = false;
}
}
//Clear last content
lastContents = "";
}
I am using Java, though i can try on C#.

How to detect excel cell reference style of a file using apache POI?

I get an excel file through front end and I do not know what is the user preferred cell reference style (A1 or R1C1) for that file. I want to display the header with column position as present in the file.
For example, if the file is using R1C1 reference style then the column position should be shown as 1, 2, 3... and for A1 references, it should return A, B C...
I want to achieve this using Java apache POI. Any lead in this will be helpful.
Thanks in advance.

The used reference mode (either A1 or R1C1) can be stored in the Excel files. It may be omitted. Then Excel defaults to the last used setting in application.
In the old binary *.xls file system (HSSF) it gets stored using a RefModeRecord in the worksheet' s record stream. Although it cannot be different for single worksheets, it will be stored for each worksheet separately. But it cannot be different for different sheets in same workbook.
In Office Open XML file system (*.xlsx, XSSF) it gets stored in xl/workbook.xml using element calcPr having attribute refMode set.
Both is not dicrectly suppoerted by apache poi upto now. But if one knows the internally structure of the file systems, then it can be set and get using following code:
import java.io.FileOutputStream;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.record.RecordBase;
import org.apache.poi.hssf.record.RefModeRecord;
import org.apache.poi.hssf.model.InternalSheet;
import java.lang.reflect.Field;
import java.util.List;
public class CreateExcelRefModes {
static void setRefMode(HSSFWorkbook hssfWorkbook, String refMode) throws Exception {
for (Sheet sheet : hssfWorkbook) {
HSSFSheet hssfSheet = (HSSFSheet)sheet;
Field _sheet = HSSFSheet.class.getDeclaredField("_sheet");
_sheet.setAccessible(true);
InternalSheet internalsheet = (InternalSheet)_sheet.get(hssfSheet);
Field _records = InternalSheet.class.getDeclaredField("_records");
_records.setAccessible(true);
#SuppressWarnings("unchecked")
List<RecordBase> records = (List<RecordBase>)_records.get(internalsheet);
RefModeRecord refModeRecord = null;
for (RecordBase record : records) {
if (record instanceof RefModeRecord) refModeRecord = (RefModeRecord)record;
}
if ("R1C1".equals(refMode)) {
if (refModeRecord == null) {
refModeRecord = new RefModeRecord();
records.add(records.size() - 1, refModeRecord);
}
refModeRecord.setMode(RefModeRecord.USE_R1C1_MODE);
} else if ("A1".equals(refMode)) {
if (refModeRecord == null) {
refModeRecord = new RefModeRecord();
records.add(records.size() - 1, refModeRecord);
}
refModeRecord.setMode(RefModeRecord.USE_A1_MODE);
}
}
}
static String getRefMode(HSSFWorkbook hssfWorkbook) throws Exception {
for (Sheet sheet : hssfWorkbook) {
HSSFSheet hssfSheet = (HSSFSheet)sheet;
Field _sheet = HSSFSheet.class.getDeclaredField("_sheet");
_sheet.setAccessible(true);
InternalSheet internalsheet = (InternalSheet)_sheet.get(hssfSheet);
Field _records = InternalSheet.class.getDeclaredField("_records");
_records.setAccessible(true);
#SuppressWarnings("unchecked")
List<RecordBase> records = (List<RecordBase>)_records.get(internalsheet);
RefModeRecord refModeRecord = null;
for (RecordBase record : records) {
if (record instanceof RefModeRecord) refModeRecord = (RefModeRecord)record;
}
if (refModeRecord == null) return "not specified";
if (refModeRecord.getMode() == RefModeRecord.USE_R1C1_MODE) return "R1C1";
if (refModeRecord.getMode() == RefModeRecord.USE_A1_MODE) return "A1";
}
return null;
}
static void setRefMode(XSSFWorkbook xssfWorkbook, String refMode) {
if ("R1C1".equals(refMode)) {
if (xssfWorkbook.getCTWorkbook().getCalcPr() == null) xssfWorkbook.getCTWorkbook().addNewCalcPr();
xssfWorkbook.getCTWorkbook().getCalcPr().setRefMode(org.openxmlformats.schemas.spreadsheetml.x2006.main.STRefMode.R_1_C_1);
} else if ("A1".equals(refMode)) {
if (xssfWorkbook.getCTWorkbook().getCalcPr() == null) xssfWorkbook.getCTWorkbook().addNewCalcPr();
xssfWorkbook.getCTWorkbook().getCalcPr().setRefMode(org.openxmlformats.schemas.spreadsheetml.x2006.main.STRefMode.A_1);
}
}
static String getRefMode(XSSFWorkbook xssfWorkbook) {
if (xssfWorkbook.getCTWorkbook().getCalcPr() == null) return "not specified";
if (xssfWorkbook.getCTWorkbook().getCalcPr().getRefMode() == org.openxmlformats.schemas.spreadsheetml.x2006.main.STRefMode.R_1_C_1) return "R1C1";
if (xssfWorkbook.getCTWorkbook().getCalcPr().getRefMode() == org.openxmlformats.schemas.spreadsheetml.x2006.main.STRefMode.A_1) return "A1";
return null;
}
public static void main(String[] args) throws Exception {
Workbook workbook = new XSSFWorkbook(); String filePath = "./CreateExcelRefModes.xlsx";
//Workbook workbook = new HSSFWorkbook(); String filePath = "./CreateExcelRefModes.xls";
Sheet sheet = workbook.createSheet();
if (workbook instanceof XSSFWorkbook) {
XSSFWorkbook xssfWorkbook = (XSSFWorkbook)workbook;
setRefMode(xssfWorkbook, "R1C1" );
//setRefMode(xssfWorkbook, "A1" );
System.out.println(getRefMode(xssfWorkbook));
} else if (workbook instanceof HSSFWorkbook) {
HSSFWorkbook hssfWorkbook = (HSSFWorkbook)workbook;
setRefMode(hssfWorkbook, "R1C1" );
//setRefMode(hssfWorkbook, "A1" );
System.out.println(getRefMode(hssfWorkbook));
}
FileOutputStream out = new FileOutputStream(filePath);
workbook.write(out);
out.close();
workbook.close();
}
}
But question is: Why? Microsoft Excel uses A1 reference mode per default while storing formulas. In stored Excel file systems you never will find R1C1 formulas. Office Open XML stores formulas as strings in XML. And although the Office Open XML specification allows R1C1 there, even Microsoft Excel itself never stores R1C1 formula strings. The old binary *.xls file system stores formulas as binary Ptg records which are independent of their string representation. The conversion to R1C1 is done in Excel GUI only. It is done by the Excel application while parsing the file. Doing this, it puts in memory two kind of formulas each, one A1 and one R1C1. So both kinds of formulas are available in GUI and in VBA.
But apache poi does not support R1C1 formulas until now. If it would must then it would must do the conversion programmatically as the Excel application does. But that code is not public available and not reverse engineered from apache poi up to now.
When using current apache poi versions using reflection will not be necessary anymore. HSSFSheet has a method getSheet which returns the InternalSheet and InternalSheet has a method getRecords which returns the List<RecordBase>.
So code could be changed as so:
...
/*
Field _sheet = HSSFSheet.class.getDeclaredField("_sheet");
_sheet.setAccessible(true);
InternalSheet internalsheet = (InternalSheet)_sheet.get(hssfSheet);
*/
InternalSheet internalsheet = hssfSheet.getSheet();
/*
Field _records = InternalSheet.class.getDeclaredField("_records");
_records.setAccessible(true);
#SuppressWarnings("unchecked")
List<RecordBase> records = (List<RecordBase>)_records.get(internalsheet);
*/
List<RecordBase> records = internalsheet.getRecords();
...

How to read large Excel file using POI in Java? [duplicate]

I have a large .xlsx file (141 MB, containing 293413 lines with 62 columns each) I need to perform some operations within.
I am having problems with loading this file (OutOfMemoryError), as POI has a large memory footprint on XSSF (xlsx) workbooks.
This SO question is similar, and the solution presented is to increase the VM's allocated/maximum memory.
It seems to work for that kind of file-size (9MB), but for me, it just simply doesn't work even if a allocate all available system memory. (Well, it's no surprise considering the file is over 15 times larger)
I'd like to know if there is any way to load the workbook in a way it won't consume all the memory, and yet, without doing the processing based (going into) the XSSF's underlying XML. (In other words, maintaining a puritan POI solution)
If there isn't tough, you are welcome to say it ("There isn't.") and point me the ways to a "XML" solution.

I was in a similar situation with a webserver environment. The typical size of the uploads were ~150k rows and it wouldn't have been good to consume a ton of memory from a single request. The Apache POI Streaming API works well for this, but it requires a total redesign of your read logic. I already had a bunch of read logic using the standard API that I didn't want to have to redo, so I wrote this instead: https://github.com/monitorjbl/excel-streaming-reader
It's not entirely a drop-in replacement for the standard XSSFWorkbook class, but if you're just iterating through rows it behaves similarly:
import com.monitorjbl.xlsx.StreamingReader;
InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx"));
StreamingReader reader = StreamingReader.builder()
.rowCacheSize(100) // number of rows to keep in memory (defaults to 10)
.bufferSize(4096) // buffer size to use when reading InputStream to file (defaults to 1024)
.sheetIndex(0) // index of sheet to use (defaults to 0)
.read(is); // InputStream or File for XLSX file (required)
for (Row r : reader) {
for (Cell c : r) {
System.out.println(c.getStringCellValue());
}
}
There are some caveats to using it; due to the way XLSX sheets are structured, not all data is available in the current window of the stream. However, if you're just trying to read simple data out from the cells, it works pretty well for that.

A improvement in memory usage can be done by using a File instead of a Stream.
(It is better to use a streaming API, but the Streaming API's have limitations, see http://poi.apache.org/spreadsheet/index.html)
So instead of
Workbook workbook = WorkbookFactory.create(inputStream);
do
Workbook workbook = WorkbookFactory.create(new File("yourfile.xlsx"));
This is according to : http://poi.apache.org/spreadsheet/quick-guide.html#FileInputStream
Files vs InputStreams
"When opening a workbook, either a .xls HSSFWorkbook, or a .xlsx XSSFWorkbook, the Workbook can be loaded from either a File or an InputStream. Using a File object allows for lower memory consumption, while an InputStream requires more memory as it has to buffer the whole file."

The Excel support in Apache POI, HSSF and XSSF, supports 3 different modes.
One is a full, DOM-Like in-memory "UserModel", which supports both reading and writing. Using the common SS (SpreadSheet) interfaces, you can code for both HSSF (.xls) and XSSF (.xlsx) basically transparently. However, it needs lots of memory.
POI also supports a streaming read-only way to process the files, the EventModel. This is much more low-level than the UserModel, and gets you very close to the file format. For HSSF (.xls) you get a stream of records, and optionally some help with handling them (missing cells, format tracking etc). For XSSF (.xlsx) you get streams of SAX events from the different parts of the file, with help to get the right part of the file and also easy processing of common but small bits of the file.
For XSSF (.xlsx) only, POI also supports a write-only streaming write, suitable for low level but low memory writing. It largely just supports new files though (certain kinds of append are possible). There is no HSSF equivalent, and due to back-and-forth byte offsets and index offsets in many records it would be pretty hard to do...
For your specific case, as described in your clarifying comments, I think you'll want to use the XSSF EventModel code. See the POI documentation to get started, then try looking at these three classes in POI and Tika which use it for more details.

POI now includes an API for these cases. SXSSF http://poi.apache.org/spreadsheet/index.html
It does not load everything on memory so it could allow you to handle such file.
Note: I have read that SXSSF works as a writing API. Loading should be done using XSSF without inputstream'ing the file (to avoid a full load of it in memory)

Check this post. I show how to use SAX parser to process an XLSX file.
https://stackoverflow.com/a/44969009/4587961
In short, I extended org.xml.sax.helpers.DefaultHandler whih processes XML structure for XLSX filez. t is event parser - SAX.
class SheetHandler extends DefaultHandler {
private static final String ROW_EVENT = "row";
private static final String CELL_EVENT = "c";
private SharedStringsTable sst;
private String lastContents;
private boolean nextIsString;
private List<String> cellCache = new LinkedList<>();
private List<String[]> rowCache = new LinkedList<>();
private SheetHandler(SharedStringsTable sst) {
this.sst = sst;
}
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
// c => cell
if (CELL_EVENT.equals(name)) {
String cellType = attributes.getValue("t");
if(cellType != null && cellType.equals("s")) {
nextIsString = true;
} else {
nextIsString = false;
}
} else if (ROW_EVENT.equals(name)) {
if (!cellCache.isEmpty()) {
rowCache.add(cellCache.toArray(new String[cellCache.size()]));
}
cellCache.clear();
}
// Clear contents cache
lastContents = "";
}
public void endElement(String uri, String localName, String name)
throws SAXException {
// Process the last contents as required.
// Do now, as characters() may be called more than once
if(nextIsString) {
int idx = Integer.parseInt(lastContents);
lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
nextIsString = false;
}
// v => contents of a cell
// Output after we've seen the string contents
if(name.equals("v")) {
cellCache.add(lastContents);
}
}
public void characters(char[] ch, int start, int length)
throws SAXException {
lastContents += new String(ch, start, length);
}
public List<String[]> getRowCache() {
return rowCache;
}
}
And then I parse the XML presending XLSX file
private List<String []> processFirstSheet(String filename) throws Exception {
OPCPackage pkg = OPCPackage.open(filename, PackageAccess.READ);
XSSFReader r = new XSSFReader(pkg);
SharedStringsTable sst = r.getSharedStringsTable();
SheetHandler handler = new SheetHandler(sst);
XMLReader parser = fetchSheetParser(handler);
Iterator<InputStream> sheetIterator = r.getSheetsData();
if (!sheetIterator.hasNext()) {
return Collections.emptyList();
}
InputStream sheetInputStream = sheetIterator.next();
BufferedInputStream bisSheet = new BufferedInputStream(sheetInputStream);
InputSource sheetSource = new InputSource(bisSheet);
parser.parse(sheetSource);
List<String []> res = handler.getRowCache();
bisSheet.close();
return res;
}
public XMLReader fetchSheetParser(ContentHandler handler) throws SAXException {
XMLReader parser = new SAXParser();
parser.setContentHandler(handler);
return parser;
}

Based on monitorjbl's answer and test suite explored from poi, following worked for me on multi-sheet xlsx file with 200K records (size > 50 MB):
import com.monitorjbl.xlsx.StreamingReader;
. . .
try (
InputStream is = new FileInputStream(new File("sample.xlsx"));
Workbook workbook = StreamingReader.builder().open(is);
) {
DataFormatter dataFormatter = new DataFormatter();
for (Sheet sheet : workbook) {
System.out.println("Processing sheet: " + sheet.getSheetName());
for (Row row : sheet) {
for (Cell cell : row) {
String value = dataFormatter.formatCellValue(cell);
}
}
}
}

For latest code use this
InputStream file = new FileInputStream(
new File("uploads/" + request.getSession().getAttribute("username") + "/" + userFile));
Workbook workbook = StreamingReader.builder().rowCacheSize(100) // number of rows to keep in memory
.bufferSize(4096) // index of sheet to use (defaults to 0)
.open(file); // InputStream or File for XLSX file (required)
Iterator<Row> rowIterator = workbook.getSheetAt(0).rowIterator();
while (rowIterator.hasNext()) {
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
String cellValue = dataFormatter.formatCellValue(cell);
}}

You can use SXXSF instead of using HSSF. I could generate excel with 200000 rows.

Read UTF-8 encoded text content inside table cell in MS-word file using Apache POI

I'm trying to read a table and exact data in a Microsoft Word document (docx file) using apache poi. The file contain UTF-8 encoded characters (Sinhala language). I'm using following code block.
FileInputStream fis = new FileInputStream("path\\to\\file.docx");
XWPFDocument doc = new XWPFDocument(fis);
Iterator<IBodyElement> iter = doc.getBodyElementsIterator();
while (iter.hasNext()) {
IBodyElement elem = iter.next();
if (elem instanceof XWPFTable) {
List<XWPFTableRow> rows = ((XWPFTable) elem).getRows();
for(XWPFTableRow row :rows){
List<XWPFTableCell> cells = row.getTableCells();
for(XWPFTableCell cell : cells){
PrintStream out = new PrintStream(System.out, true, "UTF-8");
out.println(cell.getText());
}
}
}
}
But I'm not getting correct UTF-8 characters in the output console.
I have already refer several solutions including following.
How to parse UTF-8 characters in Excel files using POI | I'm trying to read a table in a Word file. So my Cell object doesn't have getStringCellValue() method.
http://www.herongyang.com/Java-Tools/native2ascii-Set-UTF-8-Encoding-in-PrintStream.html | I have already tried this solution and It's not working.
does anyone know how to read UTF-8 encoded characters in a word file using apache poi?

I found a solution with setting font for a cell (as a peragraph).
code :
private static final String FILE_NAME = "/tmp/Diskade.docx";
public static void main(String[] args) throws IOException {
FileInputStream fis = new FileInputStream(FILE_NAME);
XWPFDocument doc = new XWPFDocument(fis);
Iterator<IBodyElement> iter = doc.getBodyElementsIterator();
while (iter.hasNext()) {
IBodyElement elem = iter.next();
if (elem instanceof XWPFTable) {
List<XWPFTableRow> rows = ((XWPFTable) elem).getRows();
for(XWPFTableRow row :rows){
List<XWPFTableCell> cells = row.getTableCells();
for(XWPFTableCell cell : cells){
String celltext = cell.getText();
XWPFParagraph paragraph = cell.addParagraph();
setRun(paragraph.createRun() , "Arial" , 10, "2b5079" , celltext , false, false);
System.out.print(cell.getParagraphs().get(0).getParagraphText() + " - ");
}
System.out.println();
}
}
}
}
private static void setRun (XWPFRun run , String fontFamily , int fontSize , String colorRGB , String text , boolean bold , boolean addBreak) {
run.setFontFamily(fontFamily);
run.setFontSize(fontSize);
run.setColor(colorRGB);
run.setText(text);
run.setBold(bold);
if (addBreak) run.addBreak();
}
EDIT :
Later I noted that, actually adding paragraph is enough. You don't need setRun method or invokin it as setRun(paragraph.createRun() , "Arial" , 10, "2b5079" , celltext , false, false);.
Will see is there anything can be done with encoding. (because, for me once the font is loaded it was working fine without paragraph also)

csv to xsl java , there is a column with some currency ( like $400 )

I want to convert csv to xsl using Java. Everything is working fine, but there is a column with some currency ( like $400 ) in the CSV file.
When these currency values are written to the XLS file, it shows green flags and we have to click and change its data type in Excel from string to number to get rid of the green flags.
Now to do so what is thought is to check that in a Cell if starting char is '$' then i will setCelltype to number but what to implement it in code ?
Please help new to excel and java too :P
//all imports are proper
public class Convert_CSV_XLS {
public static void main(String[] args) throws Exception{
/* Step -1 : Read input CSV file in Java */
String inputCSVFile = "csv_2_xls.csv";
CSVReader reader = new CSVReader(new FileReader(inputCSVFile));
/* Variables to loop through the CSV File */
String [] nextLine; /* for every line in the file */
int lnNum = 0; /* line number */
/* Step -2 : Define POI Spreadsheet objects */
HSSFWorkbook new_workbook = new HSSFWorkbook(); //create a blank workbook object
HSSFSheet sheet = new_workbook.createSheet("CSV2XLS"); //create a worksheet with caption score_details
/* Step -3: Define logical Map to consume CSV file data into excel */
Map<String, Object[]> excel_data = new HashMap<String, Object[]>(); //create a map and define data
/* Step -4: Populate data into logical Map */
while ((nextLine = reader.readNext()) != null) {
lnNum++;
excel_data.put(Integer.toString(lnNum), new Object[] {nextLine[0],nextLine[1]});
}
/* Step -5: Create Excel Data from the map using POI */
Set<String> keyset = excel_data.keySet();
int rownum = 0;
for (String key : keyset) { //loop through the data and add them to the cell
Row row = sheet.createRow(rownum++);
Object [] objArr = excel_data.get(key);
int cellnum = 0;
for (Object obj : objArr) {
Cell cell = row.createCell(cellnum++);
// Now here i want to check if first char of the value in cell is '$' or not.
if(obj instanceof Double)
cell.setCellValue((Double)obj);
else
cell.setCellValue((String)obj);
}
}
/* Write XLS converted CSV file to the output file */
FileOutputStream output_file = new FileOutputStream(new File("CSV2XLS.xls")); //create XLS file
new_workbook.write(output_file);//write converted XLS file to output stream
output_file.close(); //close the file
}
}

You should be able to check to see if a String starts with $ or not pretty easily. For setting cell types, it should look like this if you want Numeric:
cell.setCellType(HSSFCell.CELL_TYPE_NUMERIC);

Solved.
Used DataFormatter - https://poi.apache.org/apidocs/org/apache/poi/ss/usermodel/DataFormatter.html
cells cell.setCellValue(Double.parseDouble(str));
CellStyle cellStyle = wb.createCellStyle();
cellStyle.setDataFormat(HSSFDataFormat.getBuiltinFormat("£#,‌##0;[Red]-£#,##0"));
cell.setCellStyle(cellStyle);

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

SharedStringsTable payload before parsing [duplicate] - java

Related

Read large size xlsx file and copy all

How to detect excel cell reference style of a file using apache POI?

How to read large Excel file using POI in Java? [duplicate]

Read UTF-8 encoded text content inside table cell in MS-word file using Apache POI

csv to xsl java , there is a column with some currency ( like $400 )

Categories

Resources