There are too many differences in Java performance between operating systems - java

I tested the program I developed.
However, there is a significant difference in performance for each operating system of Java.
I've been trying to find the cause all day, but haven't found it.
The logic I developed is as follows:
A user uploads an excel in a browser.
Read excel using poi library.
Converts the read string data into an object through parsing.
Save the converted object to Cassandra database.
In the demo source, only the part that parses the data is attached at the bottom.
The data parsing logic also has a serious performance difference.
The environment configuration is as follows.
SpringBoot-2.2.2.Release
JVMOptions is -Xms16g -Xmx16g
Cassandra-3.11.7
OpenJDK1.8
I ran 3 tests, and the results are:
Test 1
SpringBoot-2.2.2.Release
JVMOptions is -Xms16g -Xmx16g
Cassandra-3.11.7
OpenJDK1.8
Windows OS
poi excel read: 2 seconds
Data parsing: 42 seconds
Database save: 31 seconds
Linux OS
poi excel read: 2 seconds
Data parsing: 4 seconds
Database save: 5 seconds
Test 2 (Save logic removed)
SpringBoot-2.2.2.Release
JVMOptions is -Xms16g -Xmx16g
Cassandra-3.11.7
OpenJDK1.8
Windows OS
poi excel read: 2 seconds
Data parsing: 40 seconds
Linux OS
poi excel read: 1 seconds
Data parsing: 6 seconds
Test 3
This test was conducted because it was determined that there was a performance degradation in the process of parsing data.
JVMOptions is -Xms16g -Xmx16g
OpenJDK1.8
Windows OS
poi excel read: 2 seconds
Data parsing: 2 seconds
Linux OS
poi excel read: 2 seconds
Data parsing: 1 seconds
However, the results were unexpected.
By removing all previously developed logic and performing data parsing, the speed has been greatly increased.
I am doing it through a simple restAPI request in the browser.
I don't know which point to do cause analysis. Help.
Excel sample data to parse DOWNLOAD
package com.example.demo;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Date;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.atomic.AtomicReference;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
public class DemoApplication {
public static final List<String> TABLE_USERNOUN_EXCEL_FIELD = Arrays.asList("keyword", "separation", "foreign");
public static final List<String> TITLE_USERNOUN_EXCEL_FIELD = Arrays.asList("검색어", "분리정보", "외래어");
public static void main(String[] args) {
try {
Long start = new Date().getTime();
List<Map<String, Object>> data = readFileData();
Long end = new Date().getTime();
System.out.println(data.toString());
System.out.println("finish time : " +(end-start)/1000+ "seconds");
} catch (IOException e) {
e.printStackTrace();
}
}
public static List<Map<String, Object>> readFileData() throws IOException {
List<String> tableColumns = TABLE_USERNOUN_EXCEL_FIELD;
List<String> columnNames = TITLE_USERNOUN_EXCEL_FIELD;
List<Map<String, Object>> contents = new ArrayList(); // contents array
List<Integer> requiredList = Arrays.asList(0,1);
Path targetLocation = Paths.get("D:","usernoun_dic_10000.xlsx");
InputStream inputStream = new FileInputStream(new File(targetLocation.toString()));
SXSSFWorkbook workbook = new SXSSFWorkbook(new XSSFWorkbook(inputStream));
Sheet sheet = workbook.getXSSFWorkbook().getSheetAt(0);
int rows = sheet.getPhysicalNumberOfRows();
if (rows<2){
workbook.close();
inputStream.close();
return contents;
}
Row checkRow = sheet.getRow(0);
int cellCnt = checkRow.getPhysicalNumberOfCells();
if (cellCnt<columnNames.size()){
workbook.close();
inputStream.close();
return contents;
}else if (cellCnt>columnNames.size()){
Object check = getCellValue(checkRow.getCell(tableColumns.size()));
if (check!=null){
if (!check.toString().equals("")){
workbook.close();
inputStream.close();
return contents;
}
}
}else{
for (int i=0; i<cellCnt; i++){
Object cellVal = getCellValue(checkRow.getCell(i));
String columnName = "";
if (cellVal!=null){
columnName = cellVal.toString();
}
if (columnName.equals("")||!columnName.equals(columnNames.get(i))){
workbook.close();
inputStream.close();
return contents;
}
}
}
sheet.removeRow(checkRow);
List<String> finalTableColumns = tableColumns;
List<Integer> finalRequiredList = requiredList;
sheet.forEach(row -> {
Map<String, Object> content = new HashMap<>(); //contents object
AtomicReference<Boolean> skip = new AtomicReference<>(false);
row.forEach(cell -> {
int cellIdx = cell.getColumnIndex();
if (cellIdx< finalTableColumns.size()){
Object value = getCellValue(cell);
boolean require = finalRequiredList.stream().anyMatch(integer -> integer==cellIdx);
if (require){
if (value!=null){
if (value.toString().equals("")){
skip.set(true);
}
}else{
skip.set(true);
}
}
content.put(finalTableColumns.get(content.size()), value);
}
});
if (!skip.get()){
contents.add(content);
}
});
workbook.close();
inputStream.close();
return contents;
}
public static Object getCellValue(Cell cell) {
switch (cell.getCellType()) {
case BLANK: //Null exception
return "";
case ERROR:
return cell.getErrorCellValue();
case STRING:
return cell.getStringCellValue();
case BOOLEAN:
return cell.getBooleanCellValue();
case NUMERIC:
return cell.getNumericCellValue();
case FORMULA:
return cell.getCellFormula();
}
return null;
}
}

Related

Java Iterator is skipping half elements

I wrote a small scripts to read from CSV in java. It takes a CSV, and push some values from the CSV into an HashMap. My CSV has 110 records ( 109 without the header ) however i get an HashMap with 54 values. When i debug, i can see that at each iteration, a line from my CSV is skipped.
Here's the code
package **CENSORED**.utils;
import com.day.cq.dam.api.Asset;
import com.day.cq.dam.api.Rendition;
import com.day.text.csv.Csv;
import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.*;
import org.apache.sling.api.resource.Resource;
import org.apache.sling.api.resource.ResourceResolver;
public class DateFormatUtils {
private static String dateFormatCsvPath = "/content/dam/csv/country_date_format.csv";
public static String getDateFormatByLocale(Locale Locale, ResourceResolver resourceResolver) {
Resource res = resourceResolver.getResource(dateFormatCsvPath);
Asset asset = res.adaptTo(Asset.class);
Rendition rendition = asset.getOriginal();
InputStream is = rendition.adaptTo(InputStream.class);
HashMap<String, String> localeToFormat = new HashMap<String, String>();
Csv csv = new Csv();
try {
Iterator<String[]> rowIterator = csv.read(is, StandardCharsets.UTF_8.name());
while (rowIterator.hasNext()) {
String[] row = rowIterator.next();
String country = row[1];
String locale = row[4];
String dateFormat = row[6];
localeToFormat.put(locale.toLowerCase() + "_" + country.toLowerCase(), dateFormat);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Here are few screenshot of my debug
at 1st iteration, the line 2 of my CSV gets added into my hashmap. The header have been skipped.
At 2nd iteration, the line 5 gets added to my hashmap, but lines 3-4 aren't.
At 3rd iteration, the line 8 gets added to my hasmap, but lines 6-7 aren't.
At the end i end up with 53 elements in my hashmap while i expect 109.
Here's also a sample of my CSV :
ISO 3166 Country Code,ISO639-2 Country Code,Country,ISO 3166 Country Code,ISO639-2 Lang,Language,Date Format
ALB,AL,Albania,sqi,sq,Albanian,yyyy-MM-dd
ARE,AE,United Arab Emirates,ara,ar,Arabic,dd/MM/yyyy
ARG,AR,Argentina,spa,es,Spanish,dd/MM/yyyy
AUS,AU,Australia,eng,en,English,d/MM/yyyy
AUT,AT,Austria,deu,de,German,dd.MM.yyyy
BEL,BE,Belgium,fra,fr,French,d/MM/yyyy
BEL,BE,Belgium,nld,nl,Dutch,d/MM/yyyy
BGR,BG,Bulgaria,bul,bg,Bulgarian,yyyy-M-d
BHR,BH,Bahrain,ara,ar,Arabic,dd/MM/yyyy
BIH,BA,Bosnia and Herzegovina,srp,sr,Serbian,yyyy-MM-dd
BLR,BY,Belarus,bel,be,Belarusian,d.M.yyyy
BOL,BO,Bolivia,spa,es,Spanish,dd-MM-yyyy
BRA,BR,Brazil,por,pt,Portuguese,dd/MM/yyyy
CAN,CA,Canada,fra,fr,French,yyyy-MM-dd
CAN,CA,Canada,eng,en,English,dd/MM/yyyy
Finally a last screenshot that shows that my CSV has correct EOL at their line
This is the csv.read() function, a class made by Adobe for AEM :
public Iterator<String[]> read(InputStream in, String charset) throws IOException {
if (charset == null) {
charset = System.getProperty("file.encoding");
}
InputStream in = new BufferedInputStream(in, 4096);
this.input = new InputStreamReader(in, charset);
return this.read();
}
I finally went with another solution since i wasnt able to use this one. For perennity, i was developing this for an AEM project; i decided to leverage the Generic List Item in ACS-common to get a dictionnary with all the values i needed instead of reading from a CSV. As #Artistotle stated, there is def something wrong with the reader so i'd advise against using com.day.text.csv.Csv;

How to output large csv file through Apache POI Excel in java?

Struggling to write 300k rows to csv file through Apache POI java. I have been trying to generate a csv file from an excel file with 300k rows. Everytime, I get GCOutMemory error when it tries to write to output csv file. I even tried splitting the write for every 100k rows. The output file size keeps on growing but I don't see system.println statement isnt getting printed.
import javafx.beans.binding.StringBinding;
import org.apache.poi.hssf.record.crypto.Biff8EncryptionKey;
import org.apache.poi.hssf.usermodel.HSSFDateUtil;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.*;
import java.lang.management.ManagementFactory;
import java.lang.management.MemoryPoolMXBean;
import java.lang.management.MemoryType;
import java.math.BigDecimal;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.sql.Timestamp;
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.*;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class ReadWrite {
private static Logger logger= LoggerFactory.getLogger(ReadWrite.class);
public static void main(String[] args) {
try {
long startReading = System.currentTimeMillis();
Path path = Paths.get("/Users/venkatesh/Documents/Citiout_files/citiout300k_2sheets.xlsx");
byte[] result = new byte[0];
try {
result = Files.readAllBytes(path);
} catch (IOException e) {
e.printStackTrace();
}
InputStream is = new ByteArrayInputStream(result);
Workbook workbook = WorkbookFactory.create(is);
long readDone = System.currentTimeMillis() - startReading;
logger.info("read time " + readDone);
Sheet sheet = workbook.getSheetAt(1);
Row firstRow = sheet.getRow(0);
int headcol = firstRow.getLastCellNum();
long startTransform = System.currentTimeMillis();
firstRow.createCell(headcol++).setCellValue("Sold Amount1");
firstRow.createCell(headcol++).setCellValue("CF_Quantity1");
firstRow.createCell(headcol++).setCellValue("CF_Quantity2");
firstRow.createCell(headcol++).setCellValue("CF_TradePrice");
firstRow.createCell(headcol++).setCellValue("CF_ForwardPrice");
firstRow.createCell(headcol++).setCellValue("CF_UnrealizedPL");
firstRow.createCell(headcol++).setCellValue("CF_Quantity1Round");
firstRow.createCell(headcol++).setCellValue("CF_Quantity2Round");
firstRow.createCell(headcol++).setCellValue("CF_FXLotKeyNoTradeDate");
firstRow.createCell(headcol++).setCellValue("CF_FXRoundedKeyNoTradeDate");
firstRow.createCell(headcol++).setCellValue("CF_SettlementDate");
for (int i = 1; i <=sheet.getLastRowNum()+1; i++) {
String jj="";
Row nRow = sheet.getRow(i-1);
for(Cell c:nRow) {
if (c.getColumnIndex()==3 && i!=1) {
Calendar cal = Calendar.getInstance();
Date date1 = new SimpleDateFormat("dd-MMM-yyyy").parse(c.getStringCellValue());
cal.setTime(date1);
jj = String.valueOf(cal.get(Calendar.MONTH)+1) + "/" + String.valueOf(cal.get(Calendar.DAY_OF_MONTH)) + "/" + String.valueOf(cal.get(Calendar.YEAR));
}
}
int count = nRow.getLastCellNum();
//System.out.println(nRow.getCell(3).getClass());
nRow.createCell(count++).setCellFormula("G" + i + "*-1");
nRow.createCell(count++).setCellFormula("E" + i + "/" + "G" + i);
nRow.createCell(count++).setCellFormula("G" + i + "/E" + i);
nRow.createCell(count++).setCellFormula("ROUND(ABS(T" + i + "/S" + i + "),6)");
nRow.createCell(count++).setCellFormula("ROUND(K" + i + ",6)");
nRow.createCell(count++).setCellFormula("ROUND(N" + i + ",2)");
nRow.createCell(count++).setCellFormula("ROUND(S" + i + ",0)");
nRow.createCell(count++).setCellFormula("ROUND(T" + i + ",0)");
nRow.createCell(count++).setCellFormula("CONCATENATE(T" + i + "," + "\"~\"" + ",S" + i + ")");
nRow.createCell(count++).setCellFormula("CONCATENATE(X" + i + "," + "\"~\"" + ",Y" + i + ")");
nRow.createCell(count++).setCellValue(jj);
c.setCellValue(DateUtil.getExcelDate(calendar.getTime()));
}
long endTransform = System.currentTimeMillis() - startTransform;
System.out.println("Transformations time " + endTransform);
final FormulaEvaluator evaluator = workbook.getCreationHelper().createFormulaEvaluator();
FileWriter writer= new FileWriter(new enter code hereFile("/Users/venkatesh/Documents/cit300k.csv"));
StringBuilder data = new StringBuilder();
Iterator<Row> rowIterator = workbook.getSheetAt(1).iterator();
try {
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
Iterator<Cell> cellIterator = row.cellIterator();
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
CellType type = cell.getCellType();
if (type == CellType.BOOLEAN) {
data.append(cell.getBooleanCellValue());
} else if (type == CellType.NUMERIC) {
data.append(cell.getNumericCellValue());
} else if (type == CellType.STRING) {
data.append(cell.getStringCellValue());
} else if (type == CellType.FORMULA) {
switch (evaluator.evaluateFormulaCell(cell)) {
case STRING:
data.append(cell.getStringCellValue());
break;
case NUMERIC:
data.append(cell.getNumericCellValue());
break;
}
} else if (type == CellType.BLANK) {
} else {
data.append(cell + "");
}
data.append(",");
}
writer.append(data.toString());
writer.append('\n');
}
} catch(Exception e){
e.printStackTrace();
}
finally{
if(writer!=null){
writer.flush();
writer.close();
}
}
for (MemoryPoolMXBean mpBean: ManagementFactory.getMemoryPoolMXBeans()) {
if (mpBean.getType() == MemoryType.HEAP) {
System.out.printf(
"Name: %s: %s\n",
mpBean.getName(), mpBean.getUsage()
);
}
}
try {
workbook.close();
is.close();
} catch (IOException e) {
e.printStackTrace();
}
}
catch (Exception e){
e.printStackTrace();
}
}
}
20-01-12 19:52:49:267 INFO main ReadWrite:64 - read time 11354
Transformations time 38659
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.TreeMap$Values.iterator(TreeMap.java:1031)
at org.apache.poi.xssf.usermodel.XSSFRow.cellIterator(XSSFRow.java:117)
at org.apache.poi.xssf.usermodel.XSSFRow.iterator(XSSFRow.java:132)
at org.apache.poi.xssf.usermodel.XSSFEvaluationSheet.getCell(XSSFEvaluationSheet.java:86)
at org.apache.poi.ss.formula.WorkbookEvaluator.evaluateFormula(WorkbookEvaluator.java:402)
at org.apache.poi.ss.formula.WorkbookEvaluator.evaluateAny(WorkbookEvaluator.java:275)
at org.apache.poi.ss.formula.WorkbookEvaluator.evaluate(WorkbookEvaluator.java:216)
at org.apache.poi.xssf.usermodel.BaseXSSFFormulaEvaluator.evaluateFormulaCellValue(BaseXSSFFormulaEvaluator.java:56)
at org.apache.poi.ss.formula.BaseFormulaEvaluator.evaluateFormulaCell(BaseFormulaEvaluator.java:185)
at ReadWrite.main(ReadWrite.java:150)
So now that we have a usable stacktrace, it is clear that the problem is NOT happening while writing the CSV file. It is actually happening while you are evaluating a spreadsheet formula. My guess is that the formula is summing across all rows in a sheet ... or something like that.
This is a problem, and there is probably no simple solution.
Here's what the POI documentation says:
File sizes/Memory usage
There are some inherent limits in the Excel file formats. These are defined in class SpreadsheetVersion. As long as you have enough
main-memory, you should be able to handle files up to these limits.
For huge files using the default POI classes you will likely need a
very large amount of memory.
There are ways to overcome the main-memory limitations if needed:
For writing very huge files, there is SXSSFWorkbook which allows to do a streaming write of data out to files (with certain
limitations on what you can do as only parts of the file are held in
memory).
For reading very huge files, take a look at the sample XLSX2CSV which shows how you can read a file in streaming
fashion (again with some limitations on what information you can read
out of the file, but there are ways to get at most of it if
necessary).
You are clearly running into these memory limitations. Basically, POI is trying to load too much of the spreadsheet into memory ... while you are evaluating the spreadsheet formulae ... and you are filling the heap.
One solution would be to increase the Java heap size. Or if you are already using all available RAM for your heap, run the conversion on a machine with more RAM. A lot of standard PCs have 16GB RAM these days. Maybe it is time for a hardware upgrade? But I'm guessing you have already thought of this.
If increasing the heap size is not viable, then you will need to rewrite your application to use SXSSFWorkbook. Furthermore, you may need to replace your approach of using formula evaluation with doing the calculations in native Java in a way that is compatible with row-by-row streaming of the spreadsheet. (It will depend on what the formulae do.)
Look at the linked example from the POI documentation for ideas.

Unexpected record type (org.apache.poi.hssf.record.HyperlinkRecord)

The problem:
I'm just trying to open it .xls file using the Apache-poi 4.1.0 library and it gives the same error as 4 years ago in a similar question.
I already tried
to put version 3.12-3.16.
3.13 as well
All versions can open blank .xls and filled by myself but not this one.
This document is generated automatically and I need to make a program that accepts it.
I already made a .Net standart library C# which is work, I tried to use xamarin android it's a horror, the app weighs 50 mb vs 3 mb due to various terrible SDK link errors, but that's a different story. So I decided to do it on Kotlin.
Code is from the documentation
You can check file on git
val inputStream = FileInputStream("./test.xls")
val wb = HSSFWorkbook(inputStream)
I expect no errors while opening xls.
Actual output is
Exception in thread "main" java.lang.RuntimeException: Unexpected record type (org.apache.poi.hssf.record.HyperlinkRecord)
at org.apache.poi.hssf.record.aggregates.RowRecordsAggregate.<init>(RowRecordsAggregate.java:97)
at org.apache.poi.hssf.model.InternalSheet.<init>(InternalSheet.java:183)
at org.apache.poi.hssf.model.InternalSheet.createSheet(InternalSheet.java:122)
at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:354)
at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:400)
at org.apache.poi.hssf.usermodel.HSSFWorkbook.<init>(HSSFWorkbook.java:381)
at ru.plumber71.toolbox.ExcelParcerKt.main(ExcelParcer.kt:19)
at ru.plumber71.toolbox.ExcelParcerKt.main(ExcelParcer.kt)
The document will not be modified in any way. If there any other libraries to just read the dataset or strings from the .xls file will be OK.
After some investigation I found the problem with your test.xls file.
According the file format specifications, all HyperlinkRecords should be together in the Hyperlink Table. It is contained in the Sheet Substream following the cell records. In your case the HyperlinkRecords are between other records (between NumberRecords and LabelSSTRecords in that case). So I suspect it was not Excel what had created that test.xls file.
Excelmight be tolerant enough to open that file nevertheless. But you cannot expect that apache poi also tries to tolerate all possible violations in file format. If you open the file using Excel and then re-save it, apache poi is able creating the Workbookafter that.
Apache poi is not able repairing this as Excel can do. But one could read the POIFSFileSystem a low level way and filtering out the HyperlinkRecords that are between other records. That way one could read the content using apache poi, of course except the hyperlinks.
Example:
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import org.apache.poi.poifs.filesystem.POIFSFileSystem;
import org.apache.poi.poifs.filesystem.DirectoryNode;
import org.apache.poi.hssf.record.Record;
import org.apache.poi.hssf.record.NameRecord;
import org.apache.poi.hssf.record.NameCommentRecord;
import org.apache.poi.hssf.record.HyperlinkRecord;
import org.apache.poi.hssf.record.RecordFactoryInputStream;
import org.apache.poi.hssf.record.RecordFactory;
import org.apache.poi.hssf.model.RecordStream;
import org.apache.poi.hssf.model.InternalWorkbook;
import org.apache.poi.hssf.model.InternalSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFName;
import org.apache.poi.ss.usermodel.DataFormatter;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.util.CellReference;
import java.util.List;
import java.util.ArrayList;
import java.lang.reflect.Field;
import java.lang.reflect.Method;
import java.lang.reflect.Constructor;
class ExcelOpenHSSF {
public static void main(String[] args) throws Exception {
String fileName = "test(2).xls";
try (InputStream is = new FileInputStream(fileName);
POIFSFileSystem fileSystem = new POIFSFileSystem(is)) {
//find workbook directory entry
DirectoryNode directory = fileSystem.getRoot();
String workbookName = "";
for(String wbName : InternalWorkbook.WORKBOOK_DIR_ENTRY_NAMES) {
if(directory.hasEntry(wbName)) {
workbookName = wbName;
break;
}
}
InputStream stream = directory.createDocumentInputStream(workbookName);
//loop over all records and manipulate if needed
List<Record> records = new ArrayList<Record>();
RecordFactoryInputStream recStream = new RecordFactoryInputStream(stream, true);
//here we filter out the HyperlinkRecords that are between other records (NumberRecords and LabelSSTRecords in that case)
//System.out.println prints the problematic records
Record record1 = null;
Record record2 = null;
while ((record1 = recStream.nextRecord()) != null) {
record2 = recStream.nextRecord();
if (!(record1 instanceof HyperlinkRecord) && (record2 instanceof HyperlinkRecord)) {
System.out.println(record1);
System.out.println(record2);
records.add(record1);
} else if ((record1 instanceof HyperlinkRecord) && !(record2 instanceof HyperlinkRecord)) {
System.out.println(record1);
System.out.println(record2);
records.add(record2);
} else {
records.add(record1);
if (record2 != null) records.add(record2);
}
}
//now create the HSSFWorkbook
//see https://svn.apache.org/viewvc/poi/tags/REL_4_1_0/src/java/org/apache/poi/hssf/usermodel/HSSFWorkbook.java?view=markup#l322
InternalWorkbook internalWorkbook = InternalWorkbook.createWorkbook(records);
HSSFWorkbook wb = HSSFWorkbook.create(internalWorkbook);
int recOffset = internalWorkbook.getNumRecords();
Method convertLabelRecords = HSSFWorkbook.class.getDeclaredMethod("convertLabelRecords", List.class, int.class);
convertLabelRecords.setAccessible(true);
convertLabelRecords.invoke(wb, records, recOffset);
RecordStream rs = new RecordStream(records, recOffset);
while (rs.hasNext()) {
InternalSheet internelSheet = InternalSheet.createSheet(rs);
Constructor constructor = HSSFSheet.class.getDeclaredConstructor(HSSFWorkbook.class, InternalSheet.class);
constructor.setAccessible(true);
HSSFSheet hssfSheet = (HSSFSheet)constructor.newInstance(wb, internelSheet);
Field _sheets = HSSFWorkbook.class.getDeclaredField("_sheets");
_sheets.setAccessible(true);
#SuppressWarnings("unchecked")
List<HSSFSheet> sheets = (ArrayList<HSSFSheet>)_sheets.get(wb);
sheets.add(hssfSheet);
}
for (int i = 0 ; i < internalWorkbook.getNumNames() ; ++i){
NameRecord nameRecord = internalWorkbook.getNameRecord(i);
Constructor constructor = HSSFName.class.getDeclaredConstructor(HSSFWorkbook.class, NameRecord.class, NameCommentRecord.class);
constructor.setAccessible(true);
HSSFName name = (HSSFName)constructor.newInstance(wb, nameRecord, internalWorkbook.getNameCommentRecord(nameRecord));
Field _names = HSSFWorkbook.class.getDeclaredField("names");
_names.setAccessible(true);
#SuppressWarnings("unchecked")
List<HSSFName> names = (ArrayList<HSSFName>)_names.get(wb);
names.add(name);
}
//now the workbook is created properly
System.out.println(wb);
/*
//getting the data
DataFormatter formatter = new DataFormatter();
Sheet sheet = wb.getSheetAt(0);
for (Row row : sheet) {
for (Cell cell : row) {
CellReference cellRef = new CellReference(row.getRowNum(), cell.getColumnIndex());
System.out.print(cellRef.formatAsString());
System.out.print(" - ");
String text = formatter.formatCellValue(cell);
System.out.println(text);
}
}
*/
}
}
}
I was able to open a file of this "corrupted" type by using JExcel API
But using poi.apache.org also opens the file if manually resave it using excel application. (It may not be suitable for someone)
Sorry that it was asking strange questions. Thank you all and hope that someone may find useful.
val inputStream = FileInputStream("./testCorrupted.xls")
val workbook = Workbook.getWorkbook(inputStream)
val sheet = workbook.getSheet(0)
val cell1 = sheet.getCell(0, 0)
print(cell1.contents + ":")

How to modify a large Excel file when memory is an issue

As the title states, I have a large Excel file (>200 sheets) that I need to add data to. I do not want to create new cells, I only want to modify existing ones.
I tried using Apache Poi but my application runs out of memory even with Xms and Xmx set to 8g. The only option for low-memory writing is seemingly with SXSSF. The problem is that it only works for creating new cells and does not allow modifying existing ones. I also tried using the event API in order to process the sheet's XML, but it only seems to work for read operations. I've been trying to use an XMLEventWriter but I can't find a way to access the sheets' XML data which works for writing. Is there a way to access an Excel file's XML data other than with XSSFReader?
As told in comments above there is no one fits all solution using pure XML reading and writing the Office Open XML spreadsheets. Each Excel workbook needs it's own code dependent on it's structure and on what content shall be changed.
This is because apache poi's high level classes provides a meta level to avoid this. But this needs memory to work. And for very big workbooks it needs much memory. To avoid memory consumption through manipulating the XML directly this meta level is not usable. So one must know the XML structure of a worksheet and the meaning of the XML elements used.
So if we have a Excel workbook having a first sheet having strings in column A and numbers in column B, then we could changing every fifth row using StAX for manipulating the XML directly using following code:
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.opc.PackagePart;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.openxmlformats.schemas.spreadsheetml.x2006.main.CTRst;
import javax.xml.stream.XMLEventFactory;
import javax.xml.stream.XMLEventReader;
import javax.xml.stream.XMLEventWriter;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.events.Characters;
import javax.xml.stream.events.StartElement;
import javax.xml.stream.events.XMLEvent;
import javax.xml.namespace.QName;
import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.regex.Pattern;
class StaxReadAndChangeTest {
public static void main(String[] args) throws Exception {
File file = new File("ReadAndWriteTest.xlsx");
OPCPackage opcpackage = OPCPackage.open(file);
//since there are strings in the sheet data, we need the SharedStringsTable
PackagePart sharedstringstablepart = opcpackage.getPartsByName(Pattern.compile("/xl/sharedStrings.xml")).get(0);
SharedStringsTable sharedstringstable = new SharedStringsTable();
sharedstringstable.readFrom(sharedstringstablepart.getInputStream());
//get first worksheet
PackagePart sheetpart = opcpackage.getPartsByName(Pattern.compile("/xl/worksheets/sheet1.xml")).get(0);
//get XML reader and writer
XMLEventReader reader = XMLInputFactory.newInstance().createXMLEventReader(sheetpart.getInputStream());
XMLEventWriter writer = XMLOutputFactory.newInstance().createXMLEventWriter(sheetpart.getOutputStream());
XMLEventFactory eventFactory = XMLEventFactory.newInstance();
int rowsCount = 0;
int colsCount = 0;
boolean cellAfound = false;
boolean cellBfound = false;
while(reader.hasNext()){ //loop over all XML in sheet1.xml
XMLEvent event = (XMLEvent)reader.next();
if(event.isStartElement()) {
StartElement startElement = (StartElement)event;
QName startElementName = startElement.getName();
if(startElementName.getLocalPart().equalsIgnoreCase("row")) { //start element of row
rowsCount++;
colsCount = 0;
} else if (startElementName.getLocalPart().equalsIgnoreCase("c")) { //start element of cell
colsCount++;
cellAfound = false;
cellBfound = false;
if (rowsCount % 5 == 0) { // every 5th row
if (colsCount == 1) { // cell A
cellAfound = true;
} else if (colsCount == 2) { // cell B
cellBfound = true;
}
}
} else if (startElementName.getLocalPart().equalsIgnoreCase("v")) { //start element of value
if (cellAfound) {
// create new rich text content for cell A
CTRst ctstr = CTRst.Factory.newInstance();
ctstr.setT("changed String Value A" + (rowsCount));
//int sRef = sharedstringstable.addEntry(ctstr);
int sRef = sharedstringstable.addSharedStringItem(new XSSFRichTextString(ctstr));
// set the new characters for A's value in the XML
if (reader.hasNext()) {
writer.add(event); // write the old event
event = (XMLEvent)reader.next(); // get next event - should be characters
if (event.isCharacters()) {
Characters value = eventFactory.createCharacters(Integer.toString(sRef));
event = value;
}
}
} else if (cellBfound) {
// set the new characters for B's value in the XML
if (reader.hasNext()) {
writer.add(event); // write the old event
event = (XMLEvent)reader.next(); // get next event - should be characters
if(event.isCharacters()) {
double oldValue = Double.valueOf(((Characters)event).getData()); // old double value
Characters value = eventFactory.createCharacters(Double.toString(oldValue * rowsCount));
event = value;
}
}
}
}
}
writer.add(event); //by default write each read event
}
writer.flush();
//write the SharedStringsTable
OutputStream out = sharedstringstablepart.getOutputStream();
sharedstringstable.writeTo(out);
out.close();
opcpackage.close();
}
}
This will be much less memory consuming than apache poi's XSSF classes. But, as said, it only works exactly for this kind of Excel workbook having a first sheet having strings in column A and numbers in column B.

Using hashmap for POI Java XLSX

I have been trying to edit my code to allow a XLSX file to be uploaded and be able to be read on the website. But after countless tries, the data I typed into the XLSX File is unable to be captured on the website. (Eg: After downloading the XLSX Template from the website, I am able to type in anything that I want in the XLSX file and able to upload it again to the website so I do not need to keep on adding new data by clicking "new" every single time. I can just type in everything in that XLSX File all at once and upload it right away)
I was told to use hashmap but I am unsure of the way it works. The codes I have currently only enables the website to capture the header title and I am not suppose to use jxl.
While removing those codes that has jxl, I encounter some errors (being underline in red).
public HashMap getConstructJXLList_xlsx(UploadedFile File, int Sheetindex) {
String _LOC = "[PageCodeBase: getConstructJXLList]";
HashMap _m = new HashMap();
InputStream _is = null;
try {
_is = File.getInputstream();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
XSSFWorkbook workbook;
XSSFSheet s;
try {
workbook = new XSSFWorkbook(_is);
s = workbook.getSheetAt(Sheetindex);
} catch (Exception e) {
System.out.println(_LOC + "1.0 " + " Test:");
int _totalc = getColumns(); //getColumns is being underline in red
int _totalr = getRows(); //getRows is being underline in red
// Header r=0
String[] _st = new String[_totalc];
//XSSFSheet sheet = null;
for (int _c = 0; _c < _totalc; _c++) {
_st[_c] = getCell(_c, 0); //getCell is being underline in red
}
_m.put("HEADER", _st);
System.out.println(_LOC + "1.0 " + " _m:" + _m);
// Data r=1 thereafter
List _l = new ArrayList();
for (int _r = 1; _r < _totalr; _r++) {
Object[] _o = new Object[_totalc];
String _s_r = null;
for (int _c = 0; _c < _totalc; _c++) {
_o[_c] = getCell(_c, _r);
String _cn = _o[_c].getClass().getName();
String _s_c = null;
if (!isEmptyNull(_s_c)) {
_s_r = "record_available";
}
}
if ((_o != null) && (_o.length != 0)) {
_l.add(_o);
}
}
_m.put("DATA", _l);
System.out.println(_LOC + "1.0 " + " _m:" + _m);
}
return _m;
}
Do you mind helping me to solve this? Why there isn't any data being capture in the website? The error shown is "The method getColumns/getCell/getRows is undefined for the type PageCodeBase." And the help/quick fix given is to create a new method. But after creating the new method, I am unsure of what to add in the methods. Have tried various example (http://snippetjournal.wordpress.com/2014/02/05/read-xlsx-using-poi/) but I stil can't seem to get it work out.
I would recommend you to manage de excel file using this classes from the apache POI api
org.apache.poi.ss.usermodel.Cell;
org.apache.poi.ss.usermodel.Row;
org.apache.poi.ss.usermodel.Sheet;
org.apache.poi.ss.usermodel.Workbook;
org.apache.poi.ss.usermodel.WorkbookFactory;
instead of those XSSFWorkbook, XSSFSheet...
And also when accessing the file input stream try doing it this way:
FileInputStream input = new FileInputStream(new File("C:\\Users\\admin\\Desktop\\Load_AcctCntr_Template.xlsx"));
Workbook workBook = WorkbookFactory.create(stream);
workBook.getSheetAt(0);
use this.
FileInputStream input = new FileInputStream(new File("C:/Users/admin/Desktop/Load_AcctCntr_Template.xlsx"));
Workbook wb = WorkbookFactory.create(input);
as mentioned in user3661357 answer. use
Workbook instead of XSSFWorkbook.
Sheet instead of XSSFSheet.
etc..
Also read this
Getting Exception(org.apache.poi.openxml4j.exception - no content type [M1.13]) when reading xlsx file using Apache POI?
*HINT > use ALT+SHIFT+I in netbeans to load the necessary packages.
A working example
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Iterator;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.WorkbookFactory;
public class POITest {
public static void test() {
try {
FileInputStream input = new FileInputStream(new File("C:/Users/kingslayer/Desktop/test/a.xlsx"));
Workbook wb = WorkbookFactory.create(input);
Sheet s = wb.getSheetAt(0);
Iterator<Row> rows = s.rowIterator();
while (rows.hasNext()) {
Row row = rows.next();
Iterator cells = row.cellIterator();
while (cells.hasNext()) {
Cell cell = (Cell) cells.next();
if (cell.getCellType() == Cell.CELL_TYPE_STRING) {
System.out.print(cell.getStringCellValue() + "t");
} else if (cell.getCellType() == Cell.CELL_TYPE_NUMERIC) {
System.out.print(cell.getNumericCellValue() + "t");
} else if (cell.CELL_TYPE_BLANK == cell.getCellType()) {
System.out.print("BLANK ");
} else {
System.out.print("Unknown cell type");
}
}
input.close();
}
} catch (IOException | InvalidFormatException ex) {
Logger.getLogger(POITest.class.getName()).log(Level.SEVERE, null, ex);
}
}
public static void main(String[] args) {
test();
}
}
All the libraries you must have on the project path.
commons-codec-1.5.jar ,
commons-logging-1.1.jar ,
dom4j-1.6.1.jar ,
junit-3.8.1.jar ,
log4j-1.2.13.jar ,
poi-3.9-20121203.jar ,
poi-excelant-3.9-20121203.jar ,
poi-ooxml-3.9-20121203.jar ,
poi-ooxml-schemas-3.9-20121203.jar ,
poi-scratchpad-3.9-20121203.jar ,
stax-api-1.0.1.jar ,
xmlbeans-2.3.0.jar ,
1) get rid of POIFSFileSystem fs = new POIFSFileSystem(input); as you are not using it
2) input.close(); is called after first iteration of row

Categories