APACHE POI autoSizeColumn with useMergedCells too slow - java

I try to apply autoSizeColumn on an Excel sheet. I'm using POI 3.10.1.
I apply autoSizeColumn at the end but the problem is than the process is too slow/long anyway:
On the sheet I've approximately 1000 lines and 20 columns... After 5 hours, I kill the process ...
I don't understand what is taking so long, 1000 lines and 20 columns doesn't appear so huge: Did I miss something? (nb: on a smaller file, it's working)
My simplified code below:
Workbook vWorkbook = getWorkbook();
Sheet vSheet = vWorkbook.createSheet("sheet");
CreationHelper vHelper = vWorkbook.getCreationHelper();
Drawing drawing = vSheet.createDrawingPatriarch();
Set<CellRangeAddress> vRegions = new HashSet<CellRangeAddress>();
//Parcours des lignes du document
MatrixDocument vMatrixDocument = getMatrixDocument();
List<MatrixRow> vListMatrixRows = vMatrixDocument.getRows();
int maxColNb = 0;
//Parcours des lignes de la grille.
for (MatrixRow vMatrixRow : vListMatrixRows)
{
//(...)
//create cells
//(...)
}
initColSpan(vListMatrixRows, vRegions);
//Gestion des colSpan et des RowSpan
for (CellRangeAddress vRegion : vRegions)
{
vSheet.addMergedRegion(vRegion);
}
for (int i = 0; i < maxColNb; ++i)
{
vSheet.autoSizeColumn(i, true);//Here the problem. spent more than 5 hour for 1000 lines and 20 columns
}
I've already read threads below :
http://stackoverflow.com/questions/16943493/apache-poi-autosizecolumn-resizes-incorrectly
http://stackoverflow.com/questions/15740100/apache-poi-autosizecolumn-not-working-right
http://stackoverflow.com/questions/23366606/autosizecolumn-performance-effect-in-apache-poi
http://stackoverflow.com/questions/18984785/a-poi-related-code-block-running-dead-slow
http://stackoverflow.com/questions/28564045/apache-poi-autosizecolumn-behaving-weird
http://stackoverflow.com/questions/18456474/apache-poi-autosizecolumn-is-not-working
But none solve my issue.
Any idea ?
PS : I tried to upload an example image of the Excel file but I don't find how to upload it.

Even after using Apache POI 3.12, I was facing the same issue for auto-sizing. Also auto-size doesn't work in Unix/Linux.
What I learnt from various forums is this:
1.You can try is using the SXSSF API, usually works much faster.
2. If not, then go for setColumnWidth method(I know its literally manual work for 20 columns)

My solution was only use Autosize on the last Excel line, like this:
if (i >= abaExcel.Itens.Count)
sheet.AutoSizeColumn(j);

Because merged regions cannot overlap without producing a corrupt document, POI may be checking the list of merged regions on a sheet for potential intersections before adding a merged region. This gives O(N) behavior for adding one region instead of the expected O(1).
addMergedRegion - with checking but slow
addMergedRegionUnsafe - without checking but fast
Documentation: read more about addMergedRegionUnsafe(...)

Related

iText 5.5.11 - bold text looks blurry after using PdfCleanUpProcessor

I need to remove some content from an existing pdf created with Jasper Reports in iText 5.5.11 but after running PdfCleanUpProcessor all bold text is blurry.
This is the code I'm using:
PdfReader reader = new PdfReader("input.pdf");
PdfStamper stamper = new PdfStamper(reader, new FileOutputStream("output.pdf"));
List<PdfCleanUpLocation> cleanUpLocations = new ArrayList<PdfCleanUpLocation>();
cleanUpLocations.add(new PdfCleanUpLocation(1, new Rectangle(0f, 0f, 595f, 680f)));
PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
cleaner.cleanUp();
stamper.close();
reader.close();
As already discussed here downgrading to itext-5.5.4 solves the problem, but in my case itext-5.5.11 is already in use for other reasons and so downgrading is not an option.
Is there another solution or workaround?
This are the pdf files before and after cleaning: BEFORE - AFTER
By comparing the before and after files it becomes clear that for some reason the PdfCleanUpProcessor falsely drops general graphics state operations (at least w, J, and d).
In your before document in particular the w operation is important for the text because a poor man's bold variant is used, i.e. instead of using an actual bold font the normal font is used and the text rendering mode is set to not only fill the glyph contours but also draw a line along it giving it a bold'ish appearance.
The width of that line is set to 0.23333 using a w operation. As that operation is missing in the after document, the default width value of 1 is used. Thus, the line along the contour now is 4 times as big as before resulting in a very fat appearance.
This issue has been introduced in commit d5abd23 (dated May 4th, 2015) which (among other things) added this block to PdfCleanUpContentOperator.invoke:
} else if (lineStyleOperators.contains(operatorStr)) {
if ("w" == operatorStr) {
cleanUpStrategy.getContext().setLineWidth(((PdfNumber) operands.get(0)).floatValue());
} else if ("J" == operatorStr) {
cleanUpStrategy.getContext().setLineCapStyle(((PdfNumber) operands.get(0)).intValue());
} else if ("j" == operatorStr) {
cleanUpStrategy.getContext().setLineJoinStyle(((PdfNumber) operands.get(0)).intValue());
} else if ("M" == operatorStr) {
cleanUpStrategy.getContext().setMiterLimit(((PdfNumber) operands.get(0)).floatValue());
} else if ("d" == operatorStr) {
cleanUpStrategy.getContext().setLineDashPattern(new LineDashPattern(((PdfArray) operands.get(0)),
((PdfNumber) operands.get(1)).floatValue()));
}
disableOutput = true;
This causes all lineStyleOperators to be dropped while at the same time an attempt was made to store the changed values in the cleanup strategy context. But of course using == for String comparisons in Java usually is a very bad idea, so since this version the line style operators were dropped for good in iText.
Actually this code had been ported from iTextSharp, and in C# == for the string type works entirely different; nonetheless, even in the iTextSharp version these stored values at first glance only seem to have been taken into account if paths were stroked, not if text rendering included stroking along the contour.
Later on in commit 9967627 (on the same day as the commit above) the inner if..else if..else.. has been removed with the comment Replaced PdfCleanUpGraphicsState with existing GraphicsState from itext.pdf.parser package, added missing parameters into the latter, only the disableOutput = true remained. This (also at first glance) appears to have fixed the difference between iText/Java and iTextSharp/.Net, but the line style values still are not considered if text rendering included stroking along the contour.
As a work-around consider removing the lines
} else if (lineStyleOperators.contains(operatorStr)) {
disableOutput = true;
from PdfCleanUpContentOperator.invoke. Now the line style operators are not dropped anymore and the text in your PDF after redaction looks like before. I have not checked for any side effects, though, so please test with a number of documents before even considering using that work-around in production.

Exception when adding a PdfFormField to a big PDF

I am adding a PdfTextFormField over a Table cell using a custom renderer, as per the iText7 example code in CreateFormInTable.java. This works initially, until I create a Table on page 3 or later of the PDF, at which point I'm getting an exception:
Caused by: java.lang.NullPointerException
at com.itextpdf.kernel.pdf.PdfDictionary.get(PdfDictionary.java:552)
at com.itextpdf.kernel.pdf.PdfDictionary.getAsArray(PdfDictionary.java:156)
at com.itextpdf.kernel.pdf.PdfPage.getAnnotations(PdfPage.java:746)
at ...pdf.annot.PdfAnnotation.getPage(PdfAnnotation.java:435)
at ...forms.fields.PdfFormField.regenerateField(PdfFormField.java:1761)
at ...forms.fields.PdfFormField.setValue(PdfFormField.java:1038)
at ...forms.fields.PdfFormField.setValue(PdfFormField.java:999)
at ...forms.fields.PdfFormField.setValue(PdfFormField.java:994)
etc.
It seems fairly easy to reproduce, and I can provide a full code sample if you want, but a simple way to see the problem is to insert:
for (int i=1; i < 2; i++) // Change 2 to 3 and you get an NPE
{
Paragraph para = new Paragraph("Page "+ i);
doc.add( para );
doc.add( new AreaBreak( AreaBreakType.NEXT_PAGE ) );
}
straight after the Document constructor in the aforementioned iText7 Java sample file at:
http://developers.itextpdf.com/examples/form-examples/clone-create-fields-table#2350-createformintable.java
I've tested it on 7.0.1 and 7.0.2, with same result.
Well, currently some of the form-related functionality requires the whole PDF document structure to be in memory to operate. This means that no object can be flushed. But layout's DocumentRenderer flushes the pages when possible. The problem reproduces only for three or more pages because there is a small "window" of unflushed pages.
This is indeed not mentioned in the sample and can be improved in the future. In the current version, to get the desired PDF, you can set the Document to operate in "postpone flushing" mode using the following constructor:
Document doc = new Document(pdfDoc, PageSize.A4, false);

Java Memory Error on Importing .xlsx files into R [duplicate]

The xlsx package can be used to read and write Excel spreadsheets from R. Unfortunately, even for moderately large spreadsheets, java.lang.OutOfMemoryError can occur. In particular,
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.OutOfMemoryError: Java heap space
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "newInstance", .jfindClass(class), :
java.lang.OutOfMemoryError: GC overhead limit exceeded
(Other related exceptions are also possible but rarer.)
A similar question was asked regarding this error when reading spreadsheets.
Importing a big xlsx file into R?
The main advantage of using Excel spreadsheets as a data storage medium over CSV is that you can store multiple sheets in the same file, so here we consider a list of data frames to be written one data frame per worksheet. This example dataset contains 40 data frames, each with two columns of up to 200k rows. It is designed to be big enough to be problematic, but you can change the size by altering n_sheets and n_rows.
library(xlsx)
set.seed(19790801)
n_sheets <- 40
the_data <- replicate(
n_sheets,
{
n_rows <- sample(2e5, 1)
data.frame(
x = runif(n_rows),
y = sample(letters, n_rows, replace = TRUE)
)
},
simplify = FALSE
)
names(the_data) <- paste("Sheet", seq_len(n_sheets))
The natural method of writing this to file is to create a workbook using createWorkbook, then loop over each data frame calling createSheet and addDataFrame. Finally the workbook can be written to file using saveWorkbook. I've added messages to the loop to make it easier to see where it falls over.
wb <- createWorkbook()
for(i in seq_along(the_data))
{
message("Creating sheet", i)
sheet <- createSheet(wb, sheetName = names(the_data)[i])
message("Adding data frame", i)
addDataFrame(the_data[[i]], sheet)
}
saveWorkbook(wb, "test.xlsx")
Running this in 64-bit on a machine with 8GB RAM, it throws the GC overhead limit exceeded error while running addDataFrame for the first time.
How do I write large datasets to Excel spreadsheets using xlsx?
This is a known issue:
http://code.google.com/p/rexcel/issues/detail?id=33
While unresolved, the issue page links to a solution by Gabor Grothendieck suggesting that the heap size should be increased by setting the java.parameters option before the rJava package is loaded. (rJava is a dependency of xlsx.)
options(java.parameters = "-Xmx1000m")
The value 1000 is the number of megabytes of RAM to allow for the Java heap; it can be replaced with any value you like. My experiments with this suggest that bigger values are better, and you can happily use your full RAM entitlement. For example, I got the best results using:
options(java.parameters = "-Xmx8000m")
on the machine with 8GB RAM.
A further improvement can be obtained by requesting a garbage collection in each iteration of the loop. As noted by #gjabel, R garbage collection can be performed using gc(). We can define a Java garbage collection function that calls the Java System.gc() method:
jgc <- function()
{
.jcall("java/lang/System", method = "gc")
}
Then the loop can be updated to:
for(i in seq_along(the_data))
{
gc()
jgc()
message("Creating sheet", i)
sheet <- createSheet(wb, sheetName = names(the_data)[i])
message("Adding data frame", i)
addDataFrame(the_data[[i]], sheet)
}
With both these code fixes, the code ran as far as i = 29 before throwing an error.
One technique that I tried unsuccessfully was to use write.xlsx2 to write the contents to file at each iteration. This was slower than the other code, and it fell over on the 10th iteration (but at least part of the contents were written to file).
for(i in seq_along(the_data))
{
message("Writing sheet", i)
write.xlsx2(
the_data[[i]],
"test.xlsx",
sheetName = names(the_data)[i],
append = i > 1
)
}
Building on #richie-cotton answer, I found adding gc() to the jgc function kept the CPU usage low.
jgc <- function()
{
gc()
.jcall("java/lang/System", method = "gc")
}
My previous for loop still struggled with the original jgc function, but with extra command, I no longer run into GC overhead limit exceeded error message.
Solution for the above error:
Please use the below mentioned r - code:
detach(package:xlsx)
detach(package:XLConnect)
library(openxlsx)
And, try to import the file again and you will not get any error as it works for me.
Restart R and, before loading the R packages, insert:
options(java.parameters = "-Xmx2048m")
or
options(java.parameters = "-Xmx8000m")
You can also use gc() inside the loop if you are writing row by row. gc() stands for garbage collection. gc() can be used in any case of memory issue.
I was having issues with write.xlsx() rather than reading.... but then realised that I had accidentally been running 32bit R. Swapping it out to 64bit has fixed the issue.

Error when reading Excel in R: java.lang.OutOfMemoryError: Java heap space [duplicate]

I'm wondering if anyone knows of a way to import data from a "big" xlsx file (~20Mb). I tried to use xlsx and XLConnect libraries. Unfortunately, both use rJava and I always obtain the same error:
> library(XLConnect)
> wb <- loadWorkbook("MyBigFile.xlsx")
Error: OutOfMemoryError (Java): Java heap space
or
> library(xlsx)
> mydata <- read.xlsx2(file="MyBigFile.xlsx")
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.OutOfMemoryError: Java heap space
I also tried to modify the java.parameters before loading rJava:
> options( java.parameters = "-Xmx2500m")
> library(xlsx) # load rJava
> mydata <- read.xlsx2(file="MyBigFile.xlsx")
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.OutOfMemoryError: Java heap space
or after loading rJava (this is a bit stupid, I think):
> library(xlsx) # load rJava
> options( java.parameters = "-Xmx2500m")
> mydata <- read.xlsx2(file="MyBigFile.xlsx")
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.lang.OutOfMemoryError: Java heap space
But nothing works. Does anyone have an idea?
I stumbled on this question when someone sent me (yet another) Excel file to analyze. This one isn't even that big but for whatever reason I was running into a similar error:
java.lang.OutOfMemoryError: GC overhead limit exceeded
Based on comment by #DirkEddelbuettel in a previous answer I installed the openxlsx package (http://cran.r-project.org/web/packages/openxlsx/). and then ran:
library("openxlsx")
mydf <- read.xlsx("BigExcelFile.xlsx", sheet = 1, startRow = 2, colNames = TRUE)
It was just what I was looking for. Easy to use and wicked fast. It's my new BFF. Thanks for the tip #DirkEddelbuettel!
options(java.parameters = "-Xmx2048m") ## memory set to 2 GB
library(XLConnect)
allow for more memory using "options" before any java component is loaded. Then load XLConnect library (it uses java).
That's it. Start reading in data with readWorksheet .... and so on.
:)
I do agree with #orville jackson response & it really helped me too.
Inline to the answer provided by #orville jackson. here is the detailed description of how you can use openxlsx for reading and writing big files.
When data size is small, R has many packages and functions which can be utilized as per your requirement.
write.xlsx, write.xlsx2, XLconnect also do the work but these are sometimes slow as compare to openxlsx.
So, if you are dealing with the large data sets and came across java errors. I would suggest to have a look of "openxlsx" which is really awesome and reduce the time by 1/12th.
I've tested all and finally i was really impressed with the performance of openxlsx capabilities.
Here are the steps for writing multiple datasets into multiple sheets.
install.packages("openxlsx")
library("openxlsx")
start.time <- Sys.time()
# Creating large data frame
x <- as.data.frame(matrix(1:4000000,200000,20))
y <- as.data.frame(matrix(1:4000000,200000,20))
z <- as.data.frame(matrix(1:4000000,200000,20))
# Creating a workbook
wb <- createWorkbook("Example.xlsx")
Sys.setenv("R_ZIPCMD" = "C:/Rtools/bin/zip.exe") ## path to zip.exe
Sys.setenv("R_ZIPCMD" = "C:/Rtools/bin/zip.exe") has to be static as it takes reference of some utility from Rtools.
Note: Incase Rtools is not installed on your system, please install it first for smooth experience. here is the link for your reference: (choose appropriate version)
https://cran.r-project.org/bin/windows/Rtools/
check the options as per link below (need to select all the check box while installation)
https://cloud.githubusercontent.com/assets/7400673/12230758/99fb2202-b8a6-11e5-82e6-836159440831.png
# Adding a worksheets : parameters for addWorksheet are 1. Workbook Name 2. Sheet Name
addWorksheet(wb, "Sheet 1")
addWorksheet(wb, "Sheet 2")
addWorksheet(wb, "Sheet 3")
# Writing data in to respetive sheets: parameters for writeData are 1. Workbook Name 2. Sheet index/ sheet name 3. dataframe name
writeData(wb, 1, x)
# incase you would like to write sheet with filter available for ease of access you can pass the parameter withFilter = TRUE in writeData function.
writeData(wb, 2, x = y, withFilter = TRUE)
## Similarly writeDataTable is another way for representing your data with table formatting:
writeDataTable(wb, 3, z)
saveWorkbook(wb, file = "Example.xlsx", overwrite = TRUE)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
openxlsx package is really good for reading and writing huge data from/ in excel files and has lots of options for custom formatting within excel.
The interesting fact is that we don’t have to bother about java heap memory here.
I know this question is a bit old, but There is a good solution for this nowadays. This is a default package when you try to import excel in Rstudio with GUI and It works well in my situation.
library(readxl)
data <- read_excel(filename)
As mentioned in the canonical Excel->R question, a recent alternative which has emerged comes from the readxl package, which I've found to be quite fast, compared with, e.g. openxlsx and xlsx.
That said, there's a definite limit of spreadsheet size past which you're probably better off just saving the thing as a .csv and using fread.
I also had the same error in both xlsx::read.xlsx and XLConnect::readWorksheetFromFile. Maybe you can use RODBC::odbcDriverConnect and RODBC::sqlFetch, which uses Microsoft RODBC, which is much more efficient.
#flodel's suggestion of converting to CSV seems the most straightforward. If for whatever reason, that's not an option, you can read in the file in chunks:
require(XLConnect)
chnksz <- 2e3
s <- <sheet>
wb <- loadWorkbook(<file>, s)
tot.rows <- getLastRow(wb)
last.row =0
for (i in seq(ceiling( tot.rows / chnksz) )) {
next.batch <- readWorksheet(wb, s, startRow=last.row+i, endRow=last.row+chnksz+i)
# optionally save next.batch to disk or
# assign it to a list. See which works for you.
}
I found this thread looking for an answer to the exact same question. Rather than try to hack an xlsx file from within R what ended up working for me was to convert the file to .csv using python and then import the file into R using a standard scanning function.
Check out: https://github.com/dilshod/xlsx2csv

POI Excel Merging Causing "Repaired Records: Format from /xl/styles.xml part (Styles)"

I have merged two excel files using the code specied here
http://www.coderanch.com/t/614715/Web-Services/java/merge-excel-files
this the block applying the styles for my merging cells
if (styleMap != null)
{
if (oldCell.getSheet().getWorkbook() == newCell.getSheet().getWorkbook())
{
newCell.setCellStyle(oldCell.getCellStyle());
}
else
{
int stHashCode = oldCell.getCellStyle().hashCode();
XSSFCellStyle newCellStyle = styleMap.get(stHashCode);
if (newCellStyle == null)
{
newCellStyle = newCell.getSheet().getWorkbook().createCellStyle();
newCellStyle.cloneStyleFrom(oldCell.getCellStyle());
styleMap.put(stHashCode, newCellStyle);
}
newCell.setCellStyle(newCellStyle);
}
}
It all working as expected and going well in generating my XSSFWorkbook.
Problem starting when I try to open it:
I see below error
and my error report contains below
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<logFileName>error072840_01.xml</logFileName>
<summary>Errors were detected in file 'XYZ.xlsx'</summary>
<repairedRecords summary="Following is a list of repairs:">
<repairedRecord>Repaired Records: Format from /xl/styles.xml part (Styles)</repairedRecord>
</repairedRecords>
</recoveryLog>
After all these my sheet opens up fine but without styles. I know there is a limitation on number of styles to be created and have counted the styles being created and I hardly see 4 created. I even know that this issue is with too many styles.
Unfortunately, POI has support to optimise only HSSFWorkbook (Apache POI delete CellStyle from workbook)
Any help in how to mitigate with this issue will be great.
Well, after debugging bit of POI code and how styles are being applied and so.
Doing below solved the problem
newCellStyle.getCoreXf().unsetBorderId();
newCellStyle.getCoreXf().unsetFillId();
I had the same issue.
You should to minimize instances of styles and fonts because each instance is placed into xl/styles.xml
Make styles and fonts only once for one book.
I had the same issue using the Python library xlxswriter with Pandas. After I stopped trying to use Pandas' date_format specification, I stopped getting the error.
import pandas as pd
data = pd.read_excel('somefile.xlsx')
grp = data.groupby('Property Manager')
for i, (pm, g) in enumerate(grp):
writer = pd.ExcelWriter(p + f.format(pm[:30]), engine='xlsxwriter') #,date_format='%m/%d/%Y')
g[cols].to_excel(writer, sheet_name='Summary', index=False)
writer.save()

Categories