I have an xlsx file which contains around 150 sheets in it. I need to extract text for only 30 of those sheets, I have tried the below code but this extracts the text for all of the sheets.
try (InputStream inp = new FileInputStream(filePath)) {
OPCPackage d=OPCPackage.open(inp);
XSSFWorkbook wb = new XSSFWorkbook(d);
XSSFExcelExtractor extractor = new XSSFExcelExtractor(wb);
extractor.setFormulasNotResults(true);
extractor.setIncludeSheetNames(false);
String text = extractor.getText().replaceAll("\\t"," ").replaceAll("%","");
lines =text.split("\n");
Could someone please help me if there is any method available using which I can extract text by giving the sheet names for which I want to extract data.
Sure
Class Workbook has method getSheet(String name) that returns Sheet instance.
I don't remember correct class name but you can write something like this
List<String> sheetNames = List.of("sheet1", "sheet2", .... );
List<Sheet> sheets = new ArrayList<>();
sheetNames.forEach(nm -> sheets.add(workbook.getSheet(nm)));
then you may want to filter out nulls (when sheet wasn't found)
sheets = sheets.stream().filter(s -> Objects.notNull(s)).collect(Collectors.toList())
here you go
I have a large .xlsx file (141 MB, containing 293413 lines with 62 columns each) I need to perform some operations within.
I am having problems with loading this file (OutOfMemoryError), as POI has a large memory footprint on XSSF (xlsx) workbooks.
This SO question is similar, and the solution presented is to increase the VM's allocated/maximum memory.
It seems to work for that kind of file-size (9MB), but for me, it just simply doesn't work even if a allocate all available system memory. (Well, it's no surprise considering the file is over 15 times larger)
I'd like to know if there is any way to load the workbook in a way it won't consume all the memory, and yet, without doing the processing based (going into) the XSSF's underlying XML. (In other words, maintaining a puritan POI solution)
If there isn't tough, you are welcome to say it ("There isn't.") and point me the ways to a "XML" solution.
I was in a similar situation with a webserver environment. The typical size of the uploads were ~150k rows and it wouldn't have been good to consume a ton of memory from a single request. The Apache POI Streaming API works well for this, but it requires a total redesign of your read logic. I already had a bunch of read logic using the standard API that I didn't want to have to redo, so I wrote this instead: https://github.com/monitorjbl/excel-streaming-reader
It's not entirely a drop-in replacement for the standard XSSFWorkbook class, but if you're just iterating through rows it behaves similarly:
import com.monitorjbl.xlsx.StreamingReader;
InputStream is = new FileInputStream(new File("/path/to/workbook.xlsx"));
StreamingReader reader = StreamingReader.builder()
.rowCacheSize(100) // number of rows to keep in memory (defaults to 10)
.bufferSize(4096) // buffer size to use when reading InputStream to file (defaults to 1024)
.sheetIndex(0) // index of sheet to use (defaults to 0)
.read(is); // InputStream or File for XLSX file (required)
for (Row r : reader) {
for (Cell c : r) {
System.out.println(c.getStringCellValue());
}
}
There are some caveats to using it; due to the way XLSX sheets are structured, not all data is available in the current window of the stream. However, if you're just trying to read simple data out from the cells, it works pretty well for that.
A improvement in memory usage can be done by using a File instead of a Stream.
(It is better to use a streaming API, but the Streaming API's have limitations, see http://poi.apache.org/spreadsheet/index.html)
So instead of
Workbook workbook = WorkbookFactory.create(inputStream);
do
Workbook workbook = WorkbookFactory.create(new File("yourfile.xlsx"));
This is according to : http://poi.apache.org/spreadsheet/quick-guide.html#FileInputStream
Files vs InputStreams
"When opening a workbook, either a .xls HSSFWorkbook, or a .xlsx XSSFWorkbook, the Workbook can be loaded from either a File or an InputStream. Using a File object allows for lower memory consumption, while an InputStream requires more memory as it has to buffer the whole file."
The Excel support in Apache POI, HSSF and XSSF, supports 3 different modes.
One is a full, DOM-Like in-memory "UserModel", which supports both reading and writing. Using the common SS (SpreadSheet) interfaces, you can code for both HSSF (.xls) and XSSF (.xlsx) basically transparently. However, it needs lots of memory.
POI also supports a streaming read-only way to process the files, the EventModel. This is much more low-level than the UserModel, and gets you very close to the file format. For HSSF (.xls) you get a stream of records, and optionally some help with handling them (missing cells, format tracking etc). For XSSF (.xlsx) you get streams of SAX events from the different parts of the file, with help to get the right part of the file and also easy processing of common but small bits of the file.
For XSSF (.xlsx) only, POI also supports a write-only streaming write, suitable for low level but low memory writing. It largely just supports new files though (certain kinds of append are possible). There is no HSSF equivalent, and due to back-and-forth byte offsets and index offsets in many records it would be pretty hard to do...
For your specific case, as described in your clarifying comments, I think you'll want to use the XSSF EventModel code. See the POI documentation to get started, then try looking at these three classes in POI and Tika which use it for more details.
POI now includes an API for these cases. SXSSF http://poi.apache.org/spreadsheet/index.html
It does not load everything on memory so it could allow you to handle such file.
Note: I have read that SXSSF works as a writing API. Loading should be done using XSSF without inputstream'ing the file (to avoid a full load of it in memory)
Check this post. I show how to use SAX parser to process an XLSX file.
https://stackoverflow.com/a/44969009/4587961
In short, I extended org.xml.sax.helpers.DefaultHandler whih processes XML structure for XLSX filez. t is event parser - SAX.
class SheetHandler extends DefaultHandler {
private static final String ROW_EVENT = "row";
private static final String CELL_EVENT = "c";
private SharedStringsTable sst;
private String lastContents;
private boolean nextIsString;
private List<String> cellCache = new LinkedList<>();
private List<String[]> rowCache = new LinkedList<>();
private SheetHandler(SharedStringsTable sst) {
this.sst = sst;
}
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
// c => cell
if (CELL_EVENT.equals(name)) {
String cellType = attributes.getValue("t");
if(cellType != null && cellType.equals("s")) {
nextIsString = true;
} else {
nextIsString = false;
}
} else if (ROW_EVENT.equals(name)) {
if (!cellCache.isEmpty()) {
rowCache.add(cellCache.toArray(new String[cellCache.size()]));
}
cellCache.clear();
}
// Clear contents cache
lastContents = "";
}
public void endElement(String uri, String localName, String name)
throws SAXException {
// Process the last contents as required.
// Do now, as characters() may be called more than once
if(nextIsString) {
int idx = Integer.parseInt(lastContents);
lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
nextIsString = false;
}
// v => contents of a cell
// Output after we've seen the string contents
if(name.equals("v")) {
cellCache.add(lastContents);
}
}
public void characters(char[] ch, int start, int length)
throws SAXException {
lastContents += new String(ch, start, length);
}
public List<String[]> getRowCache() {
return rowCache;
}
}
And then I parse the XML presending XLSX file
private List<String []> processFirstSheet(String filename) throws Exception {
OPCPackage pkg = OPCPackage.open(filename, PackageAccess.READ);
XSSFReader r = new XSSFReader(pkg);
SharedStringsTable sst = r.getSharedStringsTable();
SheetHandler handler = new SheetHandler(sst);
XMLReader parser = fetchSheetParser(handler);
Iterator<InputStream> sheetIterator = r.getSheetsData();
if (!sheetIterator.hasNext()) {
return Collections.emptyList();
}
InputStream sheetInputStream = sheetIterator.next();
BufferedInputStream bisSheet = new BufferedInputStream(sheetInputStream);
InputSource sheetSource = new InputSource(bisSheet);
parser.parse(sheetSource);
List<String []> res = handler.getRowCache();
bisSheet.close();
return res;
}
public XMLReader fetchSheetParser(ContentHandler handler) throws SAXException {
XMLReader parser = new SAXParser();
parser.setContentHandler(handler);
return parser;
}
Based on monitorjbl's answer and test suite explored from poi, following worked for me on multi-sheet xlsx file with 200K records (size > 50 MB):
import com.monitorjbl.xlsx.StreamingReader;
. . .
try (
InputStream is = new FileInputStream(new File("sample.xlsx"));
Workbook workbook = StreamingReader.builder().open(is);
) {
DataFormatter dataFormatter = new DataFormatter();
for (Sheet sheet : workbook) {
System.out.println("Processing sheet: " + sheet.getSheetName());
for (Row row : sheet) {
for (Cell cell : row) {
String value = dataFormatter.formatCellValue(cell);
}
}
}
}
For latest code use this
InputStream file = new FileInputStream(
new File("uploads/" + request.getSession().getAttribute("username") + "/" + userFile));
Workbook workbook = StreamingReader.builder().rowCacheSize(100) // number of rows to keep in memory
.bufferSize(4096) // index of sheet to use (defaults to 0)
.open(file); // InputStream or File for XLSX file (required)
Iterator<Row> rowIterator = workbook.getSheetAt(0).rowIterator();
while (rowIterator.hasNext()) {
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
String cellValue = dataFormatter.formatCellValue(cell);
}}
You can use SXXSF instead of using HSSF. I could generate excel with 200000 rows.
This is the code that I have for reading a very large excel file (xlsx) that is 23.5MB with 700,000+ rows.
String dir = rootPath + File.separator + "tmpFiles" + File.separator
+ FILE_NAME;
File fisNew = new File(dir);
Workbook w = StreamingReader.builder()
.rowCacheSize(100)
.open(fisNew);
Sheet worksheet = null;
worksheet = w.getSheetAt(0);
worksheet.getRow(0).getPhysicalNumberOfCells();
I get an UnsupportedOperationException Null pointer error on this line:
worksheet.getRow(0).getPhysicalNumberOfCells(); And I also don't get an actual String value when I print out this line: SpecialtyUtil.removeWhiteSpaces(excelheader.getCell(0)). I am supposed to get the name of the column but I get some StreamingSheet string instead. Not so sure what I need to change here in order to process a xlsx file.
EDIT: Any idea how to write to an excel file using StreamingReader? I know that it is an unsupported operation, but is there a workaround?
If you look into the following source code in github link, StreamingSheet does not support the method getPhysicalNumberOfCells(). I provide below the code snippet.
/**
* Not supported
*/
#Override
public int getPhysicalNumberOfRows() {
throw new UnsupportedOperationException();
}
github link is given below.
https://github.com/monitorjbl/excel-streaming-reader/blob/master/src/main/java/com/monitorjbl/xlsx/impl/StreamingSheet.java#L97
We can use getLastRowNum()
Integer noOfCol = sheet.getLastRowNum(); // row no starts from 0 --- n
here is the implementation
#Override
public int getLastRowNum() {
return reader.getLastRowNum();
}
StreamingSheet.java
I would like to save all attached files from an Excel (xls/HSSF) without extension.
I've been trying for a long time now, and I really don't know if this is even possible. I also tried Apache Tika, but I don't want to use Tika for this, because I need POI for other tasks, anyway.
I tried the sample code from the Busy Developers Guide, but this does not extract files in the old office format (doc, ppt, xls). And it throws an Error when trying to create new SlideShow(new HSLFSlideShow(dn, fs)) Error: (Remove argument to match HSLFSlideShow(dn))
My actual code is:
public static void saveEmbeddedXLS(InputStream fis_param, String embDIR) throws IOException, InvalidFormatException{
//HSSF - XLS
int i = 0;
System.out.println("Starting Embedded Search in xls...");
POIFSFileSystem fs = new POIFSFileSystem(fis_param);//create FileSystem using fileInputStream
HSSFWorkbook workbook = new HSSFWorkbook(fs);
for (HSSFObjectData obj : workbook.getAllEmbeddedObjects()) {
System.out.println("Objects : "+ obj.getOLE2ClassName());//the OLE2 Class Name of the object
String oleName = obj.getOLE2ClassName();//Document Type
DirectoryNode dn = (DirectoryNode) obj.getDirectory();//get Directory Node
//Trying to create an input Stream with the embedded document, argument of createDocumentInputStream should be: String; Where/How can I get this correct parameter for the function?
InputStream is = dn.createDocumentInputStream(dn);//This line is incorrect! How can I do i correctly?
FileOutputStream fos = new FileOutputStream("embDIR" + i);//Outputfilepath + Number
IOUtils.copy(is, fos);//FileInputStream > FileOutput Stream (save File without extension)
i++;
}
}
So my simple question is:
Is it possible to save ALL attachments from an xls file without any extension (as simple as possible)? And can any one provide me a solution? Many Thanks!
I need to append contents to an existing excel file using JExcel.
I am trying the following approach:
Read from existing workbook
workbook = Workbook.getWorkbook(new File(errorFilePath));
Create writable workbook from exisitng workbook into a temp file
if (!tempFile.exists()) {
tempFile.getParentFile().mkdirs();
tempFile.createNewFile();
}
newCopy = Workbook.createWorkbook(tempFile, workbook);
excelSheet = newCopy.getSheet(0);
Write to writable workbook(times is a writable cell format variable)
Label label;
label = new Label(column, row, stringData, times);
excelSheet .addCell(label);
Close both exisitng and writable workbook->Delete exisitng workbook
in finally block -> Rename temp file name to existing(now deleted) workbook name
finally {
if (null != newCopy) {
newCopy.write();
newCopy.close();
}
if (null != workbook) {
workbook.close();
}
if (null != errorFile && errorFile.exists()) {
errorFile.delete();
}
if (null != tempFile) {
tempFile.renameTo(new File(errorFilePath));
}
}
The problem is everything works fine for the first run(without redeploying).
But whenever I change some java code, and the web application redeploys I get a null pointer exception while closing the newly created workbook(after writing).
I am getting the following stack trace(originating from line newCopy.write())
java.lang.NullPointerException
at jxl.write.biff.CellValue.getData(CellValue.java:259)
at jxl.write.biff.LabelRecord.getData(LabelRecord.java:141)
at jxl.biff.WritableRecordData.getBytes(WritableRecordData.java:71)
at jxl.write.biff.File.write(File.java:147)
at jxl.write.biff.RowRecord.writeCells(RowRecord.java:329)
at jxl.write.biff.SheetWriter.write(SheetWriter.java:479)
at jxl.write.biff.WritableSheetImpl.write(WritableSheetImpl.java:1514)
at jxl.write.biff.WritableWorkbookImpl.write(WritableWorkbookImpl.java:950)
Java Version : 1.6
JExcel Version : 2.6.10
Windows 7
Well, first suspicion is, in this line:
label = new Label(column, row, stringData, times);
you pass null argument(s).
I faced the same issue.
I was trying to add rows to the sheet dynamically in a loop using insertRow. After spending several hours it was probably a bug in the latest version of jxl api.
JXL api after 2.6.9 seem to have bug in insertRow. I switched to 2.6.9 from 2.6.12.