How to open a huge excel file efficiently - java

I have a 150MB one-sheet excel file that takes about 7 minutes to open on a very powerful machine using the following:
# using python
import xlrd
wb = xlrd.open_workbook(file)
sh = wb.sheet_by_index(0)
Is there any way to open the excel file quicker? I'm open to even very outlandish suggestions (such as hadoop, spark, c, java, etc.). Ideally I'm looking for a way to open the file in under 30 seconds if that's not a pipe dream. Also, the above example is using python, but it doesn't have to be python.
Note: this is an Excel file from a client. It cannot be converted into any other format before we receive it. It is not our file
UPDATE: Answer with a working example of code that will open the following 200MB excel file in under 30 seconds will be rewarded with bounty: https://drive.google.com/file/d/0B_CXvCTOo7_2VW9id2VXRWZrbzQ/view?usp=sharing. This file should have string (col 1), date (col 9), and number (col 11).

Most programming languages that work with Office products have some middle layer and this is usually where the bottleneck is, a good example is using PIA's/Interop or Open XML SDK.
One way to get the data at a lower level (bypassing the middle layer) is using a Driver.
150MB one-sheet excel file that takes about 7 minutes.
The best I could do is a 130MB file in 135 seconds, roughly 3 times faster:
Stopwatch sw = new Stopwatch();
sw.Start();
DataSet excelDataSet = new DataSet();
string filePath = #"c:\temp\BigBook.xlsx";
// For .XLSXs we use =Microsoft.ACE.OLEDB.12.0;, for .XLS we'd use Microsoft.Jet.OLEDB.4.0; with "';Extended Properties=\"Excel 8.0;HDR=YES;\"";
string connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source='" + filePath + "';Extended Properties=\"Excel 12.0;HDR=YES;\"";
using (OleDbConnection conn = new OleDbConnection(connectionString))
{
conn.Open();
OleDbDataAdapter objDA = new System.Data.OleDb.OleDbDataAdapter
("select * from [Sheet1$]", conn);
objDA.Fill(excelDataSet);
//dataGridView1.DataSource = excelDataSet.Tables[0];
}
sw.Stop();
Debug.Print("Load XLSX tool: " + sw.ElapsedMilliseconds + " millisecs. Records = " + excelDataSet.Tables[0].Rows.Count);
Win 7x64, Intel i5, 2.3ghz, 8GB ram, SSD250GB.
If I could recommend a hardware solution as well, try to resolve it with an SSD if you're using standard HDD's.
Note: I cant download your Excel spreadsheet example as I'm behind a corporate firewall.
PS. See MSDN - Fastest Way to import xlsx files with 200 MB of Data, the consensus being OleDB is the fastest.
PS 2. Here's how you can do it with python:
http://code.activestate.com/recipes/440661-read-tabular-data-from-excel-spreadsheets-the-fast/

I managed to read the file in about 30 seconds using .NET core and the Open XML SDK.
The following example returns a list of objects containing all rows and cells with the matching types, it supports date, numeric and text cells. The project is available here: https://github.com/xferaa/BigSpreadSheetExample/ (Should work on Windows, Linux and Mac OS and does not require Excel or any Excel component to be installed).
public List<List<object>> ParseSpreadSheet()
{
List<List<object>> rows = new List<List<object>>();
using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(filePath, false))
{
WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
OpenXmlReader reader = OpenXmlReader.Create(worksheetPart);
Dictionary<int, string> sharedStringCache = new Dictionary<int, string>();
int i = 0;
foreach (var el in workbookPart.SharedStringTablePart.SharedStringTable.ChildElements)
{
sharedStringCache.Add(i++, el.InnerText);
}
while (reader.Read())
{
if(reader.ElementType == typeof(Row))
{
reader.ReadFirstChild();
List<object> cells = new List<object>();
do
{
if (reader.ElementType == typeof(Cell))
{
Cell c = (Cell)reader.LoadCurrentElement();
if (c == null || c.DataType == null || !c.DataType.HasValue)
continue;
object value;
switch(c.DataType.Value)
{
case CellValues.Boolean:
value = bool.Parse(c.CellValue.InnerText);
break;
case CellValues.Date:
value = DateTime.Parse(c.CellValue.InnerText);
break;
case CellValues.Number:
value = double.Parse(c.CellValue.InnerText);
break;
case CellValues.InlineString:
case CellValues.String:
value = c.CellValue.InnerText;
break;
case CellValues.SharedString:
value = sharedStringCache[int.Parse(c.CellValue.InnerText)];
break;
default:
continue;
}
if (value != null)
cells.Add(value);
}
} while (reader.ReadNextSibling());
if (cells.Any())
rows.Add(cells);
}
}
}
return rows;
}
I ran the program in a three year old Laptop with a SSD drive, 8GB of RAM and an Intel Core i7-4710 CPU # 2.50GHz (two cores) on Windows 10 64 bits.
Note that although opening and parsing the whole file as strings takes a bit less than 30 seconds, when using objects as in the example of my last edit, the time goes up to almost 50 seconds with my crappy laptop. You will probably get closer to 30 seconds in your server with Linux.
The trick was to use the SAX approach as explained here:
https://msdn.microsoft.com/en-us/library/office/gg575571.aspx

Well, if your excel is going to be as simple as a CSV file like your example (https://drive.google.com/file/d/0B_CXvCTOo7_2UVZxbnpRaEVnaFk/view?usp=sharing), you can try to open the file as a zip file and read directly every xml:
Intel i5 4460, 12 GB RAM, SSD Samsung EVO PRO.
If you have a lot of memory ram:
This code needs a lot of ram, but it takes 20~25 seconds. (You need the parameter -Xmx7g)
package com.devsaki.opensimpleexcel;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.nio.charset.Charset;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.zip.ZipFile;
public class Multithread {
public static final char CHAR_END = (char) -1;
public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
String excelFile = "C:/Downloads/BigSpreadsheetAllTypes.xlsx";
ZipFile zipFile = new ZipFile(excelFile);
long init = System.currentTimeMillis();
ExecutorService executor = Executors.newFixedThreadPool(4);
char[] sheet1 = readEntry(zipFile, "xl/worksheets/sheet1.xml").toCharArray();
Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(new CharReader(sheet1), executor));
char[] sharedString = readEntry(zipFile, "xl/sharedStrings.xml").toCharArray();
Future<String[]> futureWords = executor.submit(() -> processSharedStrings(new CharReader(sharedString)));
Object[][] sheet = futureSheet1.get();
String[] words = futureWords.get();
executor.shutdown();
long end = System.currentTimeMillis();
System.out.println("only read: " + (end - init) / 1000);
///Doing somethin with the file::Saving as csv
init = System.currentTimeMillis();
try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) {
for (Object[] rows : sheet) {
for (Object cell : rows) {
if (cell != null) {
if (cell instanceof Integer) {
writer.append(words[(Integer) cell]);
} else if (cell instanceof String) {
writer.append(toDate(Double.parseDouble(cell.toString())));
} else {
writer.append(cell.toString()); //Probably a number
}
}
writer.append(";");
}
writer.append("\n");
}
}
end = System.currentTimeMillis();
System.out.println("Main saving to csv: " + (end - init) / 1000);
}
private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME;
private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2);
//The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date
public static String toDate(double s) {
return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600))));
}
public static String readEntry(ZipFile zipFile, String entry) throws IOException {
System.out.println("Initialing readEntry " + entry);
long init = System.currentTimeMillis();
String result = null;
try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
br.readLine();
result = br.readLine();
}
long end = System.currentTimeMillis();
System.out.println("readEntry '" + entry + "': " + (end - init) / 1000);
return result;
}
public static String[] processSharedStrings(CharReader br) throws IOException {
System.out.println("Initialing processSharedStrings");
long init = System.currentTimeMillis();
String[] words = null;
char[] wordCount = "Count=\"".toCharArray();
char[] token = "<t>".toCharArray();
String uniqueCount = extractNextValue(br, wordCount, '"');
words = new String[Integer.parseInt(uniqueCount)];
String nextWord;
int currentIndex = 0;
while ((nextWord = extractNextValue(br, token, '<')) != null) {
words[currentIndex++] = nextWord;
br.skip(11); //you can skip at least 11 chars "/t></si><si>"
}
long end = System.currentTimeMillis();
System.out.println("SharedStrings: " + (end - init) / 1000);
return words;
}
public static Object[][] processSheet1(CharReader br, ExecutorService executorService) throws IOException, ExecutionException, InterruptedException {
System.out.println("Initialing processSheet1");
long init = System.currentTimeMillis();
char[] dimensionToken = "dimension ref=\"".toCharArray();
String dimension = extractNextValue(br, dimensionToken, '"');
int[] sizes = extractSizeFromDimention(dimension.split(":")[1]);
br.skip(30); //Between dimension and next tag c exists more or less 30 chars
Object[][] result = new Object[sizes[0]][sizes[1]];
int parallelProcess = 8;
int currentIndex = br.currentIndex;
CharReader[] charReaders = new CharReader[parallelProcess];
int totalChars = Math.round(br.chars.length / parallelProcess);
for (int i = 0; i < parallelProcess; i++) {
int endIndex = currentIndex + totalChars;
charReaders[i] = new CharReader(br.chars, currentIndex, endIndex, i);
currentIndex = endIndex;
}
Future[] futures = new Future[parallelProcess];
for (int i = charReaders.length - 1; i >= 0; i--) {
final int j = i;
futures[i] = executorService.submit(() -> inParallelProcess(charReaders[j], j == 0 ? null : charReaders[j - 1], result));
}
for (Future future : futures) {
future.get();
}
long end = System.currentTimeMillis();
System.out.println("Sheet1: " + (end - init) / 1000);
return result;
}
public static void inParallelProcess(CharReader br, CharReader back, Object[][] result) {
System.out.println("Initialing inParallelProcess : " + br.identifier);
char[] tokenOpenC = "<c r=\"".toCharArray();
char[] tokenOpenV = "<v>".toCharArray();
char[] tokenAttributS = " s=\"".toCharArray();
char[] tokenAttributT = " t=\"".toCharArray();
String v;
int firstCurrentIndex = br.currentIndex;
boolean first = true;
while ((v = extractNextValue(br, tokenOpenC, '"')) != null) {
if (first && back != null) {
int sum = br.currentIndex - firstCurrentIndex - tokenOpenC.length - v.length() - 1;
first = false;
System.out.println("Adding to : " + back.identifier + " From : " + br.identifier);
back.plusLength(sum);
}
int[] indexes = extractSizeFromDimention(v);
int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT);
char type = 's'; //3 types: number (n), string (s) and date (d)
if (s == 0) { // Token S = number or date
char read = br.read();
if (read == '1') {
type = 'n';
} else {
type = 'd';
}
} else if (s == -1) {
type = 'n';
}
String c = extractNextValue(br, tokenOpenV, '<');
Object value = null;
switch (type) {
case 'n':
value = Double.parseDouble(c);
break;
case 's':
try {
value = Integer.parseInt(c);
} catch (Exception ex) {
System.out.println("Identifier Error : " + br.identifier);
}
break;
case 'd':
value = c.toString();
break;
}
result[indexes[0] - 1][indexes[1] - 1] = value;
br.skip(7); ///v></c>
}
}
static class CharReader {
char[] chars;
int currentIndex;
int length;
int identifier;
public CharReader(char[] chars) {
this.chars = chars;
this.length = chars.length;
}
public CharReader(char[] chars, int currentIndex, int length, int identifier) {
this.chars = chars;
this.currentIndex = currentIndex;
if (length > chars.length) {
this.length = chars.length;
} else {
this.length = length;
}
this.identifier = identifier;
}
public void plusLength(int n) {
if (this.length + n <= chars.length) {
this.length += n;
}
}
public char read() {
if (currentIndex >= length) {
return CHAR_END;
}
return chars[currentIndex++];
}
public void skip(int n) {
currentIndex += n;
}
}
public static int[] extractSizeFromDimention(String dimention) {
StringBuilder sb = new StringBuilder();
int columns = 0;
int rows = 0;
for (char c : dimention.toCharArray()) {
if (columns == 0) {
if (Character.isDigit(c)) {
columns = convertExcelIndex(sb.toString());
sb = new StringBuilder();
}
}
sb.append(c);
}
rows = Integer.parseInt(sb.toString());
return new int[]{rows, columns};
}
public static int foundNextTokens(CharReader br, char until, char[]... tokens) {
char character;
int[] indexes = new int[tokens.length];
while ((character = br.read()) != CHAR_END) {
if (character == until) {
break;
}
for (int i = 0; i < indexes.length; i++) {
if (tokens[i][indexes[i]] == character) {
indexes[i]++;
if (indexes[i] == tokens[i].length) {
return i;
}
} else {
indexes[i] = 0;
}
}
}
return -1;
}
public static String extractNextValue(CharReader br, char[] token, char until) {
char character;
StringBuilder sb = new StringBuilder();
int index = 0;
while ((character = br.read()) != CHAR_END) {
if (index == token.length) {
if (character == until) {
return sb.toString();
} else {
sb.append(character);
}
} else {
if (token[index] == character) {
index++;
} else {
index = 0;
}
}
}
return null;
}
public static int convertExcelIndex(String index) {
int result = 0;
for (char c : index.toCharArray()) {
result = result * 26 + ((int) c - (int) 'A' + 1);
}
return result;
}
}
Old answer (Not need the parameter Xms7g, so take less memory):
It takes to open and read the example file about 35 seconds (200MB) with an HDD, with SDD takes a little less (30 seconds).
Here the code:
https://github.com/csaki/OpenSimpleExcelFast.git
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.nio.charset.Charset;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.zip.ZipFile;
public class Launcher {
public static final char CHAR_END = (char) -1;
public static void main(String[] args) throws IOException, ExecutionException, InterruptedException {
long init = System.currentTimeMillis();
String excelFile = "D:/Downloads/BigSpreadsheet.xlsx";
ZipFile zipFile = new ZipFile(excelFile);
ExecutorService executor = Executors.newFixedThreadPool(4);
Future<String[]> futureWords = executor.submit(() -> processSharedStrings(zipFile));
Future<Object[][]> futureSheet1 = executor.submit(() -> processSheet1(zipFile));
String[] words = futureWords.get();
Object[][] sheet1 = futureSheet1.get();
executor.shutdown();
long end = System.currentTimeMillis();
System.out.println("Main only open and read: " + (end - init) / 1000);
///Doing somethin with the file::Saving as csv
init = System.currentTimeMillis();
try (PrintWriter writer = new PrintWriter(excelFile + ".csv", "UTF-8");) {
for (Object[] rows : sheet1) {
for (Object cell : rows) {
if (cell != null) {
if (cell instanceof Integer) {
writer.append(words[(Integer) cell]);
} else if (cell instanceof String) {
writer.append(toDate(Double.parseDouble(cell.toString())));
} else {
writer.append(cell.toString()); //Probably a number
}
}
writer.append(";");
}
writer.append("\n");
}
}
end = System.currentTimeMillis();
System.out.println("Main saving to csv: " + (end - init) / 1000);
}
private static final DateTimeFormatter formatter = DateTimeFormatter.ISO_DATE_TIME;
private static final LocalDateTime INIT_DATE = LocalDateTime.parse("1900-01-01T00:00:00+00:00", formatter).plusDays(-2);
//The number in excel is from 1900-jan-1, so every number time that you get, you have to sum to that date
public static String toDate(double s) {
return formatter.format(INIT_DATE.plusSeconds((long) ((s*24*3600))));
}
public static Object[][] processSheet1(ZipFile zipFile) throws IOException {
String entry = "xl/worksheets/sheet1.xml";
Object[][] result = null;
char[] dimensionToken = "dimension ref=\"".toCharArray();
char[] tokenOpenC = "<c r=\"".toCharArray();
char[] tokenOpenV = "<v>".toCharArray();
char[] tokenAttributS = " s=\"".toCharArray();
char[] tokenAttributT = " t=\"".toCharArray();
try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
String dimension = extractNextValue(br, dimensionToken, '"');
int[] sizes = extractSizeFromDimention(dimension.split(":")[1]);
br.skip(30); //Between dimension and next tag c exists more or less 30 chars
result = new Object[sizes[0]][sizes[1]];
String v;
while ((v = extractNextValue(br, tokenOpenC, '"')) != null) {
int[] indexes = extractSizeFromDimention(v);
int s = foundNextTokens(br, '>', tokenAttributS, tokenAttributT);
char type = 's'; //3 types: number (n), string (s) and date (d)
if (s == 0) { // Token S = number or date
char read = (char) br.read();
if (read == '1') {
type = 'n';
} else {
type = 'd';
}
} else if (s == -1) {
type = 'n';
}
String c = extractNextValue(br, tokenOpenV, '<');
Object value = null;
switch (type) {
case 'n':
value = Double.parseDouble(c);
break;
case 's':
value = Integer.parseInt(c);
break;
case 'd':
value = c.toString();
break;
}
result[indexes[0] - 1][indexes[1] - 1] = value;
br.skip(7); ///v></c>
}
}
return result;
}
public static int[] extractSizeFromDimention(String dimention) {
StringBuilder sb = new StringBuilder();
int columns = 0;
int rows = 0;
for (char c : dimention.toCharArray()) {
if (columns == 0) {
if (Character.isDigit(c)) {
columns = convertExcelIndex(sb.toString());
sb = new StringBuilder();
}
}
sb.append(c);
}
rows = Integer.parseInt(sb.toString());
return new int[]{rows, columns};
}
public static String[] processSharedStrings(ZipFile zipFile) throws IOException {
String entry = "xl/sharedStrings.xml";
String[] words = null;
char[] wordCount = "Count=\"".toCharArray();
char[] token = "<t>".toCharArray();
try (BufferedReader br = new BufferedReader(new InputStreamReader(zipFile.getInputStream(zipFile.getEntry(entry)), Charset.forName("UTF-8")))) {
String uniqueCount = extractNextValue(br, wordCount, '"');
words = new String[Integer.parseInt(uniqueCount)];
String nextWord;
int currentIndex = 0;
while ((nextWord = extractNextValue(br, token, '<')) != null) {
words[currentIndex++] = nextWord;
br.skip(11); //you can skip at least 11 chars "/t></si><si>"
}
}
return words;
}
public static int foundNextTokens(BufferedReader br, char until, char[]... tokens) throws IOException {
char character;
int[] indexes = new int[tokens.length];
while ((character = (char) br.read()) != CHAR_END) {
if (character == until) {
break;
}
for (int i = 0; i < indexes.length; i++) {
if (tokens[i][indexes[i]] == character) {
indexes[i]++;
if (indexes[i] == tokens[i].length) {
return i;
}
} else {
indexes[i] = 0;
}
}
}
return -1;
}
public static String extractNextValue(BufferedReader br, char[] token, char until) throws IOException {
char character;
StringBuilder sb = new StringBuilder();
int index = 0;
while ((character = (char) br.read()) != CHAR_END) {
if (index == token.length) {
if (character == until) {
return sb.toString();
} else {
sb.append(character);
}
} else {
if (token[index] == character) {
index++;
} else {
index = 0;
}
}
}
return null;
}
public static int convertExcelIndex(String index) {
int result = 0;
for (char c : index.toCharArray()) {
result = result * 26 + ((int) c - (int) 'A' + 1);
}
return result;
}
}

Python's Pandas library could be used to hold and process your data, but using it to directly load the .xlsx file will be quite slow, e.g. using read_excel().
One approach would be to use Python to automate the conversion of your file into CSV using Excel itself and to then use Pandas to load the resulting CSV file using read_csv(). This will give you a good speed up, but not under 30 seconds:
import win32com.client as win32
import pandas as pd
from datetime import datetime
print ("Starting")
start = datetime.now()
# Use Excel to load the xlsx file and save it in csv format
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(r'c:\full path\BigSpreadsheet.xlsx')
excel.DisplayAlerts = False
wb.DoNotPromptForConvert = True
wb.CheckCompatibility = False
print('Saving')
wb.SaveAs(r'c:\full path\temp.csv', FileFormat=6, ConflictResolution=2)
excel.Application.Quit()
# Use Pandas to load the resulting CSV file
print('Loading CSV')
df = pd.read_csv(r'c:\full path\temp.csv', dtype=str)
print(df.shape)
print("Done", datetime.now() - start)
Column types
The types for your columns can be specified by passing dtype and converters and parse_dates:
df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[8], infer_datetime_format=True)
You should also specify infer_datetime_format=True, as this will greatly speed up the date conversion.
nfer_datetime_format : boolean, default False
If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be
inferred, switch to a faster method of parsing them. In some cases
this can increase the parsing speed by 5-10x.
Also add dayfirst=True if dates are in the form DD/MM/YYYY.
Selective columns
If you actually only need to work on columns 1 9 11, then you could further reduce resources by specifying usecols=[0, 8, 10] as follows:
df = pd.read_csv(r'c:\full path\temp.csv', dtype=str, converters={10:int}, parse_dates=[1], dayfirst=True, infer_datetime_format=True, usecols=[0, 8, 10])
The resulting dataframe would then only contain those 3 columns of data.
RAM drive
Using a RAM drive to store the temporary CSV file to would further speed up the load time.
Note: This does assume you are using a Windows PC with Excel available.

I have created an sample Java program which is able to load the file in ~40 seconds my laptop ( Intel i7 4 core, 16 GB RAM).
https://github.com/skadyan/largefile
This program uses the Apache POI library to load the .xlsx file using the XSSF SAX API.
The callback interface com.stackoverlfow.largefile.RecordHandler implementation can be used to process the data loaded from the excel. This interface define only one method which take three arguments
sheetname : String, excel sheet name
row number: int, row number of data
and data map: Map: excel cell reference and excel formatted cell value
The class com.stackoverlfow.largefile.Main demonstrate one basic implementation of this interface which just print the row number on console.
Update
woodstox parser seems have better performance than standard SAXReader. (code updated in repo).
Also in order to meet the desired performance requirement, you may consider to re-implement the org.apache.poi...XSSFSheetXMLHandler. In the implementation, more optimized string/text value handling can be implemented and unnecessary text formatting operation may be skipped.

I'm using a Dell Precision T1700 workstation and using c# I was able to open the file and read it's contents in about 24 seconds just using standard code to open a workbook using interop services. Using references to the Microsoft Excel 15.0 Object Library here is my code.
My using statements:
using System.Runtime.InteropServices;
using Excel = Microsoft.Office.Interop.Excel;
Code to open and read workbook:
public partial class MainWindow : Window {
public MainWindow() {
InitializeComponent();
Excel.Application xlApp;
Excel.Workbook wb;
Excel.Worksheet ws;
xlApp = new Excel.Application();
xlApp.Visible = false;
xlApp.ScreenUpdating = false;
wb = xlApp.Workbooks.Open(#"Desired Path of workbook\Copy of BigSpreadsheet.xlsx");
ws = wb.Sheets["Sheet1"];
//string rng = ws.get_Range("A1").Value;
MessageBox.Show(ws.get_Range("A1").Value);
Marshal.FinalReleaseComObject(ws);
wb.Close();
Marshal.FinalReleaseComObject(wb);
xlApp.Quit();
Marshal.FinalReleaseComObject(xlApp);
GC.Collect();
GC.WaitForPendingFinalizers();
}
}

Looks like it is hardly achievable in Python at all. If we unpack a sheet data file then it would take all required 30 seconds just to pass it through the C-based iterative SAX parser (using lxml, a very fast wrapper over libxml2):
from __future__ import print_function
from lxml import etree
import time
start_ts = time.time()
for data in etree.iterparse(open('xl/worksheets/sheet1.xml'), events=('start',),
collect_ids=False, resolve_entities=False,
huge_tree=True):
pass
print(time.time() - start_ts)
The sample output: 27.2134890556
By the way, the Excel itself needs about 40 seconds to load the workbook.

The c# and ole solution still have some bottleneck.So i test it by c++ and ado.
_bstr_t connStr(makeConnStr(excelFile, header).c_str());
TESTHR(pRec.CreateInstance(__uuidof(Recordset)));
TESTHR(pRec->Open(sqlSelectSheet(connStr, sheetIndex).c_str(), connStr, adOpenStatic, adLockOptimistic, adCmdText));
while(!pRec->adoEOF)
{
for(long i = 0; i < pRec->Fields->GetCount(); ++i)
{
_variant_t v = pRec->Fields->GetItem(i)->Value;
if(v.vt == VT_R8)
num[i] = v.dblVal;
if(v.vt == VT_BSTR)
str[i] = v.bstrVal;
++cellCount;
}
pRec->MoveNext();
}
In i5-4460 and HDD machine,i find 500 thousands of cell in xls will take 1.5s.But same data in xlsx will take 2.829s.so it's possible for manipulating your data under 30s.
If you really need under 30s,use RAM Drive to reduce file IO.It will significantly improve your process.
I cannot download your data to test it,so please tell me the result.

Another way that should improve largely the load/operation time is a RAMDrive
create a RAMDrive with enough space for your file and a 10%..20% extra space...
copy the file for the RAMDrive...
Load the file from there... depending on your drive and filesystem
the speed improvement should be huge...
My favourite is IMDisk toolkit
(https://sourceforge.net/projects/imdisk-toolkit/)
here you have a powerfull command line to script everything...
I also recommend SoftPerfect ramdisk
(http://www.majorgeeks.com/files/details/softperfect_ram_disk.html)
but that also depends of your OS...

I would like to have more info about the system where you
are opening the file... anyway:
look in your system for a Windows update called
"Office File Validation Add-In for Office ..."
if you have it... uninstall it...
the file should load much more quickly
specially if is loaded froma share

Have you tried loading the worksheet on demand, which available since version 0.7.1 of xlrd?
To do this you need to pass on_demand=True to open_workbook().
xlrd.open_workbook(filename=None, logfile=<_io.TextIOWrapper
name='' mode='w' encoding='UTF-8'>, verbosity=0, use_mmap=1,
file_contents=None, encoding_override=None, formatting_info=False,
on_demand=False, ragged_rows=False)
Other potential python solutions I found for reading an xlsx file:
Read the raw xml in 'xl/sharedStrings.xml' and 'xl/worksheets/sheet1.xml'
Try the openpyxl library's Read Only mode which claims too be optimized in memory usage for large files.
from openpyxl import load_workbook wb = load_workbook(filename='large_file.xlsx', read_only=True) ws = wb['big_data']
for row in ws.rows:
for cell in row:
print(cell.value)
If you are running on Windows you could use PyWin32 and 'Excel.Application'
import time
import win32com.client as win32
def excel():
xl = win32.gencache.EnsureDispatch('Excel.Application')
ss = xl.Workbooks.Add()
...

Related

How to first n rows from Parquet file in Java without downloading the entire file

My requirement was to read parquet file from s3/sftp/ftp and read few rows from the file and write it to csv file.
Since I didn't find any generic solution to read parquet file directly from s3/sftp/ftp, I am downloading parquet file to my local using InputStream.
File tmp = null;
File parquetFile = null;
try {
tmp = File.createTempFile("csvFile", ".csv");
parquetFile = File.createTempFile("partquetFile",".parquet");
//downloading file to local
StreamUtils.dumpToDisk(parquetFile, feed.getInputStream());
parquetReaderUtils.parquetReader(new
org.apache.hadoop.fs.Path(parquetFile.getAbsolutePath()),tmp);
} catch(IOException e){
System.out.println("Error reading parquet file.");
}
finally {
FileUtils.deleteQuietly(tmp);
FileUtils.deleteQuietly(parquetFile);
}
One the file is downloaded I am calling parquetReader() method of ParquetReaderUtils class to read the file from local path. And writing first 5 rows from parquet file to csv file.
Below is the ParquetReaderUtils class definition :
import org.apache.hadoop.conf.Configuration;
import org.apache.parquet.column.page.PageReadStore;
import org.apache.parquet.example.data.Group;
import org.apache.parquet.example.data.simple.convert.GroupRecordConverter;
import org.apache.parquet.format.converter.ParquetMetadataConverter;
import org.apache.parquet.hadoop.ParquetFileReader;
import org.apache.parquet.hadoop.metadata.ParquetMetadata;
import org.apache.parquet.io.ColumnIOFactory;
import org.apache.parquet.io.MessageColumnIO;
import org.apache.parquet.io.RecordReader;
import org.apache.parquet.schema.MessageType;
import org.apache.parquet.schema.PrimitiveType;
import org.apache.parquet.schema.Type;
import org.springframework.stereotype.Component;
import java.io.*;
import java.time.LocalDate;
import java.time.LocalDateTime;
import java.time.LocalTime;
import java.time.temporal.JulianFields;
#Component
public class ParquetReaderUtils {
private static final String CSV_DELIMITER = ",";
// Reading parquet file from local and writing first 5 rows to csv file.
public void parquetReader(org.apache.hadoop.fs.Path path, File csvOutputFile, InputStream in) throws IllegalArgumentException {
Configuration conf = new Configuration();
conf.addResource(in);
int headerRow = 0;
int rowsRead = 0;
try {
ParquetMetadata readFooter = ParquetFileReader.readFooter(conf, path, ParquetMetadataConverter.NO_FILTER);
MessageType schema = readFooter.getFileMetaData().getSchema();
ParquetFileReader r = new ParquetFileReader(conf, path, readFooter);
BufferedWriter w = new BufferedWriter(new FileWriter(csvOutputFile));
PageReadStore pages = null;
try {
while (null != (pages = r.readNextRowGroup())) {
final long rows = pages.getRowCount();
System.out.println("Number of rows: " + rows);
final MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
final RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));
for (int i = 0; i <= 5; i++) {
final Group g = (Group) recordReader.read();
//printGroup(g);
writeGroup(w, g, schema, headerRow);
rowsRead++;
}
if(rowsRead==5)
break;
}
} finally {
r.close();
w.close();
}
} catch (IOException e) {
System.out.println("Error reading parquet file.");
e.printStackTrace();
}
}
// writing rows to csv file.
private static void writeGroup(BufferedWriter w, Group g, MessageType schema, int headerRow)
throws IOException {
if (headerRow < 1) {
for (int j = 0; j < schema.getFieldCount(); j++) {
if (j > 0) {
w.write(CSV_DELIMITER);
}
Type fieldType = g.getType().getType(j);
String fieldName = fieldType.getName();
w.write(fieldName);
}
w.write('\n');
headerRow++;
}
for (int j = 0; j < schema.getFieldCount(); j++) {
try {
if (j > 0) {
w.write(CSV_DELIMITER);
}
Type fieldType = g.getType().getType(j);
PrimitiveType pt = (PrimitiveType) g.getType().getFields().get(j);
int valueCount = g.getFieldRepetitionCount(j);
String valueToString = g.getValueToString(j, 0);
if (pt.getPrimitiveTypeName().name().equals("INT96")) {
for (int index = 0; index < valueCount; index++) {
if (fieldType.isPrimitive()) {
LocalDateTime dateTime = convertToDate(g.getInt96(j, index).getBytes());
valueToString = String.valueOf(dateTime);
}
}
}
w.write(valueToString);
} catch (Exception e) {
w.write("");
continue;
}
}
w.write('\n');
}
// Method to convert INT96 value to LocalDateTime.
private static LocalDateTime convertToDate(byte[] int96Bytes) {
// Find Julian day
int julianDay = 0;
int index = int96Bytes.length;
while (index > 8) {
index--;
julianDay <<= 8;
julianDay += int96Bytes[index] & 0xFF;
}
// Find nanos since midday (since Julian days start at midday)
long nanos = 0;
// Continue from the index we got to
while (index > 0) {
index--;
nanos <<= 8;
nanos += int96Bytes[index] & 0xFF;
}
LocalDateTime timestamp = LocalDate.MIN
.with(JulianFields.JULIAN_DAY, julianDay)
.atTime(LocalTime.NOON)
.plusNanos(nanos);
System.out.println("Timestamp: " + timestamp);
return timestamp;
}
}
Here I am downloading entire file to local system, if the size of parquet file is big this solution is not scalable. Downloading full file is not useful for me.
Is there any way to read parquet file directly from InputStream? Instead of downloading it to local and reading a local file.
You can download a row group from server and parse it later.

Why does the stream position go to the end

I have a csv file, after I overwrite 1 line with the Write method, after re-writing to the file everything is already added to the end of the file, and not to a specific line
using System.Collections;
using System.Collections.Generic;
using UnityEngine.UI;
using UnityEngine;
using System.Text;
using System.IO;
public class LoadQuestion : MonoBehaviour
{
int index;
string path;
FileStream file;
StreamReader reader;
StreamWriter writer;
public Text City;
public string[] allQuestion;
public string[] addedQuestion;
private void Start()
{
index = 0;
path = Application.dataPath + "/Files/Questions.csv";
allQuestion = File.ReadAllLines(path, Encoding.GetEncoding(1251));
file = new FileStream(path, FileMode.Open, FileAccess.ReadWrite);
writer = new StreamWriter(file, Encoding.GetEncoding(1251));
reader = new StreamReader(file, Encoding.GetEncoding(1251));
writer.AutoFlush = true;
List<string> _questions = new List<string>();
for (int i = 0; i < allQuestion.Length; i++)
{
char status = allQuestion[i][0];
if (status == '0')
{
_questions.Add(allQuestion[i]);
}
}
addedQuestion = _questions.ToArray();
City.text = ParseToCity(addedQuestion[0]);
}
private string ParseToCity(string current)
{
string _city = "";
string[] data = current.Split(';');
_city = data[2];
return _city;
}
private void OnApplicationQuit()
{
writer.Close();
reader.Close();
file.Close();
}
public void IKnow()
{
string[] quest = addedQuestion[index].Split(';');
int indexFromFile = int.Parse(quest[1]);
string questBeforeAnsver = "";
for (int i = 0; i < quest.Length; i++)
{
if (i == 0)
{
questBeforeAnsver += "1";
}
else
{
questBeforeAnsver += ";" + quest[i];
}
}
Debug.Log("indexFromFile : " + indexFromFile);
for (int i = 0; i < allQuestion.Length; i++)
{
if (i == indexFromFile)
{
writer.Write(questBeforeAnsver);
break;
}
else
{
reader.ReadLine();
}
}
reader.DiscardBufferedData();
reader.BaseStream.Seek(0, SeekOrigin.Begin);
if (index < addedQuestion.Length - 1)
{
index++;
}
City.text = ParseToCity(addedQuestion[index]);
}
}
There are lines in the file by type :
0;0;Africa
0;1;London
0;2;Paris
The bottom line is that this is a game, and only those questions whose status is 0, that is, unanswered, are downloaded from the file. And if during the game the user clicks that he knows the answer, then there is a line in the file and is overwritten, only the status is no longer 0, but 1 and when the game is repeated, this question will not load.
It turns out for me that the first question is overwritten successfully, and all subsequent ones are simply added at the end of the file :
1;0;Africa
0;1;London
0;2;Paris1;1;London1;2;Paris
What's wrong ?
The video shows everything in detail

How can I improve the performance of execution time? And Is their any better way to read this file?

I am trying to split a text file with multiple threads. The file is of 1 GB. I am reading the file by char. The Execution time is 24 min 54 seconds. Instead of reading a file by char is their any better way where I can reduce the execution time.
I'm having a hard time figuring out an approach that will reduce the execution time. Please do suggest me also, if there is any other better way to split file with multiple threads. I am very new to java.
Any help will be appreciated. :)
public static void main(String[] args) throws Exception {
RandomAccessFile raf = new RandomAccessFile("D:\\sample\\file.txt", "r");
long numSplits = 10;
long sourceSize = raf.length();
System.out.println("file length:" + sourceSize);
long bytesPerSplit = sourceSize / numSplits;
long remainingBytes = sourceSize % numSplits;
int maxReadBufferSize = 9 * 1024;
List<String> filePositionList = new ArrayList<String>();
long startPosition = 0;
long endPosition = bytesPerSplit;
for (int i = 0; i < numSplits; i++) {
raf.seek(endPosition);
String strData = raf.readLine();
if (strData != null) {
endPosition = endPosition + strData.length();
}
String str = startPosition + "|" + endPosition;
if (sourceSize > endPosition) {
startPosition = endPosition;
endPosition = startPosition + bytesPerSplit;
} else {
break;
}
filePositionList.add(str);
}
for (int i = 0; i < filePositionList.size(); i++) {
String str = filePositionList.get(i);
String[] strArr = str.split("\\|");
String strStartPosition = strArr[0];
String strEndPosition = strArr[1];
long startPositionFile = Long.parseLong(strStartPosition);
long endPositionFile = Long.parseLong(strEndPosition);
MultithreadedSplit objMultithreadedSplit = new MultithreadedSplit(startPositionFile, endPositionFile);
objMultithreadedSplit.start();
}
long endTime = System.currentTimeMillis();
System.out.println("It took " + (endTime - startTime) + " milliseconds");
}
}
public class MultithreadedSplit extends Thread {
public static String filePath = "D:\\tenlakh\\file.txt";
private int localCounter = 0;
private long start;
private long end;
public static String outPath;
List<String> result = new ArrayList<String>();
public MultithreadedSplit(long startPos, long endPos) {
start = startPos;
end = endPos;
}
#Override
public void run() {
try {
String threadName = Thread.currentThread().getName();
long currentTime = System.currentTimeMillis();
RandomAccessFile file = new RandomAccessFile("D:\\sample\\file.txt", "r");
String outFile = "out_" + threadName + ".txt";
System.out.println("Thread Reading started for start:" + start + ";End:" + end+";threadname:"+threadName);
FileOutputStream out2 = new FileOutputStream("D:\\sample\\" + outFile);
file.seek(start);
int nRecordCount = 0;
char c = (char) file.read();
StringBuilder objBuilder = new StringBuilder();
int nCounter = 1;
while (c != -1) {
objBuilder.append(c);
// System.out.println("char-->" + c);
if (c == '\n') {
nRecordCount++;
out2.write(objBuilder.toString().getBytes());
objBuilder.delete(0, objBuilder.length());
//System.out.println("--->" + nRecordCount);
// break;
}
c = (char) file.read();
nCounter++;
if (nCounter > end) {
break;
}
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
The fastest way would be to map the file into memory segment by segment (mapping a large file as a whole may cause undesired side effects). It will skip few relatively expensive copy operations. The operating system will load file into RAM and JRE will expose it to your application as a view into an off-heap memory area in a form of a ByteBuffer. It would usually allow you to squeze last 2x/3x of the performance.
Memory-mapped way requires quite a bit of helper code (see the fragment in the bottom), it's not always the best tactical way. Instead, if your input is line-based and you just need reasonable performance (what you have now is probably not) then just do something like:
import java.nio.Files;
import java.nio.Paths;
...
File.lines(Paths.get("/path/to/the/file"), StandardCharsets.ISO_8859_1)
// .parallel() // parallel processing is still possible
.forEach(line -> { /* your code goes here */ });
For the contrast, a working example of the code for working with the file via memory mapping would look something like below. In case of fixed-size records (when segments can be selected precisely to match record boundaries) subsequent segments can be processed in parallel.
static ByteBuffer mapFileSegment(FileChannel fileChannel, long fileSize, long regionOffset, long segmentSize) throws IOException {
long regionSize = min(segmentSize, fileSize - regionOffset);
// small last region prevention
final long remainingSize = fileSize - (regionOffset + regionSize);
if (remainingSize < segmentSize / 2) {
regionSize += remainingSize;
}
return fileChannel.map(FileChannel.MapMode.READ_ONLY, regionOffset, regionSize);
}
...
final ToIntFunction<ByteBuffer> consumer = ...
try (FileChannel fileChannel = FileChannel.open(Paths.get("/path/to/file", StandardOpenOption.READ)) {
final long fileSize = fileChannel.size();
long regionOffset = 0;
while (regionOffset < fileSize) {
final ByteBuffer regionBuffer = mapFileSegment(fileChannel, fileSize, regionOffset, segmentSize);
while (regionBuffer.hasRemaining()) {
final int usedBytes = consumer.applyAsInt(regionBuffer);
if (usedBytes == 0)
break;
}
regionOffset += regionBuffer.position();
}
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}

Merging sorted Files using multithreading

Multithreading is new to me so sorry for mistakes.
I have written the below program which merges files with mulithreading but I am not able to figure out how to manage the last file and after one iteration how to merge the newly created files.
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.FileWriter;
import java.util.ArrayList;
public class MergerSorter extends Thread {
int fileNumber = 1;
public static void main(String[] args) {
startMergingfiles(9);
}
public MergerSorter(int fileNum) {
fileNumber = fileNum;
}
public static void startMergingfiles(int numberOfFiles) {
int objectcounter = 0;
while (numberOfFiles != 1) {
try {
ArrayList<MergerSorter> objectList = new ArrayList<MergerSorter>();
for (int j = 1; j <= numberOfFiles; j = j + 2) {
if (numberOfFiles == j) {// Last Single remaining File
} else {
objectList.add(new MergerSorter(j));
objectList.get(objectcounter).start();
objectList.get(objectcounter).join();
objectcounter++;
}
}
objectcounter = 0;
numberOfFiles = numberOfFiles / 2;
} catch (Exception e) {
System.out.println(e);
}
}
}
public void run() {
try {
FileReader fileReader1 = new FileReader("src/externalsort/" + Integer.toString(fileNumber));
FileReader fileReader2 = new FileReader("src/externalsort/" + Integer.toString(fileNumber + 1));
BufferedReader bufferedReader1 = new BufferedReader(fileReader1);
BufferedReader bufferedReader2 = new BufferedReader(fileReader2);
String line1 = bufferedReader1.readLine();
String line2 = bufferedReader2.readLine();
FileWriter tmpFile = new FileWriter("src/externalsort/" + Integer.toString(fileNumber) + "op.txt", false);
int whichFileToRead = 0;
boolean file_1_reader = true;
boolean file_2_reader = true;
while (file_1_reader || file_2_reader) {
if (file_1_reader == false) {
tmpFile.write(line2 + "\r\n");
whichFileToRead = 2;
} else if (file_2_reader == false) {
tmpFile.write(line1 + "\r\n");
whichFileToRead = 1;
} else {
String value1 = line1.substring(0, 10);
String value2 = line2.substring(0, 10);
int ans = value1.compareTo(value2);
if (ans < 0) {
tmpFile.write(line1 + "\r\n");
whichFileToRead = 1;
} else if (ans > 0) {
tmpFile.write(line2 + "\r\n");
whichFileToRead = 2;
} else if (ans == 0) {
tmpFile.write(line1 + "\r\n");
whichFileToRead = 1;
}
}
if (whichFileToRead == 1) {
line1 = bufferedReader1.readLine();
if (line1 == null)
file_1_reader = false;
} else {
line2 = bufferedReader2.readLine();
if (line2 == null)
file_2_reader = false;
}
}
tmpFile.close();
bufferedReader1.close();
bufferedReader2.close();
fileReader1.close();
fileReader2.close();
} catch (Exception e) {
System.out.println(e);
}
}
}
I am trying to merge sorted files with multithreading. Say I have 50 files and I want to merge all these individual files into one final sorted file but I want to speed up and utilize every core by multi threading but I am not able to do it. And the files are big so they can't be placed in heap/RAM so I have to read every file and keep writing.
You can do this with merge sort, but instead of lots of little sorted lists, you'll need to use lots of little sorted files. Once you have broken all of the files down into small sorted files, you can start merging them together again until you end up with a single sorted file.
Unfortunately, you likely won't be able to achieve high CPU utilisation as much of the time will be spend waiting for disk I/O to complete.
Edit: just read your response to a comment and it sounds like you are asking for help on the last step of the merge sort. The graphics in the wiki link above will also help you understand. So, assuming all of your files are sorted, here we go:
Read 1 item from each file
Figure out which lowest/smallest/whatever and write that line to the result file
Read a new item from the file which just provided the last item
Repeat steps 2 and 3 until all files have been completely read.

Apache POI WorkFactory.Create(new File()) java.lang.OutOfMemoryError

I'm trying to load an excel file(xlsx) into a Workbook Object using apache POI 3.10.
I'm receiving a java.lang.OutofMemoryError.
I'm using Java 8 with the -Xmx2g argument on the JVM.
All 4 cores(64bit System) and my RAM(4gb) are maxed out when I run the program.
The excel sheet has 43 columns and 166,961 Rows which equal 7,179,323 Cells.
I'm using Apache POIs WorkBookFactory.create(new File) because it uses less memory than using InputFileStream.
Does anyone have any ideas how to optimize memory usage or another way to create the Workbook?
Below is my test Reader class, don't judge, it's rough and includes debugging statements:
import java.io.File;
import java.io.IOException;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.usermodel.WorkbookFactory;
public class Reader {
private Workbook wb;
public Reader(File excel) {
System.out.println("CONSTRUCTOR");
wb = null;
try {
wb = WorkbookFactory.create(excel);
} catch (IOException e) {
System.out.println("IO Exception");
System.out.println(e.getMessage());
} catch (InvalidFormatException e) {
System.out.println("Invalid Format");
System.out.println(e.getMessage());
}
}
public boolean exists() { return (wb != null); }
public void print() {}
public static void main(String[] args) {
System.out.println("START PRG");
//File f = new File("oldfilename.xls");
File f = new File("filename.xlsx");
System.out.println("PATH:" + f.getAbsoluteFile());
if (!f.exists()) {
System.out.println("File does not exist.");
System.exit(0);
}
System.out.println("FILE");
Reader r = new Reader(f);
System.out.println("Reader");
r.print();
System.out.println("PRG DONE");
}
}
apparently loading a 24mb file shouldn't be causing OOM...
at first glance it appears to me, though Xmx set to 2G, there's actually not that much memory free in system. in other words OS and other processes may have taken more than 2G out of 4G of physical memory! Check available physical memory first. in case available below what's expected, try closing some other running apps/processes.
if that's not the case and there's indeed enough memory left, without profiling it's really hard to identify the real cause. use a profile tool to check JVM status, related to memory first. you may simply use jconsole (as it comes with JDK). #see this on how to activate JMX
once you are connected, check readings related to memory, specifically below memory spaces:
old gen
young gen
perm gen
monitor these spaces and see where it's struggling. I assume this is a standalone application. in case this is deployed on server (as web or services), you may consider '-XX:NewRatio' option for distributing heap spaces effectively and efficiently. #see tuning related details here.
Please confirm these before proceeding,
Is there any infinite execution in looping(for/while)
Ensure your physical storage size
Maximize buffer memory
Note
As per my understanding Apache POI will not consume that much amount of memory.
I am just a beginner, but may I ask you some questions.
Why not use XSSFWorkbook class to open XLSX file. I mean, I always use it to handle XLSX files, and this time I tried with a file(7 MB; that was the largest I could find in my computer), and it worked perfectly.
Why not use newer File API(NIO, Java 7). Again, I do not know if this will make any difference or not. But, it worked for me.
Windows 7 Ultimate | 64 bit | Intel 2nd Gen Core i3|Eclipse Juno|JDK 1.7.45|Apache POI 3.9
Path file = Paths.get("XYZABC.xlsx");
try {
XSSFWorkbook wb = new XSSFWorkbook(Files.newInputStream(file, StandardOpenOption.READ));
} catch (IOException e) {
System.out.println("Some IO Error!!!!");
}
Do, tell if it works for you or not.
Did you tried using SXSSFWorkbook? We also used Apache POI to handle relatively big XLSX files, and we also had memory problems when using plain XSSFWorkbook. Although we didn't have to read in the files, we were just writing tens of thousands of lines of informations. Using this, our memory problems got solved. You can pass an XSSFWorkbook to its constructor and the size of data you want to keep in memory.
Java 1.8
based on HSSF and XSSF Limitations
my poi version is 3.17 POI Examples
lauches my code
public class Controller {
EX stressTest;
public void fineFile() {
String stresstest = "C:\\Stresstest.xlsx";
HashMap<String, String[]> stressTestMap = new HashMap<>();
stressTestMap.put("aaaa", new String[]{"myField", "The field"});
stressTestMap.put("bbbb", new String[]{"other", "Other value"});
try {
InputStream stressTestIS = new FileInputStream(stresstest);
stressTest = new EX(stresstest, stressTestIS, stressTestMap);
} catch (IOException exp) {
}
}
public void printErr() {
if (stressTest.thereAreErrors()) {
try {
FileWriter myWriter = new FileWriter(
"C:\\logErrorsStressTest" +
(new SimpleDateFormat("ddMMyyyyHHmmss")).format(new Date()) +
".txt"
);
myWriter.write(stressTest.getBodyFileErrors());
myWriter.close();
} catch (IOException e) {
e.printStackTrace();
}
} else {
}
}
public void createBD() {
List<OneObjectWhatever> entitiesList =
(
!stressTest.thereAreErrors()
? ((List<OneObjectWhatever>) stressTest.toListCustomerObject(OneObjectWhatever.class))
: new ArrayList<>()
);
entitiesList.forEach(entity -> {
Field[] fields = entity.getClass().getDeclaredFields();
String valueString = "";
for (Field attr : fields) {
try {
attr.setAccessible(true);
valueString += " StressTest:" + attr.getName() + ": -" + attr.get(fields) + "- ";
attr.setAccessible(true);
} catch (Exception reflectionError) {
System.out.println(reflectionError);
}
}
});
}
}
MY CODE
public class EX {
private HashMap<Integer, HashMap<Integer, String> > rows;
private List<String> errors;
private int maxColOfHeader, minColOfHeader;
private HashMap<Integer, String> header;
private HashMap<String,String[]> relationHeaderClassPropertyDescription;
private void initVariables(String name, InputStream file) {
this.rows = new HashMap();
this.header = new HashMap<>();
this.errors = new ArrayList<String>(){{add("["+name+"] empty cells in position -> ");}};
try{
InputStream is = FileMagic.prepareToCheckMagic(file);
FileMagic fm = FileMagic.valueOf(is);
is.close();
switch (fm) {
case OLE2:
XLS2CSVmra xls2csv = new XLS2CSVmra(name, 50, rows);
xls2csv.process();
System.out.println("OLE2");
break;
case OOXML:
File flatFile = new File(name);
OPCPackage p = OPCPackage.open(flatFile, PackageAccess.READ);
XLSX2CSV xlsx2csv = new XLSX2CSV(p, System.out, 50, this.rows);
xlsx2csv.process();
p.close();
System.out.println("OOXML");
break;
default:
System.out.println("Your InputStream was neither an OLE2 stream, nor an OOXML stream");
break;
}
} catch (IOException | EncryptedDocumentException | SAXException | OpenXML4JException exp){
System.out.println(exp);
exp.printStackTrace();
}
int rowHeader = rows.keySet().stream().findFirst().get();
this.header.putAll(rows.get(rowHeader));
this.rows.remove(rowHeader);
this.minColOfHeader = this.header.keySet().stream().findFirst().get();
this.maxColOfHeader = this.header.entrySet().stream()
.mapToInt(e -> e.getKey()).max()
.orElseThrow(NoSuchElementException::new);
}
public EX(String name, InputStream file, HashMap<String,String[]> relationHeaderClassPropertyDescription_) {
this.relationHeaderClassPropertyDescription = relationHeaderClassPropertyDescription_;
initVariables(name, file);
validate();
}
private void validate(){
rows.forEach((inx,row) -> {
for(int i = minColOfHeader; i <= maxColOfHeader; i++) {
//System.out.println("r:"+inx+" c:"+i+" cr:"+(!row.containsKey(i))+" vr:"+((!row.containsKey(i)) || row.get(i).trim().isEmpty())+" ch:"+header.containsKey(i)+" vh:"+(header.containsKey(i) && (!header.get(i).trim().isEmpty()))+" val:"+(row.containsKey(i)&&!row.get(i).trim().isEmpty()?row.get(i):"empty"));
if((!row.containsKey(i)) || row.get(i).trim().isEmpty()) {
if(header.containsKey(i) && (!header.get(i).trim().isEmpty())) {
String description = getRelationHeaders(i,1);
errors.add(" ["+header.get(i)+"]{"+description+"} = fila: "+(inx+1)+" - columna: "+ CellReference.convertNumToColString(i));
// System.out.println(" fila: "+inx+" - columna: " + i + " - valor: "+ (row.get(i).isEmpty()?"empty":row.get(i)));
}
}
}
});
header.forEach((i,v)->{System.out.println("stressTestMap.put(\""+v+"\", new String[]{\"{"+i+"}\",\"Mi descripcion XD\"});");});
}
public String getBodyFileErrors()
{
return String.join(System.lineSeparator(), errors);
}
public boolean thereAreErrors() {
return errors.stream().count() > 1;
}
public<T extends Class> List<? extends Object> toListCustomerObject(T type) {
List<Object> list = new ArrayList<>();
rows.forEach((inx, row) -> {
try {
Object obj = type.newInstance();
for(int i = minColOfHeader; i <= maxColOfHeader; i++) {
if (row.containsKey(i) && !row.get(i).trim().isEmpty()) {
if (header.containsKey(i) && !header.get(i).trim().isEmpty()) {
if(relationHeaderClassPropertyDescription.containsKey(header.get(i))) {
String nameProperty = getRelationHeaders(i,0);
Field field = type.getDeclaredField(nameProperty);
try{
field.setAccessible(true);
field.set(obj, (isConvertibleTo(field.getType(),row.get(i)) ? toObject(field.getType(),row.get(i)) : defaultValue(field.getType())) );
field.setAccessible(false);
}catch (Exception fex) {
//System.out.println("113"+fex);
continue;
}
}
}
}
}
list.add(obj);
} catch (Exception ex) {
//System.out.println("123:"+ex);
}
});
return list;
}
private Object toObject( Class clazz, String value ) {
if( Boolean.class == clazz || Boolean.TYPE == clazz) return Boolean.parseBoolean( value );
if( Byte.class == clazz || Byte.TYPE == clazz) return Byte.parseByte( value );
if( Short.class == clazz || Short.TYPE == clazz) return Short.parseShort( value );
if( Integer.class == clazz || Integer.TYPE == clazz) return Integer.parseInt( value );
if( Long.class == clazz || Long.TYPE == clazz) return Long.parseLong( value );
if( Float.class == clazz || Float.TYPE == clazz) return Float.parseFloat( value );
if( Double.class == clazz || Double.TYPE == clazz) return Double.parseDouble( value );
return value;
}
private boolean isConvertibleTo( Class clazz, String value ) {
String ptn = "";
if( Boolean.class == clazz || Boolean.TYPE == clazz) ptn = ".*";
if( Byte.class == clazz || Byte.TYPE == clazz) ptn = "^\\d+$";
if( Short.class == clazz || Short.TYPE == clazz) ptn = "^\\d+$";
if( Integer.class == clazz || Integer.TYPE == clazz) ptn = "^\\d+$";
if( Long.class == clazz || Long.TYPE == clazz) ptn = "^\\d+$";
if( Float.class == clazz || Float.TYPE == clazz) ptn = "^\\d+(\\.\\d+)?$";
if( Double.class == clazz || Double.TYPE == clazz) ptn = "^\\d+(\\.\\d+)?$";
Pattern pattern = Pattern.compile(ptn, Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(value);
return matcher.find();
}
private Object defaultValue( Class clazz) {
if( Boolean.class == clazz || Boolean.TYPE == clazz) return Boolean.parseBoolean( "false" );
if( Byte.class == clazz || Byte.TYPE == clazz) return Byte.parseByte( "0" );
if( Short.class == clazz || Short.TYPE == clazz) return Short.parseShort( "0" );
if( Integer.class == clazz || Integer.TYPE == clazz) return Integer.parseInt( "0" );
if( Long.class == clazz || Long.TYPE == clazz) return Long.parseLong( "0" );
if( Float.class == clazz || Float.TYPE == clazz) return Float.parseFloat( "0.0" );
if( Double.class == clazz || Double.TYPE == clazz) return Double.parseDouble( "0.0" );
return "";
}
private String getRelationHeaders(Integer columnIndexHeader, Integer TypeOrDescription /*0 - Type, 1 - Description*/) {
try {
return relationHeaderClassPropertyDescription.get(header.get(columnIndexHeader))[TypeOrDescription];
} catch (Exception e) {
}
return header.get(columnIndexHeader);
}
}
these are the modifications I made to the examples:
XLSX2CSV
public class XLSX2CSV {
/**
* Uses the XSSF Event SAX helpers to do most of the work
* of parsing the Sheet XML, and outputs the contents
* as a (basic) CSV.
*/
private class SheetToCSV implements SheetContentsHandler {
private boolean firstCellOfRow = false;
private int currentRow = -1;
private int currentCol = -1;
HashMap<Integer, String> valuesCell;
private void outputMissingRows(int number) {
for (int i=0; i<number; i++) {
for (int j=0; j<minColumns; j++) {
output.append(',');
}
output.append('\n');
}
}
#Override
public void startRow(int rowNum) {
// If there were gaps, output the missing rows
outputMissingRows(rowNum-currentRow-1);
// Prepare for this row
firstCellOfRow = true;
currentRow = rowNum;
currentCol = -1;
valuesCell = new HashMap<>();
}
#Override
public void endRow(int rowNum) {
// Ensure the minimum number of columns
for (int i = currentCol; i < minColumns; i++) {
output.append(',');
}
output.append('\n');
if (!valuesCell.isEmpty())
_rows.put(rowNum, valuesCell);
}
#Override
public void cell(String cellReference, String formattedValue,
XSSFComment comment) {
if (firstCellOfRow) {
firstCellOfRow = false;
} else {
output.append(',');
}
// gracefully handle missing CellRef here in a similar way as XSSFCell does
if (cellReference == null) {
cellReference = new CellAddress(currentRow, currentCol).formatAsString();
}
// Did we miss any cells?
int thisCol = (new CellReference(cellReference)).getCol();
int missedCols = thisCol - currentCol - 1;
for (int i = 0; i < missedCols; i++) {
output.append(',');
}
currentCol = thisCol;
if (!formattedValue.isEmpty())
valuesCell.put(thisCol, formattedValue);
// Number or string?
output.append(formattedValue);
/*try {
//noinspection ResultOfMethodCallIgnored
Double.parseDouble(formattedValue);
output.append(formattedValue);
} catch (NumberFormatException e) {
output.append('"');
output.append(formattedValue);
output.append('"');
}*/
}
#Override
public void headerFooter(String text, boolean isHeader, String tagName) {
// Skip, no headers or footers in CSV
}
}
///////////////////////////////////////
private final OPCPackage xlsxPackage;
/**
* Number of columns to read starting with leftmost
*/
private final int minColumns;
/**
* Destination for data
*/
private final PrintStream output;
public HashMap<Integer, HashMap<Integer, String>> _rows;
/**
* Creates a new XLSX -> CSV converter
*
* #param pkg The XLSX package to process
* #param output The PrintStream to output the CSV to
* #param minColumns The minimum number of columns to output, or -1 for no minimum
*/
public XLSX2CSV(OPCPackage pkg, PrintStream output, int minColumns, HashMap<Integer, HashMap<Integer, String> > __rows) {
this.xlsxPackage = pkg;
this.output = output;
this.minColumns = minColumns;
this._rows = __rows;
}
/**
* Parses and shows the content of one sheet
* using the specified styles and shared-strings tables.
*
* #param styles The table of styles that may be referenced by cells in the sheet
* #param strings The table of strings that may be referenced by cells in the sheet
* #param sheetInputStream The stream to read the sheet-data from.
* #exception java.io.IOException An IO exception from the parser,
* possibly from a byte stream or character stream
* supplied by the application.
* #throws SAXException if parsing the XML data fails.
*/
public void processSheet(
StylesTable styles,
ReadOnlySharedStringsTable strings,
SheetContentsHandler sheetHandler,
InputStream sheetInputStream) throws IOException, SAXException {
DataFormatter formatter = new DataFormatter();
InputSource sheetSource = new InputSource(sheetInputStream);
try {
XMLReader sheetParser = SAXHelper.newXMLReader();
ContentHandler handler = new XSSFSheetXMLHandler(
styles, null, strings, sheetHandler, formatter, false);
sheetParser.setContentHandler(handler);
sheetParser.parse(sheetSource);
} catch(ParserConfigurationException e) {
throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
}
}
/**
* Initiates the processing of the XLS workbook file to CSV.
*
* #throws IOException If reading the data from the package fails.
* #throws SAXException if parsing the XML data fails.
*/
public void process() throws IOException, OpenXML4JException, SAXException {
ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(this.xlsxPackage);
XSSFReader xssfReader = new XSSFReader(this.xlsxPackage);
StylesTable styles = xssfReader.getStylesTable();
XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
int index = 0;
while (iter.hasNext()) {
InputStream stream = iter.next();
String sheetName = iter.getSheetName();
this.output.println();
this.output.println(sheetName + " [index=" + index + "]:");
processSheet(styles, strings, new SheetToCSV(), stream);
stream.close();
++index;
break;
}
}
}
XLS2CSVmra
public class XLS2CSVmra implements HSSFListener {
private int minColumns;
private POIFSFileSystem fs;
private PrintStream output;
public HashMap<Integer, HashMap<Integer, String>> _rows;
private HashMap<Integer, String> valuesCell;
private int lastRowNumber;
private int lastColumnNumber;
/** Should we output the formula, or the value it has? */
private boolean outputFormulaValues = false;
/** For parsing Formulas */
private SheetRecordCollectingListener workbookBuildingListener;
private HSSFWorkbook stubWorkbook;
// Records we pick up as we process
private SSTRecord sstRecord;
private FormatTrackingHSSFListener formatListener;
/** So we known which sheet we're on */
private int sheetIndex = -1;
private BoundSheetRecord[] orderedBSRs;
private List<BoundSheetRecord> boundSheetRecords = new ArrayList<BoundSheetRecord>();
// For handling formulas with string results
private int nextRow;
private int nextColumn;
private boolean outputNextStringRecord;
/**
* Creates a new XLS -> CSV converter
* #param fs The POIFSFileSystem to process
* #param output The PrintStream to output the CSV to
* #param minColumns The minimum number of columns to output, or -1 for no minimum
*/
public XLS2CSVmra(POIFSFileSystem fs, PrintStream output, int minColumns, HashMap<Integer, HashMap<Integer, String>> __rows) {
this.fs = fs;
this.output = output;
this.minColumns = minColumns;
this._rows = __rows;
this.valuesCell = new HashMap<>();
}
/**
* Creates a new XLS -> CSV converter
* #param filename The file to process
* #param minColumns The minimum number of columns to output, or -1 for no minimum
* #throws IOException
* #throws FileNotFoundException
*/
public XLS2CSVmra(String filename, int minColumns, HashMap<Integer, HashMap<Integer, String>> __rows) throws IOException, FileNotFoundException {
this(
new POIFSFileSystem(new FileInputStream(filename)),
System.out, minColumns,
__rows
);
}
/**
* Initiates the processing of the XLS file to CSV
*/
public void process() throws IOException {
MissingRecordAwareHSSFListener listener = new MissingRecordAwareHSSFListener(this);
formatListener = new FormatTrackingHSSFListener(listener);
HSSFEventFactory factory = new HSSFEventFactory();
HSSFRequest request = new HSSFRequest();
if(outputFormulaValues) {
request.addListenerForAllRecords(formatListener);
} else {
workbookBuildingListener = new SheetRecordCollectingListener(formatListener);
request.addListenerForAllRecords(workbookBuildingListener);
}
factory.processWorkbookEvents(request, fs);
}
/**
* Main HSSFListener method, processes events, and outputs the
* CSV as the file is processed.
*/
#Override
public void processRecord(Record record) {
if(sheetIndex>0)
return;
int thisRow = -1;
int thisColumn = -1;
String thisStr = null;
switch (record.getSid())
{
case BoundSheetRecord.sid:
if(sheetIndex==-1)
boundSheetRecords.add((BoundSheetRecord)record);
break;
case BOFRecord.sid:
BOFRecord br = (BOFRecord)record;
if(br.getType() == BOFRecord.TYPE_WORKSHEET && sheetIndex==-1) {
// Create sub workbook if required
if(workbookBuildingListener != null && stubWorkbook == null) {
stubWorkbook = workbookBuildingListener.getStubHSSFWorkbook();
}
// Output the worksheet name
// Works by ordering the BSRs by the location of
// their BOFRecords, and then knowing that we
// process BOFRecords in byte offset order
sheetIndex++;
if(orderedBSRs == null) {
orderedBSRs = BoundSheetRecord.orderByBofPosition(boundSheetRecords);
}
output.println();
output.println(
orderedBSRs[sheetIndex].getSheetname() +
" [" + (sheetIndex+1) + "]:"
);
}
break;
case SSTRecord.sid:
sstRecord = (SSTRecord) record;
break;
case BlankRecord.sid:
BlankRecord brec = (BlankRecord) record;
thisRow = brec.getRow();
thisColumn = brec.getColumn();
thisStr = "";
break;
case BoolErrRecord.sid:
BoolErrRecord berec = (BoolErrRecord) record;
thisRow = berec.getRow();
thisColumn = berec.getColumn();
thisStr = "";
break;
case FormulaRecord.sid:
FormulaRecord frec = (FormulaRecord) record;
thisRow = frec.getRow();
thisColumn = frec.getColumn();
if(outputFormulaValues) {
if(Double.isNaN( frec.getValue() )) {
// Formula result is a string
// This is stored in the next record
outputNextStringRecord = true;
nextRow = frec.getRow();
nextColumn = frec.getColumn();
} else {
thisStr = formatListener.formatNumberDateCell(frec);
}
} else {
thisStr = '"' +
HSSFFormulaParser.toFormulaString(stubWorkbook, frec.getParsedExpression()) + '"';
}
break;
case StringRecord.sid:
if(outputNextStringRecord) {
// String for formula
StringRecord srec = (StringRecord)record;
thisStr = srec.getString();
thisRow = nextRow;
thisColumn = nextColumn;
outputNextStringRecord = false;
}
break;
case LabelRecord.sid:
LabelRecord lrec = (LabelRecord) record;
thisRow = lrec.getRow();
thisColumn = lrec.getColumn();
thisStr = '"' + lrec.getValue() + '"';
break;
case LabelSSTRecord.sid:
LabelSSTRecord lsrec = (LabelSSTRecord) record;
thisRow = lsrec.getRow();
thisColumn = lsrec.getColumn();
if(sstRecord == null) {
thisStr = '"' + "(No SST Record, can't identify string)" + '"';
} else {
thisStr = '"' + sstRecord.getString(lsrec.getSSTIndex()).toString() + '"';
}
break;
case NoteRecord.sid:
NoteRecord nrec = (NoteRecord) record;
thisRow = nrec.getRow();
thisColumn = nrec.getColumn();
// TODO: Find object to match nrec.getShapeId()
thisStr = '"' + "(TODO)" + '"';
break;
case NumberRecord.sid:
NumberRecord numrec = (NumberRecord) record;
thisRow = numrec.getRow();
thisColumn = numrec.getColumn();
// Format
thisStr = formatListener.formatNumberDateCell(numrec);
break;
case RKRecord.sid:
RKRecord rkrec = (RKRecord) record;
thisRow = rkrec.getRow();
thisColumn = rkrec.getColumn();
thisStr = '"' + "(TODO)" + '"';
break;
default:
break;
}
// Handle new row
if(thisRow != -1 && thisRow != lastRowNumber) {
lastColumnNumber = -1;
}
// Handle missing column
if(record instanceof MissingCellDummyRecord) {
MissingCellDummyRecord mc = (MissingCellDummyRecord)record;
thisRow = mc.getRow();
thisColumn = mc.getColumn();
thisStr = "";
}
// If we got something to print out, do so
if(thisStr != null) {
if (thisColumn > 0) {
output.print(',');
}
if (!thisStr.isEmpty())
valuesCell.put(thisColumn, thisStr);
output.print(thisStr);
}
// Update column and row count
if(thisRow > -1)
lastRowNumber = thisRow;
if(thisColumn > -1)
lastColumnNumber = thisColumn;
// Handle end of row
if(record instanceof LastCellOfRowDummyRecord) {
// Print out any missing commas if needed
if(minColumns > 0) {
// Columns are 0 based
if(lastColumnNumber == -1) { lastColumnNumber = 0; }
for(int i=lastColumnNumber; i<(minColumns); i++) {
output.print(',');
}
}
// We're onto a new row
lastColumnNumber = -1;
// End the row
output.println();
if(!valuesCell.isEmpty()) {
HashMap<Integer, String> newRow = new HashMap<>();
valuesCell.forEach((inx,vStr) -> {
newRow.put(inx, vStr);
});
_rows.put(lastRowNumber, newRow);
valuesCell = new HashMap<>();
}
}
}
}

Categories