Extract text from pdf file by pdfbox - java

i am facing an issue in pdf reading.
public class GetLinesFromPDF extends PDFTextStripper {
static List<String> lines = new ArrayList<String>();
Map<String, String> auMap = new HashMap();
boolean objFlag = false;
public GetLinesFromPDF() throws IOException {
}
/**
* #throws IOException If there is an error parsing the document.
*/
public static void main(String[] args) throws IOException {
PDDocument document = null;
String fileName = "E:\\sample.pdf";
try {
int i;
document = PDDocument.load(new File(fileName));
PDFTextStripper stripper = new GetLinesFromPDF();
stripper.setSortByPosition(true);
stripper.setStartPage(0);
stripper.setEndPage(document.getNumberOfPages());
Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
stripper.writeText(document, dummy);
// print lines
for (String line : lines) {
//System.out.println("line = " + line);
if (line.matches("(.*)Objection(.*)")) {
System.out.println(line);
withObjection(lines);
//System.out.println("iiiiiiiiiiii");
break;
}
//System.out.println("uuuuuuuuuuuuuu");
}
} finally {
if (document != null) {
document.close();
}
}
}
/**
* Override the default functionality of PDFTextStripper.writeString()
*/
#Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
System.out.println("textPositions = " + string);
// System.out.println("tex "+textPositions.get(0).getFont()+ getArticleEnd());
// you may process the line here itself, as and when it is obtained
}
}
in need a output like
My pdf have some title, we need to skip the same.
pdf file content is
how to extract text as in separate formats as specified.
thanks in advance.

Related

Text not found in .txt , but it's already there

I am going through Mooc.fi Java course and I can't figure how not to write String into file if the file already contains it. I tried only with one String and tried without " " (empty space), and without another string, but still it adds the string even when the file already contains it.
And translate() method doesn't find/return whole line in which it found the given word.
public class main {
public static void main(String[] args) throws Exception {
MindfulDictionary dict = new MindfulDictionary();
dict.add("apina", "monkey");
dict.add("banaani", "banana");
dict.add("apina", "apfe");
System.out.println( dict.translate("apina") );
System.out.println( dict.translate("monkey") );
System.out.println( dict.translate("programming") );
System.out.println( dict.translate("banana") );
}
}
public class MindfulDictionary {
File file;
FileWriter writer;
Scanner imeskanera;
public MindfulDictionary() throws Exception {
this.file = new File("C:\\Users\\USER\\Desktop\\test.txt");
this.imeskanera = new Scanner(this.file, "UTF-8");
}
public void add(String word, String translation) throws Exception {
boolean found = false;
while(this.imeskanera.hasNextLine()) {
String lineFromFile = this.imeskanera.nextLine();
if(word.contains(lineFromFile)) {
found = true;
break;
}
}
if(!found) {
this.writer = new FileWriter(this.file,true);
this.writer.write(word +" " + translation +"\n");
this.writer.close();
}
}
public String translate(String word) throws Exception {
String line = null;
while(this.imeskanera.hasNextLine()) {
String data = this.imeskanera.nextLine();
if(data.contains(word)) {
line = data;
break;
}
}
return line;
}
}
The problem is that your Scanner object has already been consumed by the add() method. You need to reopen the input stream in order to read the contents of the file. If you add
this.imeskanera = new Scanner(this.file, "UTF-8");
At the beginning of the translate() method, it should word. Which basically tell you that there is no need for Scanner to be a global field. Use it locally in each method. This is how I have explain the concept of file streams in the past:
Think about file streams (for reading and writing) logically. You
cannot allow for such a stream to be "circular". Otherwise, when you
try to get the "next line", there will always be a next line and you
will never be able to stop reading (or writing). The stream is
consumed when it reach the end, and once that is done, to go back to
the beginning of the stream, you will need to open a new one; not
reuse the old one.
I thought I needed to add this explanation even after the answer was accepted because I know new developer struggle with this concept and it because of that, it is necessary to explain it in detail.
With that said, your MindfulDictionary class should look like this:
public class MindfulDictionary {
File file;
FileWriter writer;
// Scanner imeskanera;
public MindfulDictionary() throws Exception {
this.file = new File("test.txt"); // I changed the path to the file to make it work for me. You can change it back if you want to.
file.createNewFile();
}
public void add(String word, String translation) throws Exception {
Scanner imeskanera = new Scanner(this.file, "UTF-8");
boolean found = false;
while (imeskanera.hasNextLine()) {
String lineFromFile = imeskanera.nextLine();
if (word.contains(lineFromFile)) {
found = true;
break;
}
}
if (!found) {
this.writer = new FileWriter(this.file, true);
this.writer.write(word + " " + translation + "\n");
this.writer.close();
}
imeskanera.close();
}
public String translate(String word) throws Exception {
Scanner imeskanera = new Scanner(this.file, "UTF-8");
String line = null;
while (imeskanera.hasNextLine()) {
String data = imeskanera.nextLine();
if (data.contains(word)) {
line = data;
break;
}
}
imeskanera.close();
return line;
}
}
I ran your code with my modifications and now the output is
apina monkey
apina monkey
null
banaani banana
In addition to the Scanner issue mentioned by the answer of #hfontanez, following changes.
if (word.contains(lineFromFile))
This checks if the first word contains the line, this is not true. The file contains the first word and translation. so this can be changed to
if (lineFromFile.contains(word))
as #ghostCat mentioned searching the key(word) can be refactored. Code with these changes.
public class MindfulDictionary {
File file;
FileWriter writer;
// Scanner imeskanera;
public MindfulDictionary() throws Exception {
this.file = new File("test.txt");
file.createNewFile();
}
public void add(String word, String translation) throws Exception {
if (get(word) == null) {
this.writer = new FileWriter(this.file, true);
this.writer.write(word + " " + translation + "\n");
System.out.println("Out>>:"+word + " " + translation + "\n");
this.writer.close();
}
}
private String get(String word) throws Exception {
Scanner imeskanera = new Scanner(this.file, "UTF-8");
boolean found = false;
String retStr= null;
while (imeskanera.hasNextLine()) {
String lineFromFile = imeskanera.nextLine();
if (lineFromFile.contains(word)) {
found = true;
retStr=lineFromFile;
break;
}
}
imeskanera.close();
return(retStr);
}
public String translate(String word) throws Exception {
return get(word);
}
public static void main(String[] args) throws Exception {
MindfulDictionary dict = new MindfulDictionary();
dict.add("apina", "monkey");
dict.add("apina", "monkey");
dict.add("banaani", "banana");
dict.add("apina", "apfe");
System.out.println( dict.translate("apina") );
System.out.println( dict.translate("monkey") );
System.out.println( dict.translate("programming") );
System.out.println( dict.translate("banana") );
}
}

Font problem with renderImage function pdfbox

I have an error when i read a page from a PDF document. this page contains a bar code which is
done with a font (AAAAAC+Code3de9). this error appear only when i use the renderImage function.
I use the 2.0.17 version of pdfbox-app.
*déc. 02, 2019 9:34:13 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
AVERTISSEMENT: Could not read embedded OTF for font AAAAAC+Code3de9
java.io.IOException: Illegal seek position: 2483278652
at org.apache.fontbox.ttf.MemoryTTFDataStream.seek(MemoryTTFDataStream.java:164)
at org.apache.fontbox.ttf.TrueTypeFont.readTable(TrueTypeFont.java:352)
at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:173)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:150)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:79)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:27)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:106)
at org.apache.fontbox.ttf.OTFParser.parse(OTFParser.java:73)
at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:112)
at org.apache.pdfbox.pdmodel.font.PDCIDFontType2.<init>(PDCIDFontType2.java:65)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createDescendantFont(PDFontFactory.java:139)
at org.apache.pdfbox.pdmodel.font.PDType0Font.<init>(PDType0Font.java:192)
at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:97)
at org.apache.pdfbox.pdmodel.PDResources.getFont(PDResources.java:146)
at org.apache.pdfbox.contentstream.operator.text.SetFontAndSize.process(SetFontAndSize.java:61)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processOperator(PDFStreamEngine.java:872)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:506)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:480)
at org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:153)
at org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:268)
at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:321)
at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:243)
at org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:203)
at patrick.mart1.impose.ImposeKosmedias$1.run(ImposeKosmedias.java:370)
déc. 02, 2019 9:34:13 AM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 findFontOrSubstitute
AVERTISSEMENT: Using fallback font LiberationSans for CID-keyed TrueType font AAAAAC+Code3de9*
Many thanks for your help
This is based on the RemoveAllText.java example from the source code download. It removes the selection of F2 in the content stream, and also removes the font in the resources. It makes the assumption that F2 is not really used, i.e. that there is no text related to F2. Compared to the official example, only "createTokensWithoutText" has been changed. I kept all the names even if the meaning is different, except for the class name.
So this code is really just for this file, or for files generated similarly.
public final class RemoveFontF2
{
/**
* Default constructor.
*/
private RemoveFontF2()
{
// example class should not be instantiated
}
/**
* This will remove all text from a PDF document.
*
* #param args The command line arguments.
*
* #throws IOException If there is an error parsing the document.
*/
public static void main(String[] args) throws IOException
{
if (args.length != 2)
{
usage();
}
else
{
PDDocument document = PDDocument.load(new File(args[0]));
if (document.isEncrypted())
{
System.err.println(
"Error: Encrypted documents are not supported for this example.");
System.exit(1);
}
for (PDPage page : document.getPages())
{
List<Object> newTokens = createTokensWithoutText(page);
PDStream newContents = new PDStream(document);
writeTokensToStream(newContents, newTokens);
page.setContents(newContents);
processResources(page.getResources());
}
document.save(args[1]);
document.close();
}
}
private static void processResources(PDResources resources) throws IOException
{
for (COSName name : resources.getXObjectNames())
{
PDXObject xobject = resources.getXObject(name);
if (xobject instanceof PDFormXObject)
{
PDFormXObject formXObject = (PDFormXObject) xobject;
writeTokensToStream(formXObject.getContentStream(),
createTokensWithoutText(formXObject));
processResources(formXObject.getResources());
}
}
for (COSName name : resources.getPatternNames())
{
PDAbstractPattern pattern = resources.getPattern(name);
if (pattern instanceof PDTilingPattern)
{
PDTilingPattern tilingPattern = (PDTilingPattern) pattern;
writeTokensToStream(tilingPattern.getContentStream(),
createTokensWithoutText(tilingPattern));
processResources(tilingPattern.getResources());
}
}
}
private static void writeTokensToStream(PDStream newContents, List<Object> newTokens) throws IOException
{
OutputStream out = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter writer = new ContentStreamWriter(out);
writer.writeTokens(newTokens);
out.close();
}
private static List<Object> createTokensWithoutText(PDContentStream contentStream) throws IOException
{
PDFStreamParser parser = new PDFStreamParser(contentStream);
Object token = parser.parseNextToken();
List<Object> newTokens = new ArrayList<Object>();
while (token != null)
{
if (token instanceof Operator)
{
Operator op = (Operator) token;
String opName = op.getName();
if (OperatorName.SET_FONT_AND_SIZE.equals(opName) &&
newTokens.get(newTokens.size() - 2).equals(COSName.getPDFName("F2")))
{
// remove the 2 arguments to this operator
newTokens.remove(newTokens.size() - 1);
newTokens.remove(newTokens.size() - 1);
token = parser.parseNextToken();
continue;
}
}
newTokens.add(token);
token = parser.parseNextToken();
}
// remove F2
COSBase fontBase = contentStream.getResources().getCOSObject().getItem(COSName.FONT);
if (fontBase instanceof COSDictionary)
{
((COSDictionary) fontBase).removeItem(COSName.getPDFName("F2"));
}
return newTokens;
}
/**
* This will print the usage for this document.
*/
private static void usage()
{
System.err.println("Usage: java " + RemoveFontF2.class.getName() + " <input-pdf> <output-pdf>");
}
}

Problems to read a PDF with iText7 (work with iText5)

Here is the code to read a PDF with iText5, and it works :
public class CreateTOC {
public static final String SRC = "file.pdf";
class FontRenderFilter extends RenderFilter {
public boolean allowText(TextRenderInfo renderInfo) {
String font = renderInfo.getFont().getPostscriptFontName();
return font.endsWith("Bold") || font.endsWith("Oblique");
}
}
public static void main(String[] args) throws IOException, DocumentException {
new CreateTOC().parse(SRC);
}
public void parse(String filename) throws IOException {
PdfReader reader = new PdfReader(filename);
Rectangle rect = new Rectangle(1000, 1000);
RenderFilter regionFilter = new RegionTextRenderFilter(rect);
FontRenderFilter fontFilter = new FontRenderFilter();
TextExtractionStrategy strategy = new FilteredTextRenderListener(
new LocationTextExtractionStrategy(), regionFilter, fontFilter);
System.out.println(PdfTextExtractor.getTextFromPage(reader, 56, strategy));
reader.close();
}
}
Can someone help me to do it working in iText7 ? There are problems with the Rectangle and the TextExtractionStrategy (it's not the same constructor as iText5)
Edit : RenderFilter isn't available in iText7...

Lucene IndexWriter.commit() doesn't finished in ubuntu

Here is initialize code
public class Main {
public void index(String input_path, String index_dir, String separator, String extension, String field, DataHandler handler) {
Index index = new Index(handler);
index.initWriter(index_dir, new StandardAnalyzer());
index.run(input_path, field, extension, separator);
}
public List<?> search(String input_path, String index_dir, String separator, String extension, String field, DataHandler handler) {
Search search = new Search(handler);
search.initSearcher(index_dir, new StandardAnalyzer());
return search.runUsingFiles(input_path, field, extension, separator);
}
#SuppressWarnings("unchecked")
public static void main(String[] args) {
String lang = "en-US";
String dType = "data";
String train = "res/input/" +lang+ "/" +dType +"/train/";
String test = "res/input/"+ lang+ "/" +dType+ "/test/";
String separator = "\\|";
String extension = "csv";
String index_dir = "res/index/" +lang+ "." +dType+ ".index";
String output_file = "res/result/" +lang+ "." +dType+ ".output.json";
String searched_field = "utterance";
Main main = new Main();
DataHandler handler = new DataHandler();
main.index(train, index_dir, separator, extension, searched_field, handler);
//List<JSONObject> result = (List<JSONObject>) main.search(test, index_dir, separator, extension, searched_field, handler);
//handler.writeOutputJson(result, output_file);
}
}
It is my code
public class Index {
private IndexWriter writer;
private DataHandler handler;
public Index(DataHandler handler) {
this.handler = handler;
}
public Index() {
this(new DataHandler());
}
public void initWriter(String index_path, Directory store, Analyzer analyzer) {
IndexWriterConfig config = new IndexWriterConfig(analyzer);
try {
this.writer = new IndexWriter(store, config);
} catch (IOException e) {
e.printStackTrace();
}
}
public void initWriter(String index_path, Analyzer analyzer) {
try {
initWriter(index_path, FSDirectory.open(Paths.get(index_path)), analyzer);
} catch (IOException e) {
e.printStackTrace();
}
}
public void initWriter(String index_path) {
List<String> stopWords = Arrays.asList();
CharArraySet stopSet = new CharArraySet(stopWords, false);
initWriter(index_path, new StandardAnalyzer(stopSet));
}
#SuppressWarnings("unchecked")
public void indexDocs(List<?> datas, String field) throws IOException {
FieldType fieldType = new FieldType();
FieldType fieldType2 = new FieldType();
fieldType.setStored(true);
fieldType.setTokenized(true);
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
fieldType2.setStored(true);
fieldType2.setTokenized(false);
fieldType2.setIndexOptions(IndexOptions.DOCS);
for(int i = 0 ; i < datas.size() ; i++) {
Map<String,String> temp = (Map<String,String>) datas.get(i);
Document doc = new Document();
for(String key : temp.keySet()) {
if(key.equals(field))
continue;
doc.add(new Field(key, temp.get(key), fieldType2));
}
doc.add(new Field(field, temp.get(field), fieldType));
this.writer.addDocument(doc);
}
}
public void run(String path, String field, String extension, String separator) {
List<File> files = this.handler.getInputFiles(path, extension);
List<?> data = this.handler.readDocs(files, separator);
try {
System.out.println("start index");
indexDocs(data, field);
this.writer.commit();
this.writer.close();
System.out.println("done");
} catch (IOException e) {
e.printStackTrace();
}
}
public void run(String path) {
run(path, "search_field", "csv", "\t");
}
I made simple search module using Java and Lucene.
This module consisted of two phase, index and search.
In index phase, It read csv files and convert to Document each row and add to IndexWriter object using IndexWriter.addDocument() method.
Finaly, It call IndexWriter.commit() method.
It is working well in my local PC (windows)
but in Ubuntu PC, doesn't finished IndexWriter.commit() method.
Of course IndexWriter.flush() method doesn't work.
What is the problem?

Error While Reading Large Excel Files (xlsx) Via Apache POI

I am trying to read large excel files xlsx via Apache POI, say 40-50 MB. I am getting out of memory exception. The current heap memory is 3GB.
I can read smaller excel files without any issues. I need a way to read large excel files and then them back as response via Spring excel view.
public class FetchExcel extends AbstractView {
#Override
protected void renderMergedOutputModel(
Map model, HttpServletRequest request, HttpServletResponse response)
throws Exception {
String fileName = "SomeExcel.xlsx";
response.setContentType("application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
OPCPackage pkg = OPCPackage.open("/someDir/SomeExcel.xlsx");
XSSFWorkbook workbook = new XSSFWorkbook(pkg);
ServletOutputStream respOut = response.getOutputStream();
pkg.close();
workbook.write(respOut);
respOut.flush();
workbook = null;
response.setHeader("Content-disposition", "attachment;filename=\"" +fileName+ "\"");
}
}
I first started off using XSSFWorkbook workbook = new XSSFWorkbook(FileInputStream in);
but that was costly per Apache POI API, so I switched to OPC package way but still the same effect. I don't need to parse or process the file, just read it and return it.
Here is an example to read a large xls file using sax parser.
public void parseExcel(File file) throws IOException {
OPCPackage container;
try {
container = OPCPackage.open(file.getAbsolutePath());
ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(container);
XSSFReader xssfReader = new XSSFReader(container);
StylesTable styles = xssfReader.getStylesTable();
XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
while (iter.hasNext()) {
InputStream stream = iter.next();
processSheet(styles, strings, stream);
stream.close();
}
} catch (InvalidFormatException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (OpenXML4JException e) {
e.printStackTrace();
}
}
protected void processSheet(StylesTable styles, ReadOnlySharedStringsTable strings, InputStream sheetInputStream) throws IOException, SAXException {
InputSource sheetSource = new InputSource(sheetInputStream);
SAXParserFactory saxFactory = SAXParserFactory.newInstance();
try {
SAXParser saxParser = saxFactory.newSAXParser();
XMLReader sheetParser = saxParser.getXMLReader();
ContentHandler handler = new XSSFSheetXMLHandler(styles, strings, new SheetContentsHandler() {
#Override
public void startRow(int rowNum) {
}
#Override
public void endRow() {
}
#Override
public void cell(String cellReference, String formattedValue) {
}
#Override
public void headerFooter(String text, boolean isHeader, String tagName) {
}
},
false//means result instead of formula
);
sheetParser.setContentHandler(handler);
sheetParser.parse(sheetSource);
} catch (ParserConfigurationException e) {
throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
}
You don't mention whether you need to modify the spreadsheet or not.
This may be obvious, but if you don't need to modify the spreadsheet, then you don't need to parse it and write it back out, you can simply read bytes from the file, and write out bytes, as you would with, say an image, or any other binary format.
If you do need to modify the spreadsheet before sending it to the user, then to my knowledge, you may have to take a different approach.
Every library that I'm aware of for reading Excel files in Java reads the whole spreadsheet into memory, so you'd have to have 50MB of memory available for every spreadsheet that could possibly be concurrently processed. This involves, as others have pointed out, adjusting the heap available to the VM.
If you need to process a large number of spreadsheets concurrently, and can't allocate enough memory, consider using a format that can be streamed, instead of read all at once into memory. CSV format can be opened by Excel, and I've had good results in the past by setting the content-type to application/vnd.ms-excel, setting the attachment filename to something ending in ".xls", but actually returning CSV content. I haven't tried this in a couple of years, so YMMV.
In the bellwo example I'll add a complete code how to parse a complete excel file (for me 60Mo) into list of object without any problem of "out of memory" and work fine:
import java.util.ArrayList;
import java.util.List;
class DistinctByProperty {
private static OPCPackage xlsxPackage = null;
private static PrintStream output= System.out;
private static List<MassUpdateMonitoringRow> resultMapping = new ArrayList<>();
public static void main(String[] args) throws IOException {
File file = new File("C:\\Users\\aberguig032018\\Downloads\\your_excel.xlsx");
double bytes = file.length();
double kilobytes = (bytes / 1024);
double megabytes = (kilobytes / 1024);
System.out.println("Size "+megabytes);
parseExcel(file);
}
public static void parseExcel(File file) throws IOException {
try {
xlsxPackage = OPCPackage.open(file.getAbsolutePath(), PackageAccess.READ);
ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(xlsxPackage);
XSSFReader xssfReader = new XSSFReader(xlsxPackage);
StylesTable styles = xssfReader.getStylesTable();
XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
int index = 0;
while (iter.hasNext()) {
try (InputStream stream = iter.next()) {
String sheetName = iter.getSheetName();
output.println();
output.println(sheetName + " [index=" + index + "]:");
processSheet(styles, strings, new MappingFromXml(resultMapping), stream);
}
++index;
}
} catch (InvalidFormatException e) {
e.printStackTrace();
} catch (OpenXML4JException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
}
}
private static void processSheet(StylesTable styles, ReadOnlySharedStringsTable strings, MappingFromXml mappingFromXml, InputStream sheetInputStream) throws IOException, SAXException {
DataFormatter formatter = new DataFormatter();
InputSource sheetSource = new InputSource(sheetInputStream);
try {
XMLReader sheetParser = SAXHelper.newXMLReader();
ContentHandler handler = new XSSFSheetXMLHandler(
styles, null, strings, mappingFromXml, formatter, false);
sheetParser.setContentHandler(handler);
sheetParser.parse(sheetSource);
System.out.println("Size of Array "+resultMapping.size());
} catch(ParserConfigurationException e) {
throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
}
}
}
you have to add a calss that implements
SheetContentsHandler
import com.sun.org.apache.xpath.internal.operations.Bool;
import org.apache.poi.ss.util.CellAddress;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler;
import org.apache.poi.xssf.usermodel.XSSFComment;
import java.io.PrintStream;
import java.util.ArrayList;
import java.util.List;
public class MappingFromXml implements SheetContentsHandler {
private List<myObject> result = new ArrayList<>();
private myObject myObject = null;
private int lineNumber = 0;
/**
* Number of columns to read starting with leftmost
*/
private int minColumns = 25;
/**
* Destination for data
*/
private PrintStream output = System.out;
public MappingFromXml(List<myObject> list) {
this.result = list;
}
#Override
public void startRow(int i) {
output.println("iii " + i);
lineNumber = i;
myObject = new myObject();
}
#Override
public void endRow(int i) {
output.println("jjj " + i);
result.add(myObject);
myObject = null;
}
#Override
public void cell(String cellReference, String formattedValue, XSSFComment comment) {
int columnIndex = (new CellReference(cellReference)).getCol();
if(lineNumber > 0){
switch (columnIndex) {
case 0: {//Tech id
if (formattedValue != null && !formattedValue.isEmpty())
myObject.setId(Integer.parseInt(formattedValue));
}
break;
//TODO add other cell
}
}
}
#Override
public void headerFooter(String s, boolean b, String s1) {
}
}
For more information visite this link
I too faced the same issue of OOM while parsing xlsx file...after two days of struggle, I finally found out the below code that was really perfect;
This code is based on sjxlsx. It reads the xlsx and stores in a HSSF sheet.
[code=java]
// read the xlsx file
SimpleXLSXWorkbook = new SimpleXLSXWorkbook(new File("C:/test.xlsx"));
HSSFWorkbook hsfWorkbook = new HSSFWorkbook();
org.apache.poi.ss.usermodel.Sheet hsfSheet = hsfWorkbook.createSheet();
Sheet sheetToRead = workbook.getSheet(0, false);
SheetRowReader reader = sheetToRead.newReader();
Cell[] row;
int rowPos = 0;
while ((row = reader.readRow()) != null) {
org.apache.poi.ss.usermodel.Row hfsRow = hsfSheet.createRow(rowPos);
int cellPos = 0;
for (Cell cell : row) {
if(cell != null){
org.apache.poi.ss.usermodel.Cell hfsCell = hfsRow.createCell(cellPos);
hfsCell.setCellType(org.apache.poi.ss.usermodel.Cell.CELL_TYPE_STRING);
hfsCell.setCellValue(cell.getValue());
}
cellPos++;
}
rowPos++;
}
return hsfSheet;[/code]

Categories