Low memory writing/reading with Apache POI - java

I'm trying to write a pretty large XLSX file (4M+ cells) and I'm having some memory issues.
I can't use SXSSF since I also need to read the existing cells in the template.
Is there anything I can do to reduce the memory footprint?
Perhaps combine streaming reading and streaming writing?

To handle large data with low memory, the best and I think the only option is SXSSF api-s.
If you need to read some data of the existing cells, I assume you do not need the entire 4M+ at the same time.
In such a case based on your application requirement, you can handle the window size yourself and keep in memory only the amount of data you need at a particular time.
You can start by looking at the example at :
http://poi.apache.org/spreadsheet/how-to.html#sxssf
Something as
SXSSFWorkbook wb = new SXSSFWorkbook(-1); // turn off auto-flushing and accumulate all rows in memory
// manually control how rows are flushed to disk
if(rownum % NOR == 0) {
((SXSSFSheet)sh).flushRows(NOR); // retain NOR last rows and flush all others
Hope this helps.

I used SAX parser to process events of the XML document presentation. This is
import com.sun.org.apache.xerces.internal.parsers.SAXParser;
import org.apache.poi.openxml4j.opc.PackageAccess;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import java.io.BufferedInputStream;
import java.io.InputStream;
import java.util.Collections;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
public class LowMemoryExcelFileReader {
private String file;
public LowMemoryExcelFileReader(String file) {
this.file = file;
}
public List<String[]> read() {
try {
return processFirstSheet(file);
} catch (Exception e) {
throw new RuntimeException(e);
}
}
private List<String []> readSheet(Sheet sheet) {
List<String []> res = new LinkedList<>();
Iterator<Row> rowIterator = sheet.rowIterator();
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
int cellsNumber = row.getLastCellNum();
String [] cellsValues = new String[cellsNumber];
Iterator<Cell> cellIterator = row.cellIterator();
int cellIndex = 0;
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
cellsValues[cellIndex++] = cell.getStringCellValue();
}
res.add(cellsValues);
}
return res;
}
public String getFile() {
return file;
}
public void setFile(String file) {
this.file = file;
}
private List<String []> processFirstSheet(String filename) throws Exception {
OPCPackage pkg = OPCPackage.open(filename, PackageAccess.READ);
XSSFReader r = new XSSFReader(pkg);
SharedStringsTable sst = r.getSharedStringsTable();
SheetHandler handler = new SheetHandler(sst);
XMLReader parser = fetchSheetParser(handler);
Iterator<InputStream> sheetIterator = r.getSheetsData();
if (!sheetIterator.hasNext()) {
return Collections.emptyList();
}
InputStream sheetInputStream = sheetIterator.next();
BufferedInputStream bisSheet = new BufferedInputStream(sheetInputStream);
InputSource sheetSource = new InputSource(bisSheet);
parser.parse(sheetSource);
List<String []> res = handler.getRowCache();
bisSheet.close();
return res;
}
public XMLReader fetchSheetParser(ContentHandler handler) throws SAXException {
XMLReader parser = new SAXParser();
parser.setContentHandler(handler);
return parser;
}
/**
* See org.xml.sax.helpers.DefaultHandler javadocs
*/
private static class SheetHandler extends DefaultHandler {
private static final String ROW_EVENT = "row";
private static final String CELL_EVENT = "c";
private SharedStringsTable sst;
private String lastContents;
private boolean nextIsString;
private List<String> cellCache = new LinkedList<>();
private List<String[]> rowCache = new LinkedList<>();
private SheetHandler(SharedStringsTable sst) {
this.sst = sst;
}
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
// c => cell
if (CELL_EVENT.equals(name)) {
String cellType = attributes.getValue("t");
if(cellType != null && cellType.equals("s")) {
nextIsString = true;
} else {
nextIsString = false;
}
} else if (ROW_EVENT.equals(name)) {
if (!cellCache.isEmpty()) {
rowCache.add(cellCache.toArray(new String[cellCache.size()]));
}
cellCache.clear();
}
// Clear contents cache
lastContents = "";
}
public void endElement(String uri, String localName, String name)
throws SAXException {
// Process the last contents as required.
// Do now, as characters() may be called more than once
if(nextIsString) {
int idx = Integer.parseInt(lastContents);
lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
nextIsString = false;
}
// v => contents of a cell
// Output after we've seen the string contents
if(name.equals("v")) {
cellCache.add(lastContents);
}
}
public void characters(char[] ch, int start, int length)
throws SAXException {
lastContents += new String(ch, start, length);
}
public List<String[]> getRowCache() {
return rowCache;
}
}
}

Related

Tensorflow lite model(.tflite) in android is always giving the same result for text classification

I am trying to classify the messages that i receive into 5 categories(sports, politics, business, tech, entertainment), but for every message I send the tflite model classify it as sports ONLY.
this is my model:
i am using Average Word Vector model to train and test my data. and it gives me correct predictions when testing it. However, when I integrate the model into android studio, the model always predict the message as sports with high accuracy(around 95%)
!pip install -q tflite-model-maker
import numpy as np
import os
from tflite_model_maker import configs
from tflite_model_maker import ExportFormat
from tflite_model_maker import model_spec
from tflite_model_maker import text_classifier
from tflite_model_maker import TextClassifierDataLoader
import pandas as pd
import tensorflow as tf
assert tf.__version__.startswith('2')
data = pd.read_csv("bbc-text.csv")
print(data)
awv_spec = model_spec.get('average_word_vec')
awv_train_data = TextClassifierDataLoader.from_csv(
filename='bbc-text.csv',
text_column='text',
label_column='category',
model_spec=awv_spec,
is_training=True)
awv_test_data = TextClassifierDataLoader.from_csv(
filename='bbc-text1.csv',
text_column='text',
label_column='category',
model_spec=awv_spec,
is_training=False)
awv_model = text_classifier.create(awv_train_data, model_spec=awv_spec, epochs=20)
awv_model.evaluate(awv_test_data)
awv_model.export(export_dir='average_word_vec/')
and this is how i retrive info in android:
package com.example.letstalk.lib_interpreter;
import android.content.Context;
import android.content.res.AssetFileDescriptor;
import android.content.res.AssetManager;
import android.util.Log;
import com.example.letstalk.lib_interpreter.Result;
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.nio.ByteBuffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.PriorityQueue;
import org.tensorflow.lite.Interpreter;
import org.tensorflow.lite.support.metadata.MetadataExtractor;
public class TextClassificationClient {
private static final String TAG = "Interpreter";
private static final int SENTENCE_LEN = 256; // The maximum length of an input sentence.
// Simple delimiter to split words.
private static final String SIMPLE_SPACE_OR_PUNCTUATION = " |\\,|\\.|\\!|\\?|\n";
private static final String MODEL_PATH = "sentiment_analysis.tflite";
/*
* Reserved values in ImdbDataSet dic:
* dic["<PAD>"] = 0 used for padding
* dic["<START>"] = 1 mark for the start of a sentence
* dic["<UNKNOWN>"] = 2 mark for unknown words (OOV)
*/
private static final String START = "<START>";
private static final String PAD = "<PAD>";
private static final String UNKNOWN = "<UNKNOWN>";
/** Number of results to show in the UI. */
private static final int MAX_RESULTS = 3;
private final Context context;
private final Map<String, Integer> dic = new HashMap<>();
private final List<String> labels = new ArrayList<>();
private Interpreter tflite;
public TextClassificationClient(Context context) {
this.context = context;
}
/** Load the TF Lite model and dictionary so that the client can start classifying text. */
public void load() {
loadModel();
}
/** Load TF Lite model. */
private synchronized void loadModel() {
try {
// Load the TF Lite model
ByteBuffer buffer = loadModelFile(this.context.getAssets(), MODEL_PATH);
tflite = new Interpreter(buffer);
Log.v(TAG, "TFLite model loaded.");
// Use metadata extractor to extract the dictionary and label files.
MetadataExtractor metadataExtractor = new MetadataExtractor(buffer);
// Extract and load the dictionary file.
InputStream dictionaryFile = metadataExtractor.getAssociatedFile("vocab.txt");
loadDictionaryFile(dictionaryFile);
Log.v(TAG, "Dictionary loaded.");
// Extract and load the label file.
InputStream labelFile = metadataExtractor.getAssociatedFile("labels.txt");
loadLabelFile(labelFile);
Log.v(TAG, "Labels loaded.");
} catch (IOException ex) {
Log.e(TAG, "Error loading TF Lite model.\n", ex);
}
}
/** Free up resources as the client is no longer needed. */
public synchronized void unload() {
tflite.close();
dic.clear();
labels.clear();
}
/** Classify an input string and returns the classification results. */
public synchronized List<Result> classify(String text) {
// Pre-prosessing.
int[][] input = tokenizeInputText(text);
// Run inference.
Log.v(TAG, "Classifying text with TF Lite...");
float[][] output = new float[1][labels.size()];
tflite.run(input, output);
// Find the best classifications.
PriorityQueue<Result> pq =
new PriorityQueue<>(
MAX_RESULTS, (lhs, rhs) -> Float.compare(rhs.getConfidence(), lhs.getConfidence()));
for (int i = 0; i < labels.size(); i++) {
pq.add(new Result("" + i, labels.get(i), output[0][i]));
}
final ArrayList<Result> results = new ArrayList<>();
while (!pq.isEmpty()) {
results.add(pq.poll());
}
Collections.sort(results);
// Return the probability of each class.
return results;
}
/** Load TF Lite model from assets. */
private static MappedByteBuffer loadModelFile(AssetManager assetManager, String modelPath)
throws IOException {
try (AssetFileDescriptor fileDescriptor = assetManager.openFd(modelPath);
FileInputStream inputStream = new FileInputStream(fileDescriptor.getFileDescriptor())) {
FileChannel fileChannel = inputStream.getChannel();
long startOffset = fileDescriptor.getStartOffset();
long declaredLength = fileDescriptor.getDeclaredLength();
return fileChannel.map(FileChannel.MapMode.READ_ONLY, startOffset, declaredLength);
}
}
/** Load dictionary from model file. */
private void loadLabelFile(InputStream ins) throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(ins));
// Each line in the label file is a label.
while (reader.ready()) {
labels.add(reader.readLine());
}
}
/** Load labels from model file. */
private void loadDictionaryFile(InputStream ins) throws IOException {
BufferedReader reader = new BufferedReader(new InputStreamReader(ins));
// Each line in the dictionary has two columns.
// First column is a word, and the second is the index of this word.
while (reader.ready()) {
List<String> line = Arrays.asList(reader.readLine().split(" "));
if (line.size() < 2) {
continue;
}
dic.put(line.get(0), Integer.parseInt(line.get(1)));
}
}
/** Pre-prosessing: tokenize and map the input words into a float array. */
int[][] tokenizeInputText(String text) {
Log.d("hello", "tokenize: "+ text);
int[] tmp = new int[SENTENCE_LEN];
List<String> array = Arrays.asList(text.split(SIMPLE_SPACE_OR_PUNCTUATION));
int index = 0;
// Prepend <START> if it is in vocabulary file.
if (dic.containsKey(START)) {
tmp[index++] = dic.get(START);
}
for (String word : array) {
if (index >= SENTENCE_LEN) {
break;
}
tmp[index++] = dic.containsKey(word) ? dic.get(word) : (int) dic.get(UNKNOWN);
}
// Padding and wrapping.
Arrays.fill(tmp, index, SENTENCE_LEN - 1, (int) dic.get(PAD));
int[][] ans = {tmp};
return ans;
}
Map<String, Integer> getDic() {
return this.dic;
}
Interpreter getTflite() {
return this.tflite;
}
List<String> getLabels() {
return this.labels;
}
}

Apache POI SAX XSSFReader reads wrong date format

Apache POI SAX reader implemented similar to this well known example https://github.com/pjfanning/poi-shared-strings-sample/blob/master/src/main/java/com/github/pjfanning/poi/sample/XLSX2CSV.java reads some date values not as they are presented in excel despite it is supposed to read "formatted value".
Value in excel file : 1/1/2019, "formatted value" read by reader : 1/1/19.
Any idea why there is a difference?
Apache POI version 3.17
Reader code:
package com.lopuch.sk.lita.is.importer;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.HashMap;
import java.util.Map;
import javax.xml.parsers.ParserConfigurationException;
import org.apache.log4j.Logger;
import org.apache.poi.openxml4j.exceptions.OpenXML4JException;
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.openxml4j.util.ZipSecureFile;
import org.apache.poi.ss.usermodel.DataFormatter;
import org.apache.poi.ss.util.CellAddress;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.util.SAXHelper;
import org.apache.poi.xssf.eventusermodel.ReadOnlySharedStringsTable;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler;
import org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.SheetContentsHandler;
import org.apache.poi.xssf.model.StylesTable;
import org.apache.poi.xssf.usermodel.XSSFComment;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import com.lopuch.sk.lita.is.importer.fileImport.ExcelRowReadListener;
public class ExcelSaxImporter {
private static final Logger logger = Logger.getLogger(ExcelSaxImporter.class);
private ExcelRowReadListener listener;
public void setOnRowRead(ExcelRowReadListener listener) {
this.listener = listener;
}
public ExcelRowReadListener getListener() {
return listener;
};
public void process(byte[] fileByteArray)
throws IOException, OpenXML4JException, ParserConfigurationException, SAXException {
ZipSecureFile.setMinInflateRatio(0.0d);
OPCPackage opcpPackage = OPCPackage.open(new ByteArrayInputStream(fileByteArray));
ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(opcpPackage);
XSSFReader xssfReader = new XSSFReader(opcpPackage);
StylesTable styles = xssfReader.getStylesTable();
XSSFReader.SheetIterator iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
while (iter.hasNext()) {
InputStream stream = iter.next();
processSheet(styles, strings, getHandler(), stream);
stream.close();
}
}
private SheetContentsHandler getHandler() {
return new SheetContentsHandler() {
private boolean firstCellOfRow = false;
private int currentRow = -1;
private int currentCol = -1;
// Maps column Letter name to its value.
// Does not contain key-value pair if cell value is null for
// currently
// processed column and row.
private Map<String, String> rowValues;
#Override
public void startRow(int rowNum) {
// Prepare for this row
firstCellOfRow = true;
currentRow = rowNum;
currentCol = -1;
rowValues = new HashMap<String, String>();
}
#Override
public void endRow(int rowNum) {
if (rowValues.keySet().size() == 0) {
logger.trace("Skipping calling rowRead() because of empty row");
} else {
ExcelSaxImporter.this.getListener().rowRead(rowValues);
}
}
#Override
public void cell(String cellReference, String formattedValue, XSSFComment comment) {
if (firstCellOfRow) {
firstCellOfRow = false;
}
// gracefully handle missing CellRef here in a similar way
// as XSSFCell does
if (cellReference == null) {
cellReference = new CellAddress(currentRow, currentCol).formatAsString();
}
// Did we miss any cells?
int thisCol = (new CellReference(cellReference)).getCol();
currentCol = thisCol;
cellReference = cellReference.replaceAll("\\d","");
rowValues.put(cellReference, formattedValue);
}
#Override
public void headerFooter(String text, boolean isHeader, String tagName) {
}
};
}
/**
* Parses and shows the content of one sheet using the specified styles and
* shared-strings tables.
*
* #param styles
* #param strings
* #param sheetInputStream
*/
public void processSheet(StylesTable styles, ReadOnlySharedStringsTable strings, SheetContentsHandler sheetHandler,
InputStream sheetInputStream) throws IOException, ParserConfigurationException, SAXException {
DataFormatter formatter = new DataFormatter();
InputSource sheetSource = new InputSource(sheetInputStream);
try {
XMLReader sheetParser = SAXHelper.newXMLReader();
ContentHandler handler = new XSSFSheetXMLHandler(styles, null, strings, sheetHandler, formatter, false);
sheetParser.setContentHandler(handler);
sheetParser.parse(sheetSource);
} catch (ParserConfigurationException e) {
throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage());
}
}
}
Difference in value displayed by excel and read by Apache POI comes from date formats that react to user language settings. From Excel:
Date formats that begin with an asterisk (*) responds to changes in regional date and time settings that are specified for the operating system.
Apache POI DataFormatter ignores these locale specific formats and returns default US format date. From Apache POI DataFormatter documentation:
Some formats are automatically "localized" by Excel, eg show as mm/dd/yyyy when loaded in Excel in some Locales but as dd/mm/yyyy in others. These are always returned in the "default" (US) format, as stored in the file.
To work around this behavior see answer to Java: excel to csv date conversion issue with Apache Poi

How can I go back to Main method in my code, And depending on Condition?

In this program I am Reading .xlsx file. And adding cell data to vector, if vector size is less-than 12 no need to read remaining data, and i need to go main method.
How can I do in my program ?
This is my Code :
package com.read;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Vector;
import org.apache.poi.openxml4j.opc.OPCPackage;
import java.io.InputStream;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.Attributes;
import org.xml.sax.ContentHandler;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.helpers.XMLReaderFactory;
public class SendDataToDb {
public static void main(String[] args) {
SendDataToDb sd = new SendDataToDb();
try {
sd.processOneSheet("C:/Users/User/Desktop/New folder/Untitled 2.xlsx");
System.out.println("in Main method");
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public void processOneSheet(String filename) throws Exception {
System.out.println("executing Process Method");
OPCPackage pkg = OPCPackage.open(filename);
XSSFReader r = new XSSFReader( pkg );
SharedStringsTable sst = r.getSharedStringsTable();
System.out.println("count "+sst.getCount());
XMLReader parser = fetchSheetParser(sst);
// To look up the Sheet Name / Sheet Order / rID,
// you need to process the core Workbook stream.
// Normally it's of the form rId# or rSheet#
InputStream sheet2 = r.getSheet("rId2");
System.out.println("Sheet2");
InputSource sheetSource = new InputSource(sheet2);
parser.parse(sheetSource);
sheet2.close();
}
public XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException {
//System.out.println("EXECUTING fetchSheetParser METHOD");
XMLReader parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
ContentHandler handler = new SheetHandler(sst);
parser.setContentHandler(handler);
System.out.println("Method :fetchSheetParser");
return parser;
}
/**
* See org.xml.sax.helpers.DefaultHandler javadocs
*/
private class SheetHandler extends DefaultHandler {
private SharedStringsTable sst;
private String lastContents;
private boolean nextIsString;
Vector values = new Vector(20);
private SheetHandler(SharedStringsTable sst) {
this.sst = sst;
}
public void startElement(String uri, String localName, String name,
Attributes attributes) throws SAXException {
// c => cell
//long l = Long.valueOf(attributes.getValue("r"));
if(name.equals("c")){
columnNum++;
}
if(name.equals("c")) {
// Print the cell reference
// Figure out if the value is an index in the SST
String cellType = attributes.getValue("t");
if(cellType != null && cellType.equals("s")) {
nextIsString = true;
} else {
nextIsString = false;
}
}
// Clear contents cache
lastContents = "";
}
public void endElement(String uri, String localName, String name)
throws SAXException {
//System.out.println("Method :222222222");
// Process the last contents as required.
// Do now, as characters() may be called more than once
if(nextIsString) {
int idx = Integer.parseInt(lastContents);
lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
nextIsString = false;
}
// v => contents of a cell
// Output after we've seen the string contents
if(name.equals("v")) {
values.add(lastContents);
}
if(name.equals("row")) {
System.out.println(values);
//values.setSize(50);
System.out.println(values.size()+" "+values.capacity());
//********************************************************
//I AM CHECKING CONDITION HERE, IF CONDITION IS TRUE I NEED STOP THE REMAINING PROCESS AND GO TO MAIN METHOD.
if(values.size() < 12)
values.removeAllElements();
//WHAT CODE I NEED TO WRITE HERE TO STOP THE EXECUTION OF REMAINING PROCESS AND GO TO MAIN
//***************************************************************
}
}
public void characters(char[] ch, int start, int length)
throws SAXException {
//System.out.println("method : 333333333333");
lastContents += new String(ch, start, length);
}
}
}
check the code in between lines of //******************************
and
//******************************************
You can throw a SAXException wherever you want the parsing to stop:
throw new SAXException("<Your message>")
and handle it in the main method.
After your checking, you should throw the Exception to get out from there and get it back to the main method.
throw new Exception("vector size has to be less than 12");

Reading an ORC file in Java

How do you read an ORC file in Java? I'm wanting to read in a small file for some unit test output verification, but I can't find a solution.
Came across this and implemented one myself recently
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.io.orc.OrcFile;
import org.apache.hadoop.hive.ql.io.orc.Reader;
import org.apache.hadoop.hive.ql.io.orc.RecordReader;
import org.apache.hadoop.hive.serde2.objectinspector.StructField;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import java.util.List;
public class OrcFileDirectReaderExample {
public static void main(String[] argv)
{
try {
Reader reader = OrcFile.createReader(HdfsFactory.getFileSystem(), new Path("/user/hadoop/000000_0"));
StructObjectInspector inspector = (StructObjectInspector)reader.getObjectInspector();
System.out.println(reader.getMetadata());
RecordReader records = reader.rows();
Object row = null;
//These objects are the metadata for each column. They give you the type of each column and can parse it unless you
//want to parse each column yourself
List fields = inspector.getAllStructFieldRefs();
for(int i = 0; i < fields.size(); ++i) {
System.out.print(((StructField)fields.get(i)).getFieldObjectInspector().getTypeName() + '\t');
}
while(records.hasNext())
{
row = records.next(row);
List value_lst = inspector.getStructFieldsDataAsList(row);
StringBuilder builder = new StringBuilder();
//iterate over the fields
//Also fields can be null if a null was passed as the input field when processing wrote this file
for(Object field : value_lst) {
if(field != null)
builder.append(field.toString());
builder.append('\t');
}
//this writes out the row as it would be if this were a Text tab seperated file
System.out.println(builder.toString());
}
}catch (Exception e)
{
e.printStackTrace();
}
}
}
As per Apache Wiki, ORC file format was introduced in Hive 0.11.
So you will need Hive packages in your project source path to read ORC files. The package for the same are
org.apache.hadoop.hive.ql.io.orc.Reader;
org.apache.hadoop.hive.ql.io.orc.OrcFile
read orc testcase
#Test
public void read_orc() throws Exception {
//todo do kerberos auth
String orcPath = "hdfs://user/hive/warehouse/demo.db/orc_path";
//load hdfs conf
Configuration conf = new Configuration();
conf.addResource(getClass().getResource("/hdfs-site.xml"));
conf.addResource(getClass().getResource("/core-site.xml"));
FileSystem fs = FileSystem.get(conf);
// custom read column
List<String> columns = Arrays.asList("id", "title");
final List<Map<String, Object>> maps = OrcUtil.readOrcFile(fs, orcPath, columns);
System.out.println(new Gson().toJson(maps));
}
OrcUtil to read orc path with special columns
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.stream.Collectors;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.PathFilter;
import org.apache.hadoop.hive.ql.io.orc.OrcFile;
import org.apache.hadoop.hive.ql.io.orc.OrcInputFormat;
import org.apache.hadoop.hive.ql.io.orc.OrcSerde;
import org.apache.hadoop.hive.ql.io.orc.OrcSplit;
import org.apache.hadoop.hive.ql.io.orc.OrcStruct;
import org.apache.hadoop.hive.ql.io.orc.Reader;
import org.apache.hadoop.hive.serde2.SerDeException;
import org.apache.hadoop.hive.serde2.objectinspector.StructField;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.InputFormat;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reporter;
public class OrcUtil {
public static List<Map<String, Object>> readOrcFile(FileSystem fs, String orcPath, List<String> readColumns)
throws IOException, SerDeException {
JobConf jobConf = new JobConf();
for (Map.Entry<String, String> entry : fs.getConf()) {
jobConf.set(entry.getKey(), entry.getValue());
}
FileInputFormat.setInputPaths(jobConf, orcPath);
FileInputFormat.setInputPathFilter(jobConf, ((PathFilter) path1 -> true).getClass());
InputSplit[] splits = new OrcInputFormat().getSplits(jobConf, 1);
InputFormat<NullWritable, OrcStruct> orcInputFormat = new OrcInputFormat();
List<Map<String, Object>> rows = new ArrayList<>();
for (InputSplit split : splits) {
OrcSplit orcSplit = (OrcSplit) split;
System.out.printf("read orc split %s%n", ((OrcSplit) split).getPath());
StructObjectInspector inspector = getStructObjectInspector(orcSplit.getPath(), jobConf, fs);
List<? extends StructField> readFields = inspector.getAllStructFieldRefs()
.stream().filter(e -> readColumns.contains(e.getFieldName())).collect(Collectors.toList());
// 49B file is empty
if (orcSplit.getLength() > 49) {
RecordReader<NullWritable, OrcStruct> recordReader = orcInputFormat.getRecordReader(orcSplit, jobConf, Reporter.NULL);
NullWritable key = recordReader.createKey();
OrcStruct value = recordReader.createValue();
while (recordReader.next(key, value)) {
Map<String, Object> entity = new HashMap<>();
for (StructField field : readFields) {
entity.put(field.getFieldName(), inspector.getStructFieldData(value, field));
}
rows.add(entity);
}
}
}
return rows;
}
private static StructObjectInspector getStructObjectInspector(Path path, JobConf jobConf, FileSystem fs)
throws IOException, SerDeException {
OrcFile.ReaderOptions readerOptions = OrcFile.readerOptions(jobConf);
readerOptions.filesystem(fs);
Reader reader = OrcFile.createReader(path, readerOptions);
String typeStruct = reader.getObjectInspector().getTypeName();
System.out.println(typeStruct);
List<String> columnList = parseColumnAndType(typeStruct);
String[] fullColNames = new String[columnList.size()];
String[] fullColTypes = new String[columnList.size()];
for (int i = 0; i < columnList.size(); ++i) {
String[] temp = columnList.get(i).split(":");
fullColNames[i] = temp[0];
fullColTypes[i] = temp[1];
}
Properties p = new Properties();
p.setProperty("columns", StringUtils.join(fullColNames, ","));
p.setProperty("columns.types", StringUtils.join(fullColTypes, ":"));
OrcSerde orcSerde = new OrcSerde();
orcSerde.initialize(jobConf, p);
return (StructObjectInspector) orcSerde.getObjectInspector();
}
private static List<String> parseColumnAndType(String typeStruct) {
int startIndex = typeStruct.indexOf("<") + 1;
int endIndex = typeStruct.lastIndexOf(">");
typeStruct = typeStruct.substring(startIndex, endIndex);
List<String> columnList = new ArrayList<>();
List<String> splitList = Arrays.asList(typeStruct.split(","));
Iterator<String> it = splitList.iterator();
while (it.hasNext()) {
StringBuilder current = new StringBuilder(it.next());
String currentStr = current.toString();
boolean left = currentStr.contains("(");
boolean right = currentStr.contains(")");
if (!left && !right) {
columnList.add(currentStr);
continue;
}
if (left && right) {
columnList.add(currentStr);
continue;
}
if (left && !right) {
while (it.hasNext()) {
String next = it.next();
current.append(",").append(next);
if (next.contains(")")) {
break;
}
}
columnList.add(current.toString());
}
}
return columnList;
}
}
Try this for getting ORCFile rowcount...
private long getRowCount(FileSystem fs, String fName) throws Exception {
long tempCount = 0;
Reader rdr = OrcFile.createReader(fs, new Path(fName));
StructObjectInspector insp = (StructObjectInspector) rdr.getObjectInspector();
Iterable<StripeInformation> iterable = rdr.getStripes();
for(StripeInformation stripe:iterable){
tempCount = tempCount + stripe.getNumberOfRows();
}
return tempCount;
}
//fName is hdfs path to file.
long rowCount = getRowCount(fs,fName);

jsoup java html parsing

I'm a new french user on stack and I have a problem ^^
I use an HTML parse Jsoup for parsing a html page. For that it's ok but I can't parse more url in same time.
This is my code:
first class for parsing a web page
package test2;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public final class Utils {
public static Map<String, String> parse(String url){
Map<String, String> out = new HashMap<String, String>();
try
{
Document doc = Jsoup.connect(url).get();
doc.select("img").remove();
Elements denomination = doc.select(".AmmDenomination");
Elements composition = doc.select(".AmmComposition");
Elements corptexte = doc.select(".AmmCorpTexte");
for(int i = 0; i < denomination.size(); i++)
{
out.put("denomination" + i, denomination.get(i).text());
}
for(int i = 0; i < composition.size(); i++)
{
out.put("composition" + i, composition.get(i).text());
}
for(int i = 0; i < corptexte.size(); i++)
{
out.put("corptexte" + i, corptexte.get(i).text());
System.out.println(corptexte.get(i));
}
} catch(IOException e){
e.printStackTrace();
}
return out;
}//Fin Methode parse
public static void excelizer(int fileId, Map<String, String> values){
try
{
FileOutputStream out = new FileOutputStream("C:/Documents and Settings/c.bon/git/clinsearch/drugs/src/main/resources/META-INF/test/fichier2.xls" );
Workbook wb = new HSSFWorkbook();
Sheet mySheet = wb.createSheet();
Row row1 = mySheet.createRow(0);
Row row2 = mySheet.createRow(1);
String entete[] = {"CIS", "Denomination", "Composition", "Form pharma", "Indication therapeutiques", "Posologie", "Contre indication", "Mise en garde",
"Interraction", "Effet indesirable", "Surdosage", "Pharmacodinamie", "Liste excipients", "Incompatibilité", "Duree conservation",
"Conservation", "Emballage", "Utilisation Manipulation", "TitulaireAMM"};
for (int i = 0; i < entete.length; i++)
{
row1.createCell(i).setCellValue(entete[i]);
}
Set<String> set = values.keySet();
int rowIndexDenom = 1;
int rowIndexCompo = 1;
for(String key : set)
{
if(key.contains("denomination"))
{
mySheet.createRow(1).createCell(1).setCellValue(values.get(key));
rowIndexDenom++;
}
else if(key.contains("composition"))
{
row2.createCell(2).setCellValue(values.get(key));
rowIndexDenom++;
}
}
wb.write(out);
out.close();
}
catch(Exception e)
{
e.printStackTrace();
}
}
}
second class
package test2;
public final class Task extends Thread {
private static int fileId = 0;
private int id;
private String url;
public Task(String url)
{
this.url = url;
id = fileId;
fileId++;
}
#Override
public void run()
{
Utils.excelizer(id, Utils.parse(url));
}
}
the main class (entry point)
package test2;
import java.util.ArrayList;
public class Main {
public static void main(String[] args)
{
ArrayList<String> urls = new ArrayList<String>();
urls.add("http://base-donnees-publique.medicaments.gouv.fr/affichageDoc.php?specid=61266250&typedoc=R");
urls.add("http://base-donnees-publique.medicaments.gouv.fr/affichageDoc.php?specid=66207341&typedoc=R");
for(String url : urls)
{
new Task(url).run();
}
}
}
When the data was copied to my excel file, the second url doesn't work.
Can you help me solve my problem please?
Thanks
I think its because your main() exits before your second thread has a chance to do its job. You should wait for all spawned threads to complete using Thread.join(). Or better yet, create one of the ExecutorService's and use awaitTermination(...) to block until all URLs are parsed.
EDIT See some examples here http://www.javacodegeeks.com/2013/01/java-thread-pool-example-using-executors-and-threadpoolexecutor.html

Categories