how to write parquet files in java with apache arrow - java

I am trying to write data in java into apache parquet. So far, what i've done is use apache arrow via the examples here: https://arrow.apache.org/cookbook/java/schema.html#creating-fields and create an arrow format dataset.
Question is, how do I write it into parquet after that? Also, do I need to use apache arrow to output the data as a parquet file? or can I use apache parquet directly to serialize the data and then output it as a parquet file?
what i've done:
try (BufferAllocator allocator = new RootAllocator()) {
Field name = new Field("name", FieldType.nullable(new ArrowType.Utf8()), null);
Field age = new Field("age", FieldType.nullable(new ArrowType.Int(32, true)), null);
Schema schemaPerson = new Schema(asList(name, age));
try(
VectorSchemaRoot vectorSchemaRoot = VectorSchemaRoot.create(schemaPerson, allocator)
){
VarCharVector nameVector = (VarCharVector) vectorSchemaRoot.getVector("name");
nameVector.allocateNew(3);
nameVector.set(0, "David".getBytes());
nameVector.set(1, "Gladis".getBytes());
nameVector.set(2, "Juan".getBytes());
IntVector ageVector = (IntVector) vectorSchemaRoot.getVector("age");
ageVector.allocateNew(3);
ageVector.set(0, 10);
ageVector.set(1, 20);
ageVector.set(2, 30);
vectorSchemaRoot.setRowCount(3);
File file = new File("randon_access_to_file.arrow");
try (
FileOutputStream fileOutputStream = new FileOutputStream(file);
ArrowFileWriter writer = new ArrowFileWriter(vectorSchemaRoot, null, fileOutputStream.getChannel())
) {
writer.start();
writer.writeBatch();
writer.end();
System.out.println("Record batches written: " + writer.getRecordBlocks().size() + ". Number of rows written: " + vectorSchemaRoot.getRowCount());
} catch (IOException e) {
e.printStackTrace();
}
}
}
but this outputs as an arrow file. not a parquet. Any ideas how I can output this to parquet file instead? And do i need arrow to generate a parquet file to begin with - or can i just use parquet directly?

Arrow Java does not yet support writing to Parquet files, but you can use Parquet to do that.
There is some code in the Arrow dataset test classes that may help. See
org.apache.arrow.dataset.ParquetWriteSupport;
org.apache.arrow.dataset.file.TestFileSystemDataset;
The second class has some tests that use the utilities in the first one.
You can find them on GitHub here:
https://github.com/apache/arrow/tree/master/java/dataset/src/test/java/org/apache/arrow/dataset

Related

Spark Save as Text File grouped by Key

I would like to save RDD to text file grouped by key, currently I can't figure out how to split the output to multiple files, it seems all the output spanning across multiple keys which share the same partition gets written to the same file. I would like to have different files for each key. Here's my code snippet :
JavaPairRDD<String, Iterable<Customer>> groupedResults = customerCityPairRDD.groupByKey();
groupedResults.flatMap(x -> x._2().iterator())
.saveAsTextFile(outputPath + "/cityCounts");
This can be achieved by using foreachPartition to save each partitions into separate file.
You can develop your code as follows
groupedResults.foreachPartition(new VoidFunction<Iterator<Customer>>() {
#Override
public void call(Iterator<Customer> rec) throws Exception {
FSDataOutputStream fsoutputStream = null;
BufferedWriter writer = null;
try {
fsoutputStream = FileSystem.get(new Configuration()).create(new Path("path1"))
writer = new BufferedWriter(fsoutputStream)
while (rec.hasNext()) {
Customer cust = rec.next();
writer.write(cust)
}
} catch (Exception exp) {
exp.printStackTrace()
//Handle exception
}
finally {
// close writer.
}
}
});
Hope this helps.
Ravi
So I figured how to solve this. Convert RDD to Dataframe and then just partition by key during write.
Dataset<Row> dataFrame = spark.createDataFrame(customerRDD, Customer.class);
dataFrame.write()
.partitionBy("city")
.text("cityCounts"); // write as text file at file path cityCounts

Save embedded files from .xls document (Apache POI)

I would like to save all attached files from an Excel (xls/HSSF) without extension.
I've been trying for a long time now, and I really don't know if this is even possible. I also tried Apache Tika, but I don't want to use Tika for this, because I need POI for other tasks, anyway.
I tried the sample code from the Busy Developers Guide, but this does not extract files in the old office format (doc, ppt, xls). And it throws an Error when trying to create new SlideShow(new HSLFSlideShow(dn, fs)) Error: (Remove argument to match HSLFSlideShow(dn))
My actual code is:
public static void saveEmbeddedXLS(InputStream fis_param, String embDIR) throws IOException, InvalidFormatException{
//HSSF - XLS
int i = 0;
System.out.println("Starting Embedded Search in xls...");
POIFSFileSystem fs = new POIFSFileSystem(fis_param);//create FileSystem using fileInputStream
HSSFWorkbook workbook = new HSSFWorkbook(fs);
for (HSSFObjectData obj : workbook.getAllEmbeddedObjects()) {
System.out.println("Objects : "+ obj.getOLE2ClassName());//the OLE2 Class Name of the object
String oleName = obj.getOLE2ClassName();//Document Type
DirectoryNode dn = (DirectoryNode) obj.getDirectory();//get Directory Node
//Trying to create an input Stream with the embedded document, argument of createDocumentInputStream should be: String; Where/How can I get this correct parameter for the function?
InputStream is = dn.createDocumentInputStream(dn);//This line is incorrect! How can I do i correctly?
FileOutputStream fos = new FileOutputStream("embDIR" + i);//Outputfilepath + Number
IOUtils.copy(is, fos);//FileInputStream > FileOutput Stream (save File without extension)
i++;
}
}
So my simple question is:
Is it possible to save ALL attachments from an xls file without any extension (as simple as possible)? And can any one provide me a solution? Many Thanks!

How to read files with an offset from Hadoop using Java

Problem: I want to read a section of a file from HDFS and return it, such as lines 101-120 from a file of 1000 lines.
I don't want to use seek because I have read that it is expensive.
I have log files which I am using PIG to process down into meaningful sets of data. I've been writing an API to return the data for consumption and display by a front end. Those processed data sets can be large enough that I don't want to read the entire file out of Hadoop in one slurp to save wire time and bandwidth. (Let's say 5 - 10MB)
Currently I am using a BufferedReader to return small summary files which is working fine
ArrayList lines = new ArrayList();
...
for (FileStatus item: items) {
// ignoring files like _SUCCESS
if(item.getPath().getName().startsWith("_")) {
continue;
}
in = fs.open(item.getPath());
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String line;
line = br.readLine();
while (line != null) {
line = line.replaceAll("(\\r|\\n)", "");
lines.add(line.split("\t"));
line = br.readLine();
}
}
I've poked around the interwebs quite a bit as well as Stack but haven't found exactly what I need.
Perhaps this is completely the wrong way to go about doing it and I need a completely separate set of code and different functions to manage this. Open to any suggestions.
Thanks!
As added noted based on research from the below discussions:
How does Hadoop process records records split across block boundaries?
Hadoop FileSplit Reading
I think SEEK is a best option for reading files with huge volumes. It did not cause any problems to me as the volume of data that i was reading was in the range of 2 - 3GB. I did not encounter any issues till today but we did use file splitting to handle the large data set. below is the code which you can use for reading purpose and test the same.
public class HDFSClientTesting {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
try{
//System.loadLibrary("libhadoop.so");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
conf.addResource(new Path("core-site.xml"));
String Filename = "/dir/00000027";
long ByteOffset = 3185041;
SequenceFile.Reader rdr = new SequenceFile.Reader(fs, new Path(Filename), conf);
Text key = new Text();
Text value = new Text();
rdr.seek(ByteOffset);
rdr.next(key,value);
//Plain text
JSONObject jso = new JSONObject(value.toString());
String content = jso.getString("body");
System.out.println("\n\n\n" + content + "\n\n\n");
File file =new File("test.gz");
file.createNewFile();
}
catch (Exception e ){
throw new RuntimeException(e);
}
finally{
}
}
}

using CSVWriter to export database tables with BLOB

I've already tried exporting my database tables to CSV using the CSVWriter.
But my tables contain BLOB data. How can I include them in my export?
Then later on, im going to import that exported CSV using CSVReader. Can anyone share some concepts?
This is a part of my code for export
ResultSet res = st.executeQuery("select * from "+db+"."+obTableNames[23]);
int colunmCount = getColumnCount(res);
try {
File filename = new File(dir,""+obTableNames[23]+".csv");
fw = new FileWriter(filename);
CSVWriter writer = new CSVWriter(fw);
writer.writeAll(res, false);
int colType = res.getMetaData().getColumnType(colunmCount);
dispInt(colType);
fw.flush();
fw.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Did you take a look at encodeBase64String(byte[] data) method from the Base64 provided by Apache?
Encodes binary data using the base64 algorithm but does not chunk the output.
This should allow you to return encoded strings representing your Binary Large Object and incorporate it in your CSV.
People on the other side can then use the decodeBase64String(String data) to get the BLOB back again.

Creating tables in a MS Word file using Java

I want to create a table in a Microsoft Office Word file using Java. Can anybody tell me how to do it with an example?
Have a look at Apache POI
The POI project is the master project
for developing pure Java ports of file
formats based on Microsoft's OLE 2
Compound Document Format. OLE 2
Compound Document Format is used by
Microsoft Office Documents, as well as
by programs using MFC property sets to
serialize their document objects.
I've never seen it done, and I work in Word a lot. If you really want to programatically do something in a word document then I'd advise using Microsoft's scripting language VBA which is specifically designed for this purpose. In fact, I'm working in it right now.
If you're working under Open Office then they have a very similar set of macro-powered tools for doing the same thing.
Office 2003 has an xml format, and the default document format for office 2007 is xml (zipped). So you could just generate xml from java. If you open an existing document it's not too hard too see the xml required.
Alternatively, you could use openoffice's api to generate a document, and save it as a ms-word document.
This snippet can be used to create a table dynamically in MS Word document.
WPFDocument document = new XWPFDocument();
XWPFTable tableTwo = document.createTable();
XWPFTableRow tableTwoRowOne = tableTwo.getRow(0);
tableTwoRowOne.getCell(0).setText(Knode1);
tableTwoRowOne.createCell().setText(tags.get("node1").toString());
for (int i = 1; i < nodeList.length; i++) {
String node = "node";
String nodeVal = "";
XWPFTableRow tr = null;
node = node + (i + 1);
nodeVal = tags.get(node).toString();
if (tr == null) {
tr = tableTwo.createRow();
tr.getCell(0).setText(nodeList[i]);
tr.getCell(1).setText(tags.get(node).toString());
}
}
Our feature set is to hit a button in our web app and get the page you are looking at back as a Word document. We use the docx schema for description of documents and have a bunch of Java code on the server side which does the document creation and response back to our web client. The formatting itself is done with some compiled xsl-t's from within Java to translate from our own XML persistence tier.
The docx schema is pretty hard to understand. The way we made most progress was to create template docx's in Word with exactly the formatting that we needed but with bogus content. We then fooled around with them until we understood exactly what was going on. There is a huge amount in the docx that you don't really need to worry about. When reading / translating the docx Word is pretty tolerant to a partially complete formatting schema. In fact we chose to strip out pretty much all the formatting because it also means that the user's default formatting takes precedence, which they seem to prefer. It also makes the xsl process faster and the resulting document smaller.
I manage the docx4j project
docx4j contains a class TblFactory, which creates regular tables (ie no row or column spans), with the default settings which Word 2007 would create, and with the dimensions specified by the user.
If you want a more complex table, the easiest approach is to create it in Word, then copy the resulting XML into a String in your IDE, where you can use docx4j's XmlUtils.unmarshalString to create a Tbl object from it.
Using my little zip utility, you can create docx with ease, if you know what you're doing. Word's DOCX file format is simply zip (folders with xml files). By using java zip utilities, you can modify existing docx, just the content part.
For the following sample to work, simply open Word, enter few lines, save document. Then with zip program, remove file word/document.xml (this is file where main content of the Word document is residing) from the zip. Now you have the template prepared. Save modified zip.
Here is what creation of new Word file looks:
/* docx file head */
final String DOCUMENT_XML_HEAD =
"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\" ?>" +
"<w:document xmlns:wpc=\"http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas\" xmlns:mc=\"http://schemas.openxmlformats.org/markup-compatibility/2006\" xmlns:o=\"urn:schemas-microsoft-com:office:office\" xmlns:r=\"http://schemas.openxmlformats.org/officeDocument/2006/relationships\" xmlns:m=\"http://schemas.openxmlformats.org/officeDocument/2006/math\" xmlns:v=\"urn:schemas-microsoft-com:vml\" xmlns:wp14=\"http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing\" xmlns:wp=\"http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing\" xmlns:w10=\"urn:schemas-microsoft-com:office:word\" xmlns:w=\"http://schemas.openxmlformats.org/wordprocessingml/2006/main\" xmlns:w14=\"http://schemas.microsoft.com/office/word/2010/wordml\" xmlns:w15=\"http://schemas.microsoft.com/office/word/2012/wordml\" xmlns:wpg=\"http://schemas.microsoft.com/office/word/2010/wordprocessingGroup\" xmlns:wpi=\"http://schemas.microsoft.com/office/word/2010/wordprocessingInk\" xmlns:wne=\"http://schemas.microsoft.com/office/word/2006/wordml\" xmlns:wps=\"http://schemas.microsoft.com/office/word/2010/wordprocessingShape\" mc:Ignorable=\"w14 w15 wp14\">" +
"<w:body>";
/* docx file foot */
final String DOCUMENT_XML_FOOT =
"</w:body>" +
"</w:document>";
final ZipOutputStream zos = new ZipOutputStream(new FileOutputStream("c:\\TEMP\\test.docx"));
final String fullDocumentXmlContent = DOCUMENT_XML_HEAD + "<w:p><w:r><w:t>Hey MS Word, hello from java.</w:t></w:r></w:p>" + DOCUMENT_XML_FOOT;
final si.gustinmi.DocxZipCreator creator = new si.gustinmi.DocxZipCreator();
// create new docx file
creator.createDocxFromExistingDocx(zos, "c:\\TEMP\\existingDocx.docx", fullDocumentXmlContent);
These are zip utilities:
package si.gustinmi;
import java.io.*;
import java.nio.charset.StandardCharsets;
import java.util.logging.Logger;
import java.util.zip.CRC32;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
import java.util.zip.ZipOutputStream;
/**
* Creates new docx from existing one.
* #author gustinmi [at] gmail [dot] com
*/
public class DocxZipCreator {
public static final Logger log = Logger.getLogger(DocxZipCreator.class.getCanonicalName());
private static final int BUFFER_SIZE = 4096;
/** OnTheFly zip creator. Traverses through existing docx zip and creates new one simultaneousl.
* On the end, custom document.xml is inserted inside
* #param zipFilePath location of existing docx template (without word/document.xml)
* #param documentXmlContent content of the word/document.xml
* #throws IOException
*/
public void createDocxFromExistingDocx(ZipOutputStream zos, String zipFilePath, String documentXmlContent) throws IOException {
final FileInputStream fis = new FileInputStream(zipFilePath);
final ZipInputStream zipIn = new ZipInputStream(fis);
try{
log.info("Starting to create new docx zip");
ZipEntry entry = zipIn.getNextEntry();
while (entry != null) { // iterates over entries in the zip file
copyEntryfromZipToZip(zipIn, zos, entry.getName());
zipIn.closeEntry();
entry = zipIn.getNextEntry();
}
// add document.xml to existing zip
addZipEntry(documentXmlContent, zos, "word/document.xml");
}finally{
zipIn.close();
zos.close();
log.info("End of docx creation");
}
}
/** Copies sin gle entry from zip to zip */
public void copyEntryfromZipToZip(ZipInputStream is, ZipOutputStream zos, String entryName)
{
final byte [] data = new byte[BUFFER_SIZE];
int len;
int lenTotal = 0;
try {
final ZipEntry entry = new ZipEntry(entryName);
zos.putNextEntry(entry);
final CRC32 crc32 = new CRC32();
while ((len = is.read(data)) > -1){
zos.write(data, 0, len);
crc32.update(data, 0, len);
lenTotal += len;
}
entry.setSize(lenTotal);
entry.setTime(System.currentTimeMillis());
entry.setCrc(crc32.getValue());
}
catch (IOException ioe){
ioe.printStackTrace();
}
finally{
try { zos.closeEntry();} catch (IOException e) {}
}
}
/** Create new zip entry with content
* #param content content of a new zip entry
* #param zos
* #param entryName name (npr: word/document.xml)
*/
public void addZipEntry(String content, ZipOutputStream zos, String entryName)
{
final byte [] data = new byte[BUFFER_SIZE];
int len;
int lenTotal = 0;
try {
final InputStream is = new ByteArrayInputStream(content.getBytes(StandardCharsets.UTF_8));
final ZipEntry entry = new ZipEntry(entryName);
zos.putNextEntry(entry);
final CRC32 crc32 = new CRC32();
while ((len = is.read(data)) > -1){
zos.write(data, 0, len);
crc32.update(data, 0, len);
lenTotal += len;
}
entry.setSize(lenTotal);
entry.setTime(System.currentTimeMillis());
entry.setCrc(crc32.getValue());
}
catch (IOException ioe){
ioe.printStackTrace();
}
finally{
try { zos.closeEntry();} catch (IOException e) {}
}
}
}
Office Writer would be a better tool to use than POI for your requirement.
If all you want is a simple table without too much of formatting, I would use this simple trick. Use Java to generate the table as HTML using plain old table,tr,td tags and copy the rendered HTML table into the word document ;)
Click here for a Working example with source code.
This example generates MS-Word docs from Java, based on a template concept.

Categories