Read Parquet Files using Apache Arrow

Read Parquet Files using Apache Arrow - java

I have some Parquet files that I've written in Python using PyArrow (Apache Arrow):
pyarrow.parquet.write_table(table, "example.parquet")
Now I want to read these files (and preferably get an Arrow Table) using a Java program.
In Python, I can simply use the following to get an Arrow Table from my Parquet file:
table = pyarrow.parquet.read_table("example.parquet")
Is there an equivalent and easy solution in Java?
I couldn't really find any good / working examples nor any usefull documentation for Java (only for Python). Or some examples don't provide all needed Maven dependencies. I also don't want to use a Hadoop file system, I just want to use local files.
Note: I also found out that I can't use "Apache Avro" because my Parquet files contains column names with the symbols [, ] and $ which are invalid characters in Apache Avro.
Also, can you please provide Maven dependencies if your solution uses Maven.
I am on Windows and using Eclipse.
Update (November 2020): I never found a suitable solution and just stuck with Python for my usecase.

it's somewhat an overkill, but you can use Spark.
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

Related

google-java-format: format whole project

I'm using google-java-format to format Java code according to Google Java Style. However, I only find documentation and examples showing how to format one file using the CLI.
Is there a built-in way to format an entire Java project directly using the CLI (without looping using a shell script or something else)?

After a quick read of the google-java-format documentation,
it is intended to function one file at a time
or on a group of files,
each listed on the command line.
There appears to be plugins for intelliJ and Eclipse.
If you need to format every file in your project,
you will need to do one of the following:
Feed a list of every file. This is fairly easy with a script; use xargs.
If you use IntelliJ, check the plugin options. It likely has some kind of selection mechanism.
As above, the Eclipse plugin likely has some kind of selection mechanism, as well.

Write parquet in java

I want to write a parquet file in standalone java, in local filesystem (not on hadoop).
How to do this?
I know I can do this easily with spark, but I need to do this in standalone java so no hadoop, spark, ecc.

See this blog post: http://blog.antlypls.com/blog/2015/12/02/how-to-write-data-into-parquet/
Short version is you need to define/provide a schema and utilize the appropriate ParquetWriter.

How to add external tags file into CEDET in Emacs

I tried to use CEDET to get auto completion in Emacs and that works fine for C/C++. But I cannot find anything about how to use CEDET with Java without the help of JDEE, which is thought out of date and not compatible to CEDET 1.1. I got a tags file using utility found here but I don't know how to integrate that into CEDET system. According to CEDET's website, that's possible. But they don't explain how to do it. Is there someone willing to answer this question?
Here is some sample of the tags file generated by that utility:
java.applet.Applet$AccessibleApplet
protected java.applet.Applet$AccessibleApplet(java.applet.Applet)
public java.applet.Applet$AccessibleApplet.getAccessibleRole() returns javax.accessibility.AccessibleRole
public java.applet.Applet$AccessibleApplet.getAccessibleStateSet() returns javax.accessibility.AccessibleStateSet

It is possible to have CEDET pull in tags from a .jar file. It works by using javap to extract the tags in text form, and then it parses that data.
It isn't very easy to set up since in CEDET, the concept of where to find your library files is part of EDE, the project management system, not the parser and smart completion system. The only Java based project supported in CEDET 1.1 is Android.
The basics is to first enable the javap database by loading it with (require 'semanticdb-javap) in CEDET 1.1, or (require 'semantic/db-javap) in the bzr version of CEDET.
Once you've done that, you can configure it via the cedet-java-classpath-extension. I'm a little fuzzy on the details of what happens next, but folks have reported success on the mailing list.
If you use CEDET from the bzr repository, there is the ede-java-root project, which is similar to the ede-cpp-root project. That project type lets you configure what your library path is. The doc for that is in the ede/java-root.el file with the project type, and shows you the basics of how to use it.

How to create .mpp file in java?

I am able to create .mpx file by using mpxj library in java.
I need write ( create ) .mpp file in java can any one suggest me please.

I maintain MPXJ, and the short answer to your enquiry is that, at present, MPXJ does not write MPP files.
The main reason for this is simply that despite the effort which has gone into understanding the MPP file structure, there is still a great deal of it which is not well understood, hence it is difficult to reliably generate. The other issue is that even if I was to produce some code which could generate an MPP file, the features it could write to that file are likely to lag behind what MPXJ supports in the MSPDI file format, again due to my incomplete understanding of the MPP format.
My suspicion is that the next version of MS project (project 15? Project 2013?) may probably offer a ".mppx" file format, similar to the ".docx" etc formats used by other applications in the MS Office suite. This will be XML-based and will be more straightforward to generate than the binary MPP file format currently is... let's see what Microsoft come up with!
Jon

Visit http://www.mpxj.org/faq/
Can I use MPXJ to write MPP files?
Not at present. Although it is technically feasible to generate an MPP file, the knowledge we have of the file structure is still relatively incomplete, despite the amount of data we are able to correctly extract. It is therefore likely to take a considerable amount of development effort to make this work, and it is conceivable that we will not be ablet to write the full set of attributes that MPXJ supports back into the MPP file - simply because we don't understand the format well enough. You are therefore probably better off using MSPDI which does support the full range of data items present in an MPP file.
You can
Try this: http://www.aspose.com/java/project-management-component.aspx
It writes MPP and Microsoft Project XML.
But this not free

Try this: http://www.aspose.com/java/project-management-component.aspx
It writes MPP and Microsoft Project XML.

I think by "mpp" you probably mean "Microsoft PowerPoint", correct?
Q: Why do you think MPXJ (Microsoft Project Exchange/Java) can't do this?
http://www.mpxj.org/
Welcome to MPXJ! This library provides a set of facilities to allow
project information to be manipulated in Java and .Net. MPXJ supports
a range of data formats: Microsoft Project Exchange (MPX), Microsoft
Project (MPP,MPT), Microsoft Project Data Interchange (MSPDI XML),
Microsoft Project Database (MPD), Planner (XML), Primavera (PM XML,
XER, and database), and Asta Powerproject (PP, MDB).

Automatic update of keyword in Word document

As part of our build process (java build with ant), I want to update a version number somehow in or near a Word document (software guide). "near" meaning I'd accept updating the document properties rather than something in the text itself.
From looking around the internets, it looks like the main option is writing a small C# program that uses Office's COM functionality to do this task. I have a big philosophical problem with this (not the C# part, but making Office and COM part of our build process). Are there any other options out there?
(Yes, .docx is theoretically XML; haven't found anybody updating it that way yet - why not?)

Version 3.5 of Apache POI (a Java API for accessing Office format files) has support for Office Open XML format documents. It is currently in beta as of writing.
The Aspose.Words class library looks like a non-free option that could also be used to help solve your problem.

You could have a look at how Groovy does it using using their Scriptom module which is based on the Jacob library (Java COM Bridge).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Read Parquet Files using Apache Arrow - java

it's somewhat an overkill, but you can use Spark. https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

Related

google-java-format: format whole project

Write parquet in java

How to add external tags file into CEDET in Emacs

How to create .mpp file in java?

Automatic update of keyword in Word document

Categories

Resources