I want to write a parquet file in standalone java, in local filesystem (not on hadoop).
How to do this?
I know I can do this easily with spark, but I need to do this in standalone java so no hadoop, spark, ecc.
See this blog post: http://blog.antlypls.com/blog/2015/12/02/how-to-write-data-into-parquet/
Short version is you need to define/provide a schema and utilize the appropriate ParquetWriter.
Related
I have some Parquet files that I've written in Python using PyArrow (Apache Arrow):
pyarrow.parquet.write_table(table, "example.parquet")
Now I want to read these files (and preferably get an Arrow Table) using a Java program.
In Python, I can simply use the following to get an Arrow Table from my Parquet file:
table = pyarrow.parquet.read_table("example.parquet")
Is there an equivalent and easy solution in Java?
I couldn't really find any good / working examples nor any usefull documentation for Java (only for Python). Or some examples don't provide all needed Maven dependencies. I also don't want to use a Hadoop file system, I just want to use local files.
Note: I also found out that I can't use "Apache Avro" because my Parquet files contains column names with the symbols [, ] and $ which are invalid characters in Apache Avro.
Also, can you please provide Maven dependencies if your solution uses Maven.
I am on Windows and using Eclipse.
Update (November 2020): I never found a suitable solution and just stuck with Python for my usecase.
it's somewhat an overkill, but you can use Spark.
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
I want to store some Java/Scala objects as records in Parquet format, and I'm currently using parquet-avro and the AvroParquetWriter class for this purpose. This works fine, but it is very coupled to Hadoop and it's file system implementation(s). Instead, I would like to somehow get the raw binary data of the files (preferably, but not absolutely necessary, in a streaming fashion) and handle the writing of the files "manually", due to the nature of the framework I'm integrating with. Has anyone been able to achieve something like this?
What is the best way to create and populate Parquet files in HDFS using Java without the support of Hive or Impala libraries?
My goal is to write a simple csv record (String) to a Parquet file located in HDFS.
All the questions/answers previously asked are confusing.
Seems like parquet-mr is the way to go. They provide implementations for Thrift and Avro. Own implementations should be based on ParquetOutputFormat and might look similar to AvroParquetOutputFormat and AvroWriteSupport which does the actual conversion.
I'm looking tools/libraries which allows fast (easy) data import to existing database tables. For example phpmyadmin allows data import from .csv, .xml etc. In Hadoop hue via Beesvax for Hive we can create table from file. I'm looking tools which I can use with postgresql or libraries which allows doing such things fast and easily - I'm looking for way to avoid coding it manualy from reading file to inserting to db via jdbc.
You can do all that with standard tools in PostgreSQL, without additional libraries.
For .csv files you can use the built in COPY command. COPY is fast and simple. The source file has to lie on the same machine as the database for that. If not, you can use the very similar \copy meta-command of psql.
For .xml files (or any format really) you can use the built in pg_read_file() inside a plpgsql function. However, I quote:
Only files within the database cluster directory and the log_directory
can be accessed.
So you have to put your source file there or create a symbolic link to your actual file/directory. Then you can parse it with unnest() and xpath() and friends. You need at least PostgreSQL 8.4 for that.
A kick start on parsing XML in this blog post by Scott Bailey.
I was wondering if an API exist that would allow me to dump a synergy database through java. I can create an excel report through IBM Rational Change but it would be nice if there was a way to just send a CCMDB query to the server through java and generate a CSV or XML file and have it dumped locally. Any body know of a possible way to do that?
I would doubt that something exists publicly. You can always call a command line ccm query from java and parse the output. I am working on a Python script that will do something similar, but it will be highly specific to my project DB setup (which is a mess).