Writing files to Parquet format in Java? - java

While researching on writing files to Parquet in Java I came across -
org.apache.parquet.hadoop.ParquetWriter
org.apache.parquet.avro.AvroParquetWriter
But both have been deprecated. What are the alternatives?

AvroParquetWriter itself shouldn't be deprecated. Its constructors are deprecated, in favor of a static method builder​(org.apache.parquet.io.OutputFile file).

Related

Write/Read Apache Parquet Files in Java/Spring

I'm just trying to write a parquet file, but every example I find on google uses deprecated methods or simply just doesn't work.
Besides, there doesn't seem to be any official documentation with examples.
It would be interesting to see a writting example and a reading-to-POJO example as well.

How to create and populate Parquet files in HDFS using Java?

What is the best way to create and populate Parquet files in HDFS using Java without the support of Hive or Impala libraries?
My goal is to write a simple csv record (String) to a Parquet file located in HDFS.
All the questions/answers previously asked are confusing.
Seems like parquet-mr is the way to go. They provide implementations for Thrift and Avro. Own implementations should be based on ParquetOutputFormat and might look similar to AvroParquetOutputFormat and AvroWriteSupport which does the actual conversion.

Write parquet in java

I want to write a parquet file in standalone java, in local filesystem (not on hadoop).
How to do this?
I know I can do this easily with spark, but I need to do this in standalone java so no hadoop, spark, ecc.
See this blog post: http://blog.antlypls.com/blog/2015/12/02/how-to-write-data-into-parquet/
Short version is you need to define/provide a schema and utilize the appropriate ParquetWriter.

Class file parser

I have an assignment where i need to write a Java program that parses a .class file and retrieves things like :
1.name of the .java file
2.implemented interfaces
3.variables
4.constructors
5.methods
I don't have any ideeas where to begin from?
For example,what kind of Data I/O structure should I use?
You can you ClassParser which is available in Apache commons library. you can read the Javadoc here. You can download apache commons from here
You can also use Java reflection API which provides method such as getDeclaredFileds, getDeclaredMethods etc.
There are already several libraries for classfile parsing out there. Objectweb ASM is the most popular.
If you have to do it from scratch, that I'd recommend starting by the JVM specification, which explains the binary layout of classfiles in detail. After that, parsing is just a simple matter of programming. I've written a classfile parser before, it's not that hard.
You don't need any external library, just use java.lang.Class. Write the name of your class:
[NameOfMyClass].class.getDeclaredFields();
[NameOfMyClass].class.getDeclaredConstructors();
[NameOfMyClass].class.getDeclaredMethods();
It's the same for interfaces and many other attributes.
You can use Java Reflection. Here is a good tutorial -->
Java Reflection Tutorial
OpenJDK actually comes with an API that lets you parse and manipulate class files programmatically that most programmers don't know about. It is located at the package com.sun.org.apache.bcel.internal.

Parsing a directory of logs in Hadoop 0.20.2

I have a directory of text-based, compressed log files, each containing many records. In older versions of Hadoop I would extend MultiFileInputFormat to return a custom RecordReader which decompressed the log files and continue from there. But I'm trying to use Hadoop 0.20.2.
In the Hadoop 0.20.2 documentation, I notice MultiFileInputFormat is deprecated in favor of CombineFileInputFormat. But to extend CombineFileInputFormat, I have to use the deprecated classes JobConf and InputSplit. What is the modern equivalent of MultiFileInputFormat, or the modern way of getting records from a directory of files?
What is the modern equivalent of MultiFileInputFormat, or the modern way of getting records from a directory of files?
o.a.h.mapred.* has the old API, while the o.a.h.mapreduce.* is the new API. Some of the Input/Output formats have not been migrated to the new API. MultiFileInputFormat/CombineFileInputFormat have not been migrated to the new API in 20.2. I remember a JIRA being opened to migrate the missing formats, but I don't remember the Jira #.
But to extend CombineFileInputFormat, I have to use the deprecated classes JobConf and InputSplit.
For now it should be OK to use the old API. Check this response in the Apache forums. I am not sure of the exact plans for stopping the support to the old API. I don't think many have started using the new API, so I think it would be supported for a foreseeable future.

Categories