What is the best way to create and populate Parquet files in HDFS using Java without the support of Hive or Impala libraries?
My goal is to write a simple csv record (String) to a Parquet file located in HDFS.
All the questions/answers previously asked are confusing.
Seems like parquet-mr is the way to go. They provide implementations for Thrift and Avro. Own implementations should be based on ParquetOutputFormat and might look similar to AvroParquetOutputFormat and AvroWriteSupport which does the actual conversion.
Related
I have some Parquet files that I've written in Python using PyArrow (Apache Arrow):
pyarrow.parquet.write_table(table, "example.parquet")
Now I want to read these files (and preferably get an Arrow Table) using a Java program.
In Python, I can simply use the following to get an Arrow Table from my Parquet file:
table = pyarrow.parquet.read_table("example.parquet")
Is there an equivalent and easy solution in Java?
I couldn't really find any good / working examples nor any usefull documentation for Java (only for Python). Or some examples don't provide all needed Maven dependencies. I also don't want to use a Hadoop file system, I just want to use local files.
Note: I also found out that I can't use "Apache Avro" because my Parquet files contains column names with the symbols [, ] and $ which are invalid characters in Apache Avro.
Also, can you please provide Maven dependencies if your solution uses Maven.
I am on Windows and using Eclipse.
Update (November 2020): I never found a suitable solution and just stuck with Python for my usecase.
it's somewhat an overkill, but you can use Spark.
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
I am new to Hadoop and have been given the task of migrating structured data to HDFS using a java code. I know the same can be accomplished by Sqoop, but that is not my task.
Can someone please explain a possible way to do this.
I did attempt to do it. What I did was copy data from psql server using jdbc driver and then store it in a csv format in HDFS. Is this the right way to go about this?
I have read that Hadoop has its own datatypes for storing structured data. Can you please explain as to how that happens.
Thank you.
The state of the art is using (pull ETL) sqoop as regular batch processes to fetch the data from RDBMs. However, this way of doing is both resource consuming for the RDBMS (often sqoop run multiple thread with multiple jdbc connections), and takes long time (often you run sequential fetch on the RDBMS), and lead to data corruptions (the living RDBMS is updated while this long sqoop process is "always in late").
Then some alternative paradigm exists (push ETL) and are maturing. The idea behind is to build change data capture streams that listen the RDBMS. An example project is debezium. You can then build a realtime ETL that synchronize the RDBMS and the datawarehouse on hadoop or elsewhere.
Sqoop is a simple tool which perform following.
1) Connect to the RDBMS ( postgresql) and get the metadata of the table and create a pojo(a Java Class) of the table.
2) Use the java class to import and export through a mapreduce program.
If you need to write plain java code (Where parallelism you need to control for performance)
Do following:
1) Create a Java Class which connects to RDBMS using Java JDBC
2) Create a Java Class which accepts input String( Get from resultset) and write to HDFS service into a file.
Otherway doing this.
Create a MapReduce using DBInputFormat pass the number of input splits with TextOutputFormat as output directory to HDFS.
Please visit https://bigdatatamer.blogspot.com/ for any hadoop and spark related question.
Thanks
Sainagaraju Vaduka
You are better off using Sqoop. Because you may end up doing exactly what Sqoop is doing if you go the path of building it yourself.
Either way, conceptually, you will need a custom mapper with custom input format with ability to read partitioned data from the source. In this case, table column on which the data has to be partitioned would be required to exploit parallelism. A partitioned source table would be ideal.
DBInputFormat doesn't optimise the calls on source database. Complete dataset is sliced into configured number of splits by the InputFormat.
Each of the mappers would be executing the same query and loading only the portion of the data corresponding to split. This would result in each mapper issuing the same query along with sorting of dataset so it can pick its portion of data.
This class doesn't seem to take advantage of a partitioned source table. You can extend it to handle partitioned tables more efficiently.
Hadoop has structured file formats like AVRO, ORC and Parquet to begin with.
If your data doesn't require to be stored in a columnar format (used primarily for OLAP use cases where only few columns of large set of columns is required to be selected ), go with AVRO.
The way you are trying to do is not a good one because you are going to waste so much of time in developing the code,testing etc., Instead use sqoop to import the data from any RDBMS to hive. The first tool which has to come in our mind is Sqoop(SQL to Hadoop).
I want to store some Java/Scala objects as records in Parquet format, and I'm currently using parquet-avro and the AvroParquetWriter class for this purpose. This works fine, but it is very coupled to Hadoop and it's file system implementation(s). Instead, I would like to somehow get the raw binary data of the files (preferably, but not absolutely necessary, in a streaming fashion) and handle the writing of the files "manually", due to the nature of the framework I'm integrating with. Has anyone been able to achieve something like this?
I want to write a parquet file in standalone java, in local filesystem (not on hadoop).
How to do this?
I know I can do this easily with spark, but I need to do this in standalone java so no hadoop, spark, ecc.
See this blog post: http://blog.antlypls.com/blog/2015/12/02/how-to-write-data-into-parquet/
Short version is you need to define/provide a schema and utilize the appropriate ParquetWriter.
I'm looking tools/libraries which allows fast (easy) data import to existing database tables. For example phpmyadmin allows data import from .csv, .xml etc. In Hadoop hue via Beesvax for Hive we can create table from file. I'm looking tools which I can use with postgresql or libraries which allows doing such things fast and easily - I'm looking for way to avoid coding it manualy from reading file to inserting to db via jdbc.
You can do all that with standard tools in PostgreSQL, without additional libraries.
For .csv files you can use the built in COPY command. COPY is fast and simple. The source file has to lie on the same machine as the database for that. If not, you can use the very similar \copy meta-command of psql.
For .xml files (or any format really) you can use the built in pg_read_file() inside a plpgsql function. However, I quote:
Only files within the database cluster directory and the log_directory
can be accessed.
So you have to put your source file there or create a symbolic link to your actual file/directory. Then you can parse it with unnest() and xpath() and friends. You need at least PostgreSQL 8.4 for that.
A kick start on parsing XML in this blog post by Scott Bailey.