I want to store some Java/Scala objects as records in Parquet format, and I'm currently using parquet-avro and the AvroParquetWriter class for this purpose. This works fine, but it is very coupled to Hadoop and it's file system implementation(s). Instead, I would like to somehow get the raw binary data of the files (preferably, but not absolutely necessary, in a streaming fashion) and handle the writing of the files "manually", due to the nature of the framework I'm integrating with. Has anyone been able to achieve something like this?
Related
I am new to Hadoop and have been given the task of migrating structured data to HDFS using a java code. I know the same can be accomplished by Sqoop, but that is not my task.
Can someone please explain a possible way to do this.
I did attempt to do it. What I did was copy data from psql server using jdbc driver and then store it in a csv format in HDFS. Is this the right way to go about this?
I have read that Hadoop has its own datatypes for storing structured data. Can you please explain as to how that happens.
Thank you.
The state of the art is using (pull ETL) sqoop as regular batch processes to fetch the data from RDBMs. However, this way of doing is both resource consuming for the RDBMS (often sqoop run multiple thread with multiple jdbc connections), and takes long time (often you run sequential fetch on the RDBMS), and lead to data corruptions (the living RDBMS is updated while this long sqoop process is "always in late").
Then some alternative paradigm exists (push ETL) and are maturing. The idea behind is to build change data capture streams that listen the RDBMS. An example project is debezium. You can then build a realtime ETL that synchronize the RDBMS and the datawarehouse on hadoop or elsewhere.
Sqoop is a simple tool which perform following.
1) Connect to the RDBMS ( postgresql) and get the metadata of the table and create a pojo(a Java Class) of the table.
2) Use the java class to import and export through a mapreduce program.
If you need to write plain java code (Where parallelism you need to control for performance)
Do following:
1) Create a Java Class which connects to RDBMS using Java JDBC
2) Create a Java Class which accepts input String( Get from resultset) and write to HDFS service into a file.
Otherway doing this.
Create a MapReduce using DBInputFormat pass the number of input splits with TextOutputFormat as output directory to HDFS.
Please visit https://bigdatatamer.blogspot.com/ for any hadoop and spark related question.
Thanks
Sainagaraju Vaduka
You are better off using Sqoop. Because you may end up doing exactly what Sqoop is doing if you go the path of building it yourself.
Either way, conceptually, you will need a custom mapper with custom input format with ability to read partitioned data from the source. In this case, table column on which the data has to be partitioned would be required to exploit parallelism. A partitioned source table would be ideal.
DBInputFormat doesn't optimise the calls on source database. Complete dataset is sliced into configured number of splits by the InputFormat.
Each of the mappers would be executing the same query and loading only the portion of the data corresponding to split. This would result in each mapper issuing the same query along with sorting of dataset so it can pick its portion of data.
This class doesn't seem to take advantage of a partitioned source table. You can extend it to handle partitioned tables more efficiently.
Hadoop has structured file formats like AVRO, ORC and Parquet to begin with.
If your data doesn't require to be stored in a columnar format (used primarily for OLAP use cases where only few columns of large set of columns is required to be selected ), go with AVRO.
The way you are trying to do is not a good one because you are going to waste so much of time in developing the code,testing etc., Instead use sqoop to import the data from any RDBMS to hive. The first tool which has to come in our mind is Sqoop(SQL to Hadoop).
I'm dealing with a not-so-normal use case where data is present in WARC files.
[https://en.wikipedia.org/wiki/Web_ARChive][1]
And i want to import the data into Neo4j.
One solution i can think of is to parse the WARC file (some java code to read), then write structured data into CSV so that it can then be loaded using some import tool.
Is extracting into CSV the only option to load data into Neo4j?
Could you give me some advise on how to go about implementing this use case?
Thanks,
Phaneendra
It depends.
It depends on what data you want to load from the Web Archive. If you're talking about loading the metadata ... then you do not need the intermediate step, process the file and insert the data straight into the database. You could use a stored procedure for that (apoc library is full of similar things) or a small server application using your favorite language + driver.
If you're talking about the content inside the Web Archive, it's a different story. Neo4j is not a blob/document store so you would have to extract and interpret the archived files. That would probably be more efficient in an indirect process.
Hope this helps,
Tom
BTW csv is not the only format that can be loaded. There are procedures for loading xml, json, ...
What is the best way to create and populate Parquet files in HDFS using Java without the support of Hive or Impala libraries?
My goal is to write a simple csv record (String) to a Parquet file located in HDFS.
All the questions/answers previously asked are confusing.
Seems like parquet-mr is the way to go. They provide implementations for Thrift and Avro. Own implementations should be based on ParquetOutputFormat and might look similar to AvroParquetOutputFormat and AvroWriteSupport which does the actual conversion.
I am able to create .mpx file by using mpxj library in java.
I need write ( create ) .mpp file in java can any one suggest me please.
I maintain MPXJ, and the short answer to your enquiry is that, at present, MPXJ does not write MPP files.
The main reason for this is simply that despite the effort which has gone into understanding the MPP file structure, there is still a great deal of it which is not well understood, hence it is difficult to reliably generate. The other issue is that even if I was to produce some code which could generate an MPP file, the features it could write to that file are likely to lag behind what MPXJ supports in the MSPDI file format, again due to my incomplete understanding of the MPP format.
My suspicion is that the next version of MS project (project 15? Project 2013?) may probably offer a ".mppx" file format, similar to the ".docx" etc formats used by other applications in the MS Office suite. This will be XML-based and will be more straightforward to generate than the binary MPP file format currently is... let's see what Microsoft come up with!
Jon
Visit http://www.mpxj.org/faq/
Can I use MPXJ to write MPP files?
Not at present. Although it is technically feasible to generate an MPP file, the knowledge we have of the file structure is still relatively incomplete, despite the amount of data we are able to correctly extract. It is therefore likely to take a considerable amount of development effort to make this work, and it is conceivable that we will not be ablet to write the full set of attributes that MPXJ supports back into the MPP file - simply because we don't understand the format well enough. You are therefore probably better off using MSPDI which does support the full range of data items present in an MPP file.
You can
Try this: http://www.aspose.com/java/project-management-component.aspx
It writes MPP and Microsoft Project XML.
But this not free
Try this: http://www.aspose.com/java/project-management-component.aspx
It writes MPP and Microsoft Project XML.
I think by "mpp" you probably mean "Microsoft PowerPoint", correct?
Q: Why do you think MPXJ (Microsoft Project Exchange/Java) can't do this?
http://www.mpxj.org/
Welcome to MPXJ! This library provides a set of facilities to allow
project information to be manipulated in Java and .Net. MPXJ supports
a range of data formats: Microsoft Project Exchange (MPX), Microsoft
Project (MPP,MPT), Microsoft Project Data Interchange (MSPDI XML),
Microsoft Project Database (MPD), Planner (XML), Primavera (PM XML,
XER, and database), and Asta Powerproject (PP, MDB).
I'm looking tools/libraries which allows fast (easy) data import to existing database tables. For example phpmyadmin allows data import from .csv, .xml etc. In Hadoop hue via Beesvax for Hive we can create table from file. I'm looking tools which I can use with postgresql or libraries which allows doing such things fast and easily - I'm looking for way to avoid coding it manualy from reading file to inserting to db via jdbc.
You can do all that with standard tools in PostgreSQL, without additional libraries.
For .csv files you can use the built in COPY command. COPY is fast and simple. The source file has to lie on the same machine as the database for that. If not, you can use the very similar \copy meta-command of psql.
For .xml files (or any format really) you can use the built in pg_read_file() inside a plpgsql function. However, I quote:
Only files within the database cluster directory and the log_directory
can be accessed.
So you have to put your source file there or create a symbolic link to your actual file/directory. Then you can parse it with unnest() and xpath() and friends. You need at least PostgreSQL 8.4 for that.
A kick start on parsing XML in this blog post by Scott Bailey.