I have a requirement to build a system with Java and Hadoop to handle large data processing (in JSON Format). The system I'm going to create is including insert data to the file storage (whether it is HDFS or database) and query the processed data
I have a big picture of using Hadoop MapReduce to query the data that the user want.
But one thing that makes me confuse is how should I insert the data. Should I use HDFS and inserting the file using Java with Hadoop API? Or is it better to use another tools (e.g. HBase, Relational Database, NoSQL Database) to insert the data so that Hadoop MapReduce will take the input data from another tools that I will be used?
Please advise.
Thank you very much
I would suggest you to use HDFS/HIVE/JSONSerde approach.
The solution outline would look like.
Store your JSON data on HDFS.
Create external tables using hive and use jsonSerde to map json data to columns of your table.
Query your data using hiveQL.
In the above solution, Since hive is schema-on-read, Your json data will be parsed every time when you query the tables.
But if you want to parse the data once and if you have data arriving in batches (weekly, monthly), it would be good to parse the data once and create a staging table. which can be used for frequent querying to avoid repetitive parsing by serde.
I have an example created at :Hadoopgig
Related
I have SpringBoot project which will pull a large amount of data from one database, do some kind of transformation on it, and then insert it into a table in a PostgreSQL database. This process will continue for a few billion records so performance is key.
I've been researching trying to find the best way to do this, such as using an ORM or a JDBCTemplate for example. One thing I keep seeing constantly regarding bulk inserts into PostgreSQL is the COPY command. https://www.postgresql.org/docs/current/populate.html
I'm confused because using COPY requires the data to be written into a file, and while I've seen people saying to use it I've yet to come across a case where someone mentions how to get the data into the file. Isn't writing to a file slow? If writing to a file is slow, then the performance gains that COPY does bring, does this make it be like there was no gain at all?
These kind of data migration and conversion is better to handle in Stored procedures. Assuming that the source data is already loaded to postgres ( if not use postgres db utility to load the raw data to some flat table). Then write series of stored procs to transform the data and insert into the destination table.
I have done some complex data migration and i used this approach. If you have to do lot of complex data conversion, write some python script ( which is usually faster than spring boot/data setup), insert the parially converted data, then do some stored procs to do the final conversion.
It is better to keep the business logic to convert/massage data close to the datasource ( in stored procs) instead of pulling data to app server and reinserting them.
Hope it helps.
What is the best approach for saving statistical data on a file using spring framework? is there any available library that offers reading and updating the data on a file? or should I build my own IO code?
I already have a relational database, but don't like the approach of creating an additional table to save the calculated values in different multiple tables with joins, also don't want to add more complexity to the project by using an additional database for just one task like MongoDB.
To understand the complexity of this report, Imagine you are drawing a chart with a total number of daily transactions for a full year with billions of records at any time with a lot of extra information like( total and average with different currencies on different rates).
So, my approach was to generate those data in a file on a regular basis, so later I don't need to generate them again once requested, only accumulate the new dates if available to the file
Is this approach fine? and what is the best library to do that in an efficient way?
Update
I found this answer useful for why sometimes people prefer using flat files rather than the relational or non-relational one
Is it faster to access data from files or a database server?
I would preferet to use MongoDB for such purposes, but if you need simple approach, you can write your data to csv\excel file.
Just using I\O
List<String> data = new ArrayList<>();
data.add("head1;head2;head3");
data.add("a;b;c");
data.add("e;f;g");
data.add("9;h;i");
Files.write(Paths.get("my.csv"), data);
That is all)
How to convert your own object, to such string 'filed1;field2' I think you know.
Also you can use apache-poi csv library, but I think this is way much faster.
Files.write(Paths.get("my.csv"), data, StandardOpenOption.APPEND);
If you want to append data to existed file, there are many different options in StandardOpenOption.
For reading you should use Files.readAllLines(Paths.get("my.csv")); it will return you list of strings.
Also you can read lines in range.
But if you need to retrieve one column, or update two columns where, and so on. You should read about MongoDB or other not relational databases. It is difficult write about MongoDB here, you should read documentation.
Enjoy)
I found a library that can be used to write/read CSV files easily and can be mapped to objects as well Jackson data formats
Find an example with spring
Is CSV the only options to speed up my bulk relationships creation?
I read many articles in internet, and they all are telling about CSV. CSV will definitely give me a performance boost (could you suppose how big?), but I'm not sure I can store data in CSV format. Any other options? How much I will get from using Neo4J 3 BOLT protocol?
My program
I'm using Neo4j 2.1.7. I try to create about 50000 relationships at once. I execute queries in batch of size 10000, and it takes about 120-140 seconds to insert all 50000.
My query looks like:
MATCH (n),(m)
WHERE id(n)=5948 and id(m)=8114
CREATE (n)-[r:MY_REL {
ID:"4611686018427387904",
TYPE: "MY_REL_1"
PROPERTY_1:"some_data_1",
PROPERTY_2:"some_data_2",
.........................
PROPERTY_14:"some_data_14"
}]->(m)
RETURN id(n),id(m),r
As it is written in the documentation:
Cypher supports querying with parameters. This means developers don’t
have to resort to string building to create a query. In addition to
that, it also makes caching of execution plans much easier for Cypher.
So, you need pack your data as parameters and pass with cypher query:
UNWIND {rows} as row
MATCH (n),(m)
WHERE id(n)=row.nid and id(m)=row.mid
CREATE (n)-[r:MY_REL {
ID:row.relId,
TYPE:row.relType,
PROPERTY_1:row.someData_1,
PROPERTY_2:row.someData_2,
.........................
PROPERTY_14:row.someData_14
}]->(m)
RETURN id(n),id(m),r
I am currently working on a Java MapReduce job, which should output data to a bucketed Hive table.
I think of two approaches:
First directly write to Hive via HCatalog. The problem is, that this approach does not support writing to a bucketed Hive table. Hence, when using a bucketed Hive table, I need to first write to a non-bucketed table and then copy it to the bucketed one.
The second option is to write the output to a text file and load this data into Hive afterwards.
What is the best practice here?
Which approach is more performant with a huge amount of data (with respect to memory and time taken)?
Which approach would be the better one, if I could also use non-bucketed Hive tables?
Thanks a lot!
For non-bucketed tables, you can store your MapReduce output in the table storage location. Then you'd only need to run MSCK REPAIR TABLE to update the metadata with the new partitions.
Hive's load command actually just copies the data to the table storage location.
Also, from HIVE documentation:
The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.
So you'd need to tweak your MapReduce output to fit these constrains.
I need some inputs and suggestions from you. I have a very huge database which has around 2000 records having some information.
is it good to have another database having key value pair pointing to that huge database or XML file is enough?
Having 2000 records is not huge. And its better to use SQLite for data operations rather than using xml file, because an xml file with 2000 pairs will make the processing slow and is resource wasting. Better use SQLite for such requirements.