Fast and easy data import tools/libraries - java

I'm looking tools/libraries which allows fast (easy) data import to existing database tables. For example phpmyadmin allows data import from .csv, .xml etc. In Hadoop hue via Beesvax for Hive we can create table from file. I'm looking tools which I can use with postgresql or libraries which allows doing such things fast and easily - I'm looking for way to avoid coding it manualy from reading file to inserting to db via jdbc.

You can do all that with standard tools in PostgreSQL, without additional libraries.
For .csv files you can use the built in COPY command. COPY is fast and simple. The source file has to lie on the same machine as the database for that. If not, you can use the very similar \copy meta-command of psql.
For .xml files (or any format really) you can use the built in pg_read_file() inside a plpgsql function. However, I quote:
Only files within the database cluster directory and the log_directory
can be accessed.
So you have to put your source file there or create a symbolic link to your actual file/directory. Then you can parse it with unnest() and xpath() and friends. You need at least PostgreSQL 8.4 for that.
A kick start on parsing XML in this blog post by Scott Bailey.

Related

How to make .jar file read database and serialised .txt file?

I'm trying to make an executable .jar file from a program that uses both an SQLite database and a serialized read/write file system for storage inside the program. We are able to make an executable .jar file, but it doesn't read from any database nor file system that we have written within the program. Does anybody know how to do this?
"To make a .jar file read from a database" when you have a Java or Kotlin (or Scala) program which does that you are almost done, you just have to compile it and which provides you a .jar file which does exactly that.
I think your question is what does it take to write a program that does it.
For that you have to think about what your database should be able to handle.
Read from filesystem
A text file on a disc is very rudimentary but might suffice if the application is just to read/write some persisted state. For that the Java API has all you need, take a look at Baeldung - read from file via NIO.
Read from database
If you need transactions or support for multiple applications working with the persisted storage, a database would be the more sensible thing to use.
Although the Java API again provides you everything you need for that (e.g.Processing SQL Statements with JDBC) you would need to reinvent the wheel in a sense. The easier thing would be to rely on an OR-mapper like Hibernate or JOOQ.
For writing enterprise Java software you could overengineer it with Spring-Boot JPA.

Migrating data from postgresql to hdfs using a java code

I am new to Hadoop and have been given the task of migrating structured data to HDFS using a java code. I know the same can be accomplished by Sqoop, but that is not my task.
Can someone please explain a possible way to do this.
I did attempt to do it. What I did was copy data from psql server using jdbc driver and then store it in a csv format in HDFS. Is this the right way to go about this?
I have read that Hadoop has its own datatypes for storing structured data. Can you please explain as to how that happens.
Thank you.
The state of the art is using (pull ETL) sqoop as regular batch processes to fetch the data from RDBMs. However, this way of doing is both resource consuming for the RDBMS (often sqoop run multiple thread with multiple jdbc connections), and takes long time (often you run sequential fetch on the RDBMS), and lead to data corruptions (the living RDBMS is updated while this long sqoop process is "always in late").
Then some alternative paradigm exists (push ETL) and are maturing. The idea behind is to build change data capture streams that listen the RDBMS. An example project is debezium. You can then build a realtime ETL that synchronize the RDBMS and the datawarehouse on hadoop or elsewhere.
Sqoop is a simple tool which perform following.
1) Connect to the RDBMS ( postgresql) and get the metadata of the table and create a pojo(a Java Class) of the table.
2) Use the java class to import and export through a mapreduce program.
If you need to write plain java code (Where parallelism you need to control for performance)
Do following:
1) Create a Java Class which connects to RDBMS using Java JDBC
2) Create a Java Class which accepts input String( Get from resultset) and write to HDFS service into a file.
Otherway doing this.
Create a MapReduce using DBInputFormat pass the number of input splits with TextOutputFormat as output directory to HDFS.
Please visit https://bigdatatamer.blogspot.com/ for any hadoop and spark related question.
Thanks
Sainagaraju Vaduka
You are better off using Sqoop. Because you may end up doing exactly what Sqoop is doing if you go the path of building it yourself.
Either way, conceptually, you will need a custom mapper with custom input format with ability to read partitioned data from the source. In this case, table column on which the data has to be partitioned would be required to exploit parallelism. A partitioned source table would be ideal.
DBInputFormat doesn't optimise the calls on source database. Complete dataset is sliced into configured number of splits by the InputFormat.
Each of the mappers would be executing the same query and loading only the portion of the data corresponding to split. This would result in each mapper issuing the same query along with sorting of dataset so it can pick its portion of data.
This class doesn't seem to take advantage of a partitioned source table. You can extend it to handle partitioned tables more efficiently.
Hadoop has structured file formats like AVRO, ORC and Parquet to begin with.
If your data doesn't require to be stored in a columnar format (used primarily for OLAP use cases where only few columns of large set of columns is required to be selected ), go with AVRO.
The way you are trying to do is not a good one because you are going to waste so much of time in developing the code,testing etc., Instead use sqoop to import the data from any RDBMS to hive. The first tool which has to come in our mind is Sqoop(SQL to Hadoop).

Is there a way to insert data to Oracle using Direct Path loading for in-memory records?

New to Oracle here but I have now read about the various bulk insert options for Oracle. In essence, true bulk loading is done using Direct Path loading mechanism via SQL*Loader. There's also APPEND hint options that use serial or parallel Direct Path loading. But each of these have the following limitations -
SQL*Loader works off of a Control File, which contains the path of the data file. In my case, there is no file.
APPEND hint option for INSERT can only use the syntax - insert into select from. In my case, the source data is not in any table.
Source of my data is actually a Spark dataframe. I am looking for options to push this data in chunks to Oracle tables, but using Direct Path loading option. For example, in Postgres, the PGConnection interface provides getCopyAPI.copyIn functionality and you can create a huge serialized blob than can be sent over as one big chunk using COPY tableName FROM STDIN yourBlob command. I am unable to find anything similar Java API for Oracle that works on in-memory records and is able to push data directly (without any insert statements).
Any ideas on how to achieve this? Anyone done this before?
In general, how do folks using Oracle and Spark push data to Oracle from a dataframe in an optimized way?
Thanks in advance!

importing data from WARC files (WebArchive)

I'm dealing with a not-so-normal use case where data is present in WARC files.
[https://en.wikipedia.org/wiki/Web_ARChive][1]
And i want to import the data into Neo4j.
One solution i can think of is to parse the WARC file (some java code to read), then write structured data into CSV so that it can then be loaded using some import tool.
Is extracting into CSV the only option to load data into Neo4j?
Could you give me some advise on how to go about implementing this use case?
Thanks,
Phaneendra
It depends.
It depends on what data you want to load from the Web Archive. If you're talking about loading the metadata ... then you do not need the intermediate step, process the file and insert the data straight into the database. You could use a stored procedure for that (apoc library is full of similar things) or a small server application using your favorite language + driver.
If you're talking about the content inside the Web Archive, it's a different story. Neo4j is not a blob/document store so you would have to extract and interpret the archived files. That would probably be more efficient in an indirect process.
Hope this helps,
Tom
BTW csv is not the only format that can be loaded. There are procedures for loading xml, json, ...

Using Hibernate to work with Text Files

I am using Hibernate in a Java application to access my Database and it works pretty well with MS-SQL and MySQL. But some of the data I have to show on some forms has to come from Text files, and by Text files I mean Human-Readable files, they can be CSV, Tab-Delimited, or even a key, value pair, per line since my data is as simple as this, but my preference of course is XML files.
My question is: Can I use hibernate to read those files using HQL, Query , EntityManager and all those resources Hibernate provides me to access files. Which file format should I use and How I configure My persistence.xml file to recognize files as Tables?
Hibernate is written against the JDBC API. So, you need a JDBC driver that works with the file format you are interested in. Obviously, even for read-only access, this isn't going to perform well, but it might still be useful if that's not a high priority. On a Windows system, you can set up ODBC datasources for delimited text files, Excel files, etc. Then you can set up the JdbcOdbcDriver in your Java application to use this data source.
For most of the applications I work on, I would not consider this approach; I would use an import/export mechanism to convert from a real database (even if it's an in-process database like Berkeley DB or Derby) to the text files. Yes, it's an extra step, but it could be automated, and the performance isn't likely to be much worse than trying to use the text files directly (it will likely be much better, overall), and it will be more robust and easy to develop.
A quick google came up with
JDBC driver for csv files
JDBC driver for XML files
Hope this might provide some inspiration?
Like erickson said, your only hope is in finding a JDBC driver for that task. There is maybe xlsql (CSV, XML and Excel driver) which could fit the task. After that, you just have to either find or write the most simple Hibernate Dialect which fits your driver.

Categories