Dynamically load different XML files to Hive tables

Dynamically load different XML files to Hive tables - java

I have a folder/stream of different complex XML files (each of size ~ 1GB). I know how to load an XMLfile data to Hive table (or any Hadoop data base).
But I want to know two things:
Can I load each xml file data to hive dynamically, i.e. without explicitly writing a create table command (because I get different XML files as a stream), is there any way which does this automatically.
"Stream of different complex xml Files --> Load to Hive tables (with out manually writing Create table command) --> Use the data which is loaded into Hive tables"
Instead of writing command line scripts to create hive tables, How can I write a java code to load xml data to Hive table.

Regarding your first question, AFAIK, it is not possible. Hive is intended to manage data that is stored within Hive tables (it is not always stored within tables, but metadata is added to the tables, pointing to the real data, that's the case of Hive external tables).
The only thing I think you can try is to create a single big table for all the data within your XML files, the already stored ones and the future ones; the trick is to put all the XML files under a common HDFS folder that it is used as the location of the create table command.
Regarding your second question, please refer to this code:
public final class HiveBasicClient {
private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";
private static Connection con;
private static Connection getConnection(String hiveServer, String hivePort, String hadoopUser, String hadoopPassword) {
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
return null;
}
try {
return DriverManager.getConnection("jdbc:hive://" + hiveServer + ":" + hivePort + "/default?user=" + hadoopUser + "&password=" + hadoopPassword);
} catch (SQLException e) {
return null;
}
}
private static res doQuery(String query) {
try {
Statement stmt = con.createStatement();
ResultSet res = stmt.executeQuery(query);
res.close();
stmt.close();
return res;
} catch (SQLException ex) {
System.exit(0);
}
}
public static void main(String[] args) {
String hiveServer = args[0];
String hivePort = args[1];
String hadoopUser = args[2];
String hadoopPassword = args[3];
con = getConnection(hiveServer, hivePort, hadoopUser, hadoopPassword);
doQuery("create external table <table_name> (<list_of_columns>) row format serde '<your_xml_serde>' location `<your_xml_files_location>');
}
}
Hope it helps.

Related

how to load the resources from the nested jar files resources

i have one jar file inside it there is lib folder which contain all the jar files which we mentaion in the pom file.
here the question is : how is read the all the external jar (pom file jars) files resources.
example : example.jar has dependies in lib folder file1.jar & file2.jar i want to read the resource of both file1.jar and file2.jar
how JVM loads the all the resources?

This is a very unusual situation, maybe much better approach which is usually used is "flatterning" the jars, so you won't have dependent jars in some folder inside an "outer" jar, but instead all packages from dependent jars will become packages of the outer jar residing next to your own code that probably is in the outer jar anyway.
Maven has shade plugin for this, and this is usually the way to go.
One noticeable exception is a spring boot applications packaged as JARs that work just like you've said (they put dependent jars into BOOT-INF/lib library, so technically its jars inside jar).
They have their own reasons to work like this which are way beyond the scope of this question, but the relevant point is that they had to create a special class loader that would handle this situation. Java out of the box can read classes from filesystem or from regular jar, but in theory java application can read the binary code from any place (Remote Filesystem, Database, Jar inside Jar whatever) as long as you implement the class loader that can find and load the resources from there.
In general I would recommend not to mess with Class Loaders which are pretty advanced concepts unless you really know what you're doing. Most of java programmers do not really create their own class loaders.

venkateswararao yeluru please follow below code to read data from jar file:
public class FirstExample {
// JDBC driver name and database URL
static final String JDBC_DRIVER = "com.mysql.jdbc.Driver";***`
strong text`***
static final String DB_URL = "jdbc:mysql://localhost/EMP";
// Database credentials
static final String USER = "username";
static final String PASS = "password";
public static void main(String[] args) {
Connection conn = null;
Statement stmt = null;
try {
// STEP 2: Register JDBC driver
Underclassmen("com.mysql.jdbc.Driver");
// STEP 3: Open a connection
System.out.println("Connecting to database...");
conn = DriverManager.getConnection(DB_URL, USER, PASS);
// STEP 4: Execute a query
System.out.println("Creating statement...");
stmt = conn.createStatement();
String sql;
sql = "SELECT id, first, last, age FROM Employees";
ResultSet rs = stmt.executeQuery(sql);
// STEP 5: Extract data from result set
while (rs.next()) {
// Retrieve by column name
int id = rs.getInt("id");
int age = rs.getInt("age");
String first = rs.getString("first");
String last = rs.getString("last");
// Display values
System.out.print("ID: " + id);
System.out.print(", Age: " + age);
System.out.print(", First: " + first);
System.out.println(", Last: " + last);
}
// STEP 6: Clean-up environment
rs.close();
stmt.close();
conn.close();
} catch (SQLException se) {
// Handle errors for JDBC
se.printStackTrace();
} catch (Exception e) {
// Handle errors for Class.forName
e.printStackTrace();
} finally {
// finally block used to close resources
try {
if (stmt != null)
stmt.close();
} catch (SQLException se2) {
} // nothing we can do
try {
if (conn != null)
conn.close();
} catch (SQLException se) {
se.printStackTrace();
} // end finally try
} // end try
System.out.println("Goodbye!");
}// end main
}// end FirstExample

HSQLDB not saving updates made through Java

I am trying to add records to a table in an HSQL database through Java.
I have an HSQL database I made through OpenOffice, renamed the .odb file to .zip and extracted the SCRIPT and PROPERTIES files (It has no data in it at the moment) to a folder "\database" in my java project folder.
The table looks like this in the SCRIPT file
CREATE CACHED TABLE PUBLIC."Season"("SeasonID" INTEGER GENERATED BY DEFAULT AS IDENTITY(START WITH 0) NOT NULL PRIMARY KEY,"Year" VARCHAR(50))
All fine so far, the database connects just fine in Java with this code:
public void connect(){
try{
String dbName = "database\\db";
con = DriverManager.getConnection("jdbc:hsqldb:file:" + dbName, // filenames prefix
"sa", // user
""); // pass
}catch (Exception e){
e.printStackTrace();
}
}
I have the following code to insert a record into "Season".
public void addSeason(String year){
int result = 0;
try {
stmt = con.createStatement();
result = stmt.executeUpdate("INSERT INTO \"Season\"(\"Year\") VALUES ('" + year + "')");
con.commit();
stmt.close();
}catch (Exception e) {
e.printStackTrace();
}
System.out.println(result + " rows affected");
}
I have a final function called printTables():
private void printTables(){
try {
stmt = con.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM \"Season\"");
System.out.println("SeasonID\tYear");
while(rs.next()){
System.out.println(rs.getInt("SeasonID") + "\t\t" + rs.getString("Year"));
}
}catch (Exception e) {
e.printStackTrace(System.out);
}
}
Now if I run this sequence of functions:
connect();
printTables();
addSeason("2010");
printTables();
I get this output:
SeasonID Year
1 rows affected
SeasonID Year
0 2010
Now when I close the program and start it again I get exactly the same output. So the change made during the first run hasn't been saved to the database. Is there something I'm missing?

It's caused by write delay params in hsqldb, by default has 500ms delay synch from memory to files.
So problem is solved when it's set to false
statement.execute("SET FILES WRITE DELAY FALSE");
or set as you like based on your app behaviour.

So my workaround is to close the connection after every update, then open a new connection any time I want to do something else.
This is pretty unsatisfactory and i'm sure it will cause problems later on if I want to perform queries mid-update. Also it's a time waster.
If I could find a way to ensure that con.close() was called whenever the program was killed that would be fine...

Read records from Hive Table using Java (through Hivemetastoreclient or Hcatalog or WebHcat)

One Hive table t_event is in demo_read database. Table has more than 100,000 records.How to read records through java API.

You can use Hive JDBC driver to connect to Hive tables. It's okay for testing or POC with the code below but I recommend moving your end tables to HBase (check Phoenix) or MongoDB or some sort of Relational based table which have low latency.
You could as well use dynamic partitions or some sort of cluster technique in Hive for better performance. You can use the following code, I haven't tested it (use it as a sample).
import java.sql.*;
public class HiveDB {
public static final String HIVE_JDBC_DRIVER = "org.apache.hadoop.hive.jdbc.HiveDriver";
public static final String HIVE_JDBC_EMBEDDED_CONNECTION = "jdbc:hive://";
private static String driverName = "org.apache.hadoop.hive.jdbc.HiveDriver";
private Statement getConnection() throws ClassNotFoundException,
SQLException {
Class.forName(HIVE_JDBC_DRIVER);
Connection connection = DriverManager.getConnection(
HIVE_JDBC_EMBEDDED_CONNECTION, "", "");
Statement statement = connection.createStatement();
return statement;
}
public static void main(String[] args) {
HiveDB hiveDB = new HiveDB();
try {
Statement statement = hiveDB.getConnection();
//print each row
ResultSet resultSet = statement.executeQuery("select * from demo_read.t_event");
int columns = resultSet.getMetaData().getColumnCount();
while (resultSet.next()) {
for ( int i = 0 ; i < columns; ++i) {
System.out.print(resultSet.getString(i + 1) + " " );
if (i == 100) break; //print up to 100th rows
}
System.out.println();
}
statement.close(); //close statement
} catch (ClassNotFoundException e) {
//
} catch (SQLException e) {
//
}
}
}

Well, actually you don't want to read all that data. You need to transform it and upload into some database or (if data is relatively small) to export it into common format (CSV, JSON, etc.).
You could transform data with Hive CLI, WebHCat or JDBC Hive driver.

How to use Asynchronous/Batch writes feature with Datastax Java driver

I am planning to use Datastax Java driver for writing to Cassandra.. I was mainly interested in Batch Writes and Asycnhronous features of Datastax java driver but I am not able to get any tutorials which can explain me how to incorporate these features in my below code which uses Datastax Java driver..
/**
* Performs an upsert of the specified attributes for the specified id.
*/
public void upsertAttributes(final String userId, final Map<String, String> attributes, final String columnFamily) {
try {
// make a sql here using the above input parameters.
String sql = sqlPart1.toString()+sqlPart2.toString();
DatastaxConnection.getInstance();
PreparedStatement prepStatement = DatastaxConnection.getSession().prepare(sql);
prepStatement.setConsistencyLevel(ConsistencyLevel.ONE);
BoundStatement query = prepStatement.bind(userId, attributes.values().toArray(new Object[attributes.size()]));
DatastaxConnection.getSession().execute(query);
} catch (InvalidQueryException e) {
LOG.error("Invalid Query Exception in DatastaxClient::upsertAttributes "+e);
} catch (Exception e) {
LOG.error("Exception in DatastaxClient::upsertAttributes "+e);
}
}
In the below code, I am creating a Connection to Cassandra nodes using Datastax Java driver.
/**
* Creating Cassandra connection using Datastax Java driver
*
*/
private DatastaxConnection() {
try{
builder = Cluster.builder();
builder.addContactPoint("some_nodes");
builder.poolingOptions().setCoreConnectionsPerHost(
HostDistance.LOCAL,
builder.poolingOptions().getMaxConnectionsPerHost(HostDistance.LOCAL));
cluster = builder
.withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE)
.withReconnectionPolicy(new ConstantReconnectionPolicy(100L))
.build();
StringBuilder s = new StringBuilder();
Set<Host> allHosts = cluster.getMetadata().getAllHosts();
for (Host h : allHosts) {
s.append("[");
s.append(h.getDatacenter());
s.append(h.getRack());
s.append(h.getAddress());
s.append("]");
}
System.out.println("Cassandra Cluster: " + s.toString());
session = cluster.connect("testdatastaxks");
} catch (NoHostAvailableException e) {
e.printStackTrace();
throw new RuntimeException(e);
} catch (Exception e) {
}
}
Can anybody help me on how to add Batch writes or Asynchronous features to my above code.. Thanks for the help..
I am running Cassandra 1.2.9

For asynch it's as simple as using the executeAsync function:
...
DatastaxConnection.getSession().executeAsync(query);
For the batch, you need to build the query (I use strings because the compiler knows how to optimize string concatenation really well):
String cql = "BEGIN BATCH "
cql += "INSERT INTO test.prepared (id, col_1) VALUES (?,?); ";
cql += "INSERT INTO test.prepared (id, col_1) VALUES (?,?); ";
cql += "APPLY BATCH; "
DatastaxConnection.getInstance();
PreparedStatement prepStatement = DatastaxConnection.getSession().prepare(cql);
prepStatement.setConsistencyLevel(ConsistencyLevel.ONE);
// this is where you need to be careful
// bind expects a comma separated list of values for all the params (?) above
// so for the above batch we need to supply 4 params:
BoundStatement query = prepStatement.bind(userId, "col1_val", userId_2, "col1_val_2");
DatastaxConnection.getSession().execute(query);
On a side note, I think your binding of the statement might look something like this, assuming you change attributes to a list of maps where each map represents an update/insert inside the batch:
BoundStatement query = prepStatement.bind(userId,
attributesList.get(0).values().toArray(new Object[attributes.size()]),
userId_2,
attributesList.get(1).values().toArray(new Object[attributes.size()]));

For the example provided in Lyuben's answer, setting certain attributes of a batch like Type.COUNTER (if you need to update counters) using strings won't work. Instead you can arrange your prepared statements in batch like so:
final String insertQuery = "INSERT INTO test.prepared (id, col_1) VALUES (?,?);";
final PreparedStatement prepared = session.prepare(insertQuery);
final BatchStatement batch = new BatchStatement(BatchStatement.Type.UNLOGGED);
batch.add(prepared.bind(userId1, "something"));
batch.add(prepared.bind(userId2, "another"));
batch.add(prepared.bind(userId3, "thing"));
session.executeAsync(batch);

username, password etc hardcoded into program - how to pick them up from a file instead?

I have made some code in ECLIPSE. As you can see, the username, password, database URL, SQL Queries are hardcoded into my program. I don't want to change my code and recompile it every time I change the password, username of my database or modify the query. For that reason, I want to put these "parameters" inside a file (text, XML, json ???). I want to be able to change the parameters easily.
So what should I do ? What kind of file should I use ? I want to avoid XML, json because
you need to know both well. XML and JSON can be difficult when the code grows big.
import java.sql.*;
public class JDBCDemo {
static String query =
"SELECT customerno, name " +
"FROM customers;";
public static void main(String[]args)
{
String username = "cowboy";
String password = "123456";
try
{
Connection con = DriverManager.getConnection("jdbc:mysql://localhost:3306/business", username, password);
Statement stmt = con.createStatement();
ResultSet rs = stmt.executeQuery(query);
}catch(SQLException ex){System.out.println("Exception! " + ex);}
}
}

You can use
Properties
and read the values from a .properties file.

Maybe this helps. It's an example of such config files.

InputStream inputStream = null;
Properties properties = new Properties();
try {
inputStream = JDBCDemo.class.getClassLoader().getResourceAsStream("jdbc.properties");
properties.load(inputStream);
}
finally {
try {
if (inputStream != null) {
inputStream.close();
}
}
catch (IOException e) {
;//ignore
}
}
String username = properties.get("username");

XML and JSON are structured formats so they needs to be parsed.
Text file will be the simplest to implement but because it is unstructured the end user is going to need going to need to know the internal format understood by your program. For instance an extra space may break your program.

Ever think of using spring injection where u can create one config file containing all hardcoded variables which through getters and setters can be put into your code. Then in the future if any of these variables need changing you only change 1 file
http://www.springsource.org/javaconfig/

generally the uer-id & passwords (rather any resource that needs to be configured from outside your JAR/WAR) is stored in a Properties file as a key value pair. You can then access them using ResourceBundle class : http://docs.oracle.com/javase/1.4.2/docs/api/java/util/ResourceBundle.html

For sql queries you could use a database function or procedure which returns some kind of cursor similar to SYS_REFCURSOR in Oracle and call the function or procedure from front end. By doing this you could avoid hard coding of sql queries in java code.
Regards

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Dynamically load different XML files to Hive tables - java

Related

how to load the resources from the nested jar files resources

HSQLDB not saving updates made through Java

Read records from Hive Table using Java (through Hivemetastoreclient or Hcatalog or WebHcat)

How to use Asynchronous/Batch writes feature with Datastax Java driver

username, password etc hardcoded into program - how to pick them up from a file instead?

Categories

Resources