I am trying to find count of rows in all tables of a database on source and destination, source being Greenplum and destination being Hive(on HDFS).
To do the parallel processing, I have created two threads which calls the methods that calculate the counts on both the ends independently. The code can be seen below:
new Thread(new Runnable() {
#Override
public void run() {
try {
gpTableCount = getGpTableCount();
} catch (SQLException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
}
}).start();
new Thread(new Runnable() {
#Override
public void run() {
try {
hiveTableCount = getHiveTableCount();
} catch (SQLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
}
}
}).start();
while(!(gpTableCount != null && gpTableCount.size() > 0 && hiveTableCount != null && hiveTableCount.size() > 0)) {
Thread.sleep(5000);
}
The results of both the threads are stored in two separate Java Hashmaps.
Below is the count for calculating the GP counts. Method of calculating Hive counts is same except the database name, hence I just gave one method.
public Map<String,String> getGpTableCount() throws SQLException {
Connection gpAnalyticsCon = (Connection) DbManager.getGpConnection();
while(keySetIterator_gpTableList.hasNext()) {
gpTabSchemakey = keySetIterator_gpTableList.next();
tablesnSSNs = gpTabSchemakey.split(",");
target = tablesnSSNs[1].split(":");
analyticsTable = target[0].split("\\.");
gpCountQuery = "select '" + analyticsTable[1] + "' as TableName, count(*) as Count, source_system_name, max(xx_last_update_tms) from " + tablesnSSNs[0] + " where source_system_name = '" + target[1] + "' group by source_system_name";
try {
gp_pstmnt = gpAnalyticsCon.prepareStatement(gpCountQuery);
ResultSet gpCountRs = gp_pstmnt.executeQuery();
while(gpCountRs.next()) {
gpCountRs.getLong(2) + ", Max GP Tms: " + gpCountRs.getTimestamp(4).toString());
gpDataMap.put(gpCountRs.getString(1) + "," + gpCountRs.getString(3), gpCountRs.getLong(2) + "," + gpCountRs.getTimestamp(4).toString());
}
} catch(org.postgresql.util.PSQLException e) {
e.printStackTrace();
} catch(SQLException e) {
e.printStackTrace();
} catch(Exception e) {
e.printStackTrace();
}
}
System.out.println("GP Connection closed");
gp_pstmnt.close();
gpAnalyticsCon.close();
return gpDataMap;
}
Hive's Method:
public Map<String, String> getHiveTableCount() throws IOException, SQLException {
Connection hiveConnection = DbManager.getHiveConnection();
while(hiveIterator.hasNext()) {
gpHiveRec = hiveIterator.next();
hiveArray = gpHiveRec.split(",");
hiveDetails = hiveArray[1].split(":");
hiveTable = hiveDetails[0].split("\\.");
hiveQuery = "select '" + hiveTable[1] + "' as TableName, count(*) as Count, source_system_name, max(xx_last_update_tms) from " + hiveDetails[0] + " where source_system_name='" + hiveDetails[1] + "' group by source_system_name";
try {
hive_pstmnt = hiveConnection.prepareStatement(hiveQuery);
ResultSet hiveCountRs = hive_pstmnt.executeQuery();
while(hiveCountRs.next()) {
hiveDataMap.put(hiveCountRs.getString(1) + "," + hiveCountRs.getString(3), hiveCountRs.getLong(2) + "," + hiveCountRs.getTimestamp(4).toString());
}
} catch(HiveSQLException e) {
e.printStackTrace();
} catch(SQLException e) {
e.printStackTrace();
} catch(Exception e) {
e.printStackTrace();
}
}
return hiveDataMap;
}
When the jar is submitted, both the threads are launched and the SQL Queries for GP & Hive start executing simultaneously.
But the problem here is, as soon as the thread for GP finishes the execution of the method: getGpTableCount(), I see the print statement: GP Connection closed and the hive's thread hangs for atleast 30mins before resuming.
If checked for locks on Hive tables incase there would be none locked. After 30-40mins, the hive threads starts again and finishes. This happens even for less number of tables (like 20 tables) on hive.
This is how I submit the jar:
/usr/jdk64/jdk1.8.0_112/bin/java -Xdebug -Dsun.security.krb5.debug=true -Djava.security.krb5.conf=/etc/krb5.conf -Djava.security.krb5.realm=PROD.COM -Djava.security.krb5.kdc=ip-xx-xxx-xxx-xxx.ec2.internal -Djavax.security.auth.useSubjectCredsOnly=false -jar /home/etl/ReconTest/ReconAuto_Test_Prod.jar
Could anyone let me know if there is any issue with the way I create threads in the code and how can I fix it ?
Assuming your gpTableCount and hiveTableCount are normal HashMaps, you're running in to synchronization issues.
This is a broad topic to fully explain here, but here's a short intro:
Since they are populated in different threads, your main thread does not 'see' these changes until the memory is synchronized. There's no guarantee when this happens (and it's best to assume it will never happen unless you force it).
To do this properly, either use threadsafe versions (see Collections.synchronizedMap or ConcurrentHashMap), or manually synchronize your checks on the same monitor. (i.e. but the check itself in a synchronized method, and put the code that populated the map in a synchronized method, too). Alternatively, you could put the count itself in two volatile ints, and update those in the other two threads.
Related
I have to do some custom preprocessing tasks on a huge data file (~200GB).
currently, its works as below way.
select * from table
preprocessing line by line
return a new single flow file
so I decided to convert the above approach to the below way.
get the row count from the user (let's assume the user gives 1000)
execute select * query as resultSet
read the results line by line (rs.next())
when the line count reaches 1000 return the flow file and continues to other lines
So my approach is as below
onTrigger
public void onTrigger(final ProcessContext context, final ProcessSession session) throws ProcessException {
logger = getLogger();
FlowFile flowFile = session.get();
if (flowFile == null) {
return;
}
try {
final Long rowLimit = context.getProperty(ProcessorUtils.MAX_RECORD).evaluateAttributeExpressions(flowFile).asLong();
Connection conn = DriverManager.getConnection(
// db connection properties
);
Statement stm = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
ResultSet rs = stm.executeQuery("sql query");
Map<String, String> flowFileAttributes = flowFile.getAttributes();
process(
rs,
session,
flowFileAttributes,
rowLimit,
);
FlowFile stateFlowFile = session.create();
session.putAttribute(stateFlowFile, "processing_status", "end");
session.putAttribute(stateFlowFile, "record_count", "0");
session.transfer(stateFlowFile, GPReaderProcessorUtils.STATUS); // working line
} catch (Exception e) {
logger.warn(" conn " + e);
session.transfer(flowFile, GPReaderProcessorUtils.FAILURE);
}
}
Recursion Approach for termination based on line count
private void process(ResultSet rs, ProcessSession session, Map<String, String> flowFileAttributes, Long rowLimit) throws SQLException {
try{
logger.info("-> start processing with row limit = " + rowLimit);
AtomicInteger mainI = new AtomicInteger(0);
FlowFile flowFile =
session.write(session.putAllAttributes(session.create(), flowFileAttributes), (OutputStream out) -> {
int i = 0;
Map<String, String> preProcessResults = null;
try {
String res = "";
while (i < rowLimit && rs.next()) {
//preprocessing happens here
i++;
mainI.set(i);
out.write(preprocess results.toString().getBytes(StandardCharsets.UTF_8));
}
}catch (SQLException e) {
e.printStackTrace();
}
}
logger.info("gp-log ->"+ (String.valueOf(i)));
out.close();
});
FlowFile stateFlowFile = session.create();
session.putAttribute(stateFlowFile, "processing_status", "processing");
session.putAttribute(stateFlowFile, "record_count", mainI.toString());
session.transfer(stateFlowFile, GPReaderProcessorUtils.STATUS); // state relationship
session.transfer(flowFile, GPReaderProcessorUtils.SUCCESS); // preprocessed flow files returns
if(!rs.isAfterLast() && mainI != 0 && !rs.isLast()){ // recurrsion call
logger.info("gp-log -> recursion call" );
process(rs, session,flowFileAttributes,column,rowLimit);
}
}catch (Exception e){
logger.info(e.getMessage());
logger.error(e.getMessage());
session.transfer(session.putAllAttributes(session.create(),flowFileAttributes), GPReaderProcessorUtils.FAILURE);
}
}
Expected Behaviour -> while processing this one return completed rows as flow files
Current Behaviour -> after finishing all return all flow files (generated in recursion) once.
please advise on this.
your processor should extend AbstractSessionFactoryProcessor and create/commit sessions for incoming file and for each outgoing file.
files going to output queue as soon as session been committed.
I am trying to check jobstatus of a Mapreduce Job.
When I run job.iscomplete() , I get exception
"Job in state DEFINE instead of RUNNING" .
try {
if (job.isComplete()) {
printInfoLog(LOG, this.filename,
"** " + job.getTrackingURL());
break;
}
} catch (Exception e) {
LOG.warn("** " + e.getMessage());
}
But there is no such state as I checked all the fields in Jobstatus(https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/JobStatus.html)
I kind of understand it by the feeling that the job is not yet submitted . Can anyone please suggest me how to check whether job is submitted or not as I could not find any such method in the API.
I solved it as follows,
if ( job.getJobState() == JobStatus.State.RUNNING || job.getJobState() == JobStatus.State.SUCCEEDED || job.getJobState() == JobStatus.State.KILLED || job.getJobState() == JobStatus.State.FAILED)
{
try {
if (job.isComplete()) {
printInfoLog(LOG, this.filename,
"** " + job.getTrackingURL());
break;
}
} catch (Exception e) {
LOG.warn("** " + e.getMessage());
}
}
}
Although i do agree its a crude solution
I have a GUI-based application that takes in a file and displays it to the user in a table format, gets some input in the form of column annotations and a bunch of parameters. Then it parses the file accordingly and initiates an "analysis".
I just found a deadlock, one I have not encountered before.
Found one Java-level deadlock:
=============================
"RMI TCP Connection(5)-130.235.214.23":
waiting to lock monitor 0x00007fac650875e8 (object 0x0000000793267298, a java.util.logging.ConsoleHandler),
which is held by "AWT-EventQueue-0"
"AWT-EventQueue-0":
waiting to lock monitor 0x00007fac65086b98 (object 0x00000006c00dd8d0, a java.io.PrintStream),
which is held by "SwingWorker-pool-1-thread-3"
"SwingWorker-pool-1-thread-3":
waiting to lock monitor 0x00007fac65087538 (object 0x00000006c001db48, a java.awt.Component$AWTTreeLock),
which is held by "AWT-EventQueue-0"
Essentially there is a parsing error and trying to log it hangs the application altogether. Interestingly logging appears to work normally before and after that particular step..
Here's the part of the code that's relevant for the analysis task:
// Activate progress indicator
frame.getMainFrame().activateInfiGlass();
SwingWorker<Map<Analyte,AnalysisResult>, Void> worker = new SwingWorker<Map<Analyte,AnalysisResult>, Void>() {
#Override
protected Map<Analyte,AnalysisResult> doInBackground() {
try {
// register parameters
param.addParam(AnalysisParams.value_key,descPanel.getValueTypeComboIndex());
param.addParam(AnalysisParams.sepchar_key,descPanel.getSepCharComboIndex());
paramPanel.registerParams();
StringBuilder sb = new StringBuilder("Data preview completed, initiating analysis...");
sb.append(System.lineSeparator())
.append("... column annotations: ")
.append(Arrays.toString(annots));
logger.info(sb.toString() + System.lineSeparator());
// Create dataset; to be passed on to SwingWorker which will
// execute the analysis
ds = new Dataset();
String[] line;
for (int i=0; i < data.length; i++){
line = data[i];
// If ignore button is clicked, skip row..
if(!(Boolean) table.getValueAt(i, 0))
ds.addRow(line, annots); // <-- This step is where the parsing exception occurs
}
System.out.println("Dataset parsed...");
logger.info("Dataset parsing complete "
+ System.lineSeparator()
+ ds.toString()
+ System.lineSeparator());
visualizeDataset();
conserv = new ConcurrencyService(ds, dbMan);
conserv.serve();
} catch (InterruptedException e) {
logger.severe("Concurrency service interrupted"
+ System.lineSeparator()
+ DebugToolbox.getStackTraceAsString(e)
+ System.lineSeparator());
System.err.println("Interrupt exception!!");
}
return conserv.getAnalyzedPaths();
}
#Override
protected void done() {
try{
results = get();
visualizeResults();
}
catch (InterruptedException ignore) {}
catch (java.util.concurrent.ExecutionException e) {
String why = null;
Throwable cause = e.getCause();
if (cause != null) {
why = cause.getMessage();
} else {
why = e.getMessage();
}
System.err.println("Error analysing data: " + why);
} catch (SQLException e) {
e.printStackTrace();
}
logger.info("#DEBUG: Conserv should have been terminated by now..." + System.lineSeparator());
frame.getMainFrame().deactivateInfiGlass();
DebugToolbox.stopExecTimer();
}
};
worker.execute();
}});
The parsing of the values happens in an instance of Dataset, using method addRow(). The following piece of code shows the way the parsing error is handled
public double valueToIntensity(String val){
if(val.equalsIgnoreCase(""))
return missingVal;
try{
double d = Double.parseDouble(val);
switch(valType){
case RAW: break;
case LOG2: d = StrictMath.pow(2,d); break;
case LOGN: d = StrictMath.pow(StrictMath.E, d); break;
case LOG10: d = StrictMath.pow(10,d); break;
default: throw new RuntimeException("Unrecognized value type");
}
if(Double.isInfinite(d)){
StringBuilder msg = new StringBuilder("Double precision overflow occurred: 'd' is infinite!!");
msg.append(System.lineSeparator())
.append("chosen value scale is ").append(valType)
.append(System.lineSeparator())
.append("value = ").append(val);
logger.severe(msg.toString() + System.lineSeparator());
System.err.println("Data parsing error!!" +
"Please make sure that you have selected the correct scale...");
System.exit(FeverMainFrame.exitCodes.get(this.getClass()));
}
else
return d;
} catch (NumberFormatException e){
System.err.println("Data parsing error!!");
// THE FOLLOWING LINE IS WHERE DEADLOCK OCCURS
logger.severe("Expected: string representation of a numerical value, "
+ "Found: " + val + System.lineSeparator());
System.err.println("Please make sure the datafile does not include any strings "
+ "like 'N/A' or '-' for denoting missing values.");
System.exit(FeverMainFrame.exitCodes.get(this.getClass()));
}
// TODO: This should never happen!
throw new RuntimeException("Assertion failed during dataset parsing...");
}
If I remove the values that are causing the parsing error, without changing anything else, both the logging framework and the rest of application runs as expected.
I would really appreciate any insight as to what is going on in this particular case.
Absent a complete example, verify that your implementation of doInBackground() does not attempt to update any GUI component or model. Instead, publish() interim results and process() them on the EDT as they become available. A complete example is shown here.
I am getting too many deadlocks on OrientDb while I am using Java API to query the vertices. After the deadlock happens, the entire database becomes unresponsive and I have to kill the daemon and start again. As example, the error that I get from deadlocks is :
com.orientechnologies.common.concur.OTimeoutException: Can not lock record for 2000 ms. seems record is deadlocked by other record
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.acquireReadLock(OAbstractPaginatedStorage.java:1300)
at com.orientechnologies.orient.core.tx.OTransactionAbstract.lockRecord(OTransactionAbstract.java:120)
at com.orientechnologies.orient.core.id.ORecordId.lock(ORecordId.java:282)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.lockRecord(OAbstractPaginatedStorage.java:1776)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.readRecord(OAbstractPaginatedStorage.java:1416)
at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.readRecord(OAbstractPaginatedStorage.java:694)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.executeReadRecord(ODatabaseDocumentTx.java:1569)
at com.orientechnologies.orient.core.tx.OTransactionNoTx.loadRecord(OTransactionNoTx.java:80)
at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.load(ODatabaseDocumentTx.java:1434)
at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.readRecord(ONetworkProtocolBinary.java:1456)
at com.orientechnologies.orient.server.network.protocol.binary.ONetworkProtocolBinary.executeRequest(ONetworkProtocolBinary.java:346)
at com.orientechnologies.orient.server.network.protocol.binary.OBinaryNetworkProtocolAbstract.execute(OBinaryNetworkProtocolAbstract.java:216)
at com.orientechnologies.common.thread.OSoftThread.run(OSoftThread.java:65)
Following is the block that I use to query edges and create associations between vertices
public User generateFriend(String mobile, String userRID) {
StringBuilder errorMsg = new StringBuilder();
Iterable<OrientVertex> vertexes;
//Retrieve friends of the user
List<User> friendsList = new ArrayList<User>();
vertexes = db.queryVertices("select expand( unionAll(inE('E_Friend').out,outE('E_Friend').in) ) from " + userRID,errorMsg);
if (!errorMsg.toString().equals("")) {
throw new DbException("Db exception occured, " + errorMsg);
}
for (OrientVertex v : vertexes){
friendsList.add(vertexToUser(v));
}
//Create edges if between the user and other users with mobile number in the list and if the edge is not yet created
User u = findUserByMobileNo(friendsList,mobile);
if ( u == null){
u = findByMobileNo(mobile);
if (u != null) {
//create edge
db.executeQuery("select createEdge('E_Friend','" + userRID + "','" + u.getRid() + "') from " + userRID, new HashMap<String, Object>(), errorMsg);
if (!errorMsg.toString().equals("")) {
throw new DbException("Db exception occured, " + errorMsg);
}
}
}
return u;
}
public Iterable<OrientVertex> queryVertices(String query, StringBuilder errMsg){
logger.error("before getGraph, " + errMsg.toString());
graph = getGraph(errMsg);
if (!errMsg.toString().equals("")){
return null;
}
logger.error("after getGraph, " + errMsg.toString());
Iterable<OrientVertex> vertices = null;
try {
OSQLSynchQuery<OrientVertex> qr = new OSQLSynchQuery<OrientVertex>(query);
vertices = graph.command(qr).execute();
logger.error("after graph command execute, " + errMsg.toString());
}
catch (Exception ex){
errMsg.append(ex.getMessage());
logger.error("graph command exception, " + errMsg.toString());
}
logger.error("before return vertices, " + errMsg.toString());
return vertices;
}
public List<ODocument> executeQuery(String sql, HashMap<String,Object> params,StringBuilder errMsg) {
List<ODocument> result = new ArrayList<ODocument>();
try {
db = getDatabase(errMsg);
if (!errMsg.toString().equals("")){
return null;
}
OSQLSynchQuery<ODocument> query = new OSQLSynchQuery<ODocument>(sql);
if (params.isEmpty()) {
result = db.command(query).execute();
} else {
result = db.command(query).execute(params);
}
} catch (Exception e) {
errMsg.append(e.getMessage());
//TODO: Add threaded error log saving mechanism
}
return result;
}
Due to index missing on table deadlock come, so check your all table which are involved in this operation and find out that indexes are present or not on column.
Refer link in which I have a same problem of deadlock.
I try to receive all names out of my database.
I did write this code:
public static String getCmdCommand(int resultCount) throws Exception {
try {
// This will load the MySQL driver, each DB has its own driver
Class.forName("com.mysql.jdbc.Driver");
// Setup the connection with the DB
connect = DriverManager.getConnection(""+MyBot.mysqlDbPath+"",""+MyBot.mysqlDbUsername+"",""+MyBot.mysqlDbPassword+"");
PreparedStatement zpst=null;
ResultSet zrs=null;
zpst=connect.prepareStatement("SELECT `befehlsname` FROM `eigenebenutzerbefehle`");
zrs=zpst.executeQuery();
if(zrs.next()){
return zrs.getString(resultCount);
}else{
return "-none-";
}
}catch (Exception e) {
throw e;
} finally {
close();
}
}
and i start the method by running a loop:
for(int i = 0; i <= cmdAmount-1; i++){
try {
eebBenutzerBefehl = dao.getCmdCommand(i);
} catch (Exception e) {
e.printStackTrace();
}
}
cmdAmount is a integer with the valuable of the total fields inside the database.
so i.e My database holds name1 name2 name3, is it wrong to call them like this? :
return zrs.getString(resultCount);
which should be:
zrs.getString(0) = name1
zrs.getString(1) = name2
zrs.getString(2) = name3
I always receive java.sql.SQLException: Column Index out of range, perhaps it just continue to check the first entry only in the database :confused:
return zrs.getString(resultCount);
The getString() method should be given the index of the column you want to return which is always going to be the same. You should pass in a constant here such as 0.
Also, you should open the database only once rather than over and over again in that one method by passing in the "connect" variable as a parameter.
Here's what I would do if you are wanting to retrieve the name from each row of the table.
public static ArrayList<String> getCmdCommand(Connection connect) throws Exception {
try {
PreparedStatement zpst=null;
ResultSet zrs=null;
ArrayList<String> names = new ArrayList<String>();
zpst=connect.prepareStatement("SELECT `befehlsname` FROM `eigenebenutzerbefehle`");
zrs=zpst.executeQuery();
// The result set contains all the names retrieved from the call to the database, so
// you just need to iterate through them all and store them in a list.
while(zrs.next()) {
names.add(zrs.getString(0));
}
} catch (Exception e) {
throw e;
} finally {
close();
}
return names;
}
You don't need to tell it how many fields there are because it will figure that out itself.
Class.forName("com.mysql.jdbc.Driver");
Connection connect = DriverManager.getConnection(""+MyBot.mysqlDbPath+"",""+MyBot.mysqlDbUsername+"",""+MyBot.mysqlDbPassword+"");
try {
ArrayList<String> names = dao.getCmdCommand(connect);
} catch (Exception e) {
e.printStackTrace();
}
if(names.size() < 1) {
// " - none - "
}