Map a table of a cassandra database using spark and RDD

Map a table of a cassandra database using spark and RDD - java

i have to map a table in which is written the history of utilization of an app. The table has got these tuples:
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
<AppId,date,cpuUsage,memoryUsage>
AppId is always different, because is referenced at many app, date is expressed in this format dd/mm/yyyy hh/mm cpuUsage and memoryUsage are expressed in % so for example:
<3ghffh3t482age20304,230720142245,0.2,3,5>
I retrieved the data from cassandra in this way (little snippet):
public static void main(String[] args) {
Cluster cluster;
Session session;
cluster = Cluster.builder().addContactPoint("127.0.0.1").build();
session = cluster.connect();
session.execute("CREATE KEYSPACE IF NOT EXISTS foo WITH replication "
+ "= {'class':'SimpleStrategy', 'replication_factor':3};");
String createTableAppUsage = "CREATE TABLE IF NOT EXISTS foo.appusage"
+ "(appid text,date text, cpuusage double, memoryusage double, "
+ "PRIMARY KEY(appid,date) " + "WITH CLUSTERING ORDER BY (time ASC);";
session.execute(createTableAppUsage);
// Use select to get the appusage's table rows
ResultSet resultForAppUsage = session.execute("SELECT appid,cpuusage FROM foo.appusage");
for (Row row: resultForAppUsage)
System.out.println("appid :" + row.getString("appid") +" "+ "cpuusage"+row.getString("cpuusage"));
// Clean up the connection by closing it
cluster.close();
}
So, my problem now is to map the data by key value and create a tuple integrating this code (snippet that's doesn't work):
<AppId,cpuusage>
JavaPairRDD<String, Integer> saveTupleKeyValue =someStructureFromTakeData.mapToPair(new PairFunction<String, String, Integer>() {
public Tuple2<String, Integer> call(String x) {
return new Tuple2(x, y);
}
how i can map appId and cpuusage using RDD and the reduce eg. cpuusage >50?
Any help?
thanks in advance.

Assuming that you have a valid SparkContext sparkContext already created, have added the spark-cassandra connector dependencies to your project and configured your spark application to talk to your cassandra cluster (see docs for that), then we can load the data in an RDD like this:
val data = sparkContext.cassandraTable("foo", "appusage").select("appid", "cpuusage")
In Java, the idea is the same but it requires a bit more plumbing, described here

Related

Apache Camel with SQL component for inserts with batch=true throws error with multiple tables

I am using Apache camel with sql component for performing sql operations with Postgresql. I have already tried successfully inserting multiple rows in a single table as a batch using batch=true option and providing the Iterator in the message body. To keep the example simple with Student as table name and having 2 columns name & age, below is the code snippet displaying the relevant part:
from("direct:batch_insert_single_table")
...
.process(ex -> {
log.info("batch insert for single table");
final var iterator = IntStream.range(0, 5000).boxed()
.map(x -> {
final var query = new HashMap<String, Object>();
Integer counter = x.intValue();
String name = "abc_" + counter;
query.put("name", name);
query.put("age", counter);
return query;
}).iterator();
ex.getMessage().setBody(iterator);
})
.to("sqlComponent:INSERT INTO student (name, age) VALUES (:#name, :#age);?batch=true")
...
;
This overall takes 10 seconds for 5000 records.
However, when I use the same approach for inserting as a batch on multiple different tables, I get an error:
Here is the code that is not working:
from("direct:batch_insert_multiple_tables")
...
.process(ex -> {
log.info("batch insert for multiple tables");
final var iterator = IntStream.range(0, 3).boxed()
.map(x -> {
final var query = new HashMap<String, Object>();
Integer counter = x.intValue();
String name = "abc_" + counter;
query.put("table", "test" + counter);
query.put("name", "name");
query.put("age", counter);
return query;
}).iterator();
ex.getMessage().setBody(iterator);
})
.to("sqlComponent:INSERT INTO :#table (name,age) VALUES (:#name,:#age);?batch=true")
...
;
The tables test0, test1 & test2 are already existing.
The exception thrown is:
Failed delivery for (MessageId: A0D98C12BAD769F-0000000000000000 on ExchangeId: A0D98C12BAD769F-0000000000000000). Exhausted after delivery attempt: 1 caught: org.springframework.jdbc.BadSqlGrammarException: PreparedStatementCallback; bad SQL grammar []; nested exception is org.postgresql.util.PSQLException: ERROR: syntax error at or near "$1"
Position: 13
Plz suggest if I am doing something wrong or my approach is simply not supported by Apache Camel.
NOTE: I am using the latest version apache camel & Postgre.
Regards,
GSN

You cannot use a parameter for a table name, column name nor any other identifier in PostgreSQL. You either have to use a dynamically generated SQL statement (that is, a statement you construct in your Java code; take special care of SQL injection) or two SQL statements.

saveOrUpdate doesn't work in Spring Data with H2 database

I have for loop in my program where I save new objects to database. It looks like
for (String value: readvalue.readValue()) {
Value value= getValueForSomething(something);
System.out.println(value);
valueRepository.save(value);
}
And this fragment of code is executed every 30s and saving to database all values. Some values in database have two same fields and one other. How can I update values in h2 database instead of insert new?

I would suggest that within your for loop, create a method that checks to see if the object already exists in your H2 DB by querying for it using an unique identifier like an id. Use the following example RDB query as a reference:
private static final String PRODUCT_ALREADY_EXISTS_QUERY = "SELECT EXISTS(SELECT 1"
+ " FROM inventory.products "
+ " WHERE 1 = 1"
+ " AND id = :id)";
Then, if the record does exists, then call an update method that utilizes a query to UPDATE using the unique identifier. An RDB example query would be:
private static final String UPDATE_QUERY = "UPDATE inventory.products"
+ " SET (company_id, name, price, type, quantity, created_date, last_modified_date) = "
+ " (:companyId, :productName, :price, :productType, :quantity, :createdDate, :lastModifiedDateTime)"
+ " WHERE id = :id ";
If the record doesn't exist, then just create the record like you are.

JSONB PostgreSQL data type with JAVA - inserts and joins

Trying my hand at JSONB datatype first time(discussion continued from (Join tables using a value inside a JSONB column) on advise from #Erwin , starting new thread)
Two tables (obfuscated data and table names):
Discussion table { discussion_id int, contact_id, group_id, discussion_updates jsonb } [has around 600 thousand rows]
Authorization table { user_id varchar , auth_contacts jsonb, auth_groups jsonb } [has around 100 thousand rows]
auth_contacts jsonb data has key value pairs data (as example)
{ "CC1": "rr", "CC2": "ro" }
auth_groups jsonb data has key value pairs data (as example)
{ "GRP1": "rr", "GRP2": "ro" }
First, on inserts in database via Java JDBC. What I am doing is :
JSONObject authContacts = new JSONObject();
for (each record in data) {
authContacts.put(contactKey, contactRight);
authGroups.put(groupKey, groupRight);
}
String insertSql = "INSERT INTO SSTA_AuthAll(employee_id, auth_contacts, auth_groups) VALUES(?,?::jsonb,?::jsonb)";
//-- Connect to Db and prepare query
preparedStatement.setObject(2, authContacts.toJSONString());
preparedStatement.setObject(3, authGroups.toJSONString());
// INSERT into DB
Now, the toJSONString() takes time (as much as 1 second sometimes - TIME FOR toJSON STRING LOOP:17238ms) which again is inefficient.
So again is this right way to do it ? Most examples on google directly have a string which they insert.
If I directly insert a MAP into jsonb column, it expects an HSTORE extension which is what I shouldn't be using if I am going for jsonb?
Now on the next part:
I need to join contact_id from discussion table with contact_id of auth_contacts json datatype [which is key as shown in example above] and join group_id of auth_groups with group_id of discussion table
As of now tried join only on contact_id:
SELECT *
FROM discussion d
JOIN
(SELECT user_id, jsonb_object_keys(a.contacts) AS contacts
FROM auth_contacts a
WHERE user_id = 'XXX') AS c ON (d.contact_id = c.contacts::text)
ORDER BY d.updated_date DESC
This join for a user who has around 60 thousand authorized contacts takes around 60 ms and consecutive runs lesser - Obfuscated explain plan is as follows:
"Sort (cost=4194.02..4198.39 rows=1745 width=301) (actual time=50.791..51.042 rows=5590 loops=1)"
" Sort Key: d.updated_date"
" Sort Method: quicksort Memory: 3061kB"
" Buffers: shared hit=11601"
" -> Nested Loop (cost=0.84..4100.06 rows=1745 width=301) (actual time=0.481..44.437 rows=5590 loops=1)"
" Buffers: shared hit=11598"
" -> Index Scan using auth_contacts_pkey on auth_contacts a (cost=0.42..8.93 rows=100 width=888) (actual time=0.437..1.074 rows=1987 loops=1)"
" Index Cond: ((user_id)::text = '105037'::text)"
" Buffers: shared hit=25"
" -> Index Scan using discussion_contact_id on discussion d (cost=0.42..40.73 rows=17 width=310) (actual time=0.016..0.020 rows=3 loops=1987)"
" Index Cond: ((contact_id)::text = (jsonb_object_keys(a.contacts)))"
" Buffers: shared hit=11573"
"Planning time: 17.866 ms"
"Execution time: 52.192 ms"
My final aim is an additional join in the same query with group_id too. What jsonb_object_keys does is actually create a userid vs authcontacts mapping of each key. So for a user with 60 thousand contacts it will create a view of 60 thousand rows (probably in memory). Now if I include join on auth_groups (which for sample user with 60 thousand contacts would have around 1000 thousand groups which would make the query slower.
So is this the right way to do join on jsonb object and is there a better way to do this?

Retrieve multiple columns value from Cassandra using Hector client

I am working with Cassandra and I am using Hector client to read and upsert the data in Cassandra database. I am trying to retrieve the data from Cassandra database using hector client and I am able to do that if I am trying to retrieve only one column.
Now I am trying to retrieve the data for rowKey as 1011 but with columnNames as collection of string. Below is my API that will retrieve the data from Cassandra database using Hector client-
public Map<String, String> getAttributes(String rowKey, Collection<String> attributeNames, String columnFamily) {
final Cluster cluster = CassandraHectorConnection.getInstance().getCluster();
final Keyspace keyspace = CassandraHectorConnection.getInstance().getKeyspace();
try {
ColumnQuery<String, String, String> columnQuery = HFactory
.createStringColumnQuery(keyspace)
.setColumnFamily(columnFamily).setKey(rowKey)
.setName("c1");
QueryResult<HColumn<String, String>> result = columnQuery.execute();
System.out.println("Column Name from cassandra: " + result.get().getName() + "Column value from cassandra: " + result.get().getValue());
} catch (HectorException e) {
LOG.error("Exception in CassandraHectorClient::getAttributes " +e+ ", RowKey = " +rowKey+ ", Attribute Names = " +attributeNames);
} finally {
cluster.getConnectionManager().shutdown();
}
return null;
}
If you see my above method, I am trying to retrieve the data from Cassandra database for a particular rowKey and for column c1. Now I am trying to retrieve the data from Cassandra database for collection of columns for a particular rowKey.
Meaning something like this-
I want to retrieve the data for multiple columns but for the same rowKey. How can I do this using Hector client? And I don't want to retrieve the data for all the columns and then iterate to find out the individual columns data I am looking for.

Use column name made up with composite key as combination of UTF8Type and TIMEUUID
then after
sliceQuery.setKey("your row key");
Composite startRange = new Composite();
startRange.addComponent(0, "c1",Composite.ComponentEquality.EQUAL);
Composite endRange = new Composite();
endRange.addComponent(0, "c1",Composite.ComponentEquality.GREATER_THAN_EQUAL);
sliceQuery.setRange(startRange,endRange, false, Integer.MAX_VALUE);
QueryResult<ColumnSlice<Composite, String>> result = sliceQuery.execute();
ColumnSlice<Composite, String> cs = result.get();
above code will give you all records for you row key
after that iterate as follows
for (HColumn<Composite, String> col : cs.getColumns()) {
System.out.println("column key's first part : "+col.getName().get(0, HFactoryHelper.stringSerializer).toString());
System.out.println("column key's second part : "+col.getName().get(1, HFactoryHelper.uuidSerializer).toString());
System.out.println("column key's value : "+col.getValue());
}
some where you have to write logic to maintain set of records

How to insert time stamp into an SQLite database column? Using the function time('now')?

I am working on an android app and I am creating a database called HealthDev.db that has a table called rawData that has 4 columns:
_id, foreignUserId, data, timeStamp
I have worked with the program sqlite3 in the bash shell and have figured out that I can have a time stamp column with the following column schema parameter:
timeStamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP
so when I created the table I used:
create table rawData(_id integer primary key autoincrement, foreignUserId integer, data real, timeStamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP);
This worked fine in the bash.
Then I practiced in the sqlite3 and know that when inserting into the timeStamp column and using the function time('now') as a value to store it actually stores a time stamp in the form HH:MM:SS in Universal Coordinated Time.
So now translating that into java for the android app, I used the following code below. This way the table automatically generates about 20 rows when the onCreate is called. This is just for testing if I am passing the time('now') correctly in java.
// Below are variables to the database table name and the
// database column names.
public static final String TABLE_RAW_DATA = "rawData";
public static final String COLUMN_ID = "_id";
public static final String COLUMN_FOREIGN_USER_ID = "foreignUserId";
public static final String COLUMN_DATA = "data";
public static final String COLUMN_TIME_STAMP = "timeStamp";
// Database creation sql statement.
private static final String DATABASE_CREATE = "create table "
+ TABLE_RAW_DATA
+ "("
+ COLUMN_ID + " integer primary key autoincrement, "
+ COLUMN_FOREIGN_USER_ID + " integer, "
+ COLUMN_DATA + " real, "
+ COLUMN_TIME_STAMP + " TIMESTAMP DEFAULT CURRENT_TIMESTAMP"
+ ");";
// initializes the columns of the database given by passing the DATABASE_CREATE
// sql statement to the incoming database.
public static void onCreate(SQLiteDatabase database) {
database.execSQL(DATABASE_CREATE);
// For testing
ContentValues contentValues = new ContentValues();
System.out.println("The database is open? " + database.isOpen());
for (int i = 0; i < 20; i++)
{
contentValues.put( COLUMN_FOREIGN_USER_ID, 8976);
contentValues.put( COLUMN_DATA, Math.random()*100 );
contentValues.put( COLUMN_TIME_STAMP, " time('now') " );
database.insert( TABLE_RAW_DATA, null, contentValues );
//contentValues = new ContentValues();
}
}
After running this code in an eclipse emulator I then pulled the database file from the file explorer in DDMS view mode for eclipse android projects. Then I opened the database in a bash shell and then selected all the columns from the table rawData to show it on the shell. I noticed that the time('now') was treated as a string and not a function. To prove that the time('now') function worked I manually inserted a new row using time('now') for the timeStamp value. Then re selected all the columns to show them again. It successfully printed the time stampe as HH:MM:SS.
I am thinking there might be a difference in the enviroments? The bash shell recognizes the function time('now'), which was written in c right?, because I have the sqlite3 program in the bash? Yet in eclipse when I use a SQL database and use the insert it treats the time('now') as a string. Keep in mind I am working in a Windows 7 os. I am accessing the bash as a client (SSH Secure Shell) from my school which is the host.
My main question is it possible to code it so that way it recognizes the time('now') function?

Since the default for the column is CURRENT_TIMESTAMP, what if you leave out entirely this line:
contentValues.put( COLUMN_TIME_STAMP, " time('now') " );
Won't it now insert the current timestamp into that column by default?

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Map a table of a cassandra database using spark and RDD - java

Related

Apache Camel with SQL component for inserts with batch=true throws error with multiple tables

saveOrUpdate doesn't work in Spring Data with H2 database

JSONB PostgreSQL data type with JAVA - inserts and joins

Retrieve multiple columns value from Cassandra using Hector client

How to insert time stamp into an SQLite database column? Using the function time('now')?

Categories

Resources