Insert into column family using Datastax java driver? - java

If I have a column family created like this in Cassandra as previously I was using Thrift based client..
create column family USER
with comparator = 'UTF8Type'
and key_validation_class = 'UTF8Type'
and default_validation_class = 'BytesType'
Then can I insert into above column family by using the Datastax Java driver with asynchonous / batch writes capability?
I will be using INSERT statement to insert into above column family? Is that possible using Datastax Java driver?
I am in the impression that I can only insert into CQL based tables using Datastax Java driver not in the column family design tables...

TL;DR
Sort of, but it is better to create a cql3 based table and continue from there.
First off to get a clear impression of what's going on use the describe command in cqlsh:
cqlsh> describe COLUMNFAMILY "USER";
CREATE TABLE "USER" (
key text,
column1 text,
value blob,
PRIMARY KEY (key, column1)
) WITH COMPACT STORAGE AND
bloom_filter_fp_chance=0.010000 AND
caching='KEYS_ONLY' AND
comment='' AND
dclocal_read_repair_chance=0.000000 AND
gc_grace_seconds=864000 AND
index_interval=128 AND
read_repair_chance=0.100000 AND
replicate_on_write='true' AND
populate_io_cache_on_flush='false' AND
default_time_to_live=0 AND
speculative_retry='NONE' AND
memtable_flush_period_in_ms=0 AND
compaction={'class': 'SizeTieredCompactionStrategy'} AND
compression={'sstable_compression': 'LZ4Compressor'};
Then you can build insert statements using cqlh (i used cqlsh):
cqlsh:test> insert into test."USER" (key, column1, value) VALUES ('epickey', 'epic column 1 text', null);
If you do a select however...
cqlsh:test> SELECT * FROM test."USER" ;
(0 rows)
Once you've done that go back to the CLI:
[default#unknown] use test;
Authenticated to keyspace: test
[default#test] list USER;
Using default limit of 100
Using default cell limit of 100
-------------------
RowKey: epickey
1 Row Returned.
Elapsed time: 260 msec(s).
The last tool I'll use is sstable2json (takes all your data from an sstable and ... turns it into a json representation)
For the above single insert of the USER cf i got this back:
[
{"key": "657069636b6579","columns": [["epic column 1 text","5257c34b",1381483339315000,"d"]]}
]
So the data is there, but you just dont really have access to it over cql.
Note This is all done using C* 2.0 and cqlsh 4.0 (cql3)

Yes, using below code:
String query = "INSERT INTO " + keyspace_name + "." + column_family + " (" + column_names + ") VALUES (" + column_values + ");";
statement = session.prepare(query);
boundStatement = new BoundStatement(statement);
boundStatement.setString(0, key);
boundStatement.setString(1, subColNames[k]);
boundStatement.setMap(2, colValues);
session.execute(boundStatement);

Related

Can you check if a column exists and perform different actions with oracle?

My table looks like the following:
id | value1 | count
I have a list of value1 in RAM and I want to do the following:
(if value1 exists in table){
count + 1}else{
insert new row into table}
Is this possible with Oracle or do I have to take it to the code, do a for loop and execute one element of the list at a time? The list contains 5 million values. I'd have to do something like this in the code:
for(int i=0; i<list.size; i++){
boolean exists = checkifexists(list.get(i));
if(exists=true){
countPlusOne(list.get(i);
}else{
createNewRow(list.get(i));
}
}
So I have to do at least two queries for each value, totalling 10m+ queries. This could take a long time and may not be the most efficient way to do this. I'm trying to think of another way.
"I load them into RAM from the database"
You already have the source data in the database so you should do the processing in the database. Instantiating a list of 5 million strings in local memory is not a cheap operation, especially when it's unnecessary.
Oracle supports a MERGE capability which we can use to test whether a record exists in the target table and populate a new row conditionally. Being a set operation MERGE is way more performative than single row inserts in a Java loop.
The tricky bit is uniqueness. You need to have a driving query from the source table which contains unique values (otherwise MERGE will hurl). In this example I aggregate a count of each occurrence of value1 in the source table. This gives us a set of value1 plus a figure we can use to maintain the count column on the target table.
merge into you_target_table tt
using ( select value1
, count(*) as dup_cnt
from your_source_table
group by value1
) st
on ( st.value1 = tt.value1 )
when not matched then
insert (id, value1, cnt)
values (someseq.nextval, st.value1, st.dup_cnt)
when matched then
update
set tt.cnt = tt.cnt + st.dup_cnt;
(I'm assuming the ID column of the target table is populated by a sequence; amend that as you require).
In Oracle, we could use a MERGE statement to check if a row exists and do insertion only if it doesn't.
First create a type that defines your list.
CREATE OR REPLACE TYPE value1_type as TABLE OF VARCHAR2(10); --use the datatype of value1
Merge statement.
MERGE INTO yourtable t
USING (
select distinct column_value as value1 FROM TABLE(value1_type(v1,v2,v3))
)s ON ( s.value1 = t.value1 )
WHEN NOT MATCHED THEN INSERT
(col1,col2,col3) VALUES ( s.col1,s.col2,s.col3);
You may also use NOT EXISTS.
INSERT INTO yourtable t
select * FROM
(
select distinct column_value as value1 from TABLE(value1_type(v1,v2,v3))
) s
WHERE NOT EXISTS
(
select 1 from
yourtable t where t.value1 = s.value1
);
You can do this by two approaches
Approach 1:
Create a temp table in database and insert all your value in RAM into that Temp Table
Write query for updating count on the basis of you main table and temp table join and
set a flag in temp table which values are updated, the value which are not updated
use insert query to insert.
Approach 2:
You can create your own data type, which accepts array of values as input:
CREATE OR REPLACE TYPE MyType AS VARRAY(200) OF VARCHAR2(50);
You can write procedure with your logic,procedure will take value of array as input: CREATE OR REPLACE PROCEDURE testing (t_in MyType)
First fill your RAM list in a temporary table TMP
select * from tmp;
VALUE1
----------
V00000001
V00000002
V00000003
V00000004
V00000005
...
You may use a MERGE statement to handle your logik
if key existe increase the count by 1
if key doesn't exists insert it with the initial count of 1
.
merge into val
using tmp
on (val.value1 = tmp.value1)
when matched then update
set val.count = val.count + 1
when not matched then
insert (val.value1, val.count)
values (tmp.value1, 1)
;
Note that I assume you have IDENTITY key in the column ID, so no key assignment is requeired.
In case there are duplicated record in the TMP table (more records with the same VALUE1 key) you get error as MERGEcan not hanlde more actions with one key.
ORA-30926: unable to get a stable set of rows in the source tables
If you want to count each duplicated key as one -
you must pre-aggregate the temporary table using GROUP BY and add the counts.
Otherwise simple ignore the duplicates using DISTINCT.
merge /*+ PARALLEL(5) */ into val
using (select value1, count(*) count from tmp group by value1) tmp
on (val.value1 = tmp.value1)
when matched then update
set val.count = val.count + 1
when not matched then
insert (val.value1, val.count)
values (tmp.value1, 1)

Missing column that was just inserted in cassandra column family

We are constancly getting problem on our test cluster.
Cassandra configuration:
cassandra version: 2.2.12
nodes count: 6, seed-nodess 3, none-seed-nodes 3
replication factor 1 (of course for prod we will use 3)
Table configuration where we get problem:
CREATE TABLE "STATISTICS" (
key timeuuid,
column1 blob,
column2 blob,
column3 blob,
column4 blob,
value blob,
PRIMARY KEY (key, column1, column2, column3, column4)
) WITH COMPACT STORAGE
AND CLUSTERING ORDER BY (column1 ASC, column2 ASC, column3 ASC, column4 ASC)
AND caching = {
'keys':'ALL', 'rows_per_partition':'100'
}
AND compaction = {
'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'
};
Our java code details
java 8
cassandra driver: astyanax
app-nodes count: 4
So, whats happening:
Under high load our application do many inserts in cassandra tables from all nodes.
During this we have one workflow when we do next with one row in STATISTICS table:
do insert 3 columns from app-node-1
do insert 1 column from app-node-2
do insert 1 column from app-node-3
do read all columns from row on app-node-4
at last step(4) when we read all columns we are sure that insert of all columns is done (it is guaranteed by other checks that we have)
The problem is that some times(2-5 times on 100'000) it happens that at stpp 4 when we read all columns, we get 4 columns instead of 5, i.e. we are missing column that was inserted at step 2 or 3.
We even start doing reads of this columns every 100ms in loop and we dont get expected result. During this time we also check columns using cqlsh - same result, i.e. 4 instead of 5.
BUT, if we add in this row any new column, then we immediately get expected result, i.e. we are getting then 6 columns - 5 columns from workflow and 1 dummy.
So after inserting dummy column we get missing column that was inserted at step 2 or 3.
Moreover when we get the timestamp of missing (and then apperared column), - its very closed to time when this column was actually added from our app-node.
Basically insertions from app-node-2 & app-node-3 are done nearlly at the same time, so finally these two columns allways have nearly same timestamp, even if we do insert of dummy column in 1 minute after first read of all columns at step 4.
With replication factor 3 we cannot reproduce this problem.
So open questions are:
May be this is expected behavior of Cassandra when replication factor is 1 ?
If its not expected, then what could be potential reason?
UPDATE 1:
next code is used to insert column:
UUID uuid = <some uuid>;
short shortV = <some short>;
int intVal = <some int>;
String strVal = <some string>;
ColumnFamily<UUID, Composite> statisticsCF = ColumnFamily.newColumnFamily(
"STATISTICS",
UUIDSerializer.get(),
CompositeSerializer.get()
);
MutationBatch mb = keyspace.prepareMutationBatch();
ColumnListMutation<Composite> clm = mb.withRow(statisticsCF, uuid);
clm.putColumn(new Composite(shortV, intVal, strVal, null), true);
mb.execute();
UPDATE 2:
Proceed testing/investigatnig.
When we caught this situation again, we immediately stop(killed) our java apps. And then can constantly see in cqlsh that particular row does not contain inserted column.
To appear it, first we tried nodetool flash on every cassandra node:
pssh -h cnodes.txt /path-to-cassandra/bin/nodetool flush
result - the same, column did not appear.
Then we just restarted the cassandra cluster and column appeared
UPDATE 3:
Tried to disable cassandra cache, by setting row_cache_size_in_mb property to 0 (before it was 2Gb)
row_cache_size_in_mb: 0
After it, the problem gone.
SO probably the probmlem may be in OHCProvider which is used as default cache provider.

Timestamp in Cassandra

I prefer to use the timestamp as one of the column in Cassandra (which I decided to use as Clustering Key). which is the right way to store the column as timestamp in Cassandra?
(i.e) Is it fine to use the 'milliseconds' (Example : 1513078338560) directly like below?
INSERT INTO testdata (nodeIp, totalCapacity, physicalUsage, readIOPS, readBW, writeIOPS, writeBW, writeLatency, flashMode, timestamp) VALUES('172.30.56.60',1, 1,1,1,1,1,1,'yes',1513078338560);
or to use the dateof(now());
INSERT INTO testdata (nodeIp, totalCapacity, physicalUsage, readIOPS, readBW, writeIOPS, writeBW, writeLatency, flashMode, timestamp) VALUES('172.30.56.60',1, 1,1,1,1,1,1,'yes',dateof(now()));
which is faster and recommended way to use for timestamp based queries in Cassandra?
NOTE : I know internally it stores as milliseconds, I used the 'SELECT timestamp, blobAsBigint(timestampAsBlob(timestamp)) FROM'
Thanks,
Harry
The dateof is deprecated in Cassandra >= 2.2... Instead it's better to use function toTimestamp, like this: toTimestamp(now()). When you selecting, you can also use the toUnixTimestamp function if you want to get the timestamp as long:
cqlsh:test> CREATE TABLE test_times (a int, b timestamp, PRIMARY KEY (a,b));
cqlsh:test> INSERT INTO test_times (a,b) VALUES (1, toTimestamp(now()));
cqlsh:test> SELECT toUnixTimestamp(b) FROM test_times;
system.tounixtimestamp(b)
---------------------------
1513086032267
(1 rows)
cqlsh:test> SELECT b FROM test_times;
b
---------------------------------
2017-12-12 13:40:32.267000+0000
(1 rows)
Regarding the performance - there are different considerations:
If you already have the timestamp as number, then you can use it instead of calling function
It's better to use prepared statements instead of "raw inserts" - in this case Cassandra won't need to transfer full query, but only data, and also don't need to parse statement every time.
The pseudo code will look as following (Java-like).
PreparedStatement prepared = session.prepare(
"insert into your_table (field1, field2) values (?, ?)");
while(true) {
session.execute(prepared.bind(value1, value2));
}

How to separate between data selecting from multiple tables?

I want to search in 16 different tables, but I don't wanna repeat the "select from DB" 16 times; I think that's not really help in performance!!!
I am using:
query="SELECT * FROM table1, table2,..., table16 WHERE id=?";
Is it correct ??
my problem is how to separate between data of tables ??
also maybe I can get from one table two or more results for one "id"; So I want to know which data is from which table !!
.
Best regards,
Your query will not work, because you are trying to join those multiple tables, whereas what you want to do is search (filter) those 16 tables.
You could use a union all to do this in a single query:
select xxx, 'table1' as source_table
from table1
where id = ?
union all
select xxx, 'table2' as source_table
from table2
where id = ?
and so on. The second derived field source_table can be used to determine which table returned which result.
You have to list all fields using aliases for fields with same name, and prefix with table names.
For example :
query = "SELECT table1.id as id_1, table2.id as id_2, ... WHERE id_1 = 23"
Probably a very long query to write, but you have solution to generate and paste it : You can do this for example with FlySpeed SqlQuery (free for personal use)
FlySpeed SqlQuery will generate all aliases for you, and automatically prefix with table names.
A little clarification would help. If all 16 tables have the same fields and you want them in a continuous list, you can use UNION as suggested above. On the other hand, if there are only a few fields that match and you want to compare the values for each table side-by-side, you'll want to use joins and provide aliases with the table names, as also suggested above.
However, looking at the snippet of code you've provided, I'm going to guess that you're either building some kind of stored procedure or else implementing SQL in some other language. If that's the case, how about loading your table names into an array and using a for loop to build the query, such as the following psuedo-code:
tableList = ["table1", "table2"...]
fieldnames = ["field1", "field2"...]
query = "SELECT "
for i = 0 to count(tableList):
for j = 0 to count(fieldnames):
query = query + tablelist[i] + "." + fieldnames[j] + ", "
j++
i++
query = query + "FROM "
for i = 0 to count(tableList):
query = query + tableList[i] + ", "
i++
query = query + "WHERE " ...
And so forth. Much of this depends on what exactly you're looking to do, how often you're looking to do it, and how often the variables (like which tables or fields you're using) are going to change.

Look up a table having 1 million records using Hibernate

My database table(geo ip lookup) is having 7 columns,of which 2 columns constitute < composite-id>.
Now when i lookup for a value using first 2 coloumns it takes me 12-14 seconds to fetch a record..
My DAO code looks like this:
String queryString = "from Igeo igeo where igeo.ip_from <= " + ip
+ "and igeo.ip_to >= " + ip;
Query q = session.createQuery(queryString);
List<Igeo> igeoList = q.list();
if(igeoList.size() > 0){
Igeo igeo = igeoList.get(0);
ISP = igeo.getIsp();
...
...
}
*Igeo = class in java represnting table
**Record is fetched when ip lies between values of composite-id columns eg.
ip_from = 1 ; ip_to = 3 ; ip = 2;
so above row will be returned
This table is only used to read records ,please suggest me a queryString which is more efficient than above
First remove hibernate and run your query in a query browser and see how long it takes to return. If it takes the same amount of time it's not Hibernate. It's the performance of the database. Make sure you add indexes onto the two columns ip_from and ip_to. You can also execute a query plan on your query to see what the database is running under the hood and try and optimize the query plan.
I would suggest NOT using concatenation onto your query as you are. That produces a security hole allowing potential SQL injection from outside parties. It's better to use the following:
Query q = session.createQuery("from Igeo igeo where igeo.from_ip >= ? and igeo.to_ip <= ?");
q.setString( 0, ip );
q.setString( 1, ip );
You could also used named parameters which might shorten it up a bit more.
If the table IGeo does not contain overlapping ranges of ip_from and ip_to, you might try this
String queryString = "FROM Igeo igeo"
+ " WHERE igeo.ip_to >= " + ip
+ " ORDER BY igeo.ip_to";
Then check the first item in the list (to see that ip_from <= ip).
Even if the table could contain overlapping ranges of ip_from, ip_to, I bet the above HQL will run faster.
<Aside> You really should not concatenate a raw string like "ip" into HQL or SQL. It leads to SQL Injection Attack vulnerabilities. Use a query parameter instead</Aside>
Also, verify that your database has an index on the column corresponding to Igeo.ip_to.
Sounds to me from your description, that the database has a primary key of Igeo.ip_from + IGeo.ip_to. If the values of ip_from and ip_to are not overlapping, that does not seem to be normalized. You should need only a single column for the primary key. If you have chosen to use both columns as a primary key, the above query will benefit by adding a single index.
Some databases will perform better if you add an index containing all the columns in the table, starting with ip_to and ip_from. (This enables the database to satisfy the query by accessing only the index). Not sure if MySQL can optimize to this extent, but I know DB2 and Oracle will provide this.

Categories