I have a Cassandra CQL3 column family with the following structure
CREATE TABLE mytable(
A text,
B text,
C text,
mymap map<text,text>,
D text,
PRIMARY KEY (A,B,C)
);
I am trying to insert a bunch of data into it using Astyanax.
The version of Cassandra that I am working with is 1.2, so I can't use BATCH insert.
I know that I can run CQL3 commands in a for loop using Prepared Statements.
I wanted to know if it's possible to use Astyanax mutation batch to insert the data into the above column family? I realize that this will make use of the Astyanax Thrift interface to insert into a CQL3 column family but for the sake of performant writes, is this a viable option?
I took a look at the structure of the column family in cassandra-cli and it looks something like this
ColumnFamily: mytable
Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
Default column value validator: org.apache.cassandra.db.marshal.BytesType
Cells sorted by: org.apache.cassandra.db.marshal.CompositeType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.ColumnToCollectionType(6974656d73:org.apache.cassandra.db.marshal.MapType(org.apache.cassandra.db.marshal.UTF8Type,org.apache.cassandra.db.marshal.UTF8Type)))
While I can insert data into the other columns (i.e A, B, C, D) by creating a POJO with #Component on the various fields, I'm not sure how to go about dealing with the map insert i.e. inserting into the mymap column.
A sample POJO that I created is
public class TestColumn {
#Component(ordinal = 0)
String bComponent;
#Component(ordinal = 1)
String cComponent;
#Component(ordinal = 2)
String columnName;
public TestColumn() {
}
}
The insertion code is as follows
AnnotatedCompositeSerializer columnSerializer = new AnnotatedCompositeSerializer(TestColumn.class);
ColumnFamily columnFamily = new ColumnFamily("mytable", StringSerializer.get(), columnSerializer);
final MutationBatch m = keyspace.prepareMutationBatch();
ColumnListMutation columnListMutation = m.withRow(columnFamily, "AVal");
columnListMutation.putColumn(new TestColumn("BVal", "CVal", null), ByteBufferUtil.EMPTY_BYTE_BUFFER,
timeToLive);
columnListMutation.putColumn(new ApiAvColumn("BVal", "CVal", "D"), "DVal",
timeToLive);
m.execute;
How exactly should I modify the above code so that I can insert the map value as well?
We solved this by using the DataStax Java Driver instead of Astyanax.
Related
I'm using Spark and trying to write the RDD to the HBase table.
Here the sample code:
public static void main(String[] args) {
// ... code omitted
JavaPairRDD<ImmutableBytesWritable, Put> hBasePutsRDD = rdd
.javaRDD()
.flatMapToPair(new MyFunction());
hBasePutsRDD.saveAsNewAPIHadoopDataset(job.getConfiguration());
}
private class MyFunction implements
PairFlatMapFunction<Row, ImmutableBytesWritable, Put> {
public Iterable<Tuple2<ImmutableBytesWritable, Put>> call(final Row row)
throws Exception {
List<Tuple2<ImmutableBytesWritable, Put>> puts = new ArrayList<>();
Put put = new Put(getRowKey(row));
String value = row.getAs("rddFieldName");
put.addColumn("CF".getBytes(Charset.forName("UTF-8")),
"COLUMN".getBytes(Charset.forName("UTF-8")),
value.getBytes(Charset.forName("UTF-8")));
return Collections.singletonList(
new Tuple2<>(new ImmutableBytesWritable(getRowKey(row)), put));
}
}
If I manually set the timestamp like this:
put.addColumn("CF".getBytes(Charset.forName("UTF-8")),
"COLUMN".getBytes(Charset.forName("UTF-8")),
manualTimestamp,
value.getBytes(Charset.forName("UTF-8")));
everything works fine and I have as many cell versions in HBase column "COLUMN" as there are number of different values in RDD.
But if I do not, there is only one cell version.
In another words, if there are multiple Put objects with the same column family and column, different values and default timestamp, the only one value will be inserted and another will be omitted (maybe overwritten).
Could you please help me understand how it works (saveAsNewAPIHadoopDataset especially) in this case and how can I modify the code to insert values and do not a timestamp manually.
They are overwritten when you don't use your timestamp. Hbase needs a unique key for every value, so real key for every value is
rowkey + column family + column key + timestamp => value
When you don't use timestamp, and they are inserted as bulk, many of them get same timestamp as hbase can insert multiple rows in same millisecond. So you need a custom timestamp for every same column key values.
I did not understand why you did not want to use custom timestamp as you said it works already. If you think it will use extra space in database, hbase already use timestamp even if you don't give in Put command. So nothing changes when you use manual timestamp, please use it.
I am using play framework for the first time and I need to link objects of the same type. In order to do so I have added a self referencing many to many relationship like this:
#ManyToMany(cascade=CascadeType.ALL)
#JoinTable(name="journal_predecessor", joinColumns={#JoinColumn(name="journal_id")}, inverseJoinColumns={#JoinColumn(name="predecessor_id")})
public List<Journal> journalPredecessor = new ArrayList<Journal>();
I obtain the table journal_predecessor which contains the two columns: journal_id and predecessor_id, both being FKs pointing to the primary key of the table journal.
My question is how can I query this table using raw queries if I am using H2 in-memory database. thanks!
Actually it was very easy. I just needed to create an instance of SqlQuery to create a raw query:
SqlQuery rawQuery = Ebean.createSqlQuery("SELECT journal_id from journal_predecessor where journal_id=" + successorId + " AND predecessor_id=" + predecessorId);
And because i just needed to check weather a row exists or not, I find the size of the set of the results returned by the query:
Set<SqlRow> sqlRow = rawQuery.findSet();
int rowExists = sqlRow.size();
I am working on a Project in which I need to delete all the columns and its data except for one column and its data in Cassandra using Astyanax client.
I have a dynamic column family like below and we already have couple of million records into that Column Family.
create column family USER_TEST
with key_validation_class = 'UTF8Type'
and comparator = 'UTF8Type'
and default_validation_class = 'UTF8Type'
and gc_grace = 86400
and column_metadata = [ {column_name : 'lmd', validation_class : DateType}];
I have user_id as the rowKey and other columns I have is something like this -
a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,lmd
Now I need to delete all the columns and its data except for a15 column. Meaning, I want to keep a15 column and its data for all the user_id(rowKey) and delete rest of the columns and its data..
I already know how to delete data from Cassandra using Astyanax client for a particular rowKey-
public void deleteRecord(final String rowKey) {
try {
MutationBatch m = AstyanaxConnection.getInstance().getKeyspace().prepareMutationBatch();
m.withRow(AstyanaxConnection.getInstance().getEmp_cf(), rowKey).delete();
m.execute();
} catch (ConnectionException e) {
// some code
} catch (Exception e) {
// some code
}
}
Now how to delete all the columns and its data except for one column for all the users id which is my rowKey...
Any thoughts how this can be done using Astyanax client efficiently?
It appears that Astyanax does not currently support the slice delete functionality that is a fairly recent addition to both the storage engine and the Thrift API. If you look at the thrift API reference: http://wiki.apache.org/cassandra/API10
You see that the delete operation takes a SlicePredicate, which can take either a list of columns or a SliceRange. A SliceRange, could specify all columns greater or less than the column you wanted to keep, so that would allow you to do two slice delete operations to delete all but one of the columns in the row.
Unfortunately, Astyanax only has the ability to delete an entire row, or a defined list of columns and doesn't wrap the full SlicePredicate functionality. So it looks like you have two options:
1) See about sending a raw thrift slice delete, bypassing Astyanax wrapper, or
2) Do a column read, followed by a row delete, followed by a column write. This is not ideally efficient, but if it isn't done too frequently shouldn't be prohibitive.
or
3) Read the entire row and explicitly delete all of the columns other than the one you want to preserve.
I should note that while the storage engine and thrift API both support slice deletes, this is also not yet explicitly supported by CQL.
I filed this ticket to address that last limitation:
https://issues.apache.org/jira/browse/CASSANDRA-6292
In MySQL, if you specify ON DUPLICATE KEY UPDATE and a row is inserted that would cause a duplicate value in a UNIQUE index or PRIMARY KEY, an UPDATE of the old row is performed. For example, if column a is declared as UNIQUE and contains the value 1, the following two statements have identical effect:
INSERT INTO table (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE c=c+1;
UPDATE table SET c=c+1 WHERE a=1;
I don't believe I've come across anything of the like in T-SQL. Does SQL Server offer anything comparable to MySQL's ON DUPLICATE KEY UPDATE?
I was surprised that none of the answers on this page contained an example of an actual query, so here you go:
A more complex example of inserting data and then handling duplicate
MERGE
INTO MyBigDB.dbo.METER_DATA WITH (HOLDLOCK) AS target
USING (SELECT
77748 AS rtu_id
,'12B096876' AS meter_id
,56112 AS meter_reading
,'20150602 00:20:11' AS time_local) AS source
(rtu_id, meter_id, meter_reading, time_local)
ON (target.rtu_id = source.rtu_id
AND target.time_local = source.time_local)
WHEN MATCHED
THEN UPDATE
SET meter_id = '12B096876'
,meter_reading = 56112
WHEN NOT MATCHED
THEN INSERT (rtu_id, meter_id, meter_reading, time_local)
VALUES (77748, '12B096876', 56112, '20150602 00:20:11');
There's no DUPLICATE KEY UPDATE equivalent, but MERGE and WHEN MATCHED might work for you
Inserting, Updating, and Deleting Data by Using MERGE
You can try the other way around. It does the same thing more or less.
UPDATE tablename
SET field1 = 'Test1',
field2 = 'Test2'
WHERE id = 1
IF ##ROWCOUNT = 0
INSERT INTO tablename
(id,
field1,
field2)
VALUES (1,
'Test1',
'Test2')
SQL Server 2008 has this feature, as part of TSQL.
See documentation on MERGE statement here - http://msdn.microsoft.com/en-us/library/bb510625.aspx
SQL server 2000 onwards has a concept of instead of triggers, which can accomplish the wanted functionality - although there will be a nasty trigger hiding behind the scenes.
Check the section "Insert or update?"
http://msdn.microsoft.com/en-us/library/aa224818(SQL.80).aspx
is there someway we can group similar data in java?
i want to group all the data with same id and print it out.
i am querying for the data using jdbc and was searching for a library i could use for this.
any idea?
thanks
Use a Map<GroupID, List<Data>>.
Map<Long, List<Data>> groups = new HashMap<Long, List<Data>>();
while (resultSet.next()) {
Long groupId = resultSet.getLong("groupId");
String col1 = resultSet.getString("col1");
String col2 = resultSet.getString("col2");
// ...
List<Data> group = groups.get(groupId);
if (group == null) {
group = new ArrayList<Data>();
groups.put(groupId, group);
}
group.add(new Data(groupId, col1, col2 /* ... */));
}
You could also just make it a property of another (parent) bean.
See also:
Collections and Maps tutorial
Ideally you should use a where clause in your SQL query to limit the returned data to the id in question:
select *
from table
where id = 'xxxxxx'
Of course if you will be printing out the data for all id's this may be a bad choice, as then your app will perform multiple sql queries, which usually will result in a performance hit.
As for grouping data in Java, take a look at java.util.HashMap (or any of the container classes that implement the Map interface). HashMap is a container of key-value pairs. In your case, the 'key' can be a String (or whichever data type applies) representing your id, and the 'value' can be an object to contain the data associated to the id key (i.e.: ArrayList of Strings, or a new class you define to help you manage the data)
Are you looking for the SQL ORDER BY clause?
SELECT columns
WHERE criteria
ORDER BY id ASC;
That will give you all the data in your criteria and will order it by the id column which naturally means that all the rows with the same id will appear consecutively.