insert JSON String in Apache-Cassandra 0.8.2

insert JSON String in Apache-Cassandra 0.8.2 - java

Anybody know which is the easy way to insert a json string in cassandra.
Suppose I have a json string like this: {'key1':'val1','key2':'val2'}
In MongoDB we can insert directly a json string like dbobj.insert(jsonstring);
So is there any way to do like this in Cassandra?(I am coding in java)

There are at least 3 ways, but it depends what you are trying to achieve and what kinds of query you want to run.
You could store the JSON string as just a plain string/byte, as a Cassandra column name (assuming there is something you can use as the row key). You won't be able to do queries based on the JSON content, though; this would be opaque data that you process client-side.
You could split up the JSON before storage, so that key1, key2 are column names and val1, val2 are the corresponding column values. Again, you'd need something to use as a row key. This method would let you retrieve individual values, and use secondary indexes to retrieve rows with particular values.
You could even use key1, key2 as row keys, with val1, val2 as column names. Given that you have the key-val pairs grouped in JSON, they presumably belong to the same entity and are related, so this is unlikely to be useful, but I mention it for completeness.
Edited to add: If your question is actually how to insert data into Cassandra at all, then you should read the docs for a Java client such as Hector (there are other options too - see http://wiki.apache.org/cassandra/ClientOptions)

you can try this:
INSERT INTO users JSON '{"id": "user123", "age": 42, "state": "TX"}';
Reference:
http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-2-json-support

Related

Which datastore to use when you have unbounded(dynamic) number of fields/attributes for an entity?

I am designing a system where I have a fixed set of attributes (an entity) and then some dynamic attributes per client.
e.g. customer_name, customer_id etc are common attributes.
whereas order_id, patient_number, date_of_joining etc are dynamic attributes.
I read about EVA being an anti-pattern. I wish to use a combination of mysql and a nosql datastore for complex queries. I already use elastic search.
I cannot let the mapping explode with unlimited number of fields. So I have devised the following model:
mysql :
customer, custom_attribute, custom_attribute_mapping, custom_attribute_value
array of nested documents in elasticsearch :
[{
"field_id" :123,
"field_type" : "date",
"value" : "01/01/2020" // mapping type date - referred from mysql table at time on inserting data
}...]
I cannot use flattened mappings on es, as I wish to use range queries as well on custom fields.
Is there a better way to do it? Or an obvious choice of another database that I am too naive to see?
If I need to modify the question to add more info, I'd welcome the feedback.
P.S. : I will have large data (order in 10s of millions of records)

Why not using something like mongoDB as a pure NoSQL database.
Or as non-popular solution, I would recommend triple stores such as virtuoso or any other similar ones. Then you can use SPARQL as a query language over them and there are many drivers for such stores, e.g. Jena for Java.
Triples stores allow you to store data in the format of <Subject predicate object>
wherein your case subject is the customer id, predicates are the attributes and object will be the value. All standard and dynamic attributes will be in the same table.
Triple stores can be modeled as 3 columns table in any database management system.

Filtering bounded data in Dataflow based on timestamp

In my dataflow pipeline, I'll have two PCollections<TableRow> that have been read from BigQuery tables. I plan to merge those two PCollections into one PCollection with with a flatten.
Since BigQuery is append only, the goal is to write truncate the second table in BigQuery with the a new PCollection.
I've read through the documentation and it's the middle steps I'm confused about. With my new PCollection the plan is to use a Comparator DoFn to look at the max last update date and returning the given row. I'm unsure if I should be using a filter transform or if I should be doing a Group by key and then using a filter?
All PCollection<TableRow>s will contain the same values: IE: string, integer and timestamp. When it comes to key value pairs, most of the documentation on cloud dataflow includes just simple strings. Is it possible to have a key value pair that is the entire row of the PCollection<TableRow>?
The rows would look similar to:
customerID, customerName, lastUpdateDate
0001, customerOne, 2016-06-01 00:00:00
0001, customerOne, 2016-06-11 00:00:00
In the example above, I would want to filter the PCollection to just return the second row to a PCollection that would be written to BigQuery. Also, is it possible to apply these Pardo's on the third PCollection without creating a fourth?

You've asked a few questions. I have tried to answer them in isolation, but I may have misunderstood the whole scenario. If you provided some example code, it might help to clarify.
With my new PCollection the plan is to use a Comparator DoFn to look at the max last update date and returning the given row. I'm unsure if I should be using a filter transform or if I should be doing a Group by key and then using a filter?
Based on your description, it seems that you want to take a PCollection of elements and for each customerID (the key) find the most recent update to that customer's record. You can use the provided transforms to accomplish this via Top.largestPerKey(1, timestampComparator) where you set up your timestampComparator to look only at the timestamp.
Is it possible to have a key value pair that is the entire row of the PCollection?
A KV<K, V> can have any type for the key (K) and value (V). If you want to group by key, then the coder for the keys needs to be deterministic. TableRowJsonCoder is not deterministic, because it may contain arbitrary objects. But it sounds like you want to have the customerID for the key and the entire TableRow for the value.
is it possible to apply these Pardo's on the third PCollection without creating a fourth?
When you apply a PTransform to a PCollection, it results in a new PCollection. There is no way around that, and you don't need to try to minimize the number of PCollections in your pipeline.
A PCollection is a conceptual object; it does not have intrinsic cost. Your pipeline is going to be heavily optimized so that many intermediate PCollections - especially those in a sequence of ParDo transforms - will never be materialized anyhow.

How to store HashMap<Enum,String> in single row where each key of map is 1 to 1 to column name in row of this?

Let say we have legacy SQL table with more than 50 columns .
with different representation in model.
id timestamp (stored as separate fields)
column_1 column_2 ... column_51 (stored as single map)
I would like to avoid generate field in java code for each of column from column_1 to column_51 . I would prefer rather use HashMap with enumaration as keys same as column name.
I would like store and read map from table without Boilerplate code for store/read attributes map . Instead i would like read write map in one step.
PS:
MyBatis had parameterMap which would be good enough for this purpose but as it is now deprecated.
Using any deprecated or alpha stage api is not an option.
Changing database is not an option.

Just do a standard SQL query and then scan through the returned column metadata.
I can't remember the exact API off hand but you can query the column names from the result set, something like:
metadata = results.getMetadata();
for (int i=0;i<metadata.getColumnCount();i++) {
map.put(metadata.getColumn(i).getName(), results.getString(i));
}

Parsing Apache CSV like string into objects

I'm trying to parse data obtained via Apache HTTPClient in the fastest and most efficient way possible.
The data returned by the response is a string but in a CSV like format:
e.g. the String looks like this:
date, price, status, ...
2014-02-05, 102.22, OK,...
2014-02-05, NULL, OK
I thought about taking the string and manually parsing it, but this may be too slow as I have to do this for multiple requests.
Also the data returned is about 23,000 lines from one source and I may have to parse potentially several sources.
I'm also storing the data in a hash map of type:
Map<String, Map<String, MyObject>>
where the key is the source name, and value is a map with the parsed objects as a key.
So I have 2 questions, best way to parse a 23,000 line file into objects, and best way to store it.
I tried a csv parser, however the double's if not present are stored as NULL and not 0 so I will need to manually parse it.
Thanks

How to add arbitrary columns to Cassandra using CQL with Datastax Java driver?

I have recently started taking much interest in CQL as I am thinking to use Datastax Java driver. Previously, I was using column family instead of table and I was using Astyanax driver. I need to clarify something here-
I am using the below column family definition in my production cluster. And I can insert any arbitrary columns (with its value) on the fly without actually modifying the column family schema.
create column family FAMILY_DATA
with key_validation_class = 'UTF8Type'
and comparator = 'UTF8Type'
and default_validation_class = 'BytesType'
and gc_grace = 86400;
But after going through this post, it looks like- I need to alter the schema every time whenever I am getting a new column to insert which is not what I want to do... As I believe CQL3 requires column metadata to exist...
Is there any other way, I can still add arbitrary columns and its particular value if I am going with Datastax Java driver?
Any code samples/example will help me to understand better.. Thanks..

I believe in CQL you solve this problem using collections.
You can define the data type of a field to be a map, and then insert arbitrary numbers of key-value pairs into the map, that should mostly behave as dynamic columns did in traditional Thrift.
Something like:
CREATE TABLE data ( data_id int PRIMARY KEY, data_time long, data_values map );
INSERT INTO data (data_id, data_time, data_values) VALUES (1, 21341324, {'sum': 2134, 'avg': 44.5 });
Here is more information.
Additionally, you can find the mapping between the CQL3 types and the Java types used by the DataStax driver here.

If you enable compact storage for that table, it will be backwards compatible with thrift and CQL 2.0 both of which allow you to enter dynamic column names.
You can have as many columns of whatever name you want with this approach. The primary key is composed of two things, the first element which is the row_key and the remaining elements which when combined as a set form a single column name.
See the tweets example here
Though you've said this is in production already, it may not be possible to alter a table with existing data to use compact storage.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.