What to use to store serialized data that can be queried? - java

I need to extract data from an incoming message that could be in any format. The extracted data to store is also dependent upon the format, i.e. format A could extract field X, Y, Z, but format B could extract field A, B, C. I also need to view Message B by searching for field C within the message.
Right now I'm configuring and storing a the extraction strategy (XSLT) and executing it at runtime when it's related format is encountered, but I'm storing the extracted data in an Oracle database as an XmlType column. Oracle seems to have pretty lax development/support for XmlType as it requires an old jar that forces you to use a pretty old DOM DocumentBuilderFactory impl (looks like Java 1.4 code), which collides with Spring 3, and doesn't play very nicely with Hibernate. The XML queries are slow and non-intuitive as well.
I'm concluding that Oracle with XmlType isn't a very good way to store the extracted data, so my question is, what is the best way to store the serialized/queryable data?
NoSQL (Cassandra, CouchDB, MongoDB, etc.)?
A JCR like JackRabbit?
A blob with manual de/serialization?
Another Oracle solution?
Something else??

One alterative that you haven't listed is using an XML Database. (Notice that Oracle is one of the ten or so XML database products.)
(Obviously, a blob type won't allow querying "inside" the persisted XML objects unless you read each blob instance into memory and do the querying there; e.g. using XSLT.)

I have had great success in storing complex xml objects in PostgreSQL. Together with the functional index features, you can even create indexes on node values of the stored xml files, and use those indexes to do very fast lookups using index scans without having to reparse the XML file.
This however will only work if you know your query patterns, arbitrary xpath queries will be slow also.
Example (untested, contains syntax errors for sure):
Create a simple table:
create table test123 (
int serial primary key,
myxml text
)
Now lets assume you have xml documents like:
<test>
<name>Peter</name>
<info>Peter is a <i>very</i> good cook</info>
</test>
Now create a function index:
create index idx_test123_name on table123 using xpath(xml,"/test/name");
Now do you fast xml lookups:
SELECT xml FROM test123 WHERE xpath(xml,"/test/name") = 'Peter';
You should also consider creating an index using text_pattern_ops, so you can have fast prefix lookups like:
SELECT xml FROM test123 WHERE xpath(xml,"/test/name") like 'Pe%';

Related

Which datastore to use when you have unbounded(dynamic) number of fields/attributes for an entity?

I am designing a system where I have a fixed set of attributes (an entity) and then some dynamic attributes per client.
e.g. customer_name, customer_id etc are common attributes.
whereas order_id, patient_number, date_of_joining etc are dynamic attributes.
I read about EVA being an anti-pattern. I wish to use a combination of mysql and a nosql datastore for complex queries. I already use elastic search.
I cannot let the mapping explode with unlimited number of fields. So I have devised the following model:
mysql :
customer, custom_attribute, custom_attribute_mapping, custom_attribute_value
array of nested documents in elasticsearch :
[{
"field_id" :123,
"field_type" : "date",
"value" : "01/01/2020" // mapping type date - referred from mysql table at time on inserting data
}...]
I cannot use flattened mappings on es, as I wish to use range queries as well on custom fields.
Is there a better way to do it? Or an obvious choice of another database that I am too naive to see?
If I need to modify the question to add more info, I'd welcome the feedback.
P.S. : I will have large data (order in 10s of millions of records)
Why not using something like mongoDB as a pure NoSQL database.
Or as non-popular solution, I would recommend triple stores such as virtuoso or any other similar ones. Then you can use SPARQL as a query language over them and there are many drivers for such stores, e.g. Jena for Java.
Triples stores allow you to store data in the format of <Subject predicate object>
wherein your case subject is the customer id, predicates are the attributes and object will be the value. All standard and dynamic attributes will be in the same table.
Triple stores can be modeled as 3 columns table in any database management system.

Is it possible to create a presto table using binary serialized objects as data format?

I am quite new on Presto and want to create some tables. My files are serialized objects and would like to know if I can skip the conversion to Parquet/ORC/CSV or another format by creating the table to read serialized java objects
I have not tried anything
CREATE [EXTERNAL] TABLE
mydb.ser_objs [(col_name data_type [COMMENT col_comment] [, ...] )]
ROW FORMAT row_format
STORED AS (JAVA CLASS or something?)
LOCATION 'myloc'

How to add arbitrary columns to Cassandra using CQL with Datastax Java driver?

I have recently started taking much interest in CQL as I am thinking to use Datastax Java driver. Previously, I was using column family instead of table and I was using Astyanax driver. I need to clarify something here-
I am using the below column family definition in my production cluster. And I can insert any arbitrary columns (with its value) on the fly without actually modifying the column family schema.
create column family FAMILY_DATA
with key_validation_class = 'UTF8Type'
and comparator = 'UTF8Type'
and default_validation_class = 'BytesType'
and gc_grace = 86400;
But after going through this post, it looks like- I need to alter the schema every time whenever I am getting a new column to insert which is not what I want to do... As I believe CQL3 requires column metadata to exist...
Is there any other way, I can still add arbitrary columns and its particular value if I am going with Datastax Java driver?
Any code samples/example will help me to understand better.. Thanks..
I believe in CQL you solve this problem using collections.
You can define the data type of a field to be a map, and then insert arbitrary numbers of key-value pairs into the map, that should mostly behave as dynamic columns did in traditional Thrift.
Something like:
CREATE TABLE data ( data_id int PRIMARY KEY, data_time long, data_values map );
INSERT INTO data (data_id, data_time, data_values) VALUES (1, 21341324, {'sum': 2134, 'avg': 44.5 });
Here is more information.
Additionally, you can find the mapping between the CQL3 types and the Java types used by the DataStax driver here.
If you enable compact storage for that table, it will be backwards compatible with thrift and CQL 2.0 both of which allow you to enter dynamic column names.
You can have as many columns of whatever name you want with this approach. The primary key is composed of two things, the first element which is the row_key and the remaining elements which when combined as a set form a single column name.
See the tweets example here
Though you've said this is in production already, it may not be possible to alter a table with existing data to use compact storage.

Is there a java library to bind XML (LOM) to XML+RDF?

For an educational project, I need some code (if exists) that transform XML files (specifically LOM metadata, but just xml is fine) to XML+RDF.
I need that because I'm using a RDF store (4store) to query the triples and make searches faster.
I read that with XSLT it's possible to transform any xml to another xml, so if you know there is an actual class, library or code, please tell me.
Thank you all.
My advice would be to use a software library to transform the XML to RDF/XML since the mapping may not be straightforward and RDF/XML has different XML semantics.
There a loads of different RDF API's for different technology stacks including
dotNetRDF, Jena, Sesame, ARC, Redland
http://semanticweb.org/wiki/Tools
You also need to define how the LOM metadata should be serialised into RDF. There is a good article here:
http://www.downes.ca/xml/rss_lom.htm
Answer my own question..
I'm using a binding of key/value for the LOM file. So, this part of the metadata:
<general>
<identifier xmlns="http://ltsc.ieee.org/xsd/LOM">
<catalog>oai</catalog>
<entry>oai:archiplanet.org:ap_44629</entry>
</identifier>
catalog and entry will going to be converted like this:
s = the URI of my graph, it contains my filename or identifier.
p = "lom.general.identifier.catalog"
v = "oai"
,,,,,,
s = the URI of my graph, it contains my filename or identifier.
p = "lom.general.identifier.entry"
v = "oai:archiplanet.org:ap_44629"
An so, it generates all the triples for the RDF file. I think this approach will help in order to make queries about specific values or properties.
IEEE LOM is not straightforward structure. It contains hierarchical taxonomy which should be taken into account when you are mapping. Here you can find an instruction on how you can map each IEEE LOM element as RDF, if this is your case.
Regarding the conversion, you can use the XML java library to read the XML files and create the final RDF/XML file using Jena according to the ontology I mentioned. The lom ontology is available at here

what is the range or ms sql xml argument?

I am using mssql with j2ee spring framework.
When insert a data to a table, i am using bulk insert with xml argument in mssql.
Can you anyone say how much data we can pass using this.
I would like to know this range with xml argument.
T.Saravanan
On the SQL Server side, it is is 2GB
The stored representation of xml data type instances cannot exceed 2 gigabytes (GB) in size
"Stored" means after some processing for efficiency
SQL Server internally represents XML in an efficient binary representation that uses UTF-16 encoding. User-provided encoding is not preserved, but is considered during the parse process.

Categories