Extracting cassandra's bloom filter - java

I have a cassandra server that is queried by another service and I need to reduce the amount of queries.
My first thought was to create a bloom filter of the whole database every couple of minutes and send it to the service.
but as I have a couple of hundreds of gigabytes in the database (which is expected to grow to a couple of terabytes), it doesn't seem like a good idea overloading the database every few minutes.
After a while of searching for a better solution, I remembered that cassandra maintains its own bloom filter.
Is it possible to copy the *-Filter.db files and use them in my code instead of creating my own bloom filter?

I have Created a table test
CREATE TABLE test (
a int PRIMARY KEY,
b int
);
Inserted 1 row
INSERT INTO test(a,b) VALUES(1, 10);
After flush data to disk. we can use the *-Filter.db file. For my case it was la-2-big-Filter.db
Here is the sample code to check if a partition key exist
Murmur3Partitioner partitioner = new Murmur3Partitioner();
try (DataInputStream in = new DataInputStream(new FileInputStream(new File("la-2-big-Filter.db"))); IFilter filter = FilterFactory.deserialize(in, true)) {
for (int i = 1; i <= 10; i++) {
DecoratedKey decoratedKey = partitioner.decorateKey(Int32Type.instance.decompose(i));
if (filter.isPresent(decoratedKey)) {
System.out.println(i + " is present ");
} else {
System.out.println(i + " is not present ");
}
}
}
Output :
1 is present
2 is not present
3 is not present
4 is not present
5 is not present
6 is not present
7 is not present
8 is not present
9 is not present
10 is not present

Related

Google Big table Filter with Java

I want to filter all rows that match this condition: with input value x, return all records between two quantifier values in Java.
Example: With input value x = 15,
a record with quantifier q1 = 10 and q2 = 20 will match,
a record with quantifier q1 = 1 and q2 = 10 will not match
You are trying to filter rows that contain min numerical qualifier that is < x, as well as containing a max numerical qualifier that is > x , and then maybe you are trying to filter data from those rows, that are between those qualifiers.
This is pretty much the opposite sort of access pattern that one tries to achieve when setting up a BigTable. This has a code smell. Having said that, you can succesfully achieve this sort of query, using a combination of filters. However, these filters cannot be chained together, as far as I can tell.
First, use a filter to get keys with a cq < x. Next, send a query to bigtable per each key from the first filter, and filter on key as well as filter such that cq > x. This is an optimized way. An even more optimized way might be to limit the first filter to 1 element (i.e. get the min element), and only query without a limit on the lessThen portion, after the second step.
My implementation below is slightly more naive, in that the second step filters only on cq > x, and not keys from the first step. But the gist is the same:
val x = "15"
val a = new mutable.HashMap[ByteString, Row]
val b = new mutable.HashMap[ByteString, Row]
val c = new mutable.HashMap[ByteString, Row]
dataClient.readRows( Query.create(tableId)
.filter(Filters.FILTERS.qualifier().rangeWithinFamily("cf").startClosed(Int.MinValue.toString.padTo(Ints.max(Int.MinValue.toString.length, Int.MaxValue.toString.length), "0").toString()).endOpen(x.padTo(Ints.max(Int.MinValue.toString.length, Int.MaxValue.toString.length), "0").toString()
)))
.forEach(r => a.put(r.getKey, r))
dataClient.readRows(Query.create(tableId)
.filter(Filters.FILTERS.qualifier().rangeWithinFamily("cf").startOpen(x).endClosed(Int.MaxValue.toString.padTo(Ints.max(Int.MinValue.toString.length, Int.MaxValue.toString.length), "0").toString()))
)
.forEach(r => b.put(r.getKey, r))
dataClient.readRows(Query.create(tableId)
.filter(Filters.FILTERS.qualifier().exactMatch(x)))
.forEach(r => c.put(r.getKey, r))
val all_cells = a.keys.toSet.intersect(b.keys.toSet).flatMap(k => a.get(k).map(_.getCells).get.toArray.toSeq ++ b.get(k).map(_.getCells).get.toArray.toSeq
++ c.get(k).map(_.getCells).get.toArray.toSeq)
Can you tell me more about your use case?
It is possible to create a filter on a range of values, but it will depend on how you are encoding them. If they are encoded as strings, you would use the ValueRange filter like so:
Filter filter = FILTERS.value().range().startClosed("10").endClosed("20");
Then perform your read with the filter
try (BigtableDataClient dataClient = BigtableDataClient.create(projectId, instanceId)) {
Query query = Query.create(tableId).filter(filter);
ServerStream<Row> rows = dataClient.readRows(query);
for (Row row : rows) {
printRow(row);
}
} catch (IOException e) {
System.out.println(
"Unable to initialize service client, as a network error occurred: \n" + e.toString());
}
You can also pass bytes to the range, so if your numbers are encoded in some way, you could encode them as bytes in the same way and pass that into startClosed and endClosed.
You can read more about filters in the Cloud Bigtable Documentation.

MongoDB 4.4, Java driver 4.2.3 - InsertManyResult.getInsertedIds() not returning IDs for all inserted documents

I am trying to retrieve values of _id for inserted documents after successful InsertMany operation. To achieve this I am using InsertManyResult.getInsertedIds(). While this approach works most of the time there are cases where not all _id values are retrieved.
I am not sure if I am doing something wrong but I would assume that InsertManyResult.getInsertedIds() returns _id for all the documents inserted.
Problem details
I am inserting 1000 documents in MongoDB in two batches of 500 documents. Each document is approx 1 MB in size.
After batch is inserted using InsertMany I attempt to read values of _id via InsertManyResult.getInsertedIds() and save it to a collection for later use.
I would assume that after inserting 500 documents via InsertMany the InsertManyResult.getInsertedIds() would return 500 _id values. It is however returning only 16 _id values out of 500.
When I check the Mongo collection directly via Mongo Shell I see that all records were successfully inserted. There is 1000 documents in my test collection. I am just unable to get the _id of all the inserted document via InsertManyResult.getInsertedIds(). I only get 32 _id for 1000 documents inserted.
JSON structure
To replicate the issue I have exactly one JSON which is approx 1 MB in size which looks like this.
{
"textVal" : "RmKHtEMMzJDXgEApmWeoZGRdZJZerIj1",
"intVal" : 161390623,
"longVal" : "98213019054010317",
"timestampVal" : "2020-12-31 23:59:59.999",
"numericVal" : -401277306,
"largeArrayVal" : [ "MMzJDXg", "ApmWeoZGRdZJZerI", "1LhTxQ", "adprPSb1ZT", ..., "QNLkBZuXenmYE77"]
}
Note that key largeArrayVal is holding almost all the data. I have omitted most of the values for readability.
Sample code
The code below parses JSON shown above into a Document which is then inserted to MongoDB via InsertMany. After that is done I try to get inserted _id using InsertManyResult.getInsertedIds().
private static final int MAX_DOCUMENTS = 1000;
private static final int BULK_SIZE = 500;
private static List<ObjectId> insertBatchReturnIds(List<Document> insertBatch)
{
List<ObjectId> insertedIds = new ArrayList<ObjectId>();
InsertManyResult insertManyResult;
insertManyResult = mongoClient.getDatabase(MONGO_DATABASE).getCollection(MONGO_COLLECTION).insertMany(insertBatch);
insertManyResult.getInsertedIds().forEach((k,v) -> insertedIds.add(v.asObjectId().getValue()));
System.out.println("Batch inseted:");
System.out.println(" - Was acknowladged: " + Boolean.toString(insertManyResult.wasAcknowledged()).toUpperCase());
System.out.println(" - InsertManyResult.getInsertedIds().size(): " + insertManyResult.getInsertedIds().size());
return insertedIds;
}
private static void insertDocuments()
{
int documentsInserted = 0;
List<Document> insertBatch = new ArrayList<Document>();
List<ObjectId> insertedIds = new ArrayList<ObjectId>();
final String largeJson = loadLargeJsonFromFile("d:\\test-sample.json");
System.out.println("Starting INSERT test...");
while (documentsInserted < MAX_DOCUMENTS)
{
insertBatch.add(Document.parse(largeJson));
documentsInserted++;
if (documentsInserted % BULK_SIZE == 0)
{
insertedIds.addAll(insertBatchReturnIds(insertBatch));
insertBatch.clear();
}
}
if (insertBatch.size() > 0)
insertedIds.addAll(insertBatchReturnIds(insertBatch));
System.out.println("INSERT test finished");
System.out.println(String.format("Expected IDs retrieved: %d. Actual IDs retrieved: %d.", MAX_DOCUMENTS, insertedIds.size()));
if (insertedIds.size() != MAX_DOCUMENTS)
throw new IllegalStateException("Not all _ID were returned for each document in batch");
}
Sample output
Starting INSERT test...
Batch inseted:
- Was acknowladged: TRUE
- InsertManyResult.getInsertedIds().size(): 16
Batch inseted:
- Was acknowladged: TRUE
- InsertManyResult.getInsertedIds().size(): 16
INSERT test finished
Expected IDs retrieved: 1000. Actual IDs retrieved: 32.
Exception in thread "main" java.lang.IllegalStateException: Not all _ID were returned for each document in batch
My questions
Is InsertManyResult.getInsertedIds() meant to return _id for all documents inserted?
Is the way I am using InsertManyResult.getInsertedIds() correct?
Could size of the inserted JSON be a factor here?
How should I use InsertManyResult to get _id for inserted documents?
Note
I am aware that I can either read _id after Document.parse as it is the driver that generates this or I can select _id after documents were inserted.
I would like to know how can this be achieved using InsertManyResult.getInsertedIds() as it seems to be made to fit this purpose.
This is a bug in the Java driver, and it's being tracked in https://jira.mongodb.org/browse/JAVA-4436 (reported on January 5, 2022).
Your documents are 1 mb large, hence no more than 16 of them fit into a single command. The driver does split the full set of documents into batches automatically but you appear to be reading ids from one batch at a time, therefore the problem is likely one of the following:
There is a driver issue where it doesn't merge the batch results together prior to returning the results to your application
The driver is giving you the results one batch at a time, hence you do get all of the ids but not in the segments you were expecting (in which case there is no bug but you do need to work with batches as they are provided by the driver)
The following test in Ruby works as expected, producing 100 ids:
c = Mongo::Client.new(['localhost:14920'])
docs = [{a: 'x'*1_000_000}]*100
res = c['foo'].insert_many(docs)
p res.inserted_ids.length
pp res.inserted_ids

Full scan of long table with huge amount of versions results only small part of rows

I have case where I need to scan table with about 50 columns and every column containing about 100 versions. Nothing special (this.htable is just appropriate HTable and processor is intended to handle resulting rows):
final Scan scan = new Scan();
scan.setCaching(1000);
scan.setMaxVersions(Integer.MAX_VALUE);
final ResultScanner rs = this.table.getScanner(scan);
try {
for (Result r = rs.next(); r != null; r = rs.next()) {
processor.processRow(r);
}
} finally {
rs.close();
}
When I try to scan in such approach table with about 20 x 10^6 rows I get only about 50 x 10^3 rows. No special configuration is applied for scanner, HBase is 0.98.1 (CDH5.1). What do I miss in this? Is it some HBase drawback or I do something seriously wrong? What can I check? I have checked result size limit (not a case) and you see maxVersions is configured. Who can limit such scans?
UPDATE
It was checked returned Result instances and their Cell instances inside are seriously different in number from expected results. Yet another time, table was about 20 x 10^6 rows which could be counted by the same code without maximum versions configuration. And returned number of rows WITH versions is about 50 * 10^3.
I am not sure what you have in processRow. But key-value pairs is inside result object. For one row key there can be many key-value pairs you know. May be this can be the missing point
for (Result result : resultScanner) {
for (KeyValue kv : result.raw()) {
Bytes.toString(kv.getQualifier());
Bytes.toString(kv.getValue());
Bytes.toString(result.getRow());
}
}

Server side sorting on huge data

As of now we are providing client side sorting on Dojo datagrid. Now we need to enhance server side sorting means sorting should apply to all pages on grid. We have 4 tables joined on main table and has 2 lac records as of now and it may increase. When execute SQL it takes 5-8 mins time to fetch all records to my java code and where I need to apply some calculations over them and I am providing custom sort using Comparators. We have each comparator to represent each column.
My worry is how to get the whole data to service layer code within short time? Is there a way to increase execution speed through data source configuration?
return new Comparator<QueryHS>() {
public int compare(QueryHS object1, QueryHS object2) {
int tatAbs = object1.getTatNb().intValue() - object1.getExternalUnresolvedMins().intValue();
String negative = "";
if (tatAbs < 0) {
negative = "-";
}
String tatAbsStr = negative + FormatUtil.pad0(String.valueOf(Math.abs(tatAbs / 60)), 2) + ":"
+ FormatUtil.pad0(String.valueOf(Math.abs(tatAbs % 60)), 2);
// object1.setTatNb(tatAbs);
object1.setAbsTat(tatAbsStr.trim());
int tatAbs2 = object2.getTatNb().intValue() - object2.getExternalUnresolvedMins().intValue();
negative = "";
if (tatAbs2 < 0) {
negative = "-";
}
String tatAbsStr2 = negative + FormatUtil.pad0(String.valueOf(Math.abs(tatAbs2 / 60)), 2) + ":"
+ FormatUtil.pad0(String.valueOf(Math.abs(tatAbs2 % 60)), 2);
// object2.setTatNb(tatAbs2);
object2.setAbsTat(tatAbsStr2.trim());
if(tatAbs > tatAbs2)
return 1;
if(tatAbs < tatAbs2)
return -1;
return 0;
}
};
You should not fetch all the 2 lac record from Database into your application. You should only fetch what is needed.
As you have said you have 4 tables joined on main table, you must have Hibernate entity classes for them with the corresponding mapping among them. Use pagination technique to fetch only the number of rows that you are showing to the user. Hibernate knows the tricks to make this work efficiently on your particular database.
You can even use aggregate functions: count(), min(), max(), sum(), and avg() with your HQL to fetch the relevant data.

Update queries taking 0.1s each

I'm writing a simple java program, that does a simple task : it takes in input a text files folder, and it returns as output the 5 words with highest frequency per document.
At first, I tried to do it without any database support, but when I started having memory problems, I decided to change approach and configured the program to run with SQLite.
Everything works just fine now, but it takes a lot of time to just add the words in the database ( 67 seconds for 801 words).
Here is how I initiate the database :
this.Execute(
"CREATE TABLE words ("+
"word VARCHAR(20)"+
");"
);
this.Execute(
"CREATE UNIQUE INDEX wordindex ON words (word);"
);
then, once the programs has counted the documents in the folder ( let's say N), I add N counter columns and N frequency columns to the table
for(int i = 0; i < fileList.size(); i++)
{
db.Execute("ALTER TABLE words ADD doc"+i+" INTEGER");
db.Execute("ALTER TABLE words ADD freq"+i+" DOUBLE");
}
At last, I add words using the following funcion:
public void AddWord(String word, int docid)
{
String query = "UPDATE words SET doc"+docid+"=doc"+docid+"+1 WHERE word='"+word+"'";
int rows = this.ExecuteUpdate(query);
if( rows <= 0)
{
query = "INSERT INTO words (word,doc"+docid+") VALUES ('"+word+"',1)";
this.ExecuteUpdate(query);
}
}
Am i doing something wrong, or it's normal for an update query to take this long to execute?
Wrap all commands inside one transaction, otherwise you get one transaction (with the associated storage synchronizatrion) per command.
12 per second is slow but not unreasonable. With a database like MySQL I would expect it to be closer to 100/second with a HDD storage disk.

Categories