Check for substring efficiently for large data sets

Check for substring efficiently for large data sets - java

I have:
a database table with 400 000 000 rows (Cassandra 3)
a list of circa 10 000 keywords
both data sets are expected to grow in time
I need to:
check if a specified column contains a keyword
sum how many rows contained the keyword in the column
Which approach should I choose?
Approach 1 (Secondary index):
Create secondary SASI index on the table
Find matches for given keyword "on fly" anytime
However, I am afraid of
cappacity problem - secondary indices can consume extra space and for such large table it could be too much
performance - I am not sure if finding of keyword among hundreds milions of rows can be achieved in a reasonable time
Approach 2 (Java job - brute force):
Java job that continuously iterates over data
Matches are saved into cache
Cache is updated during the next iteration
// Paginate throuh data...
String page = null;
do {
PagingState state = page == null ? null : PagingState.fromString(page);
PagedResult<DataRow> res = getDataPaged(query, status, PAGE_SIZE, state);
// Iterate through the current page ...
for (DataRow row : res.getResult()) {
// Skip empty titles
if (row.getTitle().length() == 0) {
continue;
}
// Find match in title
for (String k : keywords) {
if (k.length() > row.getTitle().length()) {
continue;
}
if (row.getTitle().toLowerCase().contains(k.toLowerCase()) {
// TODO: SAVE match
break;
}
}
}
status = res.getResult();
page = res.getPage();
// TODO: Wait here to reduce DB load
} while (page != null);
Problems
It could be very slow to iterate through whole table. If I waited for one second per every 1000 rows, then this cycle would finish in 4.6 days
This would require extra space for cache; moreover, frequent deletions from cache would produce tombstones in Cassandra

A better way will be to use a search engine like SolR our ElasticSearch. Full text search is their speciality. You could easily dump your data from cassandra to Elasticsearch and implement your java job on top of ElasticSearch.
EDIT:
With Cassandra you can request your result query as a JSON and Elasticsearch 'speak' only in JSON so you will be able to transfer your data very easily.
Elasticsearch
SolR

Related

Limiting the nested fields in Elasticsearch

I am trying to index json documents in ElasticSearch with dynamic mappings on. Some of the documents have unpredictable number of keys (nested levels) because of which I started getting this error from ES Java api.
[ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Limit of total fields [1000] in index [my_index] has been exceeded]]]failure in bulk execution
I was wondering if there is an option which can be configured at index level where I can define to scan for fields only till certain level (maybe 2) and store the rest of the document as a string or in flattened form. I did come across some settings like index.mapping.depth.limit but it seems if I set it to 2, this setting rejects the document if there are more levels. link

For Total Field
PUT <index_name>/_settings
{
"index.mapping.total_fields.limit": 2000
}
For depth limit
PUT <index_name>/_settings
{
"index.mapping.depth.limit": 2
}
https://www.elastic.co/guide/en/elasticsearch/reference/master/mapping.html

Add this to your index _settings:
"settings": {
"index.mapping.nested_fields.limit": 150,
...
}
"mappings": {
...
}

Dataflow CoGroupByKey is very slow for more than 10000 elements per key

I have two PCollection<KV<String, TableRow>> one has ~7 Million rows and the other has ~1 Million rows.
What I want to do is apply left outer join between these two PCollections and in case of successful join put all the data of right TableRow To left TableRow and return the results.
I have tried using CoGroupByKey in apache beam SDK 2.10.0 for java and here I am getting so many Hot Keys so my Fetching Result after CoGrupByKey is getting slower with Waring 'More 10000 elements per key, need to reiterate'. I have also tried shuffle mode service but no help.
PCollection<TableRow> finalResultCollection =
coGbkResultCollection.apply(ParDo.of(
new DoFn<KV<K, CoGbkResult>, TableROw>() {
#Override
public void processElement(ProcessContext c) {
KV<K, CoGbkResult> e = c.element();
// Get all collection 1 values
Iterable<TableRow> pt1Vals = e.getValue().getAll(t1);
Iterable<TableRow> pt2Vals = e.getValue().getAll(t2);
for (TableRow tr : pt1Vals)
{
TableRow out = tr.clone();
if(pt2Vals.iterator().hasNext())
{
for (TableRow tr1 : pt2Vals)
{
out.putAll(tr1);
c.output(out);
}
}
else
{
c.output(out);
}
}
}
}));
What is the way to perform these type of joins in dataflow?

I have made some research and I have found some information that can help you.
The data sent transferred by dataflow between PCollections (serializable objects) may not exist in a single machine. Furthermore, a transformation like GroupByKey/CoGroupByKey needs requires all the data to collected in one place before the resultant populated, I don't know if you have it in a different structure.
Besides you can redistribute your keys, put less workers and add more memory, or, try using Combine.perKey.
Also you can try this workaround, or, you can read this article and have more information that can help you.

hbase how to choose pre split strategies and how its affect your rowkeys

I am trying to pre split hbase table. One the HbaseAdmin java api is to create an hbase table is function of startkey, endkey and number of regions. Here's the java api that I use from HbaseAdmin void createTable(HTableDescriptor desc, byte[] startKey, byte[] endKey, int numRegions)
Is there any recommendation on choosing startkey and endkey based on dataset?
My approach is lets say we have 100 records in dataset. I want data divided approximately in 10 regions so each will have approx 10 records. so to find startkey i will say scan '/mytable', {LIMIT => 10} and pick the last rowkey as my startkey and then scan '/mytable', {LIMIT => 90} and pick the last rowkey as my endkey.
Does this approach to find startkey and rowkey looks ok or is there better practice?
EDIT
I tried following approaches to pre split empty table. ALl three didn't work the way I used it. I think I will need to salt the key to get equal distribution.
PS> I am displaying only some region info
1)
byte[][] splits = new RegionSplitter.HexStringSplit().split(10);
hBaseAdmin.createTable(tabledescriptor, splits);
This gives regions with boundaries like:
{
"startkey":"-INFINITY",
"endkey":"11111111",
"numberofrows":3628951,
},
{
"startkey":"11111111",
"endkey":"22222222",
},
{
"startkey":"22222222",
"endkey":"33333333",
},
{
"startkey":"33333333",
"endkey":"44444444",
},
{
"startkey":"88888888",
"endkey":"99999999",
},
{
"startkey":"99999999",
"endkey":"aaaaaaaa",
},
{
"startkey":"aaaaaaaa",
"endkey":"bbbbbbbb",
},
{
"startkey":"eeeeeeee",
"endkey":"INFINITY",
}
This is useless as my rowkeys are of composite form like 'deptId|month|roleId|regionId' and doesn't fit into above boundaries.
2)
byte[][] splits = new RegionSplitter.UniformSplit().split(10);
hBaseAdmin.createTable(tabledescriptor, splits)
This has same issue:
{
"startkey":"-INFINITY",
"endkey":"\\x19\\x99\\x99\\x99\\x99\\x99\\x99\\x99",
}
{
"startkey":"\\x19\\x99\\x99\\x99\\x99\\x99\\x99\\
"endkey":"33333332",
}
{
"startkey":"33333332",
"endkey":"L\\xCC\\xCC\\xCC\\xCC\\xCC\\xCC\\xCB",
}
{
"startkey":"\\xE6ffffffa",
"endkey":"INFINITY",
}
3) I tried supplying start key and end key and got following useless regions.
hBaseAdmin.createTable(tabledescriptor, Bytes.toBytes("04120|200808|805|1999"),
Bytes.toBytes("01253|201501|805|1999"), 10);
{
"startkey":"-INFINITY",
"endkey":"04120|200808|805|1999",
}
{
"startkey":"04120|200808|805|1999",
"endkey":"000PTP\\xDC200W\\xD07\\x9C805|1999",
}
{
"startkey":"000PTP\\xDC200W\\xD07\\x9C805|1999",
"endkey":"000ptq<200wp6\\xBC805|1999",
}
{
"startkey":"001\\x11\\x15\\x13\\x1C201\\x15\\x902\\x5C805|1999",
"endkey":"01253|201501|805|1999",
}
{
"startkey":"01253|201501|805|1999",
"endkey":"INFINITY",
}

First question : Out of my experience with hbase, I am not aware any hard rule for creating number of regions, with start key and end key.
But underlying thing is,
With your rowkey design, data should be distributed across the regions and not hotspotted
(36.1. Hotspotting)
However, if you define fixed number of regions as you mentioned 10. There may not be 10 after heavy data load. If it reaches, certain limit, number of regions will again split.
In your way of creating table with hbase admin documentation says, Creates a new table with the specified number of regions. The start key specified will become the end key of the first region of the table, and the end key specified will become the start key of the last region of the table (the first region has a null start key and the last region has a null end key).
Moreover, I prefer creating a table through script with presplits say 0-10 and I will design a rowkey such that its salted and it will be sitting on one of region servers to avoid hotspotting.
like
EDIT : If you want to implement own regionSplit
you can implement and provide your own implementation org.apache.hadoop.hbase.util.RegionSplitter.SplitAlgorithm and override
public byte[][] split(int numberOfSplits)
Second question :
My understanding :
You want to find startrowkey and end rowkey for the inserted data in a specific table... below are the ways.
If you want to find start and end rowkeys scan '.meta' table to understand how is your start rowkey and end rowkey..
you can access ui http://hbasemaster:60010 if you can see how the rowkeys are spread across each region. for each region start and rowkeys will be there.
to know how your keys are organized, after pre splitting your table and inserting in to hbase... use FirstKeyOnlyFilter
for example : scan 'yourtablename', FILTER => 'FirstKeyOnlyFilter()'
which displays all your 100 rowkeys.
if you have huge data (not 100 rows as you mentioned) and want to take a dump of all rowkeys then you can use below from out side shell..
echo "scan 'yourtablename', FILTER => 'FirstKeyOnlyFilter()'" | hbase shell > rowkeys.txt

How speed up program? (Lot of slow mysql queries)

I have database with 300 000 rows, and I need filter some rows by algorithm.
protected boolean validateMatch(DbMatch m) throws MatchException, NotSupportedSportException{
// expensive part
List<DbMatch> hh = sd.getMatches(DateService.beforeDay(m.getStart()), m.getHt(), m.getCountry(),m.getSportID());
List<DbMatch> ah = sd.getMatches(DateService.beforeDay(m.getStart()), m.getAt(), m.getCountry(),m.getSportID());
....
My hibernate dao function for load data from Mysql is used 2x times of init array size.
public List<DbMatch> getMatches(Date before,String team, String country,int sportID) throws NotSupportedSportException{
//Match_soccer where date between :start and :end
Criteria criteria = session.createCriteria(DbMatch.class);
criteria.add(Restrictions.le("start",before));
criteria.add(Restrictions.disjunction()
.add(Restrictions.eq("ht", team))
.add(Restrictions.eq("at", team)));
criteria.add(Restrictions.eq("country",country));
criteria.add(Restrictions.eq("sportID",sportID));
criteria.addOrder(Order.desc("start") );
return criteria.list();
}
Example how i try filter data
function List<DbMatch> filter(List<DbMatch> mSet){
List<DbMatch> filtred = new ArrayList<>();
for(DbMatch m:mSet){
if(validateMatch(DbMatch m))filtred.add(m);
}
}
(1)I tried different criteria settings and counted function times with stopwatch. My result is when I use filter(matches) matches size 1000 my program take 3 min 21 s 659 ms.
(2)I tried remove criteria.addOrder(Order.desc("start")); than program filtered after 3 min 12 s 811 ms.
(3)But if I remove criteria.addOrder(Order.desc("start")); and add criteria.setMaxResults(1); result was 22 s 311 ms.
Using last configs i can filter all my 300 000 record by 22,3 * 300 = 22300 s (~ 6h), but if use first function I should wait (~ 60 h).
If I want use criteria without order and limit i must be sure that my table is sorted by date on database because it is important get last match .
All data is stored on matches table.
Table indexes:
Table, Non_unique, Key_name, Seq_in_index, Column_name, Collation, Cardinality, Sub_part, Packed, Null, Index_type, Comment, Index_comment
matches, 0, PRIMARY, 1, mid, A, 220712, , , , BTREE, ,
matches, 0, UK_kcenwf4m58fssuccpknl1v25v, 1, beid, A, 220712, , , YES, BTREE, ,
UPDATED
After added ALTER TABLE matches ADD INDEX (sportID, country); now program time deacrised to 15s for 1000 matches. But if I not use order by and add limit need wait only 4s for 1000 mathces.
How I should act on this situation to improve program executions speed?

Your first order of business is to figure out how long each component take to process the request.
Find out the SQL query generated by the ORM and run that manually in MySQL workbench and see how long it takes (non cached). You can also ask for it to explain the index usage.
If it's fast enough then it's your java code that's taking longer and you need to optimize your algorithm. You can use JConsole to dig further into that.
If you identify which component is taking longer you can post here with your analysis and we can make suggestions accordingly.

How do I limit the number of results when using the Java driver for mongo db?

http://api.mongodb.org/java/2.1/com/mongodb/DBCollection.html#find(com.mongodb.DBObject,com.mongodb.DBObject,int,int)
Using this with Grails and the mongo db plugin.
Here's the code I'm using... not sure why but the cursor is returning the entire set of data. In this case, I'm just trying to return the first 20 matches (with is_processed = false):
def limit = {
def count = 1;
def shape_cursor = mongo.shapes.find(new BasicDBObject("is_processed", false),new BasicDBObject(),0,20);
while(shape_cursor.hasNext()){
shape_cursor.next();
render "<div>" + count + "</div"
count++;
}
}
Anyone have an idea?

limit is a method of DBCursor: DBCursor.limit(n).
So you simply need to do
def shape_cursor = mongo.shapes.find(...).limit(20);

According to the JavaDoc you linked to the second int parameter is not the maximum number to return, but
batchSize - if positive, is the # of objects per batch sent back from the db. all objects that match will be returned. if batchSize < 0, its a hard limit, and only 1 batch will either batchSize or the # that fit in a batch
Maybe a negative number (-20) would do what you want, but I find the statement above too confusing to be sure about it, so I would set the batchSize to 20 and then filter in your application code.
Maybe file this as a bug / feature request. There should be a way to specify skip/limit that works just like on the shell interface. (Update: and there is, on the cursor class, see the other answer).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Check for substring efficiently for large data sets - java

Related

Limiting the nested fields in Elasticsearch

Dataflow CoGroupByKey is very slow for more than 10000 elements per key

hbase how to choose pre split strategies and how its affect your rowkeys

How speed up program? (Lot of slow mysql queries)

How do I limit the number of results when using the Java driver for mongo db?

Categories

Resources