I want to filter all rows that match this condition: with input value x, return all records between two quantifier values in Java.
Example: With input value x = 15,
a record with quantifier q1 = 10 and q2 = 20 will match,
a record with quantifier q1 = 1 and q2 = 10 will not match
You are trying to filter rows that contain min numerical qualifier that is < x, as well as containing a max numerical qualifier that is > x , and then maybe you are trying to filter data from those rows, that are between those qualifiers.
This is pretty much the opposite sort of access pattern that one tries to achieve when setting up a BigTable. This has a code smell. Having said that, you can succesfully achieve this sort of query, using a combination of filters. However, these filters cannot be chained together, as far as I can tell.
First, use a filter to get keys with a cq < x. Next, send a query to bigtable per each key from the first filter, and filter on key as well as filter such that cq > x. This is an optimized way. An even more optimized way might be to limit the first filter to 1 element (i.e. get the min element), and only query without a limit on the lessThen portion, after the second step.
My implementation below is slightly more naive, in that the second step filters only on cq > x, and not keys from the first step. But the gist is the same:
val x = "15"
val a = new mutable.HashMap[ByteString, Row]
val b = new mutable.HashMap[ByteString, Row]
val c = new mutable.HashMap[ByteString, Row]
dataClient.readRows( Query.create(tableId)
.filter(Filters.FILTERS.qualifier().rangeWithinFamily("cf").startClosed(Int.MinValue.toString.padTo(Ints.max(Int.MinValue.toString.length, Int.MaxValue.toString.length), "0").toString()).endOpen(x.padTo(Ints.max(Int.MinValue.toString.length, Int.MaxValue.toString.length), "0").toString()
)))
.forEach(r => a.put(r.getKey, r))
dataClient.readRows(Query.create(tableId)
.filter(Filters.FILTERS.qualifier().rangeWithinFamily("cf").startOpen(x).endClosed(Int.MaxValue.toString.padTo(Ints.max(Int.MinValue.toString.length, Int.MaxValue.toString.length), "0").toString()))
)
.forEach(r => b.put(r.getKey, r))
dataClient.readRows(Query.create(tableId)
.filter(Filters.FILTERS.qualifier().exactMatch(x)))
.forEach(r => c.put(r.getKey, r))
val all_cells = a.keys.toSet.intersect(b.keys.toSet).flatMap(k => a.get(k).map(_.getCells).get.toArray.toSeq ++ b.get(k).map(_.getCells).get.toArray.toSeq
++ c.get(k).map(_.getCells).get.toArray.toSeq)
Can you tell me more about your use case?
It is possible to create a filter on a range of values, but it will depend on how you are encoding them. If they are encoded as strings, you would use the ValueRange filter like so:
Filter filter = FILTERS.value().range().startClosed("10").endClosed("20");
Then perform your read with the filter
try (BigtableDataClient dataClient = BigtableDataClient.create(projectId, instanceId)) {
Query query = Query.create(tableId).filter(filter);
ServerStream<Row> rows = dataClient.readRows(query);
for (Row row : rows) {
printRow(row);
}
} catch (IOException e) {
System.out.println(
"Unable to initialize service client, as a network error occurred: \n" + e.toString());
}
You can also pass bytes to the range, so if your numbers are encoded in some way, you could encode them as bytes in the same way and pass that into startClosed and endClosed.
You can read more about filters in the Cloud Bigtable Documentation.
Related
I'm looking for a way to split an RDD into two or more RDDs. The closest I've seen is Scala Spark: Split collection into several RDD? which is still a single RDD.
If you're familiar with SAS, something like this:
data work.split1, work.split2;
set work.preSplit;
if (condition1)
output work.split1
else if (condition2)
output work.split2
run;
which resulted in two distinct data sets. It would have to be immediately persisted to get the results I intend...
It is not possible to yield multiple RDDs from a single transformation*. If you want to split a RDD you have to apply a filter for each split condition. For example:
def even(x): return x % 2 == 0
def odd(x): return not even(x)
rdd = sc.parallelize(range(20))
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If you have only a binary condition and computation is expensive you may prefer something like this:
kv_rdd = rdd.map(lambda x: (x, odd(x)))
kv_rdd.cache()
rdd_odd = kv_rdd.filter(lambda kv: kv[1]).keys()
rdd_even = kv_rdd.filter(lambda kv: not kv[1]).keys()
It means only a single predicate computation but requires additional pass over all data.
It is important to note that as long as an input RDD is properly cached and there no additional assumptions regarding data distribution there is no significant difference when it comes to time complexity between repeated filter and for-loop with nested if-else.
With N elements and M conditions number of operations you have to perform is clearly proportional to N times M. In case of for-loop it should be closer to (N + MN) / 2 and repeated filter is exactly NM but at the end of the day it is nothing else than O(NM). You can see my discussion** with Jason Lenderman to read about some pros-and-cons.
At the very high level you should consider two things:
Spark transformations are lazy, until you execute an action your RDD is not materialized
Why does it matter? Going back to my example:
rdd_odd, rdd_even = (rdd.filter(f) for f in (odd, even))
If later I decide that I need only rdd_odd then there is no reason to materialize rdd_even.
If you take a look at your SAS example to compute work.split2 you need to materialize both input data and work.split1.
RDDs provide a declarative API. When you use filter or map it is completely up to Spark engine how this operation is performed. As long as the functions passed to transformations are side effects free it creates multiple possibilities to optimize a whole pipeline.
At the end of the day this case is not special enough to justify its own transformation.
This map with filter pattern is actually used in a core Spark. See my answer to How does Sparks RDD.randomSplit actually split the RDD and a relevant part of the randomSplit method.
If the only goal is to achieve a split on input it is possible to use partitionBy clause for DataFrameWriter which text output format:
def makePairs(row: T): (String, String) = ???
data
.map(makePairs).toDF("key", "value")
.write.partitionBy($"key").format("text").save(...)
* There are only 3 basic types of transformations in Spark:
RDD[T] => RDD[T]
RDD[T] => RDD[U]
(RDD[T], RDD[U]) => RDD[W]
where T, U, W can be either atomic types or products / tuples (K, V). Any other operation has to be expressed using some combination of the above. You can check the original RDD paper for more details.
** https://chat.stackoverflow.com/rooms/91928/discussion-between-zero323-and-jason-lenderman
*** See also Scala Spark: Split collection into several RDD?
As other posters mentioned above, there is no single, native RDD transform that splits RDDs, but here are some "multiplex" operations that can efficiently emulate a wide variety of "splitting" on RDDs, without reading multiple times:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.rdd.multiplex.MuxRDDFunctions
Some methods specific to random splitting:
http://silex.freevariable.com/latest/api/#com.redhat.et.silex.sample.split.SplitSampleRDDFunctions
Methods are available from open source silex project:
https://github.com/willb/silex
A blog post explaining how they work:
http://erikerlandson.github.io/blog/2016/02/08/efficient-multiplexing-for-spark-rdds/
def muxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[U],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => Iterator.single(itr.next()(j)) } }
}
def flatMuxPartitions[U :ClassTag](n: Int, f: (Int, Iterator[T]) => Seq[TraversableOnce[U]],
persist: StorageLevel): Seq[RDD[U]] = {
val mux = self.mapPartitionsWithIndex { case (id, itr) =>
Iterator.single(f(id, itr))
}.persist(persist)
Vector.tabulate(n) { j => mux.mapPartitions { itr => itr.next()(j).toIterator } }
}
As mentioned elsewhere, these methods do involve a trade-off of memory for speed, because they operate by computing entire partition results "eagerly" instead of "lazily." Therefore, it is possible for these methods to run into memory problems on large partitions, where more traditional lazy transforms will not.
One way is to use a custom partitioner to partition the data depending upon your filter condition. This can be achieved by extending Partitioner and implementing something similar to the RangePartitioner.
A map partitions can then be used to construct multiple RDDs from the partitioned RDD without reading all the data.
val filtered = partitioned.mapPartitions { iter => {
new Iterator[Int](){
override def hasNext: Boolean = {
if(rangeOfPartitionsToKeep.contains(TaskContext.get().partitionId)) {
false
} else {
iter.hasNext
}
}
override def next():Int = iter.next()
}
Just be aware that the number of partitions in the filtered RDDs will be the same as the number in the partitioned RDD so a coalesce should be used to reduce this down and remove the empty partitions.
If you split an RDD using the randomSplit API call, you get back an array of RDDs.
If you want 5 RDDs returned, pass in 5 weight values.
e.g.
val sourceRDD = val sourceRDD = sc.parallelize(1 to 100, 4)
val seedValue = 5
val splitRDD = sourceRDD.randomSplit(Array(1.0,1.0,1.0,1.0,1.0), seedValue)
splitRDD(1).collect()
res7: Array[Int] = Array(1, 6, 11, 12, 20, 29, 40, 62, 64, 75, 77, 83, 94, 96, 100)
I am having huge data in cassandra db, i want to do aggregation like avg, max and sum for some column name name using spark java api
I tried like below
cassandraRowsRDD
.select("name", "age", "ann_salaray", "dept","bucket", "resourceid", "salaray")
.where("timestamp = ?", "2018-01-09 00:00:00")
.withAscOrder()
I saw this method - .aggregate(zeroValue, seqOp, combOp), but don't know how to use it
Expected :
max(salary column name)
avg(salary column name)
I have tried with CQL, getting failed because of huge data
Can any one give me an example for aggregation in cassandra tables using spark java api
the first parameter provides so-called "zero value" that is used to initialize "accumulator", 2nd parameter - function that takes accumulator & single value from your RDD, and 3rd parameter - function that takes 2 accumulators and combine them.
For your task you may use something like this (pseudo-code)
res = rdd.aggregate((0,0,0),
(acc, value) => (acc._1 + 1,
acc._2 + value.salary,
if (acc._3 > value.salary) then acc._3 else value.salary),
(acc1, acc2) => (acc1._1 + acc2._1,
acc1._2 + acc2._2,
if (acc1._3 > acc2._3) then acc1._3 else acc2._3))
val avg = res._2/res._1
val max = res._3
In this case we have:
(0,0,0) - tuple of 3 elements representing, correspondingly: number of elements in RDD, sum of all salaries, and max salary
function that generate a new tuple from accumulator & value
function that combines 2 tuples
and then having number of entries, full sum of salaries, and max, we can find all necessary data.
I have a cassandra server that is queried by another service and I need to reduce the amount of queries.
My first thought was to create a bloom filter of the whole database every couple of minutes and send it to the service.
but as I have a couple of hundreds of gigabytes in the database (which is expected to grow to a couple of terabytes), it doesn't seem like a good idea overloading the database every few minutes.
After a while of searching for a better solution, I remembered that cassandra maintains its own bloom filter.
Is it possible to copy the *-Filter.db files and use them in my code instead of creating my own bloom filter?
I have Created a table test
CREATE TABLE test (
a int PRIMARY KEY,
b int
);
Inserted 1 row
INSERT INTO test(a,b) VALUES(1, 10);
After flush data to disk. we can use the *-Filter.db file. For my case it was la-2-big-Filter.db
Here is the sample code to check if a partition key exist
Murmur3Partitioner partitioner = new Murmur3Partitioner();
try (DataInputStream in = new DataInputStream(new FileInputStream(new File("la-2-big-Filter.db"))); IFilter filter = FilterFactory.deserialize(in, true)) {
for (int i = 1; i <= 10; i++) {
DecoratedKey decoratedKey = partitioner.decorateKey(Int32Type.instance.decompose(i));
if (filter.isPresent(decoratedKey)) {
System.out.println(i + " is present ");
} else {
System.out.println(i + " is not present ");
}
}
}
Output :
1 is present
2 is not present
3 is not present
4 is not present
5 is not present
6 is not present
7 is not present
8 is not present
9 is not present
10 is not present
I have two RDDs containing time information. RDDs are split in different partitions.
One is of the form
16:00:00
16:00:18
16:00:25
16:01:01
16:01:34
16:02:12
16:02:42
...
and another containing span of time in form of tuple2
<16:00:00, 16:00:59>
<16:01:00, 16:01:59>
<16:02:00, 16:02:59>
...
I need to aggregate the first and the second RDD, by aggregating values of the first according to values in the second, in order to obtain something like
<<16:00:00, 16:00:59>, [16:00:00,16:00:18,16:00:25]>
<<16:01:00, 16:01:59>, [16:01:01,16:01:34]>
<<16:02:00, 16:02:59>, [16:02:12,16:02:42]>
...
Or, in alternative, something like
<<16:00:00, 16:00:59>, 16:00:00>
<<16:00:00, 16:00:59>, 16:00:18>
<<16:00:00, 16:00:59>, 16:00:25>
<<16:01:00, 16:01:59>, 16:01:01>
<<16:01:00, 16:01:59>, 16:01:34>
<<16:02:00, 16:02:59>, 16:02:12>
<<16:02:00, 16:02:59>, 16:02:42>
...
I'm trying to use the whole range of spark transformation functions, but I'm having hard time to find one that works on RDDs of such different nature. I know I might go for a cartesian product, and then filter, but I'd like a "better" solution. I tried zipPartition, that may work, but I may have inconsistency in the partitions, e.g. 16:00:00 may end up in a partition where the corresponding aggregation value (the tuple <16:00:00, 16:00:59>) is not present.
Which is the best way to deal with this?
PS: I'm using Java, but Scala solutions are welcome as well.
Thanks
I've simplified the below to use ints, but I believe the same can be done times. While the examples are in Scala, I suspect it can all be done in Java as well.
If the ranges are regular I'd turn the "values" RDD into a range,value then do a simple join.
val values = Seq(1, 5, 10, 14, 20)
val valuesRdd = sc.parallelize(values, 2)
valuesRdd.map(x => (((x/10)*10, ((x/10)*10)+9), x)).collect
However if the ranges are not regular then:
If you don't mind using DataFrames then an option would be to use a user defined function to create a column based on if V in in the given range and join on that.
case class Range(low : Int, high :Int)
val ranges = Seq( Range(0,9), Range(10,19), Range(20,29));
val rangesDf = sc.parallelize(ranges, 2).toDF
case class Value(value : Int)
val values = Seq(Value(1), Value(5), Value(10), Value(14), Value(20))
val valuesDf = sc.parallelize(values, 2).toDF
val inRange = udf{(v: Int, low: Int, high : Int) => v >= low && v<= high}
rangesDf.join(valuesDf, inRange(valuesDf("value"), rangesDf("low"), rangesDf("high"))).show
The next option would be to explode out the ranges and join on the exploded version:
val explodedRange = rangesRdd.map(x => (x, List.range(x._1, x._2 + 1))).flatMap( { case (range, lst) => lst.map { x => (x, range)} })
val valuesRdd = sc.parallelize(values, 2).map(x => (x,true))
valuesRdd.join(explodedRange).map(x => (x._2._2, x._1)).collect
As of now we are providing client side sorting on Dojo datagrid. Now we need to enhance server side sorting means sorting should apply to all pages on grid. We have 4 tables joined on main table and has 2 lac records as of now and it may increase. When execute SQL it takes 5-8 mins time to fetch all records to my java code and where I need to apply some calculations over them and I am providing custom sort using Comparators. We have each comparator to represent each column.
My worry is how to get the whole data to service layer code within short time? Is there a way to increase execution speed through data source configuration?
return new Comparator<QueryHS>() {
public int compare(QueryHS object1, QueryHS object2) {
int tatAbs = object1.getTatNb().intValue() - object1.getExternalUnresolvedMins().intValue();
String negative = "";
if (tatAbs < 0) {
negative = "-";
}
String tatAbsStr = negative + FormatUtil.pad0(String.valueOf(Math.abs(tatAbs / 60)), 2) + ":"
+ FormatUtil.pad0(String.valueOf(Math.abs(tatAbs % 60)), 2);
// object1.setTatNb(tatAbs);
object1.setAbsTat(tatAbsStr.trim());
int tatAbs2 = object2.getTatNb().intValue() - object2.getExternalUnresolvedMins().intValue();
negative = "";
if (tatAbs2 < 0) {
negative = "-";
}
String tatAbsStr2 = negative + FormatUtil.pad0(String.valueOf(Math.abs(tatAbs2 / 60)), 2) + ":"
+ FormatUtil.pad0(String.valueOf(Math.abs(tatAbs2 % 60)), 2);
// object2.setTatNb(tatAbs2);
object2.setAbsTat(tatAbsStr2.trim());
if(tatAbs > tatAbs2)
return 1;
if(tatAbs < tatAbs2)
return -1;
return 0;
}
};
You should not fetch all the 2 lac record from Database into your application. You should only fetch what is needed.
As you have said you have 4 tables joined on main table, you must have Hibernate entity classes for them with the corresponding mapping among them. Use pagination technique to fetch only the number of rows that you are showing to the user. Hibernate knows the tricks to make this work efficiently on your particular database.
You can even use aggregate functions: count(), min(), max(), sum(), and avg() with your HQL to fetch the relevant data.