Multiple Apache Flink windows validations - java

I'm just getting started on stream processing using Apache Flink, the thing is that I'm receiving a stream of Json that look like this:
{
token_id: “tok_afgtryuo”,
ip_address: “128.123.45.1“,
device_fingerprint: “abcghift”,
card_hash: “hgtyuigash”,
“bin_number”: “424242”,
“last4”: “4242”,
“name”: “Seu Jorge”
}
And was asked if i could fulfill the following business rules:
Decline if number of tokens > 5 for this IP in last 10 seconds
Decline if number of tokens > 15 for this IP in last minute
Decline if number of tokens > 60 for this IP in last hour
I made 2 classes, main class when I'm making an instance to call the Window function with different parameters to avoid duplicate code:
Main.java
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//This DataStream Would be Converting the Json to a Token Object
DataStream<Token> baseStream =
env.addSource(new SocketTextStreamFunction("localhost",
9999,
"\n",
1))
.map(new MapTokens());
// 1- First rule Decline if number of tokens > 5 for this IP in last 10 seconds
DataStreamSink<String> response1 = new RuleMaker().getStreamKeyCount(baseStream, "ip", Time.seconds(10),
5, "seconds").print();
//2 -Decline if number of tokens > 15 for this IP in last minute
DataStreamSink<String> response2 = new RuleMaker().getStreamKeyCount(baseStream, "ip", Time.minutes(1),
62, "minutes").print();
//3- Decline if number of tokens > 60 for this IP in last hour
DataStreamSink<String> response3 = new RuleMaker().getStreamKeyCount(baseStream, "ip", Time.hours(1),
60, "Hours").print();
env.execute("Job2");
}
And another class where I'm doing all the logic for rules, I'm counting the times where an IP address appears and, if it is more than the allowed number in the time window I'm returning a message with some information:
Rulemaker.java
public class RuleMaker {
public DataStream<String> getStreamKeyCount(DataStream<Token> stream,
String tokenProp,
Time time,
Integer maxPetitions,
String ruleType){
return
stream
.flatMap(new FlatMapFunction<Token, Tuple3<String, Integer, String>>() {
#Override
public void flatMap(Token token, Collector<Tuple3<String, Integer, String>> collector) throws Exception {
String tokenSelection = "";
switch (tokenProp)
{
case "ip":
tokenSelection = token.getIpAddress();
break;
case "device":
tokenSelection = token.getDeviceFingerprint();
break;
case "cardHash":
tokenSelection = token.getCardHash();
break;
}
collector.collect(new Tuple3<>(tokenSelection, 1, token.get_tokenId()));
}
})
.keyBy(0)
.timeWindow(time)
.process(new MyProcessWindowFunction(maxPetitions, ruleType));
}
//Class to process the elements from the window
private class MyProcessWindowFunction extends ProcessWindowFunction<
Tuple3<String, Integer, String>,
String,
Tuple,
TimeWindow
> {
private Integer _maxPetitions;
private String _ruleType;
public MyProcessWindowFunction(Integer maxPetitions, String ruleType) {
this._maxPetitions = maxPetitions;
this._ruleType = ruleType;
}
#Override
public void process(Tuple tuple, Context context, Iterable<Tuple3<String, Integer, String>> iterable, Collector<String> out) throws Exception {
Integer counter = 0;
for (Tuple3<String, Integer, String> element : iterable) {
counter += element.f1++;
if(counter > _maxPetitions){
out.collect("El elemeto ha sido declinado: " + element.f2 + " Num elements: " + counter + " rule type: " + _ruleType + " token: " + element.f0 );
counter = 0;
}
}
}
}
}
So far, i think this code is working but I'm a begginer on Apache Flink, and I'll appreciate a lot if you could tell me if it's something wrong about the way I'm trying to work with this and point me to the right direction.
Thanks a lot.

General approach looks very good, although I would have thought that Table API would be powerful enough to help you (more concise) which supports Json out of the box.
If you want to stick to DataStream API, in getStreamKeyCount, the switch around tokenProp should be replaced by passing a key extractor to getStreamKeyCount to have only one place to add new rules.
public DataStream<String> getStreamKeyCount(DataStream<Token> stream,
KeySelector<Token, String> keyExtractor,
Time time,
Integer maxPetitions,
String ruleType){
return stream
.map(token -> new Tuple3<>(keyExtractor.getKey(token), 1, token.get_tokenId()))
.keyBy(0)
.timeWindow(time)
.process(new MyProcessWindowFunction(maxPetitions, ruleType));
}
Then the invocation becomes
DataStreamSink<String> response2 = ruleMaker.getStreamKeyCount(baseStream,
Token::getIpAddress, Time.minutes(1), 62, "minutes");

Related

How to use parallel processing in most efficient and elegant way in java

I have different sources of data from which I want to request in parallel (since each of this request is an http call and may be pretty time consuming). But I'm going to use only 1 response from these requests. So I kind of prioritize them. If the first response is invalid I'm going to check the second one. If it's also invalid I want to use the third, etc.
But I want to stop processing and return the result as soon as I receive the first correct response.
To simulate the problem I created the following code, where I'm trying to use java parallel streaming. But the problem is that I receive final results only after processing all requests.
public class ParallelExecution {
private static Supplier<Optional<Integer>> testMethod(String strInt) {
return () -> {
Optional<Integer> result = Optional.empty();
try {
result = Optional.of(Integer.valueOf(strInt));
System.out.printf("converted string %s to int %d\n",
strInt,
result.orElse(null));
} catch (NumberFormatException ex) {
System.out.printf("CANNOT CONVERT %s to int\n", strInt);
}
try {
int randomValue = result.orElse(10000);
TimeUnit.MILLISECONDS.sleep(randomValue);
System.out.printf("converted string %s to int %d in %d milliseconds\n",
strInt,
result.orElse(null), randomValue);
} catch (InterruptedException e) {
e.printStackTrace();
}
return result;
};
}
public static void main(String[] args) {
Instant start = Instant.now();
System.out.println("Starting program: " + start.toString());
List<Supplier<Optional<Integer>>> listOfFunctions = new ArrayList();
for (String arg: args) {
listOfFunctions.add(testMethod(arg));
}
Integer value = listOfFunctions.parallelStream()
.map(function -> function.get())
.filter(optValue -> optValue.isPresent()).map(val-> {
System.out.println("************** VAL: " + val);
return val;
}).findFirst().orElse(null).get();
Instant end = Instant.now();
Long diff = end.toEpochMilli() - start.toEpochMilli();
System.out.println("final value:" + value + ", worked during " + diff + "ms");
}
}
So when I execute the program using the following command:
$java ParallelExecution dfafj 34 1341 4656 dfad 245df 5767
I want to get the result "34" as soon as possible (around after 34 milliseconds) but in fact, I'm waiting for more than 10 seconds.
Could you help to find the most efficient solution for this problem?
ExecutorService#invokeAny looks like a good option.
List<Callable<Optional<Integer>>> tasks = listOfFunctions
.stream()
.<Callable<Optional<Integer>>>map(f -> f::get)
.collect(Collectors.toList());
ExecutorService service = Executors.newCachedThreadPool();
Optional<Integer> value = service.invokeAny(tasks);
service.shutdown();
I converted your List<Supplier<Optional<Integer>>> into a List<Callable<Optional<Integer>>> to be able to pass it in invokeAny. You may build Callables initially. Then, I created an ExecutorService and submitted the tasks.
The result of the first successfully executed task will be returned as soon as that result is returned from a task. Other tasks will end up interrupted.
You also may want to look into CompletionService.
List<Callable<Optional<Integer>>> tasks = Arrays
.stream(args)
.<Callable<Optional<Integer>>>map(arg -> () -> testMethod(arg).get())
.collect(Collectors.toList());
final ExecutorService underlyingService = Executors.newCachedThreadPool();
final ExecutorCompletionService<Optional<Integer>> service = new ExecutorCompletionService<>(underlyingService);
tasks.forEach(service::submit);
Optional<Integer> value = service.take().get();
underlyingService.shutdownNow();
You can use a queue to put your results in:
private static void testMethod(String strInt, BlockingQueue<Integer> queue) {
// your code, but instead of returning anything:
result.ifPresent(queue::add);
}
and then call it with
for (String s : args) {
CompletableFuture.runAsync(() -> testMethod(s, queue));
}
Integer result = queue.take();
Note that this will only handle the first result, as in your sample.
I have tried it using competableFutures and anyOf method. It will return when any one of the future is completed. Now, key to stop other tasks is to provide your own executor service to the completableFuture(s) and shutting it down when required.
public static void main(String[] args) {
Instant start = Instant.now();
System.out.println("Starting program: " + start.toString());
CompletableFuture<Optional<Integer>> completableFutures[] = new CompletableFuture[args.length];
ExecutorService es = Executors.newFixedThreadPool(args.length,r -> {
Thread t = new Thread(r);
t.setDaemon(false);
return t;
});
for (int i = 0;i < args.length; i++) {
completableFutures[i] = CompletableFuture.supplyAsync(testMethod(args[i]),es);
}
CompletableFuture.anyOf(completableFutures).
thenAccept(res-> {
System.out.println("Result - " + res + ", Time Taken : " + (Instant.now().toEpochMilli()-start.toEpochMilli()));
es.shutdownNow();
});
}
PS :It will throw interrupted exceptions that you can ignore in try catch block and not print the stack trace.Also, your thread pool size ideally should be same as length of args array.

LDA in Spark 1.3.1. Converting raw data into Term Document Matrix?

I'm trying out LDA with Spark 1.3.1 in Java and got this error:
Error: application failed with exception
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NumberFormatException: For input string: "��"
My .txt file looks like this:
put weight find difficult pull ups push ups now
blindness diseases everything eyes work perfectly except ability take light use light form images
role model kid
Dear recall saddest memory childhood
This is the code:
import scala.Tuple2;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.clustering.LDAModel;
import org.apache.spark.mllib.clustering.LDA;
import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.SparkConf;
public class JavaLDA {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("LDA Example");
JavaSparkContext sc = new JavaSparkContext(conf);
// Load and parse the data
String path = "/tutorial/input/askreddit20150801.txt";
JavaRDD<String> data = sc.textFile(path);
JavaRDD<Vector> parsedData = data.map(
new Function<String, Vector>() {
public Vector call(String s) {
String[] sarray = s.trim().split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i < sarray.length; i++)
values[i] = Double.parseDouble(sarray[i]);
return Vectors.dense(values);
}
}
);
// Index documents with unique IDs
JavaPairRDD<Long, Vector> corpus = JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(
new Function<Tuple2<Vector, Long>, Tuple2<Long, Vector>>() {
public Tuple2<Long, Vector> call(Tuple2<Vector, Long> doc_id) {
return doc_id.swap();
}
}
));
corpus.cache();
// Cluster the documents into three topics using LDA
LDAModel ldaModel = new LDA().setK(100).run(corpus);
// Output topics. Each is a distribution over words (matching word count vectors)
System.out.println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize()
+ " words):");
Matrix topics = ldaModel.topicsMatrix();
for (int topic = 0; topic < 100; topic++) {
System.out.print("Topic " + topic + ":");
for (int word = 0; word < ldaModel.vocabSize(); word++) {
System.out.print(" " + topics.apply(word, topic));
}
System.out.println();
}
ldaModel.save(sc.sc(), "myLDAModel");
}
}
Anyone know why this happened? I'm just trying LDA Spark for the first time. Thanks.
values[i] = Double.parseDouble(sarray[i]);
Why are you trying to convert each word of your text file into a Double?
That's the answer to your issue:
http://docs.oracle.com/javase/6/docs/api/java/lang/Double.html#parseDouble%28java.lang.String%29
Your code is expecting the input file to be a bunch of lines of whitespace separated text that looks like numbers. Assuming your text is words instead:
Get a list of every word that appears in your corpus:
JavaRDD<String> words =
data.flatMap((FlatMapFunction<String, String>) s -> {
s = s.replaceAll("[^a-zA-Z ]", "");
s = s.toLowerCase();
return Arrays.asList(s.split(" "));
});
Make a map giving each word an integer associated with it:
Map<String, Long> vocab = words.zipWithIndex().collectAsMap();
Then instead of your parsedData doing what it's doing up there, make it look up each word, find the associated number, go to that location in an array, and add 1 to the count for that word.
JavaRDD<Vector> tokens = data.map(
(Function<String, Vector>) s -> {
String[] vals = s.split("\\s");
double[] idx = new double[vocab.size() + 1];
for (String val : vals) {
idx[vocab.get(val).intValue()] += 1.0;
}
return Vectors.dense(idx);
}
);
This results in an RDD of vectors, where each vector is vocab.size() long, and each spot in the vector is the count of how many times that vocab word appeared in the line.
I modified this code slightly from what I'm currently using and didn't test it, so there could be errors in it. Good luck!

How do I determine an offset in Apache Spark?

I'm searching through some data files (~20GB). I'd like to find some specific terms in that data and mark the offset for the matches. Is there a way to have Spark identify the offset for the chunk of data I'm operating on?
import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;
import java.util.regex.*;
public class Grep {
public static void main( String args[] ) {
SparkConf conf = new SparkConf().setMaster( "spark://ourip:7077" );
JavaSparkContext jsc = new JavaSparkContext( conf );
JavaRDD<String> data = jsc.textFile( "hdfs://ourip/test/testdata.txt" ); // load the data from HDFS
JavaRDD<String> filterData = data.filter( new Function<String, Boolean>() {
// I'd like to do something here to get the offset in the original file of the string "babe ruth"
public Boolean call( String s ) { return s.toLowerCase().contains( "babe ruth" ); } // case insens matching
});
long matches = filterData.count(); // count the hits
// execute the RDD filter
System.out.println( "Lines with search terms: " + matches );
);
} // end main
} // end class Grep
I'd like to do something in the "filter" operation to compute the offset of "babe ruth" in the original file. I can get the offset of "babe ruth" in the current line, but what's the process or function that tells me the offset of the line within the file?
In Spark common Hadoop Input Format can be used. To read the byte offset from the file you can use class TextInputFormat from Hadoop (org.apache.hadoop.mapreduce.lib.input). It is already bundled with Spark.
It will read the file as key (byte offset) and value (text line):
An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text.
In Spark it can be used by calling newAPIHadoopFile()
SparkConf conf = new SparkConf().setMaster("");
JavaSparkContext jsc = new JavaSparkContext(conf);
// read the content of the file using Hadoop format
JavaPairRDD<LongWritable, Text> data = jsc.newAPIHadoopFile(
"file_path", // input path
TextInputFormat.class, // used input format class
LongWritable.class, // class of the value
Text.class, // class of the value
new Configuration());
JavaRDD<String> mapped = data.map(new Function<Tuple2<LongWritable, Text>, String>() {
#Override
public String call(Tuple2<LongWritable, Text> tuple) throws Exception {
// you will get each line from as a tuple (offset, text)
long pos = tuple._1().get(); // extract offset
String line = tuple._2().toString(); // extract text
return pos + " " + line;
}
});
You could use the wholeTextFiles(String path, int minPartitions) method from JavaSparkContext to return a JavaPairRDD<String,String> where the key is filename and the value is a string containing the entire content of a file (thus, each record in this RDD represents a file). From here, simply run a map() that will call indexOf(String searchString) on each value. This should return the first index in each file with the occurrence of the string in question.
(EDIT:)
So finding the offset in a distributed fashion for one file (per your use case below in the comments) is possible. Below is an example that works in Scala.
val searchString = *search string*
val rdd1 = sc.textFile(*input file*, *num partitions*)
// Zip RDD lines with their indices
val zrdd1 = rdd1.zipWithIndex()
// Find the first RDD line that contains the string in question
val firstFind = zrdd1.filter { case (line, index) => line.contains(searchString) }.first()
// Grab all lines before the line containing the search string and sum up all of their lengths (and then add the inline offset)
val filterLines = zrdd1.filter { case (line, index) => index < firstFind._2 }
val offset = filterLines.map { case (line, index) => line.length }.reduce(_ + _) + firstFind._1.indexOf(searchString)
Note that you would additionally need to add any new line characters manually on top of this since they are not accounted for (the input format uses new lines as demarcations between records). The number of new lines is simply the number of lines before the line containing the search string so this is trivial to add.
I'm not entirely familiar with the Java API unfortunately and it's not exactly easy to test so I'm not sure if the code below works but have at it (Also, I used Java 1.7 but 1.8 compresses a lot of this code with lambda expressions.):
String searchString = *search string*;
JavaRDD<String> data = jsc.textFile("hdfs://ourip/test/testdata.txt");
JavaRDD<Tuple2<String, Long>> zrdd1 = data.zipWithIndex();
Tuple2<String, Long> firstFind = zrdd1.filter(new Function<Tuple2<String, Long>, Boolean>() {
public Boolean call(Tuple2<String, Long> input) { return input.productElement(0).contains(searchString); }
}).first();
JavaRDD<Tuple2<String, Long>> filterLines = zrdd1.filter(new Function<Tuple2<String, Long>, Boolean>() {
public Boolean call(Tuple2<String, Long> input) { return input.productElement(1) < firstFind.productElement(1); }
});
Long offset = filterLines.map(new Function<Tuple2<String, Long>, Int>() {
public Int call(Tuple2<String, Long> input) { return input.productElement(0).length(); }
}).reduce(new Function2<Integer, Integer, Integer>() {
public Integer call(Integer a, Integer b) { return a + b; }
}) + firstFind.productElement(0).indexOf(searchString);
This can only be done when your input is one file (since otherwise, zipWithIndex() wouldn't guarantee offsets within a file) but this method works for an RDD of any number of partitions so feel free to partition your file up into any number of chunks.

How to measure the time Spark needs to run an action on partitioned RDD?

I wrote a small Spark application which should measure the time that Spark needs to run an action on a partitioned RDD (combineByKey function to sum a value).
My problem is, that the first iteration seems to work correctly (calculated duration ~25 ms), but the next ones take much less time (~5 ms). It seems to me, that Spark persists the data without any request to do so!? Can I avoid that programmatically?
I have to know the duration that Spark needs to calculate a new RDD (without any caching / persisting of earlier iterations) --> I think the duration should always be about 20-25 ms!
To ensure the recalculation I moved the SparkContext generation into the for-loops, but this didn't bring any changes...
Thanks for your advices!
Here my code which seems to persist any data:
public static void main(String[] args) {
switchOffLogging();
// jetzt
try {
// Setup: Read out parameters & initialize SparkContext
String path = args[0];
SparkConf conf = new SparkConf(true);
JavaSparkContext sc;
// Create output file & writer
System.out.println("\npar.\tCount\tinput.p\tcons.p\tTime");
// The RDDs used for the benchmark
JavaRDD<String> input = null;
JavaPairRDD<Integer, String> pairRDD = null;
JavaPairRDD<Integer, String> partitionedRDD = null;
JavaPairRDD<Integer, Float> consumptionRDD = null;
// Do the tasks iterative (10 times the same benchmark for testing)
for (int i = 0; i < 10; i++) {
boolean partitioning = true;
int partitionsCount = 8;
sc = new JavaSparkContext(conf);
setS3credentials(sc, path);
input = sc.textFile(path);
pairRDD = mapToPair(input);
partitionedRDD = partition(pairRDD, partitioning, partitionsCount);
// Measure the duration
long duration = System.currentTimeMillis();
// Do the relevant function
consumptionRDD = partitionedRDD.combineByKey(createCombiner, mergeValue, mergeCombiners);
duration = System.currentTimeMillis() - duration;
// So some action to invoke the calculation
System.out.println(consumptionRDD.collect().size());
// Print the results
System.out.println("\n" + partitioning + "\t" + partitionsCount + "\t" + input.partitions().size() + "\t" + consumptionRDD.partitions().size() + "\t" + duration + " ms");
input = null;
pairRDD = null;
partitionedRDD = null;
consumptionRDD = null;
sc.close();
sc.stop();
}
} catch (Exception e) {
e.printStackTrace();
System.out.println(e.getMessage());
}
}
Some helper functions (should not be the problem):
private static void switchOffLogging() {
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
}
private static void setS3credentials(JavaSparkContext sc, String path) {
if (path.startsWith("s3n://")) {
Configuration hadoopConf = sc.hadoopConfiguration();
hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem");
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem");
hadoopConf.set("fs.s3n.awsAccessKeyId", "mycredentials");
hadoopConf.set("fs.s3n.awsSecretAccessKey", "mycredentials");
}
}
// Initial element
private static Function<String, Float> createCombiner = new Function<String, Float>() {
public Float call(String dataSet) throws Exception {
String[] data = dataSet.split(",");
float value = Float.valueOf(data[2]);
return value;
}
};
// merging function for a new dataset
private static Function2<Float, String, Float> mergeValue = new Function2<Float, String, Float>() {
public Float call(Float sumYet, String dataSet) throws Exception {
String[] data = dataSet.split(",");
float value = Float.valueOf(data[2]);
sumYet += value;
return sumYet;
}
};
// function to sum the consumption
private static Function2<Float, Float, Float> mergeCombiners = new Function2<Float, Float, Float>() {
public Float call(Float a, Float b) throws Exception {
a += b;
return a;
}
};
private static JavaPairRDD<Integer, String> partition(JavaPairRDD<Integer, String> pairRDD, boolean partitioning, int partitionsCount) {
if (partitioning) {
return pairRDD.partitionBy(new HashPartitioner(partitionsCount));
} else {
return pairRDD;
}
}
private static JavaPairRDD<Integer, String> mapToPair(JavaRDD<String> input) {
return input.mapToPair(new PairFunction<String, Integer, String>() {
public Tuple2<Integer, String> call(String debsDataSet) throws Exception {
String[] data = debsDataSet.split(",");
int houseId = Integer.valueOf(data[6]);
return new Tuple2<Integer, String>(houseId, debsDataSet);
}
});
}
And finally the output of the Spark console:
part. Count input.p cons.p Time
true 8 6 8 20 ms
true 8 6 8 23 ms
true 8 6 8 7 ms // Too less!!!
true 8 6 8 21 ms
true 8 6 8 13 ms
true 8 6 8 6 ms // Too less!!!
true 8 6 8 5 ms // Too less!!!
true 8 6 8 6 ms // Too less!!!
true 8 6 8 4 ms // Too less!!!
true 8 6 8 7 ms // Too less!!!
I found a solution for me now: I wrote a separate class which calls the spark-submit command on a new process. This can be done in a loop, so every benchmark is started in a new thread and sparkContext is also separated per process. So garbage collection is done and everything works fine!
String submitCommand = "/root/spark/bin/spark-submit " + submitParams + " -- class partitioning.PartitionExample /root/partitioning.jar " + javaFlags;
Process p = Runtime.getRuntime().exec(submitCommand);
BufferedReader reader;
String line;
System.out.println(p.waitFor());
reader = new BufferedReader(new InputStreamReader(p.getInputStream()));
while ((line = reader.readLine())!= null) {
System.out.println(line);
}
If the shuffle output is small enough, then the Spark shuffle files will write to the OS buffer cache as fsync is not explicitly called...this means that, as long as there is room, your data will remain in memory.
If a cold performance test is truly necessary then you can try something like this attempt to flush the disk, but that is going to slow down the in-between each test. Could you just spin the context up and down? That might solve your need.

Solr Performance for many documents query

I want to have Solr always retrieve all documents found by a search (I know Solr wasn't built for that, but anyways) and I am currently doing this with this code:
...
List<Article> ret = new ArrayList<Article>();
QueryResponse response = solr.query(query);
int offset = 0;
int totalResults = (int) response.getResults().getNumFound();
List<Article> ret = new ArrayList<Article>((int) totalResults);
query.setRows(FETCH_SIZE);
while(offset < totalResults) {
//requires an int? wtf?
query.setStart((int) offset);
int left = totalResults - offset;
if(left < FETCH_SIZE) {
query.setRows(left);
}
response = solr.query(query);
List<Article> current = response.getBeans(Article.class);
offset += current.size();
ret.addAll(current);
}
...
This works, but is pretty slow if a query gets over 1000 hits (I've read about that on here. This is being caused by Solr because I am setting the start everytime which - for some reason - takes some time). What would be a nicer (and faster) ways to do this?
To improve the suggested answer you could use a streamed response. This has been added especially for the case that one fetches all results. As you can see in Solr's Jira that guy wants to do the same as you do. This has been implemented for Solr 4.
This is also described in Solrj's javadoc.
Solr will pack the response and create a whole XML/JSON document before it starts sending the response. Then your client is required to unpack all that and offer it as a list to you. By using streaming and parallel processing, which you can do when using such a queued approach, the performance should improve further.
Yes, you will loose the automatic bean mapping, but as performance is a factor here, I think this is acceptable.
Here is a sample unit test:
public class StreamingTest {
#Test
public void streaming() throws SolrServerException, IOException, InterruptedException {
HttpSolrServer server = new HttpSolrServer("http://your-server");
SolrQuery tmpQuery = new SolrQuery("your query");
tmpQuery.setRows(Integer.MAX_VALUE);
final BlockingQueue<SolrDocument> tmpQueue = new LinkedBlockingQueue<SolrDocument>();
server.queryAndStreamResponse(tmpQuery, new MyCallbackHander(tmpQueue));
SolrDocument tmpDoc;
do {
tmpDoc = tmpQueue.take();
} while (!(tmpDoc instanceof PoisonDoc));
}
private class PoisonDoc extends SolrDocument {
// marker to finish queuing
}
private class MyCallbackHander extends StreamingResponseCallback {
private BlockingQueue<SolrDocument> queue;
private long currentPosition;
private long numFound;
public MyCallbackHander(BlockingQueue<SolrDocument> aQueue) {
queue = aQueue;
}
#Override
public void streamDocListInfo(long aNumFound, long aStart, Float aMaxScore) {
// called before start of streaming
// probably use for some statistics
currentPosition = aStart;
numFound = aNumFound;
if (numFound == 0) {
queue.add(new PoisonDoc());
}
}
#Override
public void streamSolrDocument(SolrDocument aDoc) {
currentPosition++;
System.out.println("adding doc " + currentPosition + " of " + numFound);
queue.add(aDoc);
if (currentPosition == numFound) {
queue.add(new PoisonDoc());
}
}
}
}
You might improve performance by increasing FETCH_SIZE. Since you are getting all the results, pagination doesn't make sense unless you are concerned with memory or some such. If 1000 results are liable to cause a memory overflow, I'd say your current performance seems pretty outstanding though.
So I would try getting everything at once, simplifying this to something like:
//WHOLE_BUNCHES is a constant representing a reasonable max number of docs we want to pull here.
//Integer.MAX_VALUE would probably invite an OutOfMemoryError, but that would be true of the
//implementation in the question anyway, since they were still being stored in the list at the end.
query.setRows(WHOLE_BUNCHES);
QueryResponse response = solr.query(query);
int totalResults = (int) response.getResults().getNumFound(); //If you even still need this figure.
List<Article> ret = response.getBeans(Article.class);
If you need to keep the pagination though:
You are performing this first query:
QueryResponse response = solr.query(query);
and are populating the number of found results from it, but you are not pulling any results with the response. Even if you keep pagination here, you could at least eliminate one extra query here.
This:
int left = totalResults - offset;
if(left < FETCH_SIZE) {
query.setRows(left);
}
Is unnecessary. setRows specifies a Maximum number of rows to return, so asking for more than are available won't cause any problems.
Finally, apropos of nothing, but I have to ask: what argument would you expect setStart to take if not an int?
Use below logic to fetch solr data as batch to optimize performance of solr data fetch query:
public List<Map<String, Object>> getData(int id,Set<String> fields){
final int SOLR_QUERY_MAX_ROWS = 3;
long start = System.currentTimeMillis();
SolrQuery query = new SolrQuery();
String queryStr = "id:" + id;
LOG.info(queryStr);
query.setQuery(queryStr);
query.setRows(SOLR_QUERY_MAX_ROWS);
QueryResponse rsp = server.query(query, SolrRequest.METHOD.POST);
List<Map<String, Object>> mapList = null;
if (rsp != null) {
long total = rsp.getResults().getNumFound();
System.out.println("Total count found: " + total);
// Solr query batch
mapList = new ArrayList<Map<String, Object>>();
if (total <= SOLR_QUERY_MAX_ROWS) {
addAllData(mapList, rsp,fields);
} else {
int marker = SOLR_QUERY_MAX_ROWS;
do {
if (rsp != null) {
addAllData(mapList, rsp,fields);
}
query.setStart(marker);
rsp = server.query(query, SolrRequest.METHOD.POST);
marker = marker + SOLR_QUERY_MAX_ROWS;
} while (marker <= total);
}
}
long end = System.currentTimeMillis();
LOG.debug("SOLR Performance: getData: " + (end - start));
return mapList;
}
private void addAllData(List<Map<String, Object>> mapList, QueryResponse rsp,Set<String> fields) {
for (SolrDocument sdoc : rsp.getResults()) {
Map<String, Object> map = new HashMap<String, Object>();
for (String field : fields) {
map.put(field, sdoc.getFieldValue(field));
}
mapList.add(map);
}
}

Categories