I have created an Apache Ignite application with Spark
Ignite Version - 1.6.0
Spark Version - 1.5.2 (Built on Scala 2.11)
Application stores two tuples to IgniteRDD
When retrieve is called then collect function is taking more than 3 minutes.
Number of jobs submitted are more than 1000
Code snippet:
import java.util.HashMap;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import org.apache.ignite.spark.IgniteContext;
import org.apache.ignite.spark.IgniteRDD;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
public class CopyOfMainIgnite {
public static void main(String args[]) {
SparkConf conf = new SparkConf().setAppName("Demo").setMaster(
"spark://169.254.228.183:7077");
System.out.println("Spark conf initialized.");
JavaSparkContext sc = new JavaSparkContext(conf);
sc.addJar("./target/IgnitePOC-0.0.1-SNAPSHOT-jar-with-dependencies.jar");
System.out.println("Spark context initialized.");
IgniteContext ic = new IgniteContext(sc.sc(),
"ignite/client-default-config.xml");
System.out.println("Ignite Context initialized.");
String cacheName = "demo6";
save(sc, ic, cacheName);
retrieve(ic, cacheName);
ic.close(false);
sc.close();
}
private static void retrieve(IgniteContext ic, String cacheName) {
System.out.println("Getting IgniteRDD saved.");
IgniteRDD<String, String> javaIRDDRet = ic.fromCache(cacheName);
long temp1 = System.currentTimeMillis();
JavaRDD<Tuple2<String, String>> javardd = javaIRDDRet.toJavaRDD();
System.out
.println("Is empty Start Time: " + System.currentTimeMillis());
System.out.println("javaIRDDRet.isEmpty(): " + javardd.isEmpty());
System.out.println("Is empty End Time: " + System.currentTimeMillis());
long temp2 = System.currentTimeMillis();
long temp3 = System.currentTimeMillis();
System.out.println("collect and println Start Time: "
+ System.currentTimeMillis());
javardd.collect().forEach(System.out::println);
System.out.println("collect and println End Time: "
+ System.currentTimeMillis());
long temp4 = System.currentTimeMillis();
System.out.println("Is empty : " + temp1 + " " + temp2
+ " Collect and print: " + temp3 + " " + temp4);
}
private static void save(JavaSparkContext sc, IgniteContext ic,
String cacheName) {
IgniteRDD<String, String> igniteRDD = ic.fromCache(cacheName);
System.out.println("IgniteRDD from cache initialized.");
Map<String, String> tempMap = new HashMap<String, String>();
tempMap.put("Aditya", "Jain");
tempMap.put("Pranjal", "Jaju");
Tuple2<String, String> tempTuple1 = new Tuple2<String, String>(
"Aditya", "Jain");
Tuple2<String, String> tempTuple2 = new Tuple2<String, String>(
"Pranjal", "Jaju");
List<Tuple2<String, String>> list = new LinkedList<Tuple2<String, String>>();
list.add(tempTuple1);
list.add(tempTuple2);
JavaPairRDD<String, String> jpr = sc.parallelizePairs(list, 4);
System.out.println("Random RDD saved.");
igniteRDD.savePairs(jpr.rdd(), false);
System.out.println("IgniteRDD saved.");
}
}
So my question: is it going to take 3-4 minutes to fetch 2 Rdd tuples from Ignite and collect them in my process?
Or is my expectation wrong that it will respond in milliseconds?
After debugging I found it is creating 1024 partitions in ignite rdd which is causing it to fire 1024 jobs. And i am not getting any way to control number of partitions.
You can decrease number of partitions in CacheConfiguration:
<bean class="org.apache.ignite.configuration.CacheConfiguration">
<property name="affinity">
<bean class="org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction">
<property name="partitions" value="32"/>
</bean>
</property>
</bean>
Also you can use IgniteRDD.sql(..) and IgniteRDD.objectSql(..) methods to retrieve data directly from Ignite utilizing fast indexed search. For details on how to configure SQL in Ignite refer to this page: https://apacheignite.readme.io/docs/sql-queries
Related
I want to execute MQL (metric query language) using below library.
<dependency>
<groupId>com.google.apis</groupId>
<artifactId>google-api-services-monitoring</artifactId>
<version>v3-rev540-1.25.0</version>
</dependency>
Here is my code snippet. which will create monitoring client and will try to collect data from GCP monitoring.
public void queryTimeSeriesData() throws IOException {
// create monitoring
Monitoring m = createAuthorizedMonitoringClient();
QueryTimeSeriesRequest req = new QueryTimeSeriesRequest();
String query = "fetch consumed_api\n" +
"| metric 'serviceruntime.googleapis.com/api/request_count'\n" +
"| align rate(2m)\n" +
"| every 2m\n" +
"| group_by [metric.response_code],\n" +
" [value_request_count_max: max(value.request_count)]";
req.setQuery(query);
HashMap<String, Object> queryTransformationSpec = new HashMap<String, Object>();
HashMap<String, Object> timingState = new HashMap<String, Object>();
HashMap<String, Object> absoluteWindow = new HashMap<String, Object>();
absoluteWindow.put("startTime", "2020-09-03T12:40:00.000Z");
absoluteWindow.put("endTime", "2020-09-03T13:41:00.000Z");
timingState.put("absoluteWindow", absoluteWindow);
timingState.put("graphPeriod", "60s");
timingState.put("queryPeriod", "60s");
queryTransformationSpec.put("timingState", timingState);
req.set("queryTransformationSpec", queryTransformationSpec);
req.set("reportPeriodicStats", false);
req.set("reportQueryPlan", false);
QueryTimeSeriesResponse res = m.projects().timeSeries().query("projects/MY_PROJECT_NAME", req).execute();
System.out.println(res);
}
Above code is working fine but its not returning data of given startTime and endTime ,
It always returns latest datapoint available. is there any problem with my code ?
Found way to execute MQL query with given time range. The
new working code is the following:
public void queryTimeSeriesData() throws IOException {
// create monitoring
Monitoring m = createAuthorizedMonitoringClient();
QueryTimeSeriesRequest req = new QueryTimeSeriesRequest();
String query = "fetch consumed_api\n" +
"| metric 'serviceruntime.googleapis.com/api/request_count'\n" +
"| align rate(5m)\n" +
"| every 5m\n" +
"| group_by [metric.response_code],\n" +
" [value_request_count_max: max(value.request_count)]" +
"| within d'2020/09/03-12:40:00', d'2020/09/03-12:50:00'\n";
req.setQuery(query);
QueryTimeSeriesResponse res = m.projects().timeSeries().query("projects/MY_PROJECT_NAME", req).execute();
System.out.println(res);
}
Included query start time and end time in query itself by using within operator. As per google docs for MQL queries:
within - Specifies the time range of the query output.
I'm trying to get all items from a Apache Ignite cache.
Currently I can get an individual item using
ClientCache<Integer, BinaryObject> cache = igniteClient.cache("myCache").withKeepBinary();
BinaryObject temp = cache.get(1);
To get all keys, Ive tried the following:
try(QueryCursor<Entry<Integer,BinaryObject>> cursor = cache.query(new ScanQuery<Integer, BinaryObject>(null))) {
for (Object p : cursor)
System.out.println(p.toString());
}
This returns a list of org.apache.ignite.internal.client.thin.ClientCacheEntry which is internal, and I cannot call getValue.
How can I get all items for this cache?
By using Iterator you can get all values and key from cache. below are the sample code to retrieve all values from cache.
Iterator<Entry<Integer, BinaryObject>> itr = cache.iterator();
while(itr.hasNext()) {
BinaryObject object = itr.next().getValue();
System.out.println(object);
}
The following may help you to iterate over all the records in the cache.
import javax.cache.Cache.Entry;
import org.apache.ignite.Ignite;
import org.apache.ignite.IgniteCache;
import org.apache.ignite.Ignition;
import org.apache.ignite.binary.BinaryObject;
public class App5BinaryObject {
public static void main(String[] args) {
Ignition.setClientMode(true);
try (Ignite client = Ignition
.start("/Users/amritharajherle/git_new/ignite-learning-by-examples/complete/cfg/ignite-config.xml")) {
IgniteCache<BinaryObject, BinaryObject> cities = client.cache("City").withKeepBinary();
int count = 0;
for (Entry<BinaryObject, BinaryObject> entry : cities) {
count++;
BinaryObject key = entry.getKey();
BinaryObject value = entry.getValue();
System.out.println("CountyCode=" + key.field("COUNTRYCODE") + ", DISTRICT = " + value.field("DISTRICT")
+ ", POPULATION = " + value.field("POPULATION") + ", NAME = " + value.field("NAME"));
}
System.out.println("total cities count = " + count);
}
}
}
Using Ignite Rest API we can fetch certain Number of records[Size]. I strive a lot and finally found API.
http://<Server_IP>:8080/ignite?cmd=qryscanexe&pageSize=10&cacheName=
Add Auth Header As per Ignite Cluster User.
How could I use Stanford Core NLP to generate the dependency of a Chinese Sentence? It can only work greatly with English
public class DemoChinese { public static void main(String[] args) {
Properties props = PropertiesUtils.asProperties("props", "StanfordCoreNLP-chinese.properties");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation("我喜欢吃苹果");
pipeline.annotate(document);
List<CoreMap> sentence = document.get(SentencesAnnotation.class);
#SuppressWarnings("deprecation")
// Produce a dependency of this sentence.
SemanticGraph dp= sentence.get(0).get(SemanticGraphCoreAnnotations
.CollapsedCCProcessedDependenciesAnnotation.class);
String s = dp.typedDependencies().toString();
System.out.println(s);
}
}
Setting up the Properties as you did doesn’t work. This is maybe confusing, but the StanfordCoreNLP constructor needs a “real” properties list and it won’t process a props key expanding it out with its contents. (But doing things as you did appears in some examples – I initially assumed that it used to work and there was a regression, but it doesn’t seem like it worked in any of 3.6. 3.7, or 3.8, so maybe those examples never worked.) Also, in the example below, I get the dependencies in the non-deprecated way.
import java.io.IOException;
import java.io.PrintWriter;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import edu.stanford.nlp.coref.CorefCoreAnnotations;
import edu.stanford.nlp.coref.data.CorefChain;
import edu.stanford.nlp.io.IOUtils;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.semgraph.SemanticGraph;
import edu.stanford.nlp.semgraph.SemanticGraphCoreAnnotations;
import edu.stanford.nlp.util.CoreMap;
/**
* #author Christopher Manning
*/
public class StanfordCoreNlpDemoChinese {
private StanfordCoreNlpDemoChinese() { } // static main
public static void main(String[] args) throws IOException {
// set up optional output files
PrintWriter out;
if (args.length > 1) {
out = new PrintWriter(args[1]);
} else {
out = new PrintWriter(System.out);
}
Properties props = new Properties();
props.load(IOUtils.readerFromString("StanfordCoreNLP-chinese.properties"));
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation document;
if (args.length > 0) {
document = new Annotation(IOUtils.slurpFileNoExceptions(args[0]));
} else {
document = new Annotation("我喜欢吃苹果");
}
pipeline.annotate(document);
List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
int sentNo = 1;
for (CoreMap sentence : sentences) {
out.println("Sentence #" + sentNo + " tokens are:");
for (CoreMap token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
out.println(token.toShorterString("Text", "CharacterOffsetBegin", "CharacterOffsetEnd", "Index", "PartOfSpeech", "NamedEntityTag"));
}
out.println("Sentence #" + sentNo + " basic dependencies are:");
out.println(sentence.get(SemanticGraphCoreAnnotations.BasicDependenciesAnnotation.class).toString(SemanticGraph.OutputFormat.LIST));
sentNo++;
}
// Access coreference.
out.println("Coreference information");
Map<Integer, CorefChain> corefChains =
document.get(CorefCoreAnnotations.CorefChainAnnotation.class);
if (corefChains == null) { return; }
for (Map.Entry<Integer,CorefChain> entry: corefChains.entrySet()) {
out.println("Chain " + entry.getKey());
for (CorefChain.CorefMention m : entry.getValue().getMentionsInTextualOrder()) {
// We need to subtract one since the indices count from 1 but the Lists start from 0
List<CoreLabel> tokens = sentences.get(m.sentNum - 1).get(CoreAnnotations.TokensAnnotation.class);
// We subtract two for end: one for 0-based indexing, and one because we want last token of mention not one following.
out.println(" " + m + ":[" + tokens.get(m.startIndex - 1).beginPosition() + ", " +
tokens.get(m.endIndex - 2).endPosition() + ')');
}
}
out.println();
IOUtils.closeIgnoringExceptions(out);
}
}
I need help from Authorized.net Java SDK experts. I am GetSettledBatchList transaction with the following code, but it gives me exceptions, I am not able to understand which Date format it accepts.
The error comes for reference:
11/05/15 00:32:56,875: INFO [pool-1-thread-1] (net.authorize.util.LogHelper:24) - Use Proxy: 'false'
Exception in thread "main" java.lang.NullPointerException
at com.auth.net.commons.authorize.net.GetSettledBatchList.main(GetSettledBatchList.java:52)
The code which I developed so far for reference: Please help me how to solve this error.
import java.text.DateFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.GregorianCalendar;
import javax.xml.datatype.DatatypeConfigurationException;
import javax.xml.datatype.DatatypeFactory;
import net.authorize.Environment;
import net.authorize.api.contract.v1.GetSettledBatchListRequest;
import net.authorize.api.contract.v1.GetSettledBatchListResponse;
import net.authorize.api.contract.v1.MerchantAuthenticationType;
import net.authorize.api.contract.v1.MessageTypeEnum;
import net.authorize.api.controller.GetSettledBatchListController;
import net.authorize.api.controller.base.ApiOperationBase;
public class GetSettledBatchList {
public static final String apiLoginId= "XXXXX";
public static final String transactionKey= "XXXX";
public static void main(String[] args) throws ParseException, DatatypeConfigurationException {
GregorianCalendar gc=new GregorianCalendar();
ApiOperationBase.setEnvironment(Environment.SANDBOX);
MerchantAuthenticationType merchantAuthenticationType= new MerchantAuthenticationType() ;
merchantAuthenticationType.setName(apiLoginId);
merchantAuthenticationType.setTransactionKey(transactionKey);
ApiOperationBase.setMerchantAuthentication(merchantAuthenticationType);
GetSettledBatchListRequest getRequest = new GetSettledBatchListRequest();
DateFormat df = new SimpleDateFormat("yyyy-MM-dd");
Date firstSettlementDate = df.parse("2015-01-26");
gc.setTime(firstSettlementDate);
Date lastSettlementDate = df.parse("2015-05-05");
gc.setTime(lastSettlementDate);
getRequest.setFirstSettlementDate(DatatypeFactory.newInstance().newXMLGregorianCalendar(gc));
getRequest.setLastSettlementDate(DatatypeFactory.newInstance().newXMLGregorianCalendar(gc));
getRequest.setMerchantAuthentication(merchantAuthenticationType);
GetSettledBatchListController controller = new GetSettledBatchListController(getRequest);
controller.execute();
GetSettledBatchListResponse getResponse = new GetSettledBatchListResponse();
if (getResponse!=null) {
if (getResponse.getMessages().getResultCode() == MessageTypeEnum.OK) {
System.out.println(getResponse.getMessages().getMessage().get(0).getCode());
System.out.println(getResponse.getMessages().getMessage().get(0).getText());
}
else{
System.out.println("Failed to get settled batch list: " + getResponse.getMessages().getResultCode());
}
}
}
}
Something on line 52 is null. Try adding null checks:
if (getResponse!=null && getResponse.getMessages() != null && getResponse.getMessages().getResultCode() != null) {
if (getResponse.getMessages().getResultCode() == MessageTypeEnum.OK) {
if (getResponse.getMessages().getMessage() != null && getResponse.getMessages().getMessage().get(0) != null) {
System.out.println(getResponse.getMessages().getMessage().get(0).getCode());
System.out.println(getResponse.getMessages().getMessage().get(0).getText());
}
}
else{
System.out.println("Failed to get settled batch list: " + getResponse.getMessages().getResultCode());
}
}
Line 47 does not make sense in your code. The line
GetSettledBatchListResponse getResponse = new GetSettledBatchListResponse();
returns an empty response from the API. You do not have any line to actually extract the response from the controller.
If you look at this link in the Authorize.Net GitHub repository for the sample codes, you will notice that the above line should be replaced by
GetSettledBatchListResponse getResponse = controller.getApiResponse();
Try this and get back to us with the result.
This issue is fixed in latest version version of anet-java-sdk 1.8.6 per link https://github.com/AuthorizeNet/sdk-java/issues/61. So below code works fine. Make sure when you're using dates, FirstSettlementDate and LastSettlementDate gap should not be more that 30 days.
public class SettledTransactionDetails {
public static final String apiLoginID= "XXXXX";
public static final String transactionKey= "XXXXXX";
#SuppressWarnings("unchecked")
public static void main(String[] args) throws ParseException {
Merchant merchant = Merchant.createMerchant(Environment.SANDBOX, apiLoginID, transactionKey);
// get the list of Unsettled transactions
net.authorize.reporting.Transaction transaction =
merchant.createReportingTransaction(TransactionType.GET_SETTLED_BATCH_LIST);
ReportingDetails reportingDetails = ReportingDetails.createReportingDetails();
SimpleDateFormat formatter = new SimpleDateFormat("dd/MM/yyyy");
reportingDetails.setBatchFirstSettlementDate(formatter.parse("16/06/2015"));
reportingDetails.setBatchLastSettlementDate(formatter.parse("15/07/2015"));
reportingDetails.setBatchIncludeStatistics(true);
transaction.setReportingDetails(reportingDetails);
Result<Transaction> result =(Result<Transaction>) merchant.postTransaction(transaction);
System.out.println("Result : " + result.getResultCode());
ArrayList<BatchDetails> batchDetailsList = result.getReportingDetails().getBatchDetailsList();
for (int i = 0; i < batchDetailsList.size(); i++) {
ArrayList<BatchStatistics> batchStatisticsList = batchDetailsList.get(i).getBatchStatisticsList();
for (int j = 0; j < batchStatisticsList.size(); j++) {
BatchStatistics batchStatistics = batchStatisticsList.get(j);
System.out.println("====================== " + j+ " start");
System.out.println("Account Type : [" + batchStatistics.getAccountType()+"]");
System.out.println("Charge Amount : [" + batchStatistics.getChargeAmount()+"]");
System.out.println("Charge BackAmount : [" + batchStatistics.getChargebackAmount()+"]");
System.out.println("Charge Charge Back Amount : [" + batchStatistics.getChargeChargebackAmount()+"]");
System.out.println("Charge Returned Items Amount [: " + batchStatistics.getChargeReturnedItemsAmount()+"]");
System.out.println("Refund Amount : [" + batchStatistics.getRefundAmount());
System.out.println("Refund Charge Back Amount : [" + batchStatistics.getRefundChargebackAmount());
System.out.println("Account Type : [" + batchStatistics.getAccountType());
System.out.println("====================== " + j+ " end");
}
}
}
}
I wrote a small Spark application which should measure the time that Spark needs to run an action on a partitioned RDD (combineByKey function to sum a value).
My problem is, that the first iteration seems to work correctly (calculated duration ~25 ms), but the next ones take much less time (~5 ms). It seems to me, that Spark persists the data without any request to do so!? Can I avoid that programmatically?
I have to know the duration that Spark needs to calculate a new RDD (without any caching / persisting of earlier iterations) --> I think the duration should always be about 20-25 ms!
To ensure the recalculation I moved the SparkContext generation into the for-loops, but this didn't bring any changes...
Thanks for your advices!
Here my code which seems to persist any data:
public static void main(String[] args) {
switchOffLogging();
// jetzt
try {
// Setup: Read out parameters & initialize SparkContext
String path = args[0];
SparkConf conf = new SparkConf(true);
JavaSparkContext sc;
// Create output file & writer
System.out.println("\npar.\tCount\tinput.p\tcons.p\tTime");
// The RDDs used for the benchmark
JavaRDD<String> input = null;
JavaPairRDD<Integer, String> pairRDD = null;
JavaPairRDD<Integer, String> partitionedRDD = null;
JavaPairRDD<Integer, Float> consumptionRDD = null;
// Do the tasks iterative (10 times the same benchmark for testing)
for (int i = 0; i < 10; i++) {
boolean partitioning = true;
int partitionsCount = 8;
sc = new JavaSparkContext(conf);
setS3credentials(sc, path);
input = sc.textFile(path);
pairRDD = mapToPair(input);
partitionedRDD = partition(pairRDD, partitioning, partitionsCount);
// Measure the duration
long duration = System.currentTimeMillis();
// Do the relevant function
consumptionRDD = partitionedRDD.combineByKey(createCombiner, mergeValue, mergeCombiners);
duration = System.currentTimeMillis() - duration;
// So some action to invoke the calculation
System.out.println(consumptionRDD.collect().size());
// Print the results
System.out.println("\n" + partitioning + "\t" + partitionsCount + "\t" + input.partitions().size() + "\t" + consumptionRDD.partitions().size() + "\t" + duration + " ms");
input = null;
pairRDD = null;
partitionedRDD = null;
consumptionRDD = null;
sc.close();
sc.stop();
}
} catch (Exception e) {
e.printStackTrace();
System.out.println(e.getMessage());
}
}
Some helper functions (should not be the problem):
private static void switchOffLogging() {
Logger.getLogger("org").setLevel(Level.OFF);
Logger.getLogger("akka").setLevel(Level.OFF);
}
private static void setS3credentials(JavaSparkContext sc, String path) {
if (path.startsWith("s3n://")) {
Configuration hadoopConf = sc.hadoopConfiguration();
hadoopConf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem");
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem");
hadoopConf.set("fs.s3n.awsAccessKeyId", "mycredentials");
hadoopConf.set("fs.s3n.awsSecretAccessKey", "mycredentials");
}
}
// Initial element
private static Function<String, Float> createCombiner = new Function<String, Float>() {
public Float call(String dataSet) throws Exception {
String[] data = dataSet.split(",");
float value = Float.valueOf(data[2]);
return value;
}
};
// merging function for a new dataset
private static Function2<Float, String, Float> mergeValue = new Function2<Float, String, Float>() {
public Float call(Float sumYet, String dataSet) throws Exception {
String[] data = dataSet.split(",");
float value = Float.valueOf(data[2]);
sumYet += value;
return sumYet;
}
};
// function to sum the consumption
private static Function2<Float, Float, Float> mergeCombiners = new Function2<Float, Float, Float>() {
public Float call(Float a, Float b) throws Exception {
a += b;
return a;
}
};
private static JavaPairRDD<Integer, String> partition(JavaPairRDD<Integer, String> pairRDD, boolean partitioning, int partitionsCount) {
if (partitioning) {
return pairRDD.partitionBy(new HashPartitioner(partitionsCount));
} else {
return pairRDD;
}
}
private static JavaPairRDD<Integer, String> mapToPair(JavaRDD<String> input) {
return input.mapToPair(new PairFunction<String, Integer, String>() {
public Tuple2<Integer, String> call(String debsDataSet) throws Exception {
String[] data = debsDataSet.split(",");
int houseId = Integer.valueOf(data[6]);
return new Tuple2<Integer, String>(houseId, debsDataSet);
}
});
}
And finally the output of the Spark console:
part. Count input.p cons.p Time
true 8 6 8 20 ms
true 8 6 8 23 ms
true 8 6 8 7 ms // Too less!!!
true 8 6 8 21 ms
true 8 6 8 13 ms
true 8 6 8 6 ms // Too less!!!
true 8 6 8 5 ms // Too less!!!
true 8 6 8 6 ms // Too less!!!
true 8 6 8 4 ms // Too less!!!
true 8 6 8 7 ms // Too less!!!
I found a solution for me now: I wrote a separate class which calls the spark-submit command on a new process. This can be done in a loop, so every benchmark is started in a new thread and sparkContext is also separated per process. So garbage collection is done and everything works fine!
String submitCommand = "/root/spark/bin/spark-submit " + submitParams + " -- class partitioning.PartitionExample /root/partitioning.jar " + javaFlags;
Process p = Runtime.getRuntime().exec(submitCommand);
BufferedReader reader;
String line;
System.out.println(p.waitFor());
reader = new BufferedReader(new InputStreamReader(p.getInputStream()));
while ((line = reader.readLine())!= null) {
System.out.println(line);
}
If the shuffle output is small enough, then the Spark shuffle files will write to the OS buffer cache as fsync is not explicitly called...this means that, as long as there is room, your data will remain in memory.
If a cold performance test is truly necessary then you can try something like this attempt to flush the disk, but that is going to slow down the in-between each test. Could you just spin the context up and down? That might solve your need.