Why is this PageRank job so much slower using Dataset than RDD? - java

I implemented this example of PageRank in Java using the newer Dataset API. When I benchmark my code against the sample which uses the older RDD API, I find that my code takes 186 seconds while the baseline only takes 109 seconds. What is causing the discrepancy? (Side-note: is it normal for Spark to take hundreds of seconds even when the database only contains a handful of entries?)
My code:
Dataset<Row> outLinks = spark.read().jdbc("jdbc:postgresql://127.0.0.1:5432/postgres", "storagepage_outlinks", props);
Dataset<Row> page = spark.read().jdbc("jdbc:postgresql://127.0.0.1:5432/postgres", "pages", props);
outLinks = page.join(outLinks, page.col("id").equalTo(outLinks.col("storagepage_id")));
outLinks = outLinks.distinct().groupBy(outLinks.col("url")).agg(collect_set("outlinks")).cache();
Dataset<Row> ranks = outLinks.map(row -> new Tuple2<>(row.getString(0), 1.0), Encoders.tuple(Encoders.STRING(), Encoders.DOUBLE())).toDF("url", "rank");
for (int i = 0; i < iterations; i++) {
Dataset<Row> joined = outLinks.join(ranks, new Set.Set1<>("url").toSeq());
Dataset<Row> contribs = joined.flatMap(row -> {
List<String> links = row.getList(1);
double rank = row.getDouble(2);
return links
.stream()
.map(s -> new Tuple2<>(s, rank / links.size()))
.collect(Collectors.toList()).iterator();
}, Encoders.tuple(Encoders.STRING(), Encoders.DOUBLE())).toDF("url", "num");
Dataset<Tuple2<String, Double>> reducedByKey =
contribs.groupByKey(r -> r.getString(0), Encoders.STRING())
.mapGroups((s, iterator) -> {
double sum = 0;
while (iterator.hasNext()) {
sum += iterator.next().getDouble(1);
}
return new Tuple2<>(s, sum);
}, Encoders.tuple(Encoders.STRING(), Encoders.DOUBLE()));
ranks = reducedByKey.map(t -> new Tuple2<>(t._1, .15 + t._2 * .85),
Encoders.tuple(Encoders.STRING(), Encoders.DOUBLE())).toDF("url", "rank");
}
ranks.show();
The sample code which uses RDD (adapted to read from my database):
Dataset<Row> outLinks = spark.read().jdbc("jdbc:postgresql://127.0.0.1:5432/postgres", "storagepage_outlinks", props);
Dataset<Row> page = spark.read().jdbc("jdbc:postgresql://127.0.0.1:5432/postgres", "pages", props);
outLinks = page.join(outLinks, page.col("id").equalTo(outLinks.col("storagepage_id")));
outLinks = outLinks.distinct().groupBy(outLinks.col("url")).agg(collect_set("outlinks")).cache(); // TODO: play with this cache
JavaPairRDD<String, Iterable<String>> links = outLinks.javaRDD().mapToPair(row -> new Tuple2<>(row.getString(0), row.getList(1)));
// Loads all URLs with other URL(s) link to from input file and initialize ranks of them to one.
JavaPairRDD<String, Double> ranks = links.mapValues(rs -> 1.0);
// Calculates and updates URL ranks continuously using PageRank algorithm.
for (int current = 0; current < 20; current++) {
// Calculates URL contributions to the rank of other URLs.
JavaPairRDD<String, Double> contribs = links.join(ranks).values()
.flatMapToPair(s -> {
int urlCount = size(s._1());
List<Tuple2<String, Double>> results = new ArrayList<>();
for (String n : s._1) {
results.add(new Tuple2<>(n, s._2() / urlCount));
}
return results.iterator();
});
// Re-calculates URL ranks based on neighbor contributions.
ranks = contribs.reduceByKey((x, y) -> x + y).mapValues(sum -> 0.15 + sum * 0.85);
}
// Collects all URL ranks and dump them to console.
List<Tuple2<String, Double>> output = ranks.collect();
for (Tuple2<?,?> tuple : output) {
System.out.println(tuple._1() + " has rank: " + tuple._2() + ".");
}

TL;DR It is probably good old Avoid GroupByKey thing.
Hard to say for sure but your Dataset code is equivalent to groupByKey:
groupByKey(...).mapGroups(...)
it means that it shuffles first, then reduces the data.
Your RDD uses reduceByKey - this should reduce shuffle size by applying local reduction. If you want this code to be somewhat equivalent you should rewrite groupByKey(...).mapGroups(...) with groupByKey(...).reduceGroups(...).
Another possible candidate is configuration. Default value for spark.sql.shuffle.partitions is 200 which will be used for Dataset aggregations. If
the database only contains a handful of entries?
this is a serious overkill.
RDD will use spark.default.parallelism or value based on the parent data, which are typically much more modest.

Related

Find duplicates in first column and take average based on third column

My issue here is I need to compute average time for each Id and compute average time of each id.
Sample data
T1,2020-01-16,11:16pm,start
T2,2020-01-16,11:18pm,start
T1,2020-01-16,11:20pm,end
T2,2020-01-16,11:23pm,end
I have written a code in such a way that I kept first column and third column in a map.. something like
T1, 11:16pm
but I could not able to compute values after keeping those values in a map. Also tried to keep them in string array and split into line by line. By same issue facing for that approach also.
**
public class AverageTimeGenerate {
public static void main(String[] args) throws IOException {
File file = new File("/abc.txt");
try (BufferedReader reader = new BufferedReader(new FileReader(file))) {
while (true) {
String line = reader.readLine();
if (line == null) {
break;
}
ArrayList<String> list = new ArrayList<>();
String[] tokens = line.split(",");
for (String s: tokens) {
list.add(s);
}
Map<String, String> map = new HashMap<>();
String[] data = line.split(",");
String ids= data[0];
String dates = data[1];
String transactionTime = data[2];
String transactionStartAndEndTime = data[3];
String[] transactionIds = ids.split("/n");
String[] timeOfEachTransaction = transactionTime.split("/n");
for(String id : transactionIds) {
for(String time : timeOfEachTransaction) {
map.put(id, time);
}
}
}
}
}
}
Can anyone suggest me is it possible to find duplicates in a map and compute values in map, Or is there any other way I can do this so that the output should be like
`T1 2:00
T2 5:00'
I don't know what is your logic to complete the average time but you can save data in map for one particular transaction. The map structure can be like this. Transaction id will be the key and all the time will be in array list.
Map<String,List<String>> map = new HashMap<String,List<String>>();
You can do like this:
Map<String, String> result = Files.lines(Paths.get("abc.txt"))
.map(line -> line.split(","))
.map(arr -> {
try {
return new AbstractMap.SimpleEntry<>(arr[0],
new SimpleDateFormat("HH:mm").parse(arr[2]));
} catch (ParseException e) {
return null;
}
}).collect(Collectors.groupingBy(Map.Entry::getKey,
Collectors.collectingAndThen(Collectors
.mapping(Map.Entry::getValue, Collectors.toList()),
list -> toStringTime.apply(convert.apply(list)))));
for simplify I've declared two functions.
Function<List<Date>, Long> convert = list -> (list.get(1).getTime() - list.get(0).getTime()) / 2;
Function<Long, String> toStringTime = l -> l / 60000 + ":" + l % 60000 / 1000;

Iterating over multiple CSVs and joining with Spark SQL

I have several csv files with the same headers and the same IDs. I am attempting to iterate to merge all files up to one indexed '31'. In my while loop, I'm trying to initialise the merged dataset so it can be used for the remainder of the loop. In the last line, I was told that the 'local variable merged may not have been initialised'. How should I instead be doing this?
SparkSession spark = SparkSession.builder().appName("testSql")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///c:tmp")
.getOrCreate();
Dataset<Row> first = spark.read().option("header", true).csv("mypath/01.csv");
Dataset<Row> second = spark.read().option("header", true).csv("mypath/02.csv");
IntStream.range(3, 31)
.forEach(i -> {
while(i==3) {
Dataset<Row> merged = first.join(second, first.col("customer_id").equalTo(second.col("customer_id")));
}
Dataset<Row> next = spark.read().option("header", true).csv("mypath/"+i+".csv");
Dataset<Row> merged = merged.join(next, merged.col("customer_id").equalTo(next.col("customer_id")));
EDITED based on feedback in the comments.
Following your pattern, something like this would work:
Dataset<Row> ds1 = spark.read().option("header", true).csv("mypath/01.csv");
Dataset<?>[] result = {ds1};
IntStream.range(2, 31)
.forEach(i -> {
Dataset<Row> next = spark.read().option("header", true).csv("mypath/"+i+".csv");
result[0] = result[0].join(next, "customer_id");
});
We're wrapping Dataset into an array in order to work around the restriction on variable capture in lambda expressions.
The more straightforward way, for this particular case, is to simply use a for-loop rather than stream.forEach:
Dataset<Row> result = spark.read().option("header", true).csv("mypath/01.csv");
for( int i = 2 ; i < 31 ; i++ ) {
Dataset<Row> next = spark.read().option("header", true).csv("mypath/"+i+".csv");
result[0] = result[0].join(next, "customer_id");
};

Cannot iterate through CSV columns

I'm building a stock screener that applies a calculation through each column of a csv file. However, when I run the for loop, I only get one result back.
String path = "C:/Users/0/Desktop/Git/Finance/Data/NQ100.csv";
Reader buf = Files.newBufferedReader(Paths.get(path));
CSVParser parsed = new CSVParser(buf, CSVFormat.DEFAULT.withFirstRecordAsHeader()
.withIgnoreHeaderCase().withTrim());
// Parse tickers
Map<String, Integer> header = parsed.getHeaderMap();
List<String> tickerList = new ArrayList<>(header.keySet());
for (int x=1; x < tickerList.size(); x++) { <----------------------- PROBLEM
// Accessing closing price by Header names
List<Double> closeList = new ArrayList<>();
for (CSVRecord record : parsed) {
String stringClose = record.get(x);
Double close = Double.valueOf(stringClose);
closeList.add(close);
}
// Percentage Change
List<Double> pctList = new ArrayList<>();
for (int i=1; i < closeList.size(); i++) {
Double pct = closeList.get(i) / closeList.get(i-1) - 1;
pctList.add(pct);
}
// Statistics
Double sum = 0.0, var = 0.0, mean, sd, rfr, sr;
// Mean
for (Double num : pctList) sum += num;
mean = sum/pctList.size();
// Standard Deviation
for (Double num: pctList) var += Math.pow(num - mean, 2);
sd = Math.sqrt(var/pctList.size());
// Risk Free Rate
rfr = Math.pow((1+0.03),(1/252.0))-1;
// Sharpe Ratio
sr = Math.sqrt(252) * ((mean-rfr)/sd);
System.out.println(tickerList.get(x) + " " + sr);
}
My data looks like this:
,AAL,AAPL,ADBE
2007-10-25,26.311651,23.141403,47.200001
2007-10-26,26.273216,23.384495,47.0
2007-10-29,26.004248,23.43387,47.0
So I was expecting:
AAL XXX
AAPL XXX
ADBE XXX
But I got just:
AAL 0.3604941921663456
Would be grateful if you guys can help me find the problem!
You can iterate through Iterable in Java only once, in your case CSVParser parsed implements Iterable<CSVRecord>.
So you iterate through it only for the first time when you calculate statistics for AAL, during analyzing data for AAPL and ADBE it will be handled as an empty one.
You can handle this by introducing helper list init by the parsed, add next code (it is a one line solution of course e.g. in Java 8, but this option will work for earlier versions too) before the for cycle:
List<CSVRecord> records = new ArrayList<>();
for (CSVRecord record : parsed) {
records.add(record);
}
And change next line:
for (CSVRecord record : records) {
with:
for (CSVRecord record : parsed) {
For the CSV you've provided you will have next output then:
AAL -21.583101145880306
AAPL 23.417753561072438
ADBE -16.75343297000953
So here's a block of the code that work for me, if i understand your question, you only want to "read" each column and row from a csv file, hope helps.
br = new BufferedReader(new InputStreamReader(new FileInputStream(archivo), "UTF8"));
while ((line = br.readLine()) != null) {
if(a!=0){
String[] datos = line.split(cvsSplitBy);
System.out.println(datos[0] + " - " + datos[1] + " - " + datos[2]);
}
a++;
}

LDA in Spark 1.3.1. Converting raw data into Term Document Matrix?

I'm trying out LDA with Spark 1.3.1 in Java and got this error:
Error: application failed with exception
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NumberFormatException: For input string: "��"
My .txt file looks like this:
put weight find difficult pull ups push ups now
blindness diseases everything eyes work perfectly except ability take light use light form images
role model kid
Dear recall saddest memory childhood
This is the code:
import scala.Tuple2;
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.clustering.LDAModel;
import org.apache.spark.mllib.clustering.LDA;
import org.apache.spark.mllib.linalg.Matrix;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
import org.apache.spark.SparkConf;
public class JavaLDA {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("LDA Example");
JavaSparkContext sc = new JavaSparkContext(conf);
// Load and parse the data
String path = "/tutorial/input/askreddit20150801.txt";
JavaRDD<String> data = sc.textFile(path);
JavaRDD<Vector> parsedData = data.map(
new Function<String, Vector>() {
public Vector call(String s) {
String[] sarray = s.trim().split(" ");
double[] values = new double[sarray.length];
for (int i = 0; i < sarray.length; i++)
values[i] = Double.parseDouble(sarray[i]);
return Vectors.dense(values);
}
}
);
// Index documents with unique IDs
JavaPairRDD<Long, Vector> corpus = JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(
new Function<Tuple2<Vector, Long>, Tuple2<Long, Vector>>() {
public Tuple2<Long, Vector> call(Tuple2<Vector, Long> doc_id) {
return doc_id.swap();
}
}
));
corpus.cache();
// Cluster the documents into three topics using LDA
LDAModel ldaModel = new LDA().setK(100).run(corpus);
// Output topics. Each is a distribution over words (matching word count vectors)
System.out.println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize()
+ " words):");
Matrix topics = ldaModel.topicsMatrix();
for (int topic = 0; topic < 100; topic++) {
System.out.print("Topic " + topic + ":");
for (int word = 0; word < ldaModel.vocabSize(); word++) {
System.out.print(" " + topics.apply(word, topic));
}
System.out.println();
}
ldaModel.save(sc.sc(), "myLDAModel");
}
}
Anyone know why this happened? I'm just trying LDA Spark for the first time. Thanks.
values[i] = Double.parseDouble(sarray[i]);
Why are you trying to convert each word of your text file into a Double?
That's the answer to your issue:
http://docs.oracle.com/javase/6/docs/api/java/lang/Double.html#parseDouble%28java.lang.String%29
Your code is expecting the input file to be a bunch of lines of whitespace separated text that looks like numbers. Assuming your text is words instead:
Get a list of every word that appears in your corpus:
JavaRDD<String> words =
data.flatMap((FlatMapFunction<String, String>) s -> {
s = s.replaceAll("[^a-zA-Z ]", "");
s = s.toLowerCase();
return Arrays.asList(s.split(" "));
});
Make a map giving each word an integer associated with it:
Map<String, Long> vocab = words.zipWithIndex().collectAsMap();
Then instead of your parsedData doing what it's doing up there, make it look up each word, find the associated number, go to that location in an array, and add 1 to the count for that word.
JavaRDD<Vector> tokens = data.map(
(Function<String, Vector>) s -> {
String[] vals = s.split("\\s");
double[] idx = new double[vocab.size() + 1];
for (String val : vals) {
idx[vocab.get(val).intValue()] += 1.0;
}
return Vectors.dense(idx);
}
);
This results in an RDD of vectors, where each vector is vocab.size() long, and each spot in the vector is the count of how many times that vocab word appeared in the line.
I modified this code slightly from what I'm currently using and didn't test it, so there could be errors in it. Good luck!

Aggregate data in CSV file using Java

I have a big CSV file, thousands of rows, and I want to aggregate some columns using java code.
The file in the form:
1,2012,T1
2,2015,T2
3,2013,T1
4,2012,T1
The results should be:
T, Year, Count
T1,2012, 2
T1,2013, 1
T2,2015, 1
Put your data to a Map like structure, each time add +1 to a stored value when a key (in your case ""+T+year) found.
You can use map like
Map<String, Integer> rowMap = new HashMap<>();
rowMap("T1", 1);
rowMap("T2", 2);
rowMap("2012", 1);
or you can define your own class with T and Year field by overriding hashcode and equals method. Then you can use
Map<YourClass, Integer> map= new HashMap<>();
T1,2012, 2
String csv =
"1,2012,T1\n"
+ "2,2015,T2\n"
+ "3,2013,T1\n"
+ "4,2012,T1\n";
Map<String, Integer> map = new TreeMap<>();
BufferedReader reader = new BufferedReader(new StringReader(csv));
String line;
while ((line = reader.readLine()) != null) {
String[] fields = line.split(",");
String key = fields[2] + "," + fields[1];
Integer value = map.get(key);
if (value == null)
value = 0;
map.put(key, value + 1);
}
System.out.println(map);
// -> {T1,2012=2, T1,2013=1, T2,2015=1}
Use uniVocity-parsers for the best performance. It should take 1 second to process 1 million rows.
CsvParserSettings settings = new CsvParserSettings();
settings.selectIndexes(1, 2); //select the columns we are going to read
final Map<List<String>, Integer> results = new LinkedHashMap<List<String>, Integer>(); //stores the results here
//Use a custom implementation of RowProcessor
settings.setRowProcessor(new AbstractRowProcessor() {
#Override
public void rowProcessed(String[] row, ParsingContext context) {
List<String> key = Arrays.asList(row); // converts the input array to a List - lists implement hashCode and equals based on their values so they can be used as keys on your map.
Integer count = results.get(key);
if (count == null) {
count = 0;
}
results.put(key, count + 1);
}
});
//creates a parser with the above configuration and RowProcessor
CsvParser parser = new CsvParser(settings);
String input = "1,2012,T1"
+ "\n2,2015,T2"
+ "\n3,2013,T1"
+ "\n4,2012,T1";
//the parse() method will parse and submit all rows to your RowProcessor - use a FileReader to read a file instead the String I'm using as example.
parser.parse(new StringReader(input));
//Here are the results:
for(Entry<List<String>, Integer> entry : results.entrySet()){
System.out.println(entry.getKey() + " -> " + entry.getValue());
}
Output:
[2012, T1] -> 2
[2015, T2] -> 1
[2013, T1] -> 1
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

Categories