Spark Sorting with JavaRDD<String> - java

Let's say I have a file with line of strings and I import it to a JavaRDD, if I am trying to sort the strings and export as a new file, how should I do it? The code below is my attempt and it is not working
JavaSparkContext sparkContext = new JavaSparkContext("local[*]", "Spark Sort");
Configuration hadoopConfig = sparkContext.hadoopConfiguration();
hadoopConfig.set("fs.hdfs.imp", DistributedFileSystem.class.getName());
hadoopConfig.set("fs.file.impl", LocalFileSystem.class.getName());
JavaRDD<String> lines = sparkContext.textFile(args[0]);
JavaRDD<String> sorted = lines.sortBy(i->i, true,1);
sorted.saveAsTextFile(args[1]);
What I mean by "not working" is that the output file is not sorted. I think the issue is with my "i->i" code, I am not sure how to make it sort with the compare method of strings as each "i" will be a string (also not sure how to make it compare between different "i"
EDIT
I have modified the code as per the comments, I suspect the file was being read as 1 giant string.
JavaSparkContext sparkContext = new JavaSparkContext("local[*]", "Spark Sort");
Configuration hadoopConfig = sparkContext.hadoopConfiguration();
hadoopConfig.set("fs.hdfs.imp", DistributedFileSystem.class.getName());
hadoopConfig.set("fs.file.impl", LocalFileSystem.class.getName());
long start = System.currentTimeMillis();
List<String> array = buildArrayList(args[0]);
JavaRDD<String> lines = sparkContext.parallelize(array);
JavaRDD<String> sorted = lines.sortBy(i->i, true, 1);
sorted.saveAsTextFile(args[1]);
Still not sorting it :(

I made a little research. Your code is correct. Here are the samples which I tested:
Spark initizalization
SparkSession spark = SparkSession.builder().appName("test")
.config("spark.debug.maxToStringFields", 10000)
.config("spark.sql.tungsten.enabled", true)
.enableHiveSupport().getOrCreate();
JavaSparkContext jSpark = new JavaSparkContext(spark.sparkContext());
Example for RDD
//RDD
JavaRDD rdd = jSpark.parallelize(Arrays.asList("z", "b", "c", "a"));
JavaRDD sorted = rdd.sortBy(i -> i, true, 1);
List<String> result = sorted.collect();
result.stream().forEach(i -> System.out.println(i));
The output is
a
b
c
z
You also can use dataset API
//Dataset
Dataset<String> stringDataset = spark.createDataset(Arrays.asList("z", "b", "c", "a"), Encoders.STRING());
Dataset<String> sortedDataset = stringDataset.sort(stringDataset.col(stringDataset.columns()[0]).desc()); //by defualt is ascending order
result = sortedDataset.collectAsList();
result.stream().forEach(i -> System.out.println(i));
The output is
z
c
b
a
Your problem I think is that your text file have a specific lines separator. If it's so - you can use flatMap function to split your giant text string into line strings.
Here the example with Dataset
//flatMap example
Dataset<String> singleLineDS= spark.createDataset(Arrays.asList("z:%b:%c:%a"), Encoders.STRING());
Dataset<String> splitedDS = singleLineDS.flatMap(i->Arrays.asList(i.split(":%")).iterator(),Encoders.STRING());
Dataset<String> sortedSplitedDs = splitedDS.sort(splitedDS.col(splitedDS.columns()[0]).desc());
result = sortedSplitedDs.collectAsList();
result.stream().forEach(i -> System.out.println(i));
So you should find which separator is in your text file and adopt the code above for your task

Related

Iterating over multiple CSVs and joining with Spark SQL

I have several csv files with the same headers and the same IDs. I am attempting to iterate to merge all files up to one indexed '31'. In my while loop, I'm trying to initialise the merged dataset so it can be used for the remainder of the loop. In the last line, I was told that the 'local variable merged may not have been initialised'. How should I instead be doing this?
SparkSession spark = SparkSession.builder().appName("testSql")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///c:tmp")
.getOrCreate();
Dataset<Row> first = spark.read().option("header", true).csv("mypath/01.csv");
Dataset<Row> second = spark.read().option("header", true).csv("mypath/02.csv");
IntStream.range(3, 31)
.forEach(i -> {
while(i==3) {
Dataset<Row> merged = first.join(second, first.col("customer_id").equalTo(second.col("customer_id")));
}
Dataset<Row> next = spark.read().option("header", true).csv("mypath/"+i+".csv");
Dataset<Row> merged = merged.join(next, merged.col("customer_id").equalTo(next.col("customer_id")));
EDITED based on feedback in the comments.
Following your pattern, something like this would work:
Dataset<Row> ds1 = spark.read().option("header", true).csv("mypath/01.csv");
Dataset<?>[] result = {ds1};
IntStream.range(2, 31)
.forEach(i -> {
Dataset<Row> next = spark.read().option("header", true).csv("mypath/"+i+".csv");
result[0] = result[0].join(next, "customer_id");
});
We're wrapping Dataset into an array in order to work around the restriction on variable capture in lambda expressions.
The more straightforward way, for this particular case, is to simply use a for-loop rather than stream.forEach:
Dataset<Row> result = spark.read().option("header", true).csv("mypath/01.csv");
for( int i = 2 ; i < 31 ; i++ ) {
Dataset<Row> next = spark.read().option("header", true).csv("mypath/"+i+".csv");
result[0] = result[0].join(next, "customer_id");
};

Why is this PageRank job so much slower using Dataset than RDD?

I implemented this example of PageRank in Java using the newer Dataset API. When I benchmark my code against the sample which uses the older RDD API, I find that my code takes 186 seconds while the baseline only takes 109 seconds. What is causing the discrepancy? (Side-note: is it normal for Spark to take hundreds of seconds even when the database only contains a handful of entries?)
My code:
Dataset<Row> outLinks = spark.read().jdbc("jdbc:postgresql://127.0.0.1:5432/postgres", "storagepage_outlinks", props);
Dataset<Row> page = spark.read().jdbc("jdbc:postgresql://127.0.0.1:5432/postgres", "pages", props);
outLinks = page.join(outLinks, page.col("id").equalTo(outLinks.col("storagepage_id")));
outLinks = outLinks.distinct().groupBy(outLinks.col("url")).agg(collect_set("outlinks")).cache();
Dataset<Row> ranks = outLinks.map(row -> new Tuple2<>(row.getString(0), 1.0), Encoders.tuple(Encoders.STRING(), Encoders.DOUBLE())).toDF("url", "rank");
for (int i = 0; i < iterations; i++) {
Dataset<Row> joined = outLinks.join(ranks, new Set.Set1<>("url").toSeq());
Dataset<Row> contribs = joined.flatMap(row -> {
List<String> links = row.getList(1);
double rank = row.getDouble(2);
return links
.stream()
.map(s -> new Tuple2<>(s, rank / links.size()))
.collect(Collectors.toList()).iterator();
}, Encoders.tuple(Encoders.STRING(), Encoders.DOUBLE())).toDF("url", "num");
Dataset<Tuple2<String, Double>> reducedByKey =
contribs.groupByKey(r -> r.getString(0), Encoders.STRING())
.mapGroups((s, iterator) -> {
double sum = 0;
while (iterator.hasNext()) {
sum += iterator.next().getDouble(1);
}
return new Tuple2<>(s, sum);
}, Encoders.tuple(Encoders.STRING(), Encoders.DOUBLE()));
ranks = reducedByKey.map(t -> new Tuple2<>(t._1, .15 + t._2 * .85),
Encoders.tuple(Encoders.STRING(), Encoders.DOUBLE())).toDF("url", "rank");
}
ranks.show();
The sample code which uses RDD (adapted to read from my database):
Dataset<Row> outLinks = spark.read().jdbc("jdbc:postgresql://127.0.0.1:5432/postgres", "storagepage_outlinks", props);
Dataset<Row> page = spark.read().jdbc("jdbc:postgresql://127.0.0.1:5432/postgres", "pages", props);
outLinks = page.join(outLinks, page.col("id").equalTo(outLinks.col("storagepage_id")));
outLinks = outLinks.distinct().groupBy(outLinks.col("url")).agg(collect_set("outlinks")).cache(); // TODO: play with this cache
JavaPairRDD<String, Iterable<String>> links = outLinks.javaRDD().mapToPair(row -> new Tuple2<>(row.getString(0), row.getList(1)));
// Loads all URLs with other URL(s) link to from input file and initialize ranks of them to one.
JavaPairRDD<String, Double> ranks = links.mapValues(rs -> 1.0);
// Calculates and updates URL ranks continuously using PageRank algorithm.
for (int current = 0; current < 20; current++) {
// Calculates URL contributions to the rank of other URLs.
JavaPairRDD<String, Double> contribs = links.join(ranks).values()
.flatMapToPair(s -> {
int urlCount = size(s._1());
List<Tuple2<String, Double>> results = new ArrayList<>();
for (String n : s._1) {
results.add(new Tuple2<>(n, s._2() / urlCount));
}
return results.iterator();
});
// Re-calculates URL ranks based on neighbor contributions.
ranks = contribs.reduceByKey((x, y) -> x + y).mapValues(sum -> 0.15 + sum * 0.85);
}
// Collects all URL ranks and dump them to console.
List<Tuple2<String, Double>> output = ranks.collect();
for (Tuple2<?,?> tuple : output) {
System.out.println(tuple._1() + " has rank: " + tuple._2() + ".");
}
TL;DR It is probably good old Avoid GroupByKey thing.
Hard to say for sure but your Dataset code is equivalent to groupByKey:
groupByKey(...).mapGroups(...)
it means that it shuffles first, then reduces the data.
Your RDD uses reduceByKey - this should reduce shuffle size by applying local reduction. If you want this code to be somewhat equivalent you should rewrite groupByKey(...).mapGroups(...) with groupByKey(...).reduceGroups(...).
Another possible candidate is configuration. Default value for spark.sql.shuffle.partitions is 200 which will be used for Dataset aggregations. If
the database only contains a handful of entries?
this is a serious overkill.
RDD will use spark.default.parallelism or value based on the parent data, which are typically much more modest.

finding Bigrams in spark with java(8)

I have tokenized the sentences into word RDD. so now i need Bigrams.
ex. This is my test => (This is), (is my), (my test)
I have search thru and found .sliding operator for this purpose. But I'm not getting this option on my eclipse (may it is available for newer version of spark)
So what how can i make this happen w/o .sliding?
Adding code to get started-
public static void biGram (JavaRDD<String> in)
{
JavaRDD<String> sentence = in.map(s -> s.toLowerCase());
//get bigram from sentence w/o sliding - CODE HERE
}
You can simply use n-gram transformation feature in spark.
public static void biGram (JavaRDD<String> in)
{
//Converting string into row
JavaRDD<Row> sentence = sentence.map(s -> RowFactory.create(s.toLowerCase()));
StructType schema = new StructType(new StructField[] {
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
//Creating dataframe
DataFrame dataFrame = sqlContext.createDataFrame(sentence, schema);
//Tokenizing sentence into words
RegexTokenizer rt = new RegexTokenizer().setInputCol("sentence").setOutputCol("split")
.setMinTokenLength(4)
.setPattern("\\s+");
DataFrame rtDF = rt.transform(dataFrame);
//Creating bigrams
NGram bigram = new NGram().setInputCol(rt.getOutputCol()).setOutputCol("bigram").setN(2); //Here setN(2) means bigram
DataFrame bigramDF = bigram.transform(rtDF);
System.out.println("Result :: "+bigramDF.select("bigram").collectAsList());
}
sliding is indeed the way to go with ngrams. thing is, sliding works on iterators, just split your sentence and slide over the array. I am adding a Scala code.
val sentences:RDD[String] = in.map(s => s.toLowerCase())
val biGrams:RDD[Iterator[Array[String]]] = sentences.map(s => s.split(" ").sliding(2))

Reading delimited data from text file into different RDDs

Am new to Apache Spark java
I have a text file delimited by space as below
3,45.25,23.45
5,22.15,19.35
4,33.24,12.45
2,15.67,21.22
Here the columns mean:
1st column: index value
2nd column: latitude values
3rd column: longitude values
Am trying to parse this data into 2 or 3 RDDs (or pair RDDs). This is my code so far:
JavaRDD<String> data = sc.textFile("hdfs://data.txt");
JavaRDD<Double> data1 = data.flatMap(
new FlatMapFunction<String, Double>() {
public Iterable<Double> call(Double data) {
return Arrays.asList(data.split(","));
}
});
Something like this (use Java 8 for better readability)?
JavaRDD<String> data = sc.textFile("hdfs://data.txt");
JavaRDD<Tuple3<Integer, Float, Float>> parsedData = data.map((line) -> line.split(","))
.map((line) -> new Tuple3<>(parseInt(line[0]), parseFloat(line[1]), parseFloat(line[2])))
.cache(); // Cache parsed to avoid recomputation in subsequent .mapToPair calls
JavaPairRDD<Integer, Float> latitudeByIndex = parsedData.mapToPair((line) -> new Tuple2<>(line._1(), line._2()));
JavaPairRDD<Integer, Float> longitudeByIndex = parsedData.mapToPair((line) -> new Tuple2<>(line._1(), line._3()));
JavaPairRDD<Integer, Tuple2<Float, Float>> pointByIndex = parsedData.mapToPair((line) -> new Tuple2<>(line._1(), new Tuple2<>(line._2(), line._3())));

Parsing xml content line by line and extracting some values from it

How can I elegantly extract these values from the following text content ? I have this long file that contains thousands of entries. I tried the XML Parser and Slurper approach, but I ran out of memory. I have only 1GB. So now I'm reading the file line by line and extract the values. But I think there should be a better in Java/Groovy to do this, maybe a cleaner and reusable way. (I read the content from Standard-In)
1 line of Content:
<sample t="336" lt="0" ts="1406036100481" s="true" lb="txt1016.pb" rc="" rm="" tn="Thread Group 1-9" dt="" by="0"/>
My Groovy Solution:
Map<String, List<Integer>> requestSet = new HashMap<String, List<Integer>>();
String reqName;
String[] tmpData;
Integer reqTime;
System.in.eachLine() { line ->
if (line.find("sample")){
tmpData = line.split(" ");
reqTime = Integer.parseInt(tmpData[1].replaceAll('"', '').replaceAll("t=", ""));
reqName = tmpData[5].replaceAll('"', '').replaceAll("lb=", "");
if (requestSet.containsKey(reqName)){
List<Integer> myList = requestSet.get(reqName);
myList.add(reqTime);
requestSet.put(reqName, myList);
}else{
List<Integer> myList = new ArrayList<Integer>();
myList.add(reqTime);
requestSet.put(reqName, myList);
}
}
}
Any suggestion or code snippets that improve this ?

Categories