Iterating over multiple CSVs and joining with Spark SQL

Iterating over multiple CSVs and joining with Spark SQL - java

I have several csv files with the same headers and the same IDs. I am attempting to iterate to merge all files up to one indexed '31'. In my while loop, I'm trying to initialise the merged dataset so it can be used for the remainder of the loop. In the last line, I was told that the 'local variable merged may not have been initialised'. How should I instead be doing this?
SparkSession spark = SparkSession.builder().appName("testSql")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///c:tmp")
.getOrCreate();
Dataset<Row> first = spark.read().option("header", true).csv("mypath/01.csv");
Dataset<Row> second = spark.read().option("header", true).csv("mypath/02.csv");
IntStream.range(3, 31)
.forEach(i -> {
while(i==3) {
Dataset<Row> merged = first.join(second, first.col("customer_id").equalTo(second.col("customer_id")));
}
Dataset<Row> next = spark.read().option("header", true).csv("mypath/"+i+".csv");
Dataset<Row> merged = merged.join(next, merged.col("customer_id").equalTo(next.col("customer_id")));

EDITED based on feedback in the comments.
Following your pattern, something like this would work:
Dataset<Row> ds1 = spark.read().option("header", true).csv("mypath/01.csv");
Dataset<?>[] result = {ds1};
IntStream.range(2, 31)
.forEach(i -> {
Dataset<Row> next = spark.read().option("header", true).csv("mypath/"+i+".csv");
result[0] = result[0].join(next, "customer_id");
});
We're wrapping Dataset into an array in order to work around the restriction on variable capture in lambda expressions.
The more straightforward way, for this particular case, is to simply use a for-loop rather than stream.forEach:
Dataset<Row> result = spark.read().option("header", true).csv("mypath/01.csv");
for( int i = 2 ; i < 31 ; i++ ) {
Dataset<Row> next = spark.read().option("header", true).csv("mypath/"+i+".csv");
result[0] = result[0].join(next, "customer_id");
};

Related

Replace the WindowSpec by groupBy in spark java

I would like to replace the WindowSpec below with a groupBy and retrieve all the columns of the initial dataset,
resultDataset = targetDataset.unionByName(sourceDs);
Column[] keys = getKey(listOfColumnsAsKey);
WindowSpec windowSpec = Window.partitionBy(keys).orderBy(functions.col(ORDER).asc());
resultDataset = resultDataset.withColumn(RANK_COLUMN, functions.dense_rank().over(windowSpec))
.filter(functions.col(RANK_COLUMN).equalTo(1))
.drop(functions.col(RANK_COLUMN));
I tried the code below but it's not working, not giving the right result as expected
Seq<String> joinedColumns = getJoinedColumns(listOfColumnsAsKey + order column);
Dataset<Row> groupedDataset = resultDataset.groupBy(keys).agg(functions.min(ORDER).as(ORDER));
resultDataset = resultDataset.join(groupedDataset, joinedColumns);

Spark Sorting with JavaRDD<String>

Let's say I have a file with line of strings and I import it to a JavaRDD, if I am trying to sort the strings and export as a new file, how should I do it? The code below is my attempt and it is not working
JavaSparkContext sparkContext = new JavaSparkContext("local[*]", "Spark Sort");
Configuration hadoopConfig = sparkContext.hadoopConfiguration();
hadoopConfig.set("fs.hdfs.imp", DistributedFileSystem.class.getName());
hadoopConfig.set("fs.file.impl", LocalFileSystem.class.getName());
JavaRDD<String> lines = sparkContext.textFile(args[0]);
JavaRDD<String> sorted = lines.sortBy(i->i, true,1);
sorted.saveAsTextFile(args[1]);
What I mean by "not working" is that the output file is not sorted. I think the issue is with my "i->i" code, I am not sure how to make it sort with the compare method of strings as each "i" will be a string (also not sure how to make it compare between different "i"
EDIT
I have modified the code as per the comments, I suspect the file was being read as 1 giant string.
JavaSparkContext sparkContext = new JavaSparkContext("local[*]", "Spark Sort");
Configuration hadoopConfig = sparkContext.hadoopConfiguration();
hadoopConfig.set("fs.hdfs.imp", DistributedFileSystem.class.getName());
hadoopConfig.set("fs.file.impl", LocalFileSystem.class.getName());
long start = System.currentTimeMillis();
List<String> array = buildArrayList(args[0]);
JavaRDD<String> lines = sparkContext.parallelize(array);
JavaRDD<String> sorted = lines.sortBy(i->i, true, 1);
sorted.saveAsTextFile(args[1]);
Still not sorting it :(

I made a little research. Your code is correct. Here are the samples which I tested:
Spark initizalization
SparkSession spark = SparkSession.builder().appName("test")
.config("spark.debug.maxToStringFields", 10000)
.config("spark.sql.tungsten.enabled", true)
.enableHiveSupport().getOrCreate();
JavaSparkContext jSpark = new JavaSparkContext(spark.sparkContext());
Example for RDD
//RDD
JavaRDD rdd = jSpark.parallelize(Arrays.asList("z", "b", "c", "a"));
JavaRDD sorted = rdd.sortBy(i -> i, true, 1);
List<String> result = sorted.collect();
result.stream().forEach(i -> System.out.println(i));
The output is
a
b
c
z
You also can use dataset API
//Dataset
Dataset<String> stringDataset = spark.createDataset(Arrays.asList("z", "b", "c", "a"), Encoders.STRING());
Dataset<String> sortedDataset = stringDataset.sort(stringDataset.col(stringDataset.columns()[0]).desc()); //by defualt is ascending order
result = sortedDataset.collectAsList();
result.stream().forEach(i -> System.out.println(i));
The output is
z
c
b
a
Your problem I think is that your text file have a specific lines separator. If it's so - you can use flatMap function to split your giant text string into line strings.
Here the example with Dataset
//flatMap example
Dataset<String> singleLineDS= spark.createDataset(Arrays.asList("z:%b:%c:%a"), Encoders.STRING());
Dataset<String> splitedDS = singleLineDS.flatMap(i->Arrays.asList(i.split(":%")).iterator(),Encoders.STRING());
Dataset<String> sortedSplitedDs = splitedDS.sort(splitedDS.col(splitedDS.columns()[0]).desc());
result = sortedSplitedDs.collectAsList();
result.stream().forEach(i -> System.out.println(i));
So you should find which separator is in your text file and adopt the code above for your task

Why is this PageRank job so much slower using Dataset than RDD?

I implemented this example of PageRank in Java using the newer Dataset API. When I benchmark my code against the sample which uses the older RDD API, I find that my code takes 186 seconds while the baseline only takes 109 seconds. What is causing the discrepancy? (Side-note: is it normal for Spark to take hundreds of seconds even when the database only contains a handful of entries?)
My code:
Dataset<Row> outLinks = spark.read().jdbc("jdbc:postgresql://127.0.0.1:5432/postgres", "storagepage_outlinks", props);
Dataset<Row> page = spark.read().jdbc("jdbc:postgresql://127.0.0.1:5432/postgres", "pages", props);
outLinks = page.join(outLinks, page.col("id").equalTo(outLinks.col("storagepage_id")));
outLinks = outLinks.distinct().groupBy(outLinks.col("url")).agg(collect_set("outlinks")).cache();
Dataset<Row> ranks = outLinks.map(row -> new Tuple2<>(row.getString(0), 1.0), Encoders.tuple(Encoders.STRING(), Encoders.DOUBLE())).toDF("url", "rank");
for (int i = 0; i < iterations; i++) {
Dataset<Row> joined = outLinks.join(ranks, new Set.Set1<>("url").toSeq());
Dataset<Row> contribs = joined.flatMap(row -> {
List<String> links = row.getList(1);
double rank = row.getDouble(2);
return links
.stream()
.map(s -> new Tuple2<>(s, rank / links.size()))
.collect(Collectors.toList()).iterator();
}, Encoders.tuple(Encoders.STRING(), Encoders.DOUBLE())).toDF("url", "num");
Dataset<Tuple2<String, Double>> reducedByKey =
contribs.groupByKey(r -> r.getString(0), Encoders.STRING())
.mapGroups((s, iterator) -> {
double sum = 0;
while (iterator.hasNext()) {
sum += iterator.next().getDouble(1);
}
return new Tuple2<>(s, sum);
}, Encoders.tuple(Encoders.STRING(), Encoders.DOUBLE()));
ranks = reducedByKey.map(t -> new Tuple2<>(t._1, .15 + t._2 * .85),
Encoders.tuple(Encoders.STRING(), Encoders.DOUBLE())).toDF("url", "rank");
}
ranks.show();
The sample code which uses RDD (adapted to read from my database):
Dataset<Row> outLinks = spark.read().jdbc("jdbc:postgresql://127.0.0.1:5432/postgres", "storagepage_outlinks", props);
Dataset<Row> page = spark.read().jdbc("jdbc:postgresql://127.0.0.1:5432/postgres", "pages", props);
outLinks = page.join(outLinks, page.col("id").equalTo(outLinks.col("storagepage_id")));
outLinks = outLinks.distinct().groupBy(outLinks.col("url")).agg(collect_set("outlinks")).cache(); // TODO: play with this cache
JavaPairRDD<String, Iterable<String>> links = outLinks.javaRDD().mapToPair(row -> new Tuple2<>(row.getString(0), row.getList(1)));
// Loads all URLs with other URL(s) link to from input file and initialize ranks of them to one.
JavaPairRDD<String, Double> ranks = links.mapValues(rs -> 1.0);
// Calculates and updates URL ranks continuously using PageRank algorithm.
for (int current = 0; current < 20; current++) {
// Calculates URL contributions to the rank of other URLs.
JavaPairRDD<String, Double> contribs = links.join(ranks).values()
.flatMapToPair(s -> {
int urlCount = size(s._1());
List<Tuple2<String, Double>> results = new ArrayList<>();
for (String n : s._1) {
results.add(new Tuple2<>(n, s._2() / urlCount));
}
return results.iterator();
});
// Re-calculates URL ranks based on neighbor contributions.
ranks = contribs.reduceByKey((x, y) -> x + y).mapValues(sum -> 0.15 + sum * 0.85);
}
// Collects all URL ranks and dump them to console.
List<Tuple2<String, Double>> output = ranks.collect();
for (Tuple2<?,?> tuple : output) {
System.out.println(tuple._1() + " has rank: " + tuple._2() + ".");
}

TL;DR It is probably good old Avoid GroupByKey thing.
Hard to say for sure but your Dataset code is equivalent to groupByKey:
groupByKey(...).mapGroups(...)
it means that it shuffles first, then reduces the data.
Your RDD uses reduceByKey - this should reduce shuffle size by applying local reduction. If you want this code to be somewhat equivalent you should rewrite groupByKey(...).mapGroups(...) with groupByKey(...).reduceGroups(...).
Another possible candidate is configuration. Default value for spark.sql.shuffle.partitions is 200 which will be used for Dataset aggregations. If
the database only contains a handful of entries?
this is a serious overkill.
RDD will use spark.default.parallelism or value based on the parent data, which are typically much more modest.

Duplicate column when I create an IndexedRowMatrix using Spark

I need to calculate the pairwise similarity between several Documents. For that, I procced as follows:
JavaPairRDD<String,String> files = sc.wholeTextFiles(file_path);
System.out.println(files.count()+"**");
JavaRDD<Row> rowRDD = files.map((Tuple2<String, String> t) -> {
return RowFactory.create(t._1,t._2.replaceAll("[^\\w\\s]+","").replaceAll("\\d", ""));
});
StructType schema = new StructType(new StructField[]{
new StructField("id", DataTypes.StringType, false, Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> rows = spark.createDataFrame(rowRDD, schema);
Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
Dataset<Row> tokenized_rows = tokenizer.transform(rows);
StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered_words");
Dataset<Row> filtred_rows = remover.transform(tokenized_rows);
CountVectorizerModel cvModel = new CountVectorizer().setInputCol("filtered_words").setOutputCol("rowF").setVocabSize(100000).fit(filtred_rows);
Dataset<Row> verct_rows = cvModel.transform(filtred_rows);
IDF idf = new IDF().setInputCol("rowF").setOutputCol("features");
IDFModel idfModel = idf.fit(verct_rows);
Dataset<Row> rescaledData = idfModel.transform(verct_rows);
JavaRDD<IndexedRow> vrdd = rescaledData.toJavaRDD().map((Row r) -> {
//DenseVector dense;
String s = r.getAs(0);
int index = new Integer(s.replace(s.substring(0,24),"").replace(s.substring(s.indexOf(".txt")),""));
SparseVector sparse = (SparseVector) r.getAs(5);
//dense = sparse.toDense();parseVector) r.getAs(5);
org.apache.spark.mllib.linalg.Vector vec = org.apache.spark.mllib.linalg.Vectors.dense(sparse.toDense().toArray());
return new IndexedRow(index, vec);
});
System.out.println(vrdd.count()+"---");
IndexedRowMatrix mat = new IndexedRowMatrix(vrdd.rdd());
System.out.println(mat.numCols()+"---"+mat.numRows());
Unfortunately, the results show that the IndexedRowMatrix is created with 4 columns (the first one is duplicated) even if my dataset contains 3 documents.
3**
3--
1106---4
Can you help me to detect the cause of this duplication?

Most likely there is no duplication at all and your data simply doesn't follow the specification, which requires indices to be consecutive, zero-based, integers. Therefore numRows is max(row.index for row in rows) + 1
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.mllib.linalg.Vectors
new IndexedRowMatrix(sc.parallelize(Seq(
IndexedRow(5, Vectors.sparse(5, Array(), Array()))) // Only one non-empty row
)).numRows
// res4: Long = 6

finding Bigrams in spark with java(8)

I have tokenized the sentences into word RDD. so now i need Bigrams.
ex. This is my test => (This is), (is my), (my test)
I have search thru and found .sliding operator for this purpose. But I'm not getting this option on my eclipse (may it is available for newer version of spark)
So what how can i make this happen w/o .sliding?
Adding code to get started-
public static void biGram (JavaRDD<String> in)
{
JavaRDD<String> sentence = in.map(s -> s.toLowerCase());
//get bigram from sentence w/o sliding - CODE HERE
}

You can simply use n-gram transformation feature in spark.
public static void biGram (JavaRDD<String> in)
{
//Converting string into row
JavaRDD<Row> sentence = sentence.map(s -> RowFactory.create(s.toLowerCase()));
StructType schema = new StructType(new StructField[] {
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
//Creating dataframe
DataFrame dataFrame = sqlContext.createDataFrame(sentence, schema);
//Tokenizing sentence into words
RegexTokenizer rt = new RegexTokenizer().setInputCol("sentence").setOutputCol("split")
.setMinTokenLength(4)
.setPattern("\\s+");
DataFrame rtDF = rt.transform(dataFrame);
//Creating bigrams
NGram bigram = new NGram().setInputCol(rt.getOutputCol()).setOutputCol("bigram").setN(2); //Here setN(2) means bigram
DataFrame bigramDF = bigram.transform(rtDF);
System.out.println("Result :: "+bigramDF.select("bigram").collectAsList());
}

sliding is indeed the way to go with ngrams. thing is, sliding works on iterators, just split your sentence and slide over the array. I am adding a Scala code.
val sentences:RDD[String] = in.map(s => s.toLowerCase())
val biGrams:RDD[Iterator[Array[String]]] = sentences.map(s => s.split(" ").sliding(2))

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Iterating over multiple CSVs and joining with Spark SQL - java

Related

Replace the WindowSpec by groupBy in spark java

Spark Sorting with JavaRDD<String>

Why is this PageRank job so much slower using Dataset than RDD?

Duplicate column when I create an IndexedRowMatrix using Spark

finding Bigrams in spark with java(8)

Categories

Resources