I have a web service built around Spark that, based on a JSON request, builds a series of dataframe/dataset operations.
These operations involve multiple joins, filters, etc. that would change the ordering of the values in the columns. This final data set could have rows to the scale of millions.
Preferably without converting it to an RDD, is there anyway to apply a custom sort(s) on some columns of the final dataset based on the order of elements passed in as Lists?
The original dataframe is of the form
+----------+----------+
| Column 1 | Column 2 |
+----------+----------+
| Val 1 | val a |
+----------+----------+
| Val 2 | val b |
+----------+----------+
| val 3 | val c |
+----------+----------+
After a series of transformations are performed, the dataframe ends up looking like this.
+----------+----------+----------+----------+
| Column 1 | Column 2 | Column 3 | Column 4 |
+----------+----------+----------+----------+
| Val 2 | val b | val 999 | val 900 |
+----------+----------+----------+----------+
| Val 1 | val c | val 100 | val 9$## |
+----------+----------+----------+----------+
| val 3 | val a | val 2## | val $##8 |
+----------+----------+----------+----------+
I now need to apply a sort on multiple columns based on the order of the values passed as an Array list.
For example:
Col1values Order=[val 1,val 3,val 2}
Col3values Order=[100,2##,999].
Custom sorting works by creating a column for sorting. It does not need to be a visible column inside the dataframe. I can show it using PySpark.
Initial df:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 'a', 'A'),
(2, 'a', 'B'),
(3, 'a', 'C'),
(4, 'b', 'A'),
(5, 'b', 'B'),
(6, 'b', 'C'),
(7, 'c', 'A'),
(8, 'c', 'B'),
(9, 'c', 'C')],
['id', 'c1', 'c2']
)
Custom sorting on 1 column:
from itertools import chain
order = {'b': 1, 'a': 2, 'c': 3}
sort_col = F.create_map([F.lit(x) for x in chain(*order.items())])[F.col('c1')]
df = df.sort(sort_col)
df.show()
# +---+---+---+
# | id| c1| c2|
# +---+---+---+
# | 5| b| B|
# | 6| b| C|
# | 4| b| A|
# | 1| a| A|
# | 2| a| B|
# | 3| a| C|
# | 7| c| A|
# | 8| c| B|
# | 9| c| C|
# +---+---+---+
On 2 columns:
from itertools import chain
order1 = {'b': 1, 'a': 2, 'c': 3}
order2 = {'B': 1, 'C': 2, 'A': 3}
sort_col1 = F.create_map([F.lit(x) for x in chain(*order1.items())])[F.col('c1')]
sort_col2 = F.create_map([F.lit(x) for x in chain(*order2.items())])[F.col('c2')]
df = df.sort(sort_col1, sort_col2)
df.show()
# +---+---+---+
# | id| c1| c2|
# +---+---+---+
# | 5| b| B|
# | 6| b| C|
# | 4| b| A|
# | 2| a| B|
# | 3| a| C|
# | 1| a| A|
# | 8| c| B|
# | 9| c| C|
# | 7| c| A|
# +---+---+---+
Or as a function:
from itertools import chain
def cust_sort(col: str, order: dict):
return F.create_map([F.lit(x) for x in chain(*order.items())])[F.col(col)]
df = df.sort(
cust_sort('c1', {'b': 1, 'a': 2, 'c': 3}),
cust_sort('c2', {'B': 1, 'C': 2, 'A': 3})
)
Related
I have a 2 column data frame as mentioned below.
+---+--------+
| src| dst|
+---+--------+
| 1| 2|
| 2| 3|
| 1| 3|
| 4| 3|
| 4| 5|
| 2| 5|
| 7| 8|
| 9| 8|
+---+--------+
I am trying to create a graph from this Edge DF. I am able to create the graph manual using a List<Edge<String>> . I want to know if there is any simpler way to create a graph from the Edge DF.
List<Edge<String>> edges = new ArrayList<>();
edges.add(new Edge<String>(1, 2, ""));
edges.add(new Edge<String>(2, 3, ""));
edges.add(new Edge<String>(1, 3, ""));
edges.add(new Edge<String>(4, 3, ""));
edges.add(new Edge<String>(4, 5, ""));
edges.add(new Edge<String>(2, 5, ""));
//sc is the Java Spark Context
JavaRDD<Edge<String>> edgeRDD = sc.parallelize(edges, 1);
ClassTag<String> stringTag = scala.reflect.ClassTag$.MODULE$.apply(String.class);
Graph<String, String> graph = Graph.fromEdges(edgeRDD.rdd(), "",StorageLevel.MEMORY_ONLY(), StorageLevel.MEMORY_ONLY(), stringTag, stringTag);
What would be the easiest way to create a graph from the above mentioned data frame
i have a bunch of Data with 20000 rows in a JavaRDD. Now i want to save several files with exact the same size (like 70 rows per file).
I tried it with the code below, but because it is not exactly dividable some data sets consist of 69, 70 or 71 rows. The struggle is I need all with the same size except the last record (it can have less).
Help is appreciated!!! Thanks in advance guys!
myString.repartition(286).saveAsTextFile(outputPath);
You can use filterByRange do something like (pseudo code):
for i = 0; i < javaRDD.size ; i+= 70
val tempRDD = javaRDD.filterByRange(i,i+70).repartition(1)
tempRDD.saveAsTextFile(outputPath + i.toString());
Unfortunately a Scala answer, but it works.
First define a custom partitioner:
class IndexPartitioner[V](n_per_part: Int, rdd: org.apache.spark.rdd.RDD[_ <: Product2[Long, V]], do_cache: Boolean = true) extends org.apache.spark.Partitioner {
val max = {
if (do_cache) rdd.cache()
rdd.map(_._1).max
}
override def numPartitions: Int = math.ceil(max.toDouble/n_per_part).toInt
override def getPartition(key: Any): Int = key match {
case k:Long => (k/n_per_part).toInt
case _ => (key.hashCode/n_per_part).toInt
}
}
Create an RDD of random strings and index it:
val rdd = sc.parallelize(Array.tabulate(1000)(_ => scala.util.Random.alphanumeric.filter(_.isLetter).take(5).mkString))
val rdd_idx = rdd.zipWithIndex.map(_.swap)
Create the partitioner and apply it:
val partitioner = new IndexPartitioner(70, rdd_idx)
val rdd_part = rdd_idx.partitionBy(partitioner).values
Check partition sizes:
rdd_part
.mapPartitionsWithIndex{case (i,rows) => Iterator((i,rows.size))}
.toDF("partition_number","number_of_records")
.show
/**
+----------------+-----------------+
| 0| 70|
| 1| 70|
| 2| 70|
| 3| 70|
| 4| 70|
| 5| 70|
| 6| 70|
| 7| 70|
| 8| 70|
| 9| 70|
| 10| 70|
| 11| 70|
| 12| 70|
| 13| 70|
| 14| 20|
+----------------+-----------------+
*/
One file for each partition:
import sqlContext.implicits._
rdd_part.toDF.write.format("com.databricks.spark.csv").save("/tmp/idx_part_test/")
(+1 for "_SUCCESS")
XXX$ ls /tmp/idx_part_test/ | wc -l
16
I have Data that looks the following
ID Sensor No
1 specificSensor 1
2 1234 null
3 1234 null
4 specificSensor 2
5 2345 null
6 2345 null
7
...
I need an output format like this
ID Sensor No
1 specificSensor 1
2 1234 1
3 1234 1
4 specificSensor 2
5 2345 2
6 2345 2
7
...
I'm using Apache Spark in Java.
after that, the data is processed using groupby and pivot.
I'm thinking of something like
df.withColumn("No", functions.when(df.col("Sensor").equalTo("specificSensor"), functions.monotonically_increasing_id())
//this works as I need it
.otherwise(WHEN NULL THEN VALUE ABOVE);
I don't know if this is feasable in a way.
Help appreciated, thanks a lot!
Dataframe with sensor ID ranges can be created, and then joined to original dataframe:
val df = Seq((1, "specificSensor", Some(1)),
(2, "1234", None),
(3, "1234", None),
(4, "specificSensor", Some(2)),
(5, "2345", None),
(6, "2345", None))
.toDF("ID", "Sensor", "No")
val idWindow = Window.orderBy("ID")
val sensorsRange = df
.where($"Sensor" === "specificSensor")
.withColumn("nextId", coalesce(lead($"id", 1).over(idWindow), lit(Long.MaxValue)))
sensorsRange.show(false)
val joinColumn = $"d.ID" > $"s.id" && $"d.ID" < $"s.nextId"
val result =
df.alias("d")
.join(sensorsRange.alias("s"), joinColumn, "left")
.select($"d.ID", $"d.Sensor", coalesce($"d.No", $"s.No").alias("No"))
Output:
+---+--------------+---+-------------------+
|ID |Sensor |No |nextId |
+---+--------------+---+-------------------+
|1 |specificSensor|1 |4 |
|4 |specificSensor|2 |9223372036854775807|
+---+--------------+---+-------------------+
+---+--------------+---+
|ID |Sensor |No |
+---+--------------+---+
|1 |specificSensor|1 |
|2 |1234 |1 |
|3 |1234 |1 |
|4 |specificSensor|2 |
|5 |2345 |2 |
|6 |2345 |2 |
+---+--------------+---+
Using last aggregation with ignoreNulls over an ordered window does the trick
df.select(
$"ID",
$"Sensor",
last($"No", ignoreNulls = true) over Window.orderBy($"ID") as "No")
.show()
Output:
+---+--------------+---+
| ID| Sensor| No|
+---+--------------+---+
| 1|specificSensor| 1|
| 2| 1234| 1|
| 3| 1234| 1|
| 4|specificSensor| 2|
| 5| 2345| 2|
| 6| 2345| 2|
+---+--------------+---+
P.S. I have no working Java setup right now but should be easy to translate
I have a Dataset as below,
+------+------+---------------+
| col1 | col2 | sum(costs) |
+------+------+---------------+
| 1 | a | 3555204326.27 |
| 4 | b | 22273491.72 |
| 5 | c | 219175.00 |
| 3 | a | 219175.00 |
| 2 | c | 75341433.37 |
+------+------+---------------+
I need to select the distinct values of the col1 and my resultant dataset should have the order as 1, 4, 5, 3, 2 (the order in which these values are available in initial dataset). But the order is getting shuffled. Is there any way to maintain the same order as the intital dataset. Any suggestion in Spark/SQL could be fine.
This dataset can be obtained by below sequence in spark.
df = sqlCtx.createDataFrame(
[(1, a, 355.27), (4, b, 222.98), (5, c, 275.00), (3, a, 25.00),
(2, c, 753.37)], ('Col1', 'col2', 'cost'));
You can add another column containing the index of each row, then sort on that column after "distinct". Here is an example:
import org.apache.spark.sql.functions._
val df = Seq(1, 4, 4, 5, 2)
.toDF("a")
.withColumn("id", monotonically_increasing_id())
df.show()
// +---+---+
// | a| id|
// +---+---+
// | 1| 0|
// | 4| 1|
// | 4| 2|
// | 5| 3|
// | 2| 4|
// +---+---+
df.dropDuplicates("a").sort("id").show()
// +---+---+
// | a| id|
// +---+---+
// | 1| 0|
// | 4| 1|
// | 5| 3|
// | 2| 4|
// +---+---+
Note that to do distinct on 1 specific column, you can use dropDuplicates, if you want to control which row you want to take in case of duplicate then use groupBy.
Assuming you are trying to remote the duplicates in col2 (as there are none in col1), so that the final result would be:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
+----+----+---------------+
You could add an index column like:
df = df.withColumn("__idx", monotonically_increasing_id());
Then do all the transformations you want, and then drop it, like in:
df = df.dropDuplicates("col2").orderBy("__idx").drop("__idx");
This would mean do:
Step 1: load the data and stuff:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
| 3| a| 219175.0|
| 2| c| 7.534143337E7|
+----+----+---------------+
Step 2: add the index:
+----+----+---------------+-----+
|col1|col2| sum|__idx|
+----+----+---------------+-----+
| 1| a|3.55520432627E9| 0|
| 4| b| 2.227349172E7| 1|
| 5| c| 219175.0| 2|
| 3| a| 219175.0| 3|
| 2| c| 7.534143337E7| 4|
+----+----+---------------+-----+
Step 3: transformations (here remove the dups in col2) and remove the __idx column:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
+----+----+---------------+
The Java code could be:
package net.jgp.books.spark.ch12.lab990_others;
import static org.apache.spark.sql.functions.monotonically_increasing_id;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
/**
* Keeping the order of rows during transformations.
*
* #author jgp
*/
public class KeepingOrderApp {
/**
* main() is your entry point to the application.
*
* #param args
*/
public static void main(String[] args) {
KeepingOrderApp app = new KeepingOrderApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("Splitting a dataframe to collect it")
.master("local")
.getOrCreate();
Dataset<Row> df = createDataframe(spark);
df.show();
df = df.withColumn("__idx", monotonically_increasing_id());
df.show();
df = df.dropDuplicates("col2").orderBy("__idx").drop("__idx");
df.show();
}
private static Dataset<Row> createDataframe(SparkSession spark) {
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(
"col1",
DataTypes.IntegerType,
false),
DataTypes.createStructField(
"col2",
DataTypes.StringType,
false),
DataTypes.createStructField(
"sum",
DataTypes.DoubleType,
false) });
List<Row> rows = new ArrayList<>();
rows.add(RowFactory.create(1, "a", 3555204326.27));
rows.add(RowFactory.create(4, "b", 22273491.72));
rows.add(RowFactory.create(5, "c", 219175.0));
rows.add(RowFactory.create(3, "a", 219175.0));
rows.add(RowFactory.create(2, "c", 75341433.37));
return spark.createDataFrame(rows, schema);
}
}
You could add an index column to your DB and then in your SQL request make an ORDER BY id
I believe you need to reformat your query and use group by instead of distinct like this answer suggests SQL: How to keep rows order with DISTINCT?
Newbie here on spark... how can I use a column in spark dataset ask key to get some values and add the values as new column to the dataset?
In python, we have something like:
df.loc[:,'values'] = df.loc[:,'key'].apply(lambda x: D.get(x))
where D is a function in python defined earlier.
how can I do this in spark using Java? thank you.
Edit:
for example:
I have a following dataset df:
A
1
3
6
0
8
I want to create a weekday column based on the following dictionary:
D[1] = "Monday"
D[2] = "Tuesday"
D[3] = "Wednesday"
D[4] = "Thursday"
D[5] = "Friday"
D[6] = "Saturday"
D[7] = "Sunday"
and add the column back to my dataset df:
A days
1 Monday
3 Wednesday
6 Saturday
0 Sunday
8 NULL
This is just an example, column A could be anything other than integers of course.
You can use df.withColumn to return a new df with the new column values and the previous values of df.
create a udf function (user defined functions) to apply the dictionary mapping.
here's a reproducible example:
>>> from pyspark.sql.types import StringType
>>> from pyspark.sql.functions import udf
>>> df = spark.createDataFrame([{'A':1,'B':5},{'A':5,'B':2},{'A':1,'B':3},{'A':5,'B':4}], ['A','B'])
>>> df.show()
+---+---+
| A| B|
+---+---+
| 1| 5|
| 5| 2|
| 1| 3|
| 5| 4|
+---+---+
>>> d = {1:'x', 2:'y', 3:'w', 4:'t', 5:'z'}
>>> mapping_func = lambda x: d.get(x)
>>> df = df.withColumn('values',udf(mapping_func, StringType())("A"))
>>> df.show()
+---+---+------+
| A| B|values|
+---+---+------+
| 1| 5| x|
| 5| 2| z|
| 1| 3| x|
| 5| 4| z|
+---+---+------+