I have a 2 column data frame as mentioned below.
+---+--------+
| src| dst|
+---+--------+
| 1| 2|
| 2| 3|
| 1| 3|
| 4| 3|
| 4| 5|
| 2| 5|
| 7| 8|
| 9| 8|
+---+--------+
I am trying to create a graph from this Edge DF. I am able to create the graph manual using a List<Edge<String>> . I want to know if there is any simpler way to create a graph from the Edge DF.
List<Edge<String>> edges = new ArrayList<>();
edges.add(new Edge<String>(1, 2, ""));
edges.add(new Edge<String>(2, 3, ""));
edges.add(new Edge<String>(1, 3, ""));
edges.add(new Edge<String>(4, 3, ""));
edges.add(new Edge<String>(4, 5, ""));
edges.add(new Edge<String>(2, 5, ""));
//sc is the Java Spark Context
JavaRDD<Edge<String>> edgeRDD = sc.parallelize(edges, 1);
ClassTag<String> stringTag = scala.reflect.ClassTag$.MODULE$.apply(String.class);
Graph<String, String> graph = Graph.fromEdges(edgeRDD.rdd(), "",StorageLevel.MEMORY_ONLY(), StorageLevel.MEMORY_ONLY(), stringTag, stringTag);
What would be the easiest way to create a graph from the above mentioned data frame
Related
I have a web service built around Spark that, based on a JSON request, builds a series of dataframe/dataset operations.
These operations involve multiple joins, filters, etc. that would change the ordering of the values in the columns. This final data set could have rows to the scale of millions.
Preferably without converting it to an RDD, is there anyway to apply a custom sort(s) on some columns of the final dataset based on the order of elements passed in as Lists?
The original dataframe is of the form
+----------+----------+
| Column 1 | Column 2 |
+----------+----------+
| Val 1 | val a |
+----------+----------+
| Val 2 | val b |
+----------+----------+
| val 3 | val c |
+----------+----------+
After a series of transformations are performed, the dataframe ends up looking like this.
+----------+----------+----------+----------+
| Column 1 | Column 2 | Column 3 | Column 4 |
+----------+----------+----------+----------+
| Val 2 | val b | val 999 | val 900 |
+----------+----------+----------+----------+
| Val 1 | val c | val 100 | val 9$## |
+----------+----------+----------+----------+
| val 3 | val a | val 2## | val $##8 |
+----------+----------+----------+----------+
I now need to apply a sort on multiple columns based on the order of the values passed as an Array list.
For example:
Col1values Order=[val 1,val 3,val 2}
Col3values Order=[100,2##,999].
Custom sorting works by creating a column for sorting. It does not need to be a visible column inside the dataframe. I can show it using PySpark.
Initial df:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[(1, 'a', 'A'),
(2, 'a', 'B'),
(3, 'a', 'C'),
(4, 'b', 'A'),
(5, 'b', 'B'),
(6, 'b', 'C'),
(7, 'c', 'A'),
(8, 'c', 'B'),
(9, 'c', 'C')],
['id', 'c1', 'c2']
)
Custom sorting on 1 column:
from itertools import chain
order = {'b': 1, 'a': 2, 'c': 3}
sort_col = F.create_map([F.lit(x) for x in chain(*order.items())])[F.col('c1')]
df = df.sort(sort_col)
df.show()
# +---+---+---+
# | id| c1| c2|
# +---+---+---+
# | 5| b| B|
# | 6| b| C|
# | 4| b| A|
# | 1| a| A|
# | 2| a| B|
# | 3| a| C|
# | 7| c| A|
# | 8| c| B|
# | 9| c| C|
# +---+---+---+
On 2 columns:
from itertools import chain
order1 = {'b': 1, 'a': 2, 'c': 3}
order2 = {'B': 1, 'C': 2, 'A': 3}
sort_col1 = F.create_map([F.lit(x) for x in chain(*order1.items())])[F.col('c1')]
sort_col2 = F.create_map([F.lit(x) for x in chain(*order2.items())])[F.col('c2')]
df = df.sort(sort_col1, sort_col2)
df.show()
# +---+---+---+
# | id| c1| c2|
# +---+---+---+
# | 5| b| B|
# | 6| b| C|
# | 4| b| A|
# | 2| a| B|
# | 3| a| C|
# | 1| a| A|
# | 8| c| B|
# | 9| c| C|
# | 7| c| A|
# +---+---+---+
Or as a function:
from itertools import chain
def cust_sort(col: str, order: dict):
return F.create_map([F.lit(x) for x in chain(*order.items())])[F.col(col)]
df = df.sort(
cust_sort('c1', {'b': 1, 'a': 2, 'c': 3}),
cust_sort('c2', {'B': 1, 'C': 2, 'A': 3})
)
I have a Dataset as below,
+------+------+---------------+
| col1 | col2 | sum(costs) |
+------+------+---------------+
| 1 | a | 3555204326.27 |
| 4 | b | 22273491.72 |
| 5 | c | 219175.00 |
| 3 | a | 219175.00 |
| 2 | c | 75341433.37 |
+------+------+---------------+
I need to select the distinct values of the col1 and my resultant dataset should have the order as 1, 4, 5, 3, 2 (the order in which these values are available in initial dataset). But the order is getting shuffled. Is there any way to maintain the same order as the intital dataset. Any suggestion in Spark/SQL could be fine.
This dataset can be obtained by below sequence in spark.
df = sqlCtx.createDataFrame(
[(1, a, 355.27), (4, b, 222.98), (5, c, 275.00), (3, a, 25.00),
(2, c, 753.37)], ('Col1', 'col2', 'cost'));
You can add another column containing the index of each row, then sort on that column after "distinct". Here is an example:
import org.apache.spark.sql.functions._
val df = Seq(1, 4, 4, 5, 2)
.toDF("a")
.withColumn("id", monotonically_increasing_id())
df.show()
// +---+---+
// | a| id|
// +---+---+
// | 1| 0|
// | 4| 1|
// | 4| 2|
// | 5| 3|
// | 2| 4|
// +---+---+
df.dropDuplicates("a").sort("id").show()
// +---+---+
// | a| id|
// +---+---+
// | 1| 0|
// | 4| 1|
// | 5| 3|
// | 2| 4|
// +---+---+
Note that to do distinct on 1 specific column, you can use dropDuplicates, if you want to control which row you want to take in case of duplicate then use groupBy.
Assuming you are trying to remote the duplicates in col2 (as there are none in col1), so that the final result would be:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
+----+----+---------------+
You could add an index column like:
df = df.withColumn("__idx", monotonically_increasing_id());
Then do all the transformations you want, and then drop it, like in:
df = df.dropDuplicates("col2").orderBy("__idx").drop("__idx");
This would mean do:
Step 1: load the data and stuff:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
| 3| a| 219175.0|
| 2| c| 7.534143337E7|
+----+----+---------------+
Step 2: add the index:
+----+----+---------------+-----+
|col1|col2| sum|__idx|
+----+----+---------------+-----+
| 1| a|3.55520432627E9| 0|
| 4| b| 2.227349172E7| 1|
| 5| c| 219175.0| 2|
| 3| a| 219175.0| 3|
| 2| c| 7.534143337E7| 4|
+----+----+---------------+-----+
Step 3: transformations (here remove the dups in col2) and remove the __idx column:
+----+----+---------------+
|col1|col2| sum|
+----+----+---------------+
| 1| a|3.55520432627E9|
| 4| b| 2.227349172E7|
| 5| c| 219175.0|
+----+----+---------------+
The Java code could be:
package net.jgp.books.spark.ch12.lab990_others;
import static org.apache.spark.sql.functions.monotonically_increasing_id;
import java.util.ArrayList;
import java.util.List;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
/**
* Keeping the order of rows during transformations.
*
* #author jgp
*/
public class KeepingOrderApp {
/**
* main() is your entry point to the application.
*
* #param args
*/
public static void main(String[] args) {
KeepingOrderApp app = new KeepingOrderApp();
app.start();
}
/**
* The processing code.
*/
private void start() {
// Creates a session on a local master
SparkSession spark = SparkSession.builder()
.appName("Splitting a dataframe to collect it")
.master("local")
.getOrCreate();
Dataset<Row> df = createDataframe(spark);
df.show();
df = df.withColumn("__idx", monotonically_increasing_id());
df.show();
df = df.dropDuplicates("col2").orderBy("__idx").drop("__idx");
df.show();
}
private static Dataset<Row> createDataframe(SparkSession spark) {
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField(
"col1",
DataTypes.IntegerType,
false),
DataTypes.createStructField(
"col2",
DataTypes.StringType,
false),
DataTypes.createStructField(
"sum",
DataTypes.DoubleType,
false) });
List<Row> rows = new ArrayList<>();
rows.add(RowFactory.create(1, "a", 3555204326.27));
rows.add(RowFactory.create(4, "b", 22273491.72));
rows.add(RowFactory.create(5, "c", 219175.0));
rows.add(RowFactory.create(3, "a", 219175.0));
rows.add(RowFactory.create(2, "c", 75341433.37));
return spark.createDataFrame(rows, schema);
}
}
You could add an index column to your DB and then in your SQL request make an ORDER BY id
I believe you need to reformat your query and use group by instead of distinct like this answer suggests SQL: How to keep rows order with DISTINCT?
I am performing groupBy on COL1 and getting the concatenated list of COL2 using concat_ws. How can I get the count of values in that list? Here's my code:
Dataset<Row> ds = df.groupBy("COL1").agg(org.apache.spark.sql.functions
.concat_ws(",",org.apache.spark.sql.functions.collect_list("COL2")).as("sample"));
Use size function.
size(e: Column): Column Returns length of array or map.
The following example is in Scala and am leaving it to you to convert it to Java, but the general idea is exactly the same regardless of the programming language.
val input = spark.range(4)
.withColumn("COL1", $"id" % 2)
.select($"COL1", $"id" as "COL2")
scala> input.show
+----+----+
|COL1|COL2|
+----+----+
| 0| 0|
| 1| 1|
| 0| 2|
| 1| 3|
+----+----+
val s = input
.groupBy("COL1")
.agg(
concat_ws(",", collect_list("COL2")) as "concat",
size(collect_list("COL2")) as "size") // <-- size
scala> s.show
+----+------+----+
|COL1|concat|size|
+----+------+----+
| 0| 0,2| 2|
| 1| 1,3| 2|
+----+------+----+
In Java that'd be as follows. Thanks Krishna Prasad for sharing the code with the SO/Spark community!
Dataset<Row> ds = df.groupBy("COL1").agg(
org.apache.spark.sql.functions.concat_ws(",",org.apache.spark.sql.functions.collect_list("COL2")).as("sample"),
org.apache.spark.sql.functions.size(org.apache.spark.sql.functions.collect_list("COL2")).as("size"));
Newbie here on spark... how can I use a column in spark dataset ask key to get some values and add the values as new column to the dataset?
In python, we have something like:
df.loc[:,'values'] = df.loc[:,'key'].apply(lambda x: D.get(x))
where D is a function in python defined earlier.
how can I do this in spark using Java? thank you.
Edit:
for example:
I have a following dataset df:
A
1
3
6
0
8
I want to create a weekday column based on the following dictionary:
D[1] = "Monday"
D[2] = "Tuesday"
D[3] = "Wednesday"
D[4] = "Thursday"
D[5] = "Friday"
D[6] = "Saturday"
D[7] = "Sunday"
and add the column back to my dataset df:
A days
1 Monday
3 Wednesday
6 Saturday
0 Sunday
8 NULL
This is just an example, column A could be anything other than integers of course.
You can use df.withColumn to return a new df with the new column values and the previous values of df.
create a udf function (user defined functions) to apply the dictionary mapping.
here's a reproducible example:
>>> from pyspark.sql.types import StringType
>>> from pyspark.sql.functions import udf
>>> df = spark.createDataFrame([{'A':1,'B':5},{'A':5,'B':2},{'A':1,'B':3},{'A':5,'B':4}], ['A','B'])
>>> df.show()
+---+---+
| A| B|
+---+---+
| 1| 5|
| 5| 2|
| 1| 3|
| 5| 4|
+---+---+
>>> d = {1:'x', 2:'y', 3:'w', 4:'t', 5:'z'}
>>> mapping_func = lambda x: d.get(x)
>>> df = df.withColumn('values',udf(mapping_func, StringType())("A"))
>>> df.show()
+---+---+------+
| A| B|values|
+---+---+------+
| 1| 5| x|
| 5| 2| z|
| 1| 3| x|
| 5| 4| z|
+---+---+------+
After some data processing, I end up with this Dataset:
Dataset<Row> counts //ID,COUNT,DAY_OF_WEEK
Now I want to transform this to this format and save as CSV:
ID,COUNT_DoW1, ID,COUNT_DoW2, ID,COUNT_DoW3,..ID,COUNT_DoW7
I can think of one approach of:
JavaPairRDD<Long, Map<Integer, Integer>> r = counts.toJavaRDD().mapToPair(...)
JavaPairRDD<Long, Map<Integer, Integer>> merged = r.reduceByKey(...);
Where its a pair of "ID" and List of size 7.
After I get JavaPairRDD, I can store it in csv. Is there a simpler approach for this transformation without converting it to an RDD?
You can use the struct function to construct a pair from cnt and day and then do a groupby with collect_list.
Something like this (scala but you can easily convert to java):
df.groupBy("ID").agg(collect_list(struct("COUNT","DAY")))
Now you can write a UDF which extracts the relevant column. So you simply do a withColumn in a loop to simply copy the ID (df.withColumn("id2",col("id")))
then you create a UDF which extracts the count element from position i and run it on all columns and lastly the same on day.
If you keep the order you want and drop irrelevant columns you would get what you asked for.
You can also work with the pivot command (again in scala but you should be able to easily convert to java):
df.show()
>>+---+---+---+
>>| id|cnt|day|
>>+---+---+---+
>>|333| 31| 1|
>>|333| 32| 2|
>>|333|133| 3|
>>|333| 34| 4|
>>|333| 35| 5|
>>|333| 36| 6|
>>|333| 37| 7|
>>|222| 41| 4|
>>|111| 11| 1|
>>|111| 22| 2|
>>|111| 33| 3|
>>|111| 44| 4|
>>|111| 55| 5|
>>|111| 66| 6|
>>|111| 77| 7|
>>|222| 21| 1|
>>+---+---+---+
val df2 = df.withColumn("all",struct('id, 'cnt' 'day))
val res = .groupBy("id").pivot("day").agg(first('all).as("bla")).select("1.*","2.*","3.*", "4.*", "5.*", "6.*", "7.*")
res.show()
>>+---+---+---+----+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+
>>| id|cnt|day| id| cnt| day| id| cnt| day| id|cnt|day| id| cnt| day| id| cnt| day| id| cnt| day|
>>+---+---+---+----+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+
>>|333| 31| 1| 333| 32| 2| 333| 133| 3|333| 34| 4| 333| 35| 5| 333| 36| 6| 333| 37| 7|
>>|222| 21| 1|null|null|null|null|null|null|222| 41| 4|null|null|null|null|null|null|null|null|null|
>>|111| 11| 1| 111| 22| 2| 111| 33| 3|111| 44| 4| 111| 55| 5| 111| 66| 6| 111| 77| 7|
>>+---+---+---+----+----+----+----+----+----+---+---+---+----+----+----+----+----+----+----+----+----+