I would love if someone of you guys can guide me to convert a scala (or java) Resultset to spark Dataframe.
I cannot use this notation:
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql://XXX-XX-XXX-XX-XX.compute-1.amazonaws.com:3306/")
.option("dbtable", "pg_partner")
.option("user", "XXX")
.option("password", "XXX")
.load()
So before referring me to this similar question, please take it into account.
The reason why I cannot use that notation is that I need to use a jdbc configuration which is not present in current version of spark that I am using (2.2.0), because I want to use a "queryTimeout" option which has been recently added to the spark version 2.4, so I need to use it in the ResultSet.
Any help will be appreciated.
Thank you in advance!
A working example against public source mySQL
import java.util.Properties
import org.apache.spark.rdd.JdbcRDD
import java.sql.{Connection, DriverManager, ResultSet}
import org.apache.spark.implicits.
val url = "jdbc:mysql://mysql-rfam-public.ebi.ac.uk:4497/Rfam"
val username = "rfamro"
val password = ""
val myRDD = new JdbcRDD( sc, () => DriverManager.getConnection(url, username, password), "select rfam_id, noise_cutoff from family limit ?, ?", 1, 100, 10,
r => r.getString("rfam_id") + ", " + r.getString("noise_cutoff"))
val DF = myRDD.toDF
DF.show
returns:
+-------------------+
| value|
+-------------------+
| 5_8S_rRNA, 41.9|
| U1, 39.9|
| U2, 45.9|
| tRNA, 28.9|
| Vault, 33.9|
| U12, 52.9|
....
....
Give this a try
(haven't tried but should work with slight modification)
import java.sql.ResultSet
import org.apache.spark.sql.DataFrame
// assuming ResultSet comprises rows of (String, Int)
def resultSetToDataFrame(resultSet: ResultSet): DataFrame = {
val resultSetAsList: List[(String, Int)] = new Iterator[(String, Int)] {
override def hasNext: Boolean = resultSet.next()
override def next(): (String, Int) = {
// can also use column-label instead of column-index
(resultSet.getString(0), resultSet.getInt(1))
}
}.toStream.toList
import org.apache.spark.implicits._
val listAsDataFrame: DataFrame = resultSetAsList.toDF("column_name_1", "column_name_2")
listAsDataFrame
}
References:
SQL ResultSet to Scala List
Creating Spark DataFrame manually
Related
I am writing a small UDF
val transform = udf((x: Array[Byte]) => {
val mapper = new ObjectMapper() with ScalaObjectMapper
val stream: InputStream = new ByteArrayInputStream(x);
val obs = new ObjectInputStream(stream)
val stock = mapper.readValue(obs, classOf[util.Hashtable[String, String]])
stock
})
Where in I get error
java.lang.UnsupportedOperationException: Schema for type java.util.Hashtable[String,String] is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:809)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:740)
at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:926)
at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:739)
at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:736)
at org.apache.spark.sql.functions$.udf(functions.scala:3898)
... 59 elided
Can anyone help in understanding why this is coming?
The error you get just means that Spark does not understand java hash tables. We can reproduce your error with this simple UDF.
val gen = udf(() => new java.util.Hashtable[String, String]())
Spark tries to create a DataType (to put in a spark schema) from a java.util.Hashtable, which it does not know how to do. Spark understands scala maps though. Indeed the following code
val gen2 = udf(() => Map("a" -> "b"))
spark.range(1).select(gen2()).show()
yields
+--------+
| UDF()|
+--------+
|[a -> b]|
+--------+
To fix the first UDF, and yours by the way, you can convert the Hashtable to a scala map. Converting a HashMap can be done easily with JavaConverters. I do not know of any easy way to do it with a Hashtable but you can do it this way:
import collection.JavaConverters._
val gen3 = udf(() => {
val table = new java.util.Hashtable[String, String]()
table.put("a", "b")
Map(table.entrySet.asScala.toSeq.map(x => x.getKey -> x.getValue) :_*)
})
I'm using Jooq with Kotlin and i want to write a statement that fetches data from a query that uses couple of tables using join statement(example attached)
The problem I'm facing is that I want to map the result to my complex model which consist of one to many relationships and also many to many.
According to my knowledge I know i can use fetchgroups operation in Jooq to some how map the records but i still can't figure out how to get the result into my model.
my model:
data class MicroserviceDto(
val microservice_id: Long = 1,
val microservice_name: String? = "",
val endpoint: String? = "",
val mappings: String? = "",
val solutionDefinitionMinimalDtoList: List<SolutionDefinitionDto> = emptyList(),
val projectFileDtoList: List<ProjectFileDto> = emptyList()
)
data class SolutionDefinitionDto(
val solution_definition_id: Long = 0L,
val solution_definition_name: String = "",
val solutionId: String = "",
val solutionVersion: String = ""
)
data class ProjectFileDto(
val project_file_id: Long = 1,
val model: String = "",
val relativePath: String = "",
val fileContentDtoList: List<FileContentDto> = emptyList()
)
data class FileContentDto(
val file_content_id: Long = 1,
val content: ByteArray = ByteArray(0)
)
Link to my schema diagram
Database Diagram visualization
Explanation of the diagram:
Microservice has many to many relationship with SolutionDefinistion
ProjectFile has one to many relationship with Microservice
ProjectFile has one to many relationship with SolutionDefinition
FileContent has one to many with ProjectFile
I've created a view to represent my desired query with all tables and the join statements between them.
Here is the View:
CREATE OR REPLACE VIEW Microservice_Metadata_by_Microservice_Id AS
select
# microservice
M.id as `microservice_id`,
M.name as `microservice_name`,
M.mappings,
M.endpoint,
# solution definition
SD.id as `solution_definition_id`,
SD.name as `solution_definition_name`,
SD.solutionId,
SD.solutionVersion,
# project file of microservice
PF.id as `project_file_id`,
PF.relativePath,
PF.model,
# file content data of project file
FC.id as `file_content_id`,
FC.content
from Microservice M
# get project file
left join Microservice_SolutionDefinition MSD
on MSD.microserviceId = M.id
left join ProjectFile PF
on PF.microserviceId = M.id
# get data content
left JOIN FileContent FC
on PF.id = FC.projectFileId
# get solutions of microservice
left join SolutionDefinition SD
on SD.id = MSD.solutionDefinitionId;
How can I implement such a Jooq dsl query that map the ResultSet to my data model
Spark 2.0 (final) with Scala 2.11.8. The following super simple code yields the compilation error Error:(17, 45) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
import org.apache.spark.sql.SparkSession
case class SimpleTuple(id: Int, desc: String)
object DatasetTest {
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder.
master("local")
.appName("example")
.getOrCreate()
val dataset = sparkSession.createDataset(dataList)
}
}
Spark Datasets require Encoders for data type which is about to be stored. For common types (atomics, product types) there is a number of predefined encoders available but you have to import these first from SparkSession.implicits to make it work:
val sparkSession: SparkSession = ???
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
Alternatively you can provide directly an explicit
import org.apache.spark.sql.{Encoder, Encoders}
val dataset = sparkSession.createDataset(dataList)(Encoders.product[SimpleTuple])
or implicit
implicit val enc: Encoder[SimpleTuple] = Encoders.product[SimpleTuple]
val dataset = sparkSession.createDataset(dataList)
Encoder for the stored type.
Note that Encoders also provide a number of predefined Encoders for atomic types, and Encoders for complex ones, can derived with ExpressionEncoder.
Further reading:
For custom objects which are not covered by built-in encoders see How to store custom objects in Dataset?
For Row objects you have to provide Encoder explicitly as shown in Encoder error while trying to map dataframe row to updated row
For debug cases, case class must be defined outside of the Main https://stackoverflow.com/a/34715827/3535853
For other users (yours is correct), note that you it's also important that the case class is defined outside of the object scope. So:
Fails:
object DatasetTest {
case class SimpleTuple(id: Int, desc: String)
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
val dataset = sparkSession.createDataset(dataList)
}
}
Add the implicits, still fails with the same error:
object DatasetTest {
case class SimpleTuple(id: Int, desc: String)
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
}
}
Works:
case class SimpleTuple(id: Int, desc: String)
object DatasetTest {
val dataList = List(
SimpleTuple(5, "abc"),
SimpleTuple(6, "bcd")
)
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder
.master("local")
.appName("example")
.getOrCreate()
import sparkSession.implicits._
val dataset = sparkSession.createDataset(dataList)
}
}
Here's the relevant bug: https://issues.apache.org/jira/browse/SPARK-13540, so hopefully it will be fixed in the next release of Spark 2.
(Edit: Looks like that bugfix is actually in Spark 2.0.0... So I'm not sure why this still fails).
I'd clarify with an answer to my own question, that if the goal is to define a simple literal SparkData frame, rather than use Scala tuples and implicit conversion, the simpler route is to use the Spark API directly like this:
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import scala.collection.JavaConverters._
val simpleSchema = StructType(
StructField("a", StringType) ::
StructField("b", IntegerType) ::
StructField("c", IntegerType) ::
StructField("d", IntegerType) ::
StructField("e", IntegerType) :: Nil)
val data = List(
Row("001", 1, 0, 3, 4),
Row("001", 3, 4, 1, 7),
Row("001", null, 0, 6, 4),
Row("003", 1, 4, 5, 7),
Row("003", 5, 4, null, 2),
Row("003", 4, null, 9, 2),
Row("003", 2, 3, 0, 1)
)
val df = spark.createDataFrame(data.asJava, simpleSchema)
val rdd = sc.parallelize(Seq(("vskp", Array(2.0, 1.0, 2.1, 5.4)),("hyd",Array(1.5, 0.5, 0.9, 3.7)),("hyd", Array(1.5, 0.5, 0.9, 3.2)),("tvm", Array(8.0, 2.9, 9.1, 2.5))))
val df1= rdd.toDF("id", "vals")
val rdd1 = sc.parallelize(Seq(("vskp","ap"),("hyd","tel"),("bglr","kkt")))
val df2 = rdd1.toDF("id", "state")
val df3 = df1.join(df2,df1("id")===df2("id"),"left")
The join operation works fine
but when I reuse the df2 I am facing unresolved attributes error
val rdd2 = sc.parallelize(Seq(("vskp", "Y"),("hyd", "N"),("hyd", "N"),("tvm", "Y")))
val df4 = rdd2.toDF("id","existance")
val df5 = df4.join(df2,df4("id")===df2("id"),"left")
ERROR: org.apache.spark.sql.AnalysisException: resolved attribute(s)id#426
As mentioned in my comment, it is related to https://issues.apache.org/jira/browse/SPARK-10925 and, more specifically https://issues.apache.org/jira/browse/SPARK-14948. Reuse of the reference will create ambiguity in naming, so you will have to clone the df - see the last comment in https://issues.apache.org/jira/browse/SPARK-14948 for an example.
If you have df1, and df2 derived from df1, try renaming all columns in df2 such that no two columns have identical name after join. So before the join:
so instead of df1.join(df2...
do
# Step 1 rename shared column names in df2.
df2_renamed = df2.withColumnRenamed('columna', 'column_a_renamed').withColumnRenamed('columnb', 'column_b_renamed')
# Step 2 do the join on the renamed df2 such that no two columns have same name.
df1.join(df2_renamed)
This issue really killed a lot of my time and I finally got an easy solution for it.
In PySpark, for the problematic column, say colA, we could simply use
import pyspark.sql.functions as F
df = df.select(F.col("colA").alias("colA"))
prior to using df in the join.
I think this should work for Scala/Java Spark too.
just rename your columns and put the same name.
in pyspark:
for i in df.columns:
df = df.withColumnRenamed(i,i)
In my case this error appeared during self join of same table.
I was facing the below issue with Spark SQL and not the dataframe API:
org.apache.spark.sql.AnalysisException: Resolved attribute(s) originator#3084,program_duration#3086,originator_locale#3085 missing from program_duration#1525,guid#400,originator_locale#1524,EFFECTIVE_DATETIME_UTC#3157L,device_timezone#2366,content_rpd_id#734L,originator_sublocale#2355,program_air_datetime_utc#3155L,originator#1523,master_campaign#735,device_provider_id#2352 in operator !Deduplicate [guid#400, program_duration#3086, device_timezone#2366, originator_locale#3085, originator_sublocale#2355, master_campaign#735, EFFECTIVE_DATETIME_UTC#3157L, device_provider_id#2352, originator#3084, program_air_datetime_utc#3155L, content_rpd_id#734L]. Attribute(s) with the same name appear in the operation: originator,program_duration,originator_locale. Please check if the right attribute(s) are used.;;
Earlier I was using below query,
SELECT * FROM DataTable as aext
INNER JOIN AnotherDataTable LAO
ON aext.device_provider_id = LAO.device_provider_id
Selecting only required columns before joining solved the issue for me.
SELECT * FROM (
select distinct EFFECTIVE_DATE,system,mso_Name,EFFECTIVE_DATETIME_UTC,content_rpd_id,device_provider_id
from DataTable
) as aext
INNER JOIN AnotherDataTable LAO ON aext.device_provider_id = LAO.device_provider_id
I got the same issue when trying to use one DataFrame in two consecutive joins.
Here is the problem: DataFrame A has 2 columns (let's call them x and y) and DataFrame B has 2 columns as well (let's call them w and z). I need to join A with B on x=z and then join them together on y=z.
(A join B on A.x=B.z) as C join B on C.y=B.z
I was getting the exact error that in the second join it was complaining "resolved attribute(s) B.z#1234 ...".
Following the links #Erik provided and some other blogs and questions, I gathered I need a clone of B.
Here is what I did:
val aDF = ...
val bDF = ...
val bCloned = spark.createDataFrame(bDF.rdd, bDF.schema)
aDF.join(bDF, aDF("x") === bDF("z")).join(bCloned, aDF("y") === bCloned("z"))
#Json_Chans answer is pretty good because it does not require any resource intensive operation. Anyhow, when dealing with huge amounts of columns you need some generic function to handle that stuff on the fly and not code hundreds of columns manually.
Luckily, you can derive that function from the Dataframe itself so that you do not need any additional code except of a one-liner (at least in Python respectively pySpark):
import pyspark.sql.functions as f
df # Some Dataframe you have the "resolve(d) attribute(s)" error with
df = df.select([ f.col( column_name ).alias( column_name) for column_name in df.columns])
Since the correct string representation of a column is still stored in the columns-attribute of the Dataframe(df.columns: list), you can just reset it with itself - That's done with the .alias() (note: This still results in a new Dataframe since Dataframes are immutable, meaning they cannot be changed).
For java developpers, try to call this method:
private static Dataset<Row> cloneDataset(Dataset<Row> ds) {
List<Column> filterColumns = new ArrayList<>();
List<String> filterColumnsNames = new ArrayList<>();
scala.collection.Iterator<StructField> it = ds.exprEnc().schema().toIterator();
while (it.hasNext()) {
String columnName = it.next().name();
filterColumns.add(ds.col(columnName));
filterColumnsNames.add(columnName);
}
ds = ds.select(JavaConversions.asScalaBuffer(filterColumns).seq()).toDF(scala.collection.JavaConverters.asScalaIteratorConverter(filterColumnsNames.iterator()).asScala().toSeq());
return ds;
}
on both datasets just before the joining, it clone the datasets into new ones:
df1 = cloneDataset(df1);
df2 = cloneDataset(df2);
Dataset<Row> join = df1.join(df2, col("column_name"));
// if it didn't work try this
final Dataset<Row> join = cloneDataset(df1.join(df2, columns_seq));
It will work if you do the below.
suppose you have a dataframe. df1 and if you want to cross join the same dataframe, you can use the below
df1.toDF("ColA","ColB").as("f_df").join(df1.toDF("ColA","ColB").as("t_df"),
$"f_df.pcmdty_id" ===
$"t_df.assctd_pcmdty_id").select($"f_df.pcmdty_id",$"f_df.assctd_pcmdty_id")
From my experience, we have 2 solutions
1) clone DF
2) rename columns that have ambiguity before joining tables. (don't forget to drop duplicated join key)
Personally I prefer the second method, because cloning DF in the first method takes time, especially if data size is big.
[TLDR]
Break the AttributeReference shared between columns in parent DataFrame and derived DataFrame by writing the intermediate DataFrame to file system and reading it again.
Ex:
val df1 = spark.read.parquet("file1")
df1.createOrReplaceTempView("df1")
val df2 = spark.read.parquet("file2")
df2.createOrReplaceTempView("df2")
val df12 = spark.sql("""SELECT * FROM df1 as d1 JOIN df2 as d2 ON d1.a = d2.b""")
df12.createOrReplaceTempView("df12")
val df12_ = spark.sql(""" -- some transformation -- """)
df12_.createOrReplaceTempView("df12_")
val df3 = spark.read.parquet("file3")
df3.createOrReplaceTempView("df3")
val df123 = spark.sql("""SELECT * FROM df12_ as d12_ JOIN df3 as d3 ON d12_.a = d3.c""")
df123.createOrReplaceTempView("df123")
Now joining with top level DataFrame will lead to "unresolved attribute error"
val df1231 = spark.sql("""SELECT * FROM df123 as d123 JOIN df1 as d1 ON d123.a = d1.a""")
Solution: d123.a and d1.a share same AttributeReference break it by
writing intermediate table df123 to file system and reading again. now df123write.a and d1.a does not share AttributeReference
val df123 = spark.sql("""SELECT * FROM df12 as d12 JOIN df3 as d3 ON d12.a = d3.c""")
df123.createOrReplaceTempView("df123")
df123.write.parquet("df123.par")
val df123write = spark.read.parquet("df123.par")
spark.catalog.dropTempView("df123")
df123write.createOrReplaceTempView("df123")
val df1231 = spark.sql("""SELECT * FROM df123 as d123 JOIN df1 as d1 ON d123.a = d1.a""")
Long story:
We had complex ETLs with transformation and self joins of DataFrames, performed at multiple levels. We faced "unresolved attribute" error frequently and we solved it by selecting required attribute and performing join on the top level table instead of directly joining with the top level table this solved the issue temporarily but when we applied some more transformation on these DataFrame and joined with any top level DataFrames, "unresolved attribute" error raised its ugly head again.
This was happening because DataFrames in bottom level were sharing the same AttributeReference with top level DataFrames from which they were derived [more details]
So we broke this reference sharing by writing just 1 intermediate transformed DataFrame and reading it again and continuing with our ETL. This broke sharing AttributeReference between bottom DataFrames and Top DataFrames and we never again faced "unresolved attribute" error.
This worked for us because as we moved from top level DataFrame to bottom performing transformation and join our data shrank than initial DataFrames that we started, it also improved our performance as data size was less and spark didn't have to traverse back the DAG all the way to the last persisted DataFrame.
Thanks to Tomer's Answer
For scala - The issue came up when I tried to use the column in the self-join clause, to fix it use the method
// To `and` all the column conditions
def andAll(cols: Iterable[Column]): Column =
if (cols.isEmpty) lit(true)
else cols.tail.foldLeft(cols.head) { case (soFar, curr) => soFar.and(curr) }
// To perform join different col name
def renameColAndJoin(leftDf: DataFrame, joinCols: Seq[String], joinType: String = "inner")(rightDf: DataFrame): DataFrame = {
val renamedCols: Seq[String] = joinCols.map(colName => s"${colName}_renamed")
val zippedCols: Seq[(String, String)] = joinCols.zip(renamedCols)
val renamedRightDf: DataFrame = zippedCols.foldLeft(rightDf) {
case (df, (origColName, renamedColName)) => df.withColumnRenamed(origColName, renamedColName)
}
val joinExpr: Column = andAll(zippedCols.map {
case (origCol, renamedCol) => renamedRightDf(renamedCol).equalTo(rightDf(origCol))
})
leftDf.join(renamedRightDf, joinExpr, joinType)
}
In my case, Checkpointing the original dataframe fixed the issue.
Using Spark SQL, I have two dataframes, they are created from one, such as:
df = sqlContext.createDataFrame(...);
df1 = df.filter("value = 'abc'"); //[path, value]
df2 = df.filter("value = 'qwe'"); //[path, value]
I want to filter df1, if part of its 'path' is any path in df2.
So if df1 has row with path 'a/b/c/d/e' I would find out if in df2 is a row that path is 'a/b/c'.
In SQL it should be like
SELECT * FROM df1 WHERE udf(path) IN (SELECT path FROM df2)
where udf is user defined function that shorten original path from df1.
Naive solution is to use JOIN and then filter result, but it is slow, since df1 and df2 have each more than 10mil rows.
I also tried following code, but firstly I had to create broadcast variable from df2
static Broadcast<DataFrame> bdf;
bdf = sc.broadcast(df2); //variable 'sc' is JavaSparkContext
sqlContext.createDataFrame(df1.javaRDD().filter(
new Function<Row, Boolean>(){
#Override
public Boolean call(Row row) throws Exception {
String foo = shortenPath(row.getString(0));
return bdf.value().filter("path = '"+foo+"'").count()>0;
}
}
), myClass.class)
the problem I'm having is that Spark got stuck when the return was evaluated/when filtering of df2 was performed.
I would like to know how to work with two dataframes to do this.
I really want to avoid JOIN. Any ideas?
EDIT>>
In my original code df1 has alias 'first' and df2 'second'. This join is not cartesian, and it also does not use broadcast.
df1 = df1.as("first");
df2 = df2.as("second");
df1.join(df2, df1.col("first.path").
lt(df2.col("second.path"))
, "left_outer").
filter("isPrefix(first.path, second.path)").
na().drop("any");
isPrefix is udf
UDF2 isPrefix = new UDF2<String, String, Boolean>() {
#Override
public Boolean call(String p, String s) throws Exception {
//return true if (p.length()+4==s.length()) and s.contains(p)
}};
shortenPath - it cuts last two characters in path
UDF1 shortenPath = new UDF1<String, String>() {
#Override
public String call(String s) throws Exception {
String[] foo = s.split("/");
String result = "";
for (int i = 0; i < foo.length-2; i++) {
result += foo[i];
if(i<foo.length-3) result+="/";
}
return result;
}
};
Example of records. Path is unique.
a/a/a/b/c abc
a/a/a qwe
a/b/c/d/e abc
a/b/c qwe
a/b/b/k foo
a/b/f/a bar
...
So df1 consits of
a/a/a/b/c abc
a/b/c/d/e abc
...
and df2 consits of
a/a/a qwe
a/b/c qwe
...
There at least few problems with your code:
you cannot execute action or transformation inside another action or transformation. It means that filtering broadcasted DataFrame simply cannot work and you should get an exception.
join you use is executed as a Cartesian product followed by filter. Since Spark is using Hashing for joins only equality based joins can be efficiently executed without Cartesian. It is slightly related to Why using a UDF in a SQL query leads to cartesian product?
if both DataFrames are relatively large and have similar size then broadcasting is unlikely to be useful. See Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark
not important when it comes to performance but isPrefix seems to wrong. In particular it looks like it can match both prefix and suffix
col("first.path").lt(col("second.path")) condition looks wrong. I assume you want a/a/a/b/c from df1 match a/a/a from df2. If so it should be gt not lt.
Probably the best thing you can do is something similar to this:
import org.apache.spark.sql.functions.{col, regexp_extract}
val df = sc.parallelize(Seq(
("a/a/a/b/c", "abc"), ("a/a/a","qwe"),
("a/b/c/d/e", "abc"), ("a/b/c", "qwe"),
("a/b/b/k", "foo"), ("a/b/f/a", "bar")
)).toDF("path", "value")
val df1 = df
.where(col("value") === "abc")
.withColumn("path_short", regexp_extract(col("path"), "^(.*)(/.){2}$", 1))
.as("df1")
val df2 = df.where(col("value") === "qwe").as("df2")
val joined = df1.join(df2, col("df1.path_short") === col("df2.path"))
You can try to broadcast one of the tables like this (Spark >= 1.5.0 only):
import org.apache.spark.sql.functions.broadcast
df1.join(broadcast(df2), col("df1.path_short") === col("df2.path"))
and increase auto broadcast limits, but as I've mentioned above it most likely will be less efficient than plain HashJoin.
As a possible way of implementing IN with subquery, the LEFT SEMI JOIN can be used:
JavaSparkContext javaSparkContext = new JavaSparkContext("local", "testApp");
SQLContext sqlContext = new SQLContext(javaSparkContext);
StructType schema = DataTypes.createStructType(new StructField[]{
DataTypes.createStructField("path", DataTypes.StringType, false),
DataTypes.createStructField("value", DataTypes.StringType, false)
});
// Prepare First DataFrame
List<Row> dataForFirstDF = new ArrayList<>();
dataForFirstDF.add(RowFactory.create("a/a/a/b/c", "abc"));
dataForFirstDF.add(RowFactory.create("a/b/c/d/e", "abc"));
dataForFirstDF.add(RowFactory.create("x/y/z", "xyz"));
DataFrame df1 = sqlContext.createDataFrame(javaSparkContext.parallelize(dataForFirstDF), schema);
//
df1.show();
//
// +---------+-----+
// | path|value|
// +---------+-----+
// |a/a/a/b/c| abc|
// |a/b/c/d/e| abc|
// | x/y/z| xyz|
// +---------+-----+
// Prepare Second DataFrame
List<Row> dataForSecondDF = new ArrayList<>();
dataForSecondDF.add(RowFactory.create("a/a/a", "qwe"));
dataForSecondDF.add(RowFactory.create("a/b/c", "qwe"));
DataFrame df2 = sqlContext.createDataFrame(javaSparkContext.parallelize(dataForSecondDF), schema);
// Use left semi join to filter out df1 based on path in df2
Column pathContains = functions.column("firstDF.path").contains(functions.column("secondDF.path"));
DataFrame result = df1.as("firstDF").join(df2.as("secondDF"), pathContains, "leftsemi");
//
result.show();
//
// +---------+-----+
// | path|value|
// +---------+-----+
// |a/a/a/b/c| abc|
// |a/b/c/d/e| abc|
// +---------+-----+
The Physical Plan of such query will look like this:
== Physical Plan ==
Limit 21
ConvertToSafe
LeftSemiJoinBNL Some(Contains(path#0, path#2))
ConvertToUnsafe
Scan PhysicalRDD[path#0,value#1]
TungstenProject [path#2]
Scan PhysicalRDD[path#2,value#3]
It will use the LeftSemiJoinBNL for the actual join operation, which should broadcast values internally. From more details refer to the actual implementation in Spark - LeftSemiJoinBNL.scala
P.S. I didn't quite understand the need for removing the last two characters, but if that's needed - it can be done, like #zero323 proposed (using regexp_extract).