how to output value without brackets in spark? - java

I want to store the dataframe as pure value, but what I got is value with brackets, the code:
val df = sqlContext.read.format("orc").load(filename)
//I skip the processes here, just shows as an example
df.rdd.saveAsTextFile(outputPath)
The data is:
[40fc4ab12a174bf4]
[5572a277df472931]
[5fbce7c5c854996b]
[b4283abd92ea904]
[2f486994064f6875]
What I want is :
40fc4ab12a174bf4
5572a277df472931
5fbce7c5c854996b
b4283abd92ea904
2f486994064f6875

Use spark-csv to write data:
df.write
.format("com.databricks.spark.csv")
.option("header", "false")
.save(outputPath)
Or using rdd, just get the first value from Row:
df.rdd.map(l => l.get(0)).saveAsTextFile(outputPath)

Related

Java Spark: How to get value from a column which is JSON formatted string for entire dataset?

Needs some help here. I am trying to read data from Hive/CSV. There is a column whose type is string and the value is json formatted string. It is something like this:
| Column Name A |
|----------------------------------------------------------|
|"{"key":{"data":{"key_1":{"key_A":[123]},"key_2":[456]}}}"|
How can I get the value of key_2 and insert it to a new column?
I tried to create a new function to the get value via Gson
private BigDecimal getValue(final String columnValue){
JsonObject jsonObject = JsonParser.parseString(columnValue).getAsJsonOBject();
return jsonObject.get("key").getAsJsonObject().get("key_1").getAsJsonObject().get("key_2").getAsJsonArray().get(0).getAsBigDecimal();
}
But how i can apply this method to the whole dataset?
I was trying to achieve something like this:
Dataset<Row> ds = souceDataSet.withColumn("New_column", getValue(sourceDataSet.col("Column Name A")));
But it cannot be done as the data types are different...
Could you please give any suggestions?
Thx!
hx!
------------------Update---------------------
As #Mck suggested, I used get_json_object.
As my value contains "
"{"key":{"data":{"key_1":{"key_A":[123]},"key_2":[456]}}}"
I used substring to removed " and make the new string like this
{"key":{"data":{"key_1":{"key_A":[123]},"key_2":[456]}}}
Code for substring
DataSet<Row> dsA = sourceDataSet.withColumn("Column Name A",expr("substring(Column Name A, 2, length(Column Name A))"))
I used dsA.show() and confirmed the dataset looks correct.
Then I used following code try to do it
Dataset<Row> ds = dsA.withColumn("New_column",get_json_object(dsA.col("Column Name A"), "$.key.data.key_2[0]"));
which returns null.
However, if the data is this:
{"key":{"data":{"key_2":[456]}}}
I can get value 456.
Any suggestions why I get null?
Thx for the help!
Use get_json_object:
ds.withColumn(
"New_column",
get_json_object(
col("Column Name A").substr(lit(2), length(col("Column Name A")) - 2),
"$.key.data.key_2[0]")
).show(false)
+----------------------------------------------------------+----------+
|Column Name A |New_column|
+----------------------------------------------------------+----------+
|"{"key":{"data":{"key_1":{"key_A":[123]},"key_2":[456]}}}"|456 |
+----------------------------------------------------------+----------+

JavaRDD<String> to JavaRDD<Row>

I am reading a txt file as a JavaRDD with the following command:
JavaRDD<String> vertexRDD = ctx.textFile(pathVertex);
Now, I would like to convert this to a JavaRDD because in that txt file I have two columns of Integers and want to add some schema to the rows after splitting the columns.
I tried also this:
JavaRDD<Row> rows = vertexRDD.map(line -> line.split("\t"))
But is says I cannot assign the map function to an "Object" RDD
How can I create a JavaRDD out of a JavaRDD
How can I use map to the JavaRDD?
Thanks!
Creating a JavaRDD out of another is implicit when you apply a transformation such as map. Here, the RDD you create is a RDD of arrays of strings (result of split).
To get a RDD of rows, just create a Row from the array:
JavaRDD<String> vertexRDD = ctx.textFile("");
JavaRDD<String[]> rddOfArrays = vertexRDD.map(line -> line.split("\t"));
JavaRDD<Row> rddOfRows =rddOfArrays.map(fields -> RowFactory.create(fields));
Note that if your goal is then to transform the JavaRDD<Row> to a dataframe (Dataset<Row>), there is a simpler way. You can change the delimiter option when using spark.read to avoid having to use RDDs:
Dataset<Row> dataframe = spark.read()
.option("delimiter", "\t")
.csv("your_path/file.csv");
You can define this two columns as a class's field, and then you can use
JavaRDD<Row> rows = rdd.map(new Function<ClassName, Row>() {
#Override
public Row call(ClassName target) throws Exception {
return RowFactory.create(
target.getField1(),
target.getUsername(),
}
});
And then create StructField,
finally using
StructType struct = DataTypes.createStructType(fields);
Dataset<Row> dataFrame = sparkSession.createDataFrame(rows, struct);

Jave Equivalent impementation of withColumn in Spark

I am trying to use function which are available in org.apache.spark.sql.functions
When I am using it as
Dataset<Row> dfSelect =sqlContext.sql(
"SELECT unix_timestamp(concat(Date,' ',regexp_replace(Time,'[.]',':'))) AS TIMESTAMP,
`NMHC(GT)` from airQuality");
These functions are working fine as they should but when I am using
Dataset<Row> org.apache.spark.sql.Dataset.withColumn(String colName, Column col)
function in Java, i have implemented as below but it is giving error
Dataset<Row> df = spark.read().format("csv")
.option("dateFormat", "dd/MM/yyyy")
.option("timeFormat", "hh.mm.ss")
.option("mode", "PERMISSIVE")
.option("inferSchema", true)
.option("header", true)
.schema(schema)
.load("src/main/resources/AirQualityUCI/AirQualityUCI.csv");
df.createOrReplaceTempView("airQuality");
df.withColumn("DateStamp",unix_timestamp(concat(df.col("Date"),col(" "),regexp_replace(df.col("Time"),"[.]",":"))));
Error is
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve '` `' given input columns: [Time, Date];;
'Project [Date#0, Time#1, unix_timestamp(concat(Date#0, ' , regexp_replace(Time#1, [.], :)), yyyy-MM-dd HH:mm:ss) AS DateStamp#32]
+- Relation[Date#0,Time#1] csv
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
Your issue probably lies in the concat
concat(df.col("Date"),col(" "),regexp_replace(df.col("Time"),"[.]",":"))
And more precisely inside the col(" ") which instructs the SQL engine to find a column (hence the col function) whose name is " " (space character). And of course, no such columns exist, which is why you get an error saying there is no such column :
cannot resolve '` `' given input columns: [Time, Date];;
If what you want, as I suspect, is a blank character inside your concatenation, you may express that with a literal column value, which is lit(" ") in spark.
Which would give :
concat(df.col("Date"),lit(" "),regexp_replace(df.col("Time"),"[.]",":"))
In any case, my advice when dealing with such errors would be to simplify your expression untill it works, thus identifying what is at fault.
Try this.
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.regexp_replace;
import static org.apache.spark.sql.functions.concat;
import static org.apache.spark.sql.functions.unix_timestamp;
import static org.apache.spark.sql.functions.lit;
//Display date and time
df.withColumn("DateTime",concat(col("Date"),lit(" "),
regexp_replace(col("Time"),"[.]",":"))).show(false);
//Display unix timestamp
df.withColumn("DateTimeUnix",unix_timestamp(concat(col("Date"),lit(" "),
regexp_replace(col("Time"),"[.]",":")),"dd/MM/yyyy HH:mm:ss")).show(false);

How to resolve the AnalysisException: resolved attribute(s) in Spark

val rdd = sc.parallelize(Seq(("vskp", Array(2.0, 1.0, 2.1, 5.4)),("hyd",Array(1.5, 0.5, 0.9, 3.7)),("hyd", Array(1.5, 0.5, 0.9, 3.2)),("tvm", Array(8.0, 2.9, 9.1, 2.5))))
val df1= rdd.toDF("id", "vals")
val rdd1 = sc.parallelize(Seq(("vskp","ap"),("hyd","tel"),("bglr","kkt")))
val df2 = rdd1.toDF("id", "state")
val df3 = df1.join(df2,df1("id")===df2("id"),"left")
The join operation works fine
but when I reuse the df2 I am facing unresolved attributes error
val rdd2 = sc.parallelize(Seq(("vskp", "Y"),("hyd", "N"),("hyd", "N"),("tvm", "Y")))
val df4 = rdd2.toDF("id","existance")
val df5 = df4.join(df2,df4("id")===df2("id"),"left")
ERROR: org.apache.spark.sql.AnalysisException: resolved attribute(s)id#426
As mentioned in my comment, it is related to https://issues.apache.org/jira/browse/SPARK-10925 and, more specifically https://issues.apache.org/jira/browse/SPARK-14948. Reuse of the reference will create ambiguity in naming, so you will have to clone the df - see the last comment in https://issues.apache.org/jira/browse/SPARK-14948 for an example.
If you have df1, and df2 derived from df1, try renaming all columns in df2 such that no two columns have identical name after join. So before the join:
so instead of df1.join(df2...
do
# Step 1 rename shared column names in df2.
df2_renamed = df2.withColumnRenamed('columna', 'column_a_renamed').withColumnRenamed('columnb', 'column_b_renamed')
# Step 2 do the join on the renamed df2 such that no two columns have same name.
df1.join(df2_renamed)
This issue really killed a lot of my time and I finally got an easy solution for it.
In PySpark, for the problematic column, say colA, we could simply use
import pyspark.sql.functions as F
df = df.select(F.col("colA").alias("colA"))
prior to using df in the join.
I think this should work for Scala/Java Spark too.
just rename your columns and put the same name.
in pyspark:
for i in df.columns:
df = df.withColumnRenamed(i,i)
In my case this error appeared during self join of same table.
I was facing the below issue with Spark SQL and not the dataframe API:
org.apache.spark.sql.AnalysisException: Resolved attribute(s) originator#3084,program_duration#3086,originator_locale#3085 missing from program_duration#1525,guid#400,originator_locale#1524,EFFECTIVE_DATETIME_UTC#3157L,device_timezone#2366,content_rpd_id#734L,originator_sublocale#2355,program_air_datetime_utc#3155L,originator#1523,master_campaign#735,device_provider_id#2352 in operator !Deduplicate [guid#400, program_duration#3086, device_timezone#2366, originator_locale#3085, originator_sublocale#2355, master_campaign#735, EFFECTIVE_DATETIME_UTC#3157L, device_provider_id#2352, originator#3084, program_air_datetime_utc#3155L, content_rpd_id#734L]. Attribute(s) with the same name appear in the operation: originator,program_duration,originator_locale. Please check if the right attribute(s) are used.;;
Earlier I was using below query,
SELECT * FROM DataTable as aext
INNER JOIN AnotherDataTable LAO
ON aext.device_provider_id = LAO.device_provider_id
Selecting only required columns before joining solved the issue for me.
SELECT * FROM (
select distinct EFFECTIVE_DATE,system,mso_Name,EFFECTIVE_DATETIME_UTC,content_rpd_id,device_provider_id
from DataTable
) as aext
INNER JOIN AnotherDataTable LAO ON aext.device_provider_id = LAO.device_provider_id
I got the same issue when trying to use one DataFrame in two consecutive joins.
Here is the problem: DataFrame A has 2 columns (let's call them x and y) and DataFrame B has 2 columns as well (let's call them w and z). I need to join A with B on x=z and then join them together on y=z.
(A join B on A.x=B.z) as C join B on C.y=B.z
I was getting the exact error that in the second join it was complaining "resolved attribute(s) B.z#1234 ...".
Following the links #Erik provided and some other blogs and questions, I gathered I need a clone of B.
Here is what I did:
val aDF = ...
val bDF = ...
val bCloned = spark.createDataFrame(bDF.rdd, bDF.schema)
aDF.join(bDF, aDF("x") === bDF("z")).join(bCloned, aDF("y") === bCloned("z"))
#Json_Chans answer is pretty good because it does not require any resource intensive operation. Anyhow, when dealing with huge amounts of columns you need some generic function to handle that stuff on the fly and not code hundreds of columns manually.
Luckily, you can derive that function from the Dataframe itself so that you do not need any additional code except of a one-liner (at least in Python respectively pySpark):
import pyspark.sql.functions as f
df # Some Dataframe you have the "resolve(d) attribute(s)" error with
df = df.select([ f.col( column_name ).alias( column_name) for column_name in df.columns])
Since the correct string representation of a column is still stored in the columns-attribute of the Dataframe(df.columns: list), you can just reset it with itself - That's done with the .alias() (note: This still results in a new Dataframe since Dataframes are immutable, meaning they cannot be changed).
For java developpers, try to call this method:
private static Dataset<Row> cloneDataset(Dataset<Row> ds) {
List<Column> filterColumns = new ArrayList<>();
List<String> filterColumnsNames = new ArrayList<>();
scala.collection.Iterator<StructField> it = ds.exprEnc().schema().toIterator();
while (it.hasNext()) {
String columnName = it.next().name();
filterColumns.add(ds.col(columnName));
filterColumnsNames.add(columnName);
}
ds = ds.select(JavaConversions.asScalaBuffer(filterColumns).seq()).toDF(scala.collection.JavaConverters.asScalaIteratorConverter(filterColumnsNames.iterator()).asScala().toSeq());
return ds;
}
on both datasets just before the joining, it clone the datasets into new ones:
df1 = cloneDataset(df1);
df2 = cloneDataset(df2);
Dataset<Row> join = df1.join(df2, col("column_name"));
// if it didn't work try this
final Dataset<Row> join = cloneDataset(df1.join(df2, columns_seq));
It will work if you do the below.
suppose you have a dataframe. df1 and if you want to cross join the same dataframe, you can use the below
df1.toDF("ColA","ColB").as("f_df").join(df1.toDF("ColA","ColB").as("t_df"),
$"f_df.pcmdty_id" ===
$"t_df.assctd_pcmdty_id").select($"f_df.pcmdty_id",$"f_df.assctd_pcmdty_id")
From my experience, we have 2 solutions
1) clone DF
2) rename columns that have ambiguity before joining tables. (don't forget to drop duplicated join key)
Personally I prefer the second method, because cloning DF in the first method takes time, especially if data size is big.
[TLDR]
Break the AttributeReference shared between columns in parent DataFrame and derived DataFrame by writing the intermediate DataFrame to file system and reading it again.
Ex:
val df1 = spark.read.parquet("file1")
df1.createOrReplaceTempView("df1")
val df2 = spark.read.parquet("file2")
df2.createOrReplaceTempView("df2")
val df12 = spark.sql("""SELECT * FROM df1 as d1 JOIN df2 as d2 ON d1.a = d2.b""")
df12.createOrReplaceTempView("df12")
val df12_ = spark.sql(""" -- some transformation -- """)
df12_.createOrReplaceTempView("df12_")
val df3 = spark.read.parquet("file3")
df3.createOrReplaceTempView("df3")
val df123 = spark.sql("""SELECT * FROM df12_ as d12_ JOIN df3 as d3 ON d12_.a = d3.c""")
df123.createOrReplaceTempView("df123")
Now joining with top level DataFrame will lead to "unresolved attribute error"
val df1231 = spark.sql("""SELECT * FROM df123 as d123 JOIN df1 as d1 ON d123.a = d1.a""")
Solution: d123.a and d1.a share same AttributeReference break it by
writing intermediate table df123 to file system and reading again. now df123write.a and d1.a does not share AttributeReference
val df123 = spark.sql("""SELECT * FROM df12 as d12 JOIN df3 as d3 ON d12.a = d3.c""")
df123.createOrReplaceTempView("df123")
df123.write.parquet("df123.par")
val df123write = spark.read.parquet("df123.par")
spark.catalog.dropTempView("df123")
df123write.createOrReplaceTempView("df123")
val df1231 = spark.sql("""SELECT * FROM df123 as d123 JOIN df1 as d1 ON d123.a = d1.a""")
Long story:
We had complex ETLs with transformation and self joins of DataFrames, performed at multiple levels. We faced "unresolved attribute" error frequently and we solved it by selecting required attribute and performing join on the top level table instead of directly joining with the top level table this solved the issue temporarily but when we applied some more transformation on these DataFrame and joined with any top level DataFrames, "unresolved attribute" error raised its ugly head again.
This was happening because DataFrames in bottom level were sharing the same AttributeReference with top level DataFrames from which they were derived [more details]
So we broke this reference sharing by writing just 1 intermediate transformed DataFrame and reading it again and continuing with our ETL. This broke sharing AttributeReference between bottom DataFrames and Top DataFrames and we never again faced "unresolved attribute" error.
This worked for us because as we moved from top level DataFrame to bottom performing transformation and join our data shrank than initial DataFrames that we started, it also improved our performance as data size was less and spark didn't have to traverse back the DAG all the way to the last persisted DataFrame.
Thanks to Tomer's Answer
For scala - The issue came up when I tried to use the column in the self-join clause, to fix it use the method
// To `and` all the column conditions
def andAll(cols: Iterable[Column]): Column =
if (cols.isEmpty) lit(true)
else cols.tail.foldLeft(cols.head) { case (soFar, curr) => soFar.and(curr) }
// To perform join different col name
def renameColAndJoin(leftDf: DataFrame, joinCols: Seq[String], joinType: String = "inner")(rightDf: DataFrame): DataFrame = {
val renamedCols: Seq[String] = joinCols.map(colName => s"${colName}_renamed")
val zippedCols: Seq[(String, String)] = joinCols.zip(renamedCols)
val renamedRightDf: DataFrame = zippedCols.foldLeft(rightDf) {
case (df, (origColName, renamedColName)) => df.withColumnRenamed(origColName, renamedColName)
}
val joinExpr: Column = andAll(zippedCols.map {
case (origCol, renamedCol) => renamedRightDf(renamedCol).equalTo(rightDf(origCol))
})
leftDf.join(renamedRightDf, joinExpr, joinType)
}
In my case, Checkpointing the original dataframe fixed the issue.

convert RDD to Dataset in Java Spark

I have an RDD, i need to convert it into a Dataset, i tried:
Dataset<Person> personDS = sqlContext.createDataset(personRDD, Encoders.bean(Person.class));
the above line throws the error,
cannot resolve method createDataset(org.apache.spark.api.java.JavaRDD
Main.Person, org.apache.spark.sql.Encoder T)
however, i can convert to Dataset after converting to Dataframe. the below code works:
Dataset<Row> personDF = sqlContext.createDataFrame(personRDD, Person.class);
Dataset<Person> personDS = personDF.as(Encoders.bean(Person.class));
.createDataset() accepts RDD<T> not JavaRDD<T>. JavaRDD is a wrapper around RDD inorder to make calls from java code easier. It contains RDD internally and can be accessed using .rdd(). The following can create a Dataset:
Dataset<Person> personDS = sqlContext.createDataset(personRDD.rdd(), Encoders.bean(Person.class));
on your rdd use .toDS() you will get a dataset.
Let me know if it helps. Cheers.
In addition to accepted answer, if you want to create a Dataset<Row> instead of Dataset<Person> in Java, please try like this:
StructType yourStruct = ...; //Create your own structtype based on individual field types
Dataset<Row> personDS = sqlContext.createDataset(personRDD.rdd(), RowEncoder.apply(yourStruct));
StructType schema = new StructType()
.add("Id", DataTypes.StringType)
.add("Name", DataTypes.StringType)
.add("Country", DataTypes.StringType);
Dataset<Row> dataSet = sqlContext.createDataFrame(yourJavaRDD, schema);
Be carefull with schema variable, not always easy to predict what datatype you need to use, sometimes it's better to use just StringType for all columns

Categories