merge two dataset which are having different column names in Apache spark

merge two dataset which are having different column names in Apache spark - java

We need to merge two dataset which are having different column names, there are no common columns across the datasets.
We have tried couple of approaches, both of the approaches are not yielding result. Kindly let us know how to combine two dataset using Apache spark Java
Input data set 1
"405-048011-62815", "CRC Industries",
"630-0746","Dixon value",
"4444-444","3M INdustries",
"555-55","Dixon coupling valve"
Input dataset 2
"222-2222-5555", "Tata",
"7777-88886","WestSide",
"22222-22224","Reliance",
"33333-3333","V industries"
Expected out is
----------label1----|------sentence1------|------label2---|------sentence2-----------
| 405-048011-62815 | CRC Industries | 222-2222-5555 | Tata|
| 630-0746 | Dixon value | 7777-88886 | WestSide|
-------------------------------------------------------------------------------------
`
List<Row> data = Arrays.asList(
RowFactory.create("405-048011-62815", "CRC Industries"),
RowFactory.create("630-0746","Dixon value"),
RowFactory.create("4444-444","3M INdustries"),
RowFactory.create("555-55","Dixon coupling valve"));
StructType schema = new StructType(new StructField[] {new StructField("label1", DataTypes.StringType, false,Metadata.empty()),
new StructField("sentence1", DataTypes.StringType, false,Metadata.empty()) });
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
List<String> listStrings = new ArrayList<String>();
listStrings.add("405-048011-62815");
listStrings.add("630-0746");
Dataset<Row> matchFound1=sentenceDataFrame.filter(col("label1").isin(listStrings.stream().toArray(String[]::new)));
matchFound1.show();
listStrings.clear();
listStrings.add("222-2222-5555");
listStrings.add("7777-88886");
List<Row> data2 = Arrays.asList(
RowFactory.create("222-2222-5555", "Tata"),
RowFactory.create("7777-88886","WestSide"),
RowFactory.create("22222-22224","Reliance"),
RowFactory.create("33333-3333","V industries"));
StructType schema2 = new StructType(new StructField[] {new StructField("label2", DataTypes.StringType, false,Metadata.empty()),
new StructField("sentence2", DataTypes.StringType, false,Metadata.empty()) });
Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2, schema2);
Dataset<Row> matchFound2=sentenceDataFrame2.filter(col("label2").isin(listStrings.stream().toArray(String[]::new)));
matchFound2.show();
//Approach 1
Dataset<Row> matchFound3=matchFound1.select(matchFound1.col("label1"),matchFound1.col("sentence1"),matchFound2.col("label2"),
matchFound2.col("sentence2"));
System.out.println("After concat");
matchFound3.show();
//Approach 2
Dataset<Row> matchFound4=matchFound1.filter(concat((col("label1")),matchFound1.col("sentence1"),matchFound2.col("label2"),
matchFound2.col("sentence2")));
System.out.println("After concat 2");
matchFound4.show();`
Error for each of the approaches are as follows
Approach 1 error
----------
org.apache.spark.sql.AnalysisException: resolved attribute(s) label2#10,sentence2#11 missing from label1#0,sentence1#1 in operator !Project [label1#0, sentence1#1, label2#10, sentence2#11];;
!Project [label1#0, sentence1#1, label2#10, sentence2#11]
+- Filter label1#0 IN (405-048011-62815,630-0746)
+- LocalRelation [label1#0, sentence1#1]
----------
Error for each of the approaches are as follows
Approach 2 error
org.apache.spark.sql.AnalysisException: filter expression 'concat(`label1`, `sentence1`, `label2`, `sentence2`)' of type string is not a boolean.;;
!Filter concat(label1#0, sentence1#1, label2#10, sentence2#11)
+- Filter label1#0 IN (405-048011-62815,630-0746)
+- LocalRelation [label1#0, sentence1#1]

hope this work for you
DF
val pre: Array[String] = Array("CRC Industries", "Dixon value" ,"3M INdustries" ,"Dixon coupling valve")
val rea: Array[String] = Array("405048011-62815", "630-0746", "4444-444", "555-55")
val df1 = sc.parallelize( rea zip pre).toDF("label1","sentence1")
val preasons2: Array[String] = Array("Tata", "WestSide","Reliance", "V industries")
val reasonsI2: Array[String] = Array( "222-2222-5555", "7777-88886", "22222-22224", "33333-3333")
val df2 = sc.parallelize( reasonsI2 zip preasons2 ).toDF("label2","sentence2")
String Indexer
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("label1")
.setOutputCol("label1Index")
val indexed = indexer.fit(df1).transform(df1)
indexed.show()
val indexer1 = new StringIndexer()
.setInputCol("label2")
.setOutputCol("label2Index")
val indexed1 = indexer1.fit(df2).transform(df2)
indexed1.show()
Join
val rnd_reslt12 = indexed.join(indexed1 , indexed.col("label1Index")===indexed1.col("label2Index")).drop(indexed.col("label1Index")).drop(indexed1.col("label2Index"))
rnd_reslt12.show()
+---------------+--------------------+-------------+------------+
| label1| sentence1| label2| sentence2|
+---------------+--------------------+-------------+------------+
| 630-0746| Dixon value|222-2222-5555| Tata|
| 4444-444| 3M INdustries| 22222-22224| Reliance|
| 555-55|Dixon coupling valve| 33333-3333|V industries|
|405048011-62815| CRC Industries| 7777-88886| WestSide|
+---------------+--------------------+-------------+------------+

With string indexer i have done with java, this will work.
public class StringIndexer11 {
public static void main(String[] args) {
Dataset<Row> csvDataSet=null;
try{
System.setProperty("hadoop.home.dir", "D:\\AI matching\\winutil");
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder().appName("JavaTokenizerExample").getOrCreate();
List<Row> data = Arrays.asList(
RowFactory.create("405-048011-62815", "CRC Industries"),
RowFactory.create("630-0746","Dixon value"),
RowFactory.create("4444-444","3M INdustries"),
RowFactory.create("555-55","Dixon coupling valve"));
StructType schema = new StructType(new StructField[] {new StructField("label1", DataTypes.StringType, false,Metadata.empty()),
new StructField("sentence1", DataTypes.StringType, false,Metadata.empty()) });
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
List<String> listStrings = new ArrayList<String>();
listStrings.add("405-048011-62815");
listStrings.add("630-0746");
Dataset<Row> matchFound1=sentenceDataFrame.filter(col("label1").isin(listStrings.stream().toArray(String[]::new)));
matchFound1.show();
listStrings.clear();
listStrings.add("222-2222-5555");
listStrings.add("7777-88886");
StringIndexer indexer = new StringIndexer()
.setInputCol("label1")
.setOutputCol("label1Index");
Dataset<Row> Dataset1 = indexer.fit(matchFound1).transform(matchFound1);
//Dataset1.show();
List<Row> data2 = Arrays.asList(
RowFactory.create("222-2222-5555", "Tata"),
RowFactory.create("7777-88886","WestSide"),
RowFactory.create("22222-22224","Reliance"),
RowFactory.create("33333-3333","V industries"));
StructType schema2 = new StructType(new StructField[] {new StructField("label2", DataTypes.StringType, false,Metadata.empty()),
new StructField("sentence2", DataTypes.StringType, false,Metadata.empty()) });
Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2, schema2);
Dataset<Row> matchFound2=sentenceDataFrame2.filter(col("label2").isin(listStrings.stream().toArray(String[]::new)));
matchFound2.show();
StringIndexer indexer1 = new StringIndexer()
.setInputCol("label2")
.setOutputCol("label2Index");
Dataset<Row> Dataset2 = indexer1.fit(matchFound2).transform(matchFound2);
//Dataset2.show();
Dataset<Row> Finalresult = Dataset1.join(Dataset2 , Dataset1.col("label1Index").equalTo(Dataset2.col("label2Index"))).drop(Dataset1.col("label1Index")).drop(Dataset2.col("label2Index"));
Finalresult.show();
}catch(Exception e)
{
e.printStackTrace();
}
}

Related

How to create a dataframe using spark java

I need to create a data frame in my test.
I tried the code below:
StructType structType = new StructType();
structType = structType.add("A", DataTypes.StringType, false);
structType = structType.add("B", DataTypes.StringType, false);
List<String> nums = new ArrayList<String>();
nums.add("value1");
nums.add("value2");
Dataset<Row> df = spark.createDataFrame(nums, structType);
The expected result is :
+------+------+
|A |B |
+------+------+
|value1|value2|
+------+------+
But it is not accepted. How do I initiate a data frame/Dataset?

For Spark 3.0 and before, SparkSession instances don't have a method to create dataframe from list of Objects and a StructType.
However, there is a method that can build dataframe from list of rows and a StructType. So to make your code work, you have to change your nums type from ArrayList<String> to ArrayList<Row>. You can do that using RowFactory:
// imports
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
// code
StructType structType = new StructType();
structType = structType.add("A", DataTypes.StringType, false);
structType = structType.add("B", DataTypes.StringType, false);
List<Row> nums = new ArrayList<Row>();
nums.add(RowFactory.create("value1", "value2"));
Dataset<Row> df = spark.createDataFrame(nums, structType);
// result
// +------+------+
// |A |B |
// +------+------+
// |value1|value2|
// +------+------+
If you want to add more rows to your dataframe, just add other rows:
// code
...
List<Row> nums = new ArrayList<Row>();
nums.add(RowFactory.create("value1", "value2"));
nums.add(RowFactory.create("value3", "value4"));
Dataset<Row> df = spark.createDataFrame(nums, structType);
// result
// +------+------+
// |A |B |
// +------+------+
// |value1|value2|
// |value3|value4|
// +------+------+

So this is the cleaner way of doing things.
Step 1: Create a bean class for your custom class. Make sure you have public getter, setter and all args constructor and the class should implement serializable
public class StringWrapper implements Serializable {
private String key;
private String value;
public StringWrapper(String key, String value) {
this.key = key;
this.value = value;
}
public String getKey() {
return key;
}
public void setKey(String key) {
this.key = key;
}
public String getValue() {
return value;
}
public void setValue(String value) {
this.value = value;
}
}
Step 2: Generate data
List<StringWrapper> nums = new ArrayList<>();
nums.add(new StringWrapper("value1", "value2"));
Step 3: Convert it to RDD
JavaRDD<StringWrapper> rdd = javaSparkContext.parallelize(nums);
Step 4: Convert it to dataset
sparkSession.createDataFrame(rdd, StringWrapper.class).show(false);
Step 5 : See results
+------+------+
|key |value |
+------+------+
|value1|value2|
+------+------+

Exception in thread "main" java.lang.IllegalArgumentException: Data type StringType of column score is not supported

I can't implement kmeans for the column "score" of my dataset loaded from mongodb with spark.
Here is my code :
public static void main(String[] args) {
JavaSparkContext jsc = new JavaSparkContext(spark.sparkContext());
JavaMongoRDD<Document> rdd = MongoSpark.load(jsc);
Dataset<Row> df = rdd.toDF();
Dataset<Row> dataset = df.select("score");
dataset.show();
VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[]{"score"})
.setOutputCol("features");
Dataset<Row> vectorized_df = assembler.transform(dataset);
// vectorized_df.show();
KMeans kmeans = new KMeans().setK(2).setSeed(1L);
KMeansModel model = kmeans.fit(dataset);
Dataset<Row> predictions = model.transform(dataset);

Spark SQL RowFactory returns empty rows

I have a dataset with such schema:
{"user":"A10T7BS07XCWQ1","recommendations":[{"iID":34142,"rating":22.998692},{"iID":24963,"rating":22.678337},{"iID":47761,"rating":22.31455},{"iID":28694,"rating":21.269365},{"iID":36890,"rating":21.143366},{"iID":48522,"rating":20.678747},{"iID":20032,"rating":20.330639},{"iID":57315,"rating":20.099955},{"iID":18148,"rating":20.07064},{"iID":7321,"rating":19.754635}]}
I try to flatMap my dataset by such way:
StructType struc = new StructType();
struc.add("user", DataTypes.StringType, false);
struc.add("item", DataTypes.IntegerType, false);
struc.add("relevance", DataTypes.DoubleType, false);
ExpressionEncoder<Row> encoder = RowEncoder.apply(struc);
Dataset<Row> recomenderResult = userRecs.flatMap((FlatMapFunction<Row, Row>) row -> {
String user = row.getString(0);
List<Row> recsWithIntItemID = row.getList(1);
Integer item;
Double relevance;
List<Row> rows = new ArrayList<>();
for (Row rec : recsWithIntItemID) {
item = rec.getInt(0);
relevance = (double) rec.getFloat(1);
System.out.println(user + " : " + item + " : " + relevance);
Row newRow = RowFactory.create(user, item, relevance);
rows.add(newRow);
}
System.out.println("++++++++++++++++++++++++++++++++");
return rows.iterator();
}, encoder);
recomenderResult.write().json("temp2");
recomenderResult.show();
system output is folowing:
...
A1049B0RS95K7B : 24708 : 17.146669387817383
A1049B0RS95K7B : 2825 : 16.809375762939453
A1049B0RS95K7B : 36503 : 16.758258819580078
++++++++++++++++++++++++++++++++
...
But Row instance is empty, show() method gives such output:
++
||
++
||
||
I have no idea why my result dataset is empty. I have already watched all topics on this site relevant to my problem and used google, but I have not found solution of my problem. Could somebody help me?

It was very stupid bug :( Simple answer, mistake was here:
StructType struc = new StructType();
struc = struc.add("user", DataTypes.StringType, false);
struc = struc.add("item", DataTypes.IntegerType, false);
struc = struc.add("relevance", DataTypes.DoubleType, false);
ExpressionEncoder<Row> encoder = RowEncoder.apply(struc);
It costs me 2 days and one night...

Spark SQL Dataframe- java.lang.ArrayIndexOutOfBoundsException error

Using spark java I have created dataframe on comma delimiter source file. In sourcefile if last column contains blank value then its throwing arrayindexoutofbound error. Below is sample data and code. is there any way I can handle this error because there is lot of chance getting blank values in last column. In below sample data 4th row causing issue.
Sample Data:
1,viv,chn,34
2,man,gnt,56
3,anu,pun,22
4,raj,bang,*
Code:
JavaRDD<String> dataQualityRDD = spark.sparkContext().textFile(inputFile, 1).toJavaRDD();
String schemaString = schemaColumns;
List<StructField> fields = new ArrayList<>();
for (String fieldName : schemaString.split(" ")) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);
JavaRDD<Row> rowRDD = dataQualityRDD.map((Function<String, Row>) record -> {
// String[] attributes = record.split(attributes[0], attributes[1].trim());
Object[] items = record.split(fileSplit);
// return RowFactory.create(attributes[0], attributes[1].trim());
return RowFactory.create(items);
});
}
}

I used spark 2.0 and was able to read the csv without any exception:
SparkSession spark = SparkSession.builder().config("spark.master", "local").getOrCreate();
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(spark.sparkContext());
JavaRDD<Row> csvRows = spark.read().csv("resources/csvwithnulls.csv").toJavaRDD();
StructType schema = DataTypes.createStructType(
new StructField[] { new StructField("id", DataTypes.StringType, false, Metadata.empty()),
new StructField("fname", DataTypes.StringType, false, Metadata.empty()),
new StructField("lname", DataTypes.StringType, false, Metadata.empty()),
new StructField("age", DataTypes.StringType, false, Metadata.empty()) });
Dataset<Row> newCsvRows = spark.createDataFrame(csvRows, schema);
newCsvRows.show();
Used exactly the rows you have and it worked fine: see the output:

Duplicate column when I create an IndexedRowMatrix using Spark

I need to calculate the pairwise similarity between several Documents. For that, I procced as follows:
JavaPairRDD<String,String> files = sc.wholeTextFiles(file_path);
System.out.println(files.count()+"**");
JavaRDD<Row> rowRDD = files.map((Tuple2<String, String> t) -> {
return RowFactory.create(t._1,t._2.replaceAll("[^\\w\\s]+","").replaceAll("\\d", ""));
});
StructType schema = new StructType(new StructField[]{
new StructField("id", DataTypes.StringType, false, Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> rows = spark.createDataFrame(rowRDD, schema);
Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
Dataset<Row> tokenized_rows = tokenizer.transform(rows);
StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered_words");
Dataset<Row> filtred_rows = remover.transform(tokenized_rows);
CountVectorizerModel cvModel = new CountVectorizer().setInputCol("filtered_words").setOutputCol("rowF").setVocabSize(100000).fit(filtred_rows);
Dataset<Row> verct_rows = cvModel.transform(filtred_rows);
IDF idf = new IDF().setInputCol("rowF").setOutputCol("features");
IDFModel idfModel = idf.fit(verct_rows);
Dataset<Row> rescaledData = idfModel.transform(verct_rows);
JavaRDD<IndexedRow> vrdd = rescaledData.toJavaRDD().map((Row r) -> {
//DenseVector dense;
String s = r.getAs(0);
int index = new Integer(s.replace(s.substring(0,24),"").replace(s.substring(s.indexOf(".txt")),""));
SparseVector sparse = (SparseVector) r.getAs(5);
//dense = sparse.toDense();parseVector) r.getAs(5);
org.apache.spark.mllib.linalg.Vector vec = org.apache.spark.mllib.linalg.Vectors.dense(sparse.toDense().toArray());
return new IndexedRow(index, vec);
});
System.out.println(vrdd.count()+"---");
IndexedRowMatrix mat = new IndexedRowMatrix(vrdd.rdd());
System.out.println(mat.numCols()+"---"+mat.numRows());
Unfortunately, the results show that the IndexedRowMatrix is created with 4 columns (the first one is duplicated) even if my dataset contains 3 documents.
3**
3--
1106---4
Can you help me to detect the cause of this duplication?

Most likely there is no duplication at all and your data simply doesn't follow the specification, which requires indices to be consecutive, zero-based, integers. Therefore numRows is max(row.index for row in rows) + 1
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.mllib.linalg.Vectors
new IndexedRowMatrix(sc.parallelize(Seq(
IndexedRow(5, Vectors.sparse(5, Array(), Array()))) // Only one non-empty row
)).numRows
// res4: Long = 6

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

merge two dataset which are having different column names in Apache spark - java

Related

How to create a dataframe using spark java

Exception in thread "main" java.lang.IllegalArgumentException: Data type StringType of column score is not supported

Spark SQL RowFactory returns empty rows

Spark SQL Dataframe- java.lang.ArrayIndexOutOfBoundsException error

Duplicate column when I create an IndexedRowMatrix using Spark

Categories

Resources