How to create a dataframe using spark java - java

I need to create a data frame in my test.
I tried the code below:
StructType structType = new StructType();
structType = structType.add("A", DataTypes.StringType, false);
structType = structType.add("B", DataTypes.StringType, false);
List<String> nums = new ArrayList<String>();
nums.add("value1");
nums.add("value2");
Dataset<Row> df = spark.createDataFrame(nums, structType);
The expected result is :
+------+------+
|A |B |
+------+------+
|value1|value2|
+------+------+
But it is not accepted. How do I initiate a data frame/Dataset?

For Spark 3.0 and before, SparkSession instances don't have a method to create dataframe from list of Objects and a StructType.
However, there is a method that can build dataframe from list of rows and a StructType. So to make your code work, you have to change your nums type from ArrayList<String> to ArrayList<Row>. You can do that using RowFactory:
// imports
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
// code
StructType structType = new StructType();
structType = structType.add("A", DataTypes.StringType, false);
structType = structType.add("B", DataTypes.StringType, false);
List<Row> nums = new ArrayList<Row>();
nums.add(RowFactory.create("value1", "value2"));
Dataset<Row> df = spark.createDataFrame(nums, structType);
// result
// +------+------+
// |A |B |
// +------+------+
// |value1|value2|
// +------+------+
If you want to add more rows to your dataframe, just add other rows:
// code
...
List<Row> nums = new ArrayList<Row>();
nums.add(RowFactory.create("value1", "value2"));
nums.add(RowFactory.create("value3", "value4"));
Dataset<Row> df = spark.createDataFrame(nums, structType);
// result
// +------+------+
// |A |B |
// +------+------+
// |value1|value2|
// |value3|value4|
// +------+------+

So this is the cleaner way of doing things.
Step 1: Create a bean class for your custom class. Make sure you have public getter, setter and all args constructor and the class should implement serializable
public class StringWrapper implements Serializable {
private String key;
private String value;
public StringWrapper(String key, String value) {
this.key = key;
this.value = value;
}
public String getKey() {
return key;
}
public void setKey(String key) {
this.key = key;
}
public String getValue() {
return value;
}
public void setValue(String value) {
this.value = value;
}
}
Step 2: Generate data
List<StringWrapper> nums = new ArrayList<>();
nums.add(new StringWrapper("value1", "value2"));
Step 3: Convert it to RDD
JavaRDD<StringWrapper> rdd = javaSparkContext.parallelize(nums);
Step 4: Convert it to dataset
sparkSession.createDataFrame(rdd, StringWrapper.class).show(false);
Step 5 : See results
+------+------+
|key |value |
+------+------+
|value1|value2|
+------+------+

Related

convert datatypeString to datatype array in SPARK JAVA

JavaRDD rawText = raw
.filter(
(FilterFunction) f -> {
if (f.getAs("content") == null)
return false;
return true;
}
).toJavaRDD().mapPartitions(
partition -> {
List rows = new ArrayList();
while (partition.hasNext()) {
Row row = partition.next();
String content = row.getAs("content");
String category = row.getAs("category");
if (content != null) {
List words = Arrays.asList("a", "b");
rows.add(RowFactory.create(category, words));
}
}
return rows.iterator();
}
);
List structFields = new ArrayList();
structFields.add(DataTypes.createStructField("category", DataTypes.StringType, true));
structFields.add(DataTypes.createStructField("content", DataTypes.createArrayType(DataTypes.StringType), false, Metadata.empty()));
StructType structType = DataTypes.createStructType(structFields);
Dataset dataset = sparkSession.createDataFrame(rawText, structType);
dataset.show();
I'm stucking with converting dataString to dataset arraylist of string. Can you help me with this problem. my error:
Two non-abstract methods "public scala.collection.Iterator scala.collection.IterableOnceOps.toIterator()" have the same parameter types, declaring type and return type

How can I convert a list of map List<Map<String, String>> myList to Spark Dataframe in Java?

I have a list of Map like this,
List<Map<String, Object>> myList = new ArrayList<>();
Map<String, Object> mp1 = new HashMap<>();
mp1.put("id", 1);
mp1.put("name", "John");
Map<String, Object> mp2 = new HashMap<>();
mp2.put("id", 2);
mp2.put("name", "Carte");
the key-value pairs the we are putting in the map are not fixed, we can have any dynamic key-value pairs(dynamic schema).
I want to convert it into spark dataframe.
(Dataset<Row>).
+--+--------+
| id | name |
+--+--------+
| 1 | John |
+--+--------+
| 2 | Carte |
+--+--------+
How this can be achieved ?
Note: As I said, the key-value pairs are dynamic, I can not create a java bean in advance and use this below syntax.
Dataset<Row> ds = spark.createDataFrame(myList, MyClass.class);
You can build rows and schema from the list of maps, then use spark.createDataFrame(rows: java.util.List[Row], schema: StructType) to build your dataframe:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.catalyst.expressions.GenericRow;
import org.apache.spark.sql.types.*;
...
public static Dataset<Row> buildDataframe(List<Map<String, Object>> listOfMaps, SparkSession spark) {
// extract columns name list
Set<String> columnSet = new HashSet<>();
for (Map<String, Object> elem: listOfMaps) {
columnSet.addAll(elem.keySet());
}
List<String> columns = new ArrayList<>(columnSet);
// build rows
List<Row> rows = new ArrayList<>();
for (Map<String, Object> elem : listOfMaps) {
List<Object> row = new ArrayList<>();
for (String key: columns) {
row.add(elem.get(key));
}
rows.add(new GenericRow(row.toArray()));
}
// build schema
List<StructField> fields = new ArrayList<>();
for (String column: columns) {
fields.add(new StructField(column, getDataType(column, listOfMaps), true, Metadata.empty()));
}
StructType schema = new StructType(fields.toArray(new StructField[0]));
// build dataframe from rows and schema
return spark.createDataFrame(rows, schema);
}
public static DataType getDataType(String column, List<Map<String, Object>> data) {
for (Map<String, Object> elem : data) {
if (elem.get(column) != null) {
return getDataType(elem.get(column));
}
}
return DataTypes.NullType;
}
public static DataType getDataType(Object value) {
if (value.getClass() == Integer.class) {
return DataTypes.IntegerType;
} else if (value.getClass() == String.class) {
return DataTypes.StringType;
// TODO add all other spark types (Long, Timestamp, etc...)
} else {
throw new IllegalArgumentException("unknown type for value " + value);
}
}

Duplicate column when I create an IndexedRowMatrix using Spark

I need to calculate the pairwise similarity between several Documents. For that, I procced as follows:
JavaPairRDD<String,String> files = sc.wholeTextFiles(file_path);
System.out.println(files.count()+"**");
JavaRDD<Row> rowRDD = files.map((Tuple2<String, String> t) -> {
return RowFactory.create(t._1,t._2.replaceAll("[^\\w\\s]+","").replaceAll("\\d", ""));
});
StructType schema = new StructType(new StructField[]{
new StructField("id", DataTypes.StringType, false, Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> rows = spark.createDataFrame(rowRDD, schema);
Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
Dataset<Row> tokenized_rows = tokenizer.transform(rows);
StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered_words");
Dataset<Row> filtred_rows = remover.transform(tokenized_rows);
CountVectorizerModel cvModel = new CountVectorizer().setInputCol("filtered_words").setOutputCol("rowF").setVocabSize(100000).fit(filtred_rows);
Dataset<Row> verct_rows = cvModel.transform(filtred_rows);
IDF idf = new IDF().setInputCol("rowF").setOutputCol("features");
IDFModel idfModel = idf.fit(verct_rows);
Dataset<Row> rescaledData = idfModel.transform(verct_rows);
JavaRDD<IndexedRow> vrdd = rescaledData.toJavaRDD().map((Row r) -> {
//DenseVector dense;
String s = r.getAs(0);
int index = new Integer(s.replace(s.substring(0,24),"").replace(s.substring(s.indexOf(".txt")),""));
SparseVector sparse = (SparseVector) r.getAs(5);
//dense = sparse.toDense();parseVector) r.getAs(5);
org.apache.spark.mllib.linalg.Vector vec = org.apache.spark.mllib.linalg.Vectors.dense(sparse.toDense().toArray());
return new IndexedRow(index, vec);
});
System.out.println(vrdd.count()+"---");
IndexedRowMatrix mat = new IndexedRowMatrix(vrdd.rdd());
System.out.println(mat.numCols()+"---"+mat.numRows());
Unfortunately, the results show that the IndexedRowMatrix is created with 4 columns (the first one is duplicated) even if my dataset contains 3 documents.
3**
3--
1106---4
Can you help me to detect the cause of this duplication?
Most likely there is no duplication at all and your data simply doesn't follow the specification, which requires indices to be consecutive, zero-based, integers. Therefore numRows is max(row.index for row in rows) + 1
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.mllib.linalg.Vectors
new IndexedRowMatrix(sc.parallelize(Seq(
IndexedRow(5, Vectors.sparse(5, Array(), Array()))) // Only one non-empty row
)).numRows
// res4: Long = 6

merge two dataset which are having different column names in Apache spark

We need to merge two dataset which are having different column names, there are no common columns across the datasets.
We have tried couple of approaches, both of the approaches are not yielding result. Kindly let us know how to combine two dataset using Apache spark Java
Input data set 1
"405-048011-62815", "CRC Industries",
"630-0746","Dixon value",
"4444-444","3M INdustries",
"555-55","Dixon coupling valve"
Input dataset 2
"222-2222-5555", "Tata",
"7777-88886","WestSide",
"22222-22224","Reliance",
"33333-3333","V industries"
Expected out is
----------label1----|------sentence1------|------label2---|------sentence2-----------
| 405-048011-62815 | CRC Industries | 222-2222-5555 | Tata|
| 630-0746 | Dixon value | 7777-88886 | WestSide|
-------------------------------------------------------------------------------------
`
List<Row> data = Arrays.asList(
RowFactory.create("405-048011-62815", "CRC Industries"),
RowFactory.create("630-0746","Dixon value"),
RowFactory.create("4444-444","3M INdustries"),
RowFactory.create("555-55","Dixon coupling valve"));
StructType schema = new StructType(new StructField[] {new StructField("label1", DataTypes.StringType, false,Metadata.empty()),
new StructField("sentence1", DataTypes.StringType, false,Metadata.empty()) });
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
List<String> listStrings = new ArrayList<String>();
listStrings.add("405-048011-62815");
listStrings.add("630-0746");
Dataset<Row> matchFound1=sentenceDataFrame.filter(col("label1").isin(listStrings.stream().toArray(String[]::new)));
matchFound1.show();
listStrings.clear();
listStrings.add("222-2222-5555");
listStrings.add("7777-88886");
List<Row> data2 = Arrays.asList(
RowFactory.create("222-2222-5555", "Tata"),
RowFactory.create("7777-88886","WestSide"),
RowFactory.create("22222-22224","Reliance"),
RowFactory.create("33333-3333","V industries"));
StructType schema2 = new StructType(new StructField[] {new StructField("label2", DataTypes.StringType, false,Metadata.empty()),
new StructField("sentence2", DataTypes.StringType, false,Metadata.empty()) });
Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2, schema2);
Dataset<Row> matchFound2=sentenceDataFrame2.filter(col("label2").isin(listStrings.stream().toArray(String[]::new)));
matchFound2.show();
//Approach 1
Dataset<Row> matchFound3=matchFound1.select(matchFound1.col("label1"),matchFound1.col("sentence1"),matchFound2.col("label2"),
matchFound2.col("sentence2"));
System.out.println("After concat");
matchFound3.show();
//Approach 2
Dataset<Row> matchFound4=matchFound1.filter(concat((col("label1")),matchFound1.col("sentence1"),matchFound2.col("label2"),
matchFound2.col("sentence2")));
System.out.println("After concat 2");
matchFound4.show();`
Error for each of the approaches are as follows
Approach 1 error
----------
org.apache.spark.sql.AnalysisException: resolved attribute(s) label2#10,sentence2#11 missing from label1#0,sentence1#1 in operator !Project [label1#0, sentence1#1, label2#10, sentence2#11];;
!Project [label1#0, sentence1#1, label2#10, sentence2#11]
+- Filter label1#0 IN (405-048011-62815,630-0746)
+- LocalRelation [label1#0, sentence1#1]
----------
Error for each of the approaches are as follows
Approach 2 error
org.apache.spark.sql.AnalysisException: filter expression 'concat(`label1`, `sentence1`, `label2`, `sentence2`)' of type string is not a boolean.;;
!Filter concat(label1#0, sentence1#1, label2#10, sentence2#11)
+- Filter label1#0 IN (405-048011-62815,630-0746)
+- LocalRelation [label1#0, sentence1#1]
hope this work for you
DF
val pre: Array[String] = Array("CRC Industries", "Dixon value" ,"3M INdustries" ,"Dixon coupling valve")
val rea: Array[String] = Array("405048011-62815", "630-0746", "4444-444", "555-55")
val df1 = sc.parallelize( rea zip pre).toDF("label1","sentence1")
val preasons2: Array[String] = Array("Tata", "WestSide","Reliance", "V industries")
val reasonsI2: Array[String] = Array( "222-2222-5555", "7777-88886", "22222-22224", "33333-3333")
val df2 = sc.parallelize( reasonsI2 zip preasons2 ).toDF("label2","sentence2")
String Indexer
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("label1")
.setOutputCol("label1Index")
val indexed = indexer.fit(df1).transform(df1)
indexed.show()
val indexer1 = new StringIndexer()
.setInputCol("label2")
.setOutputCol("label2Index")
val indexed1 = indexer1.fit(df2).transform(df2)
indexed1.show()
Join
val rnd_reslt12 = indexed.join(indexed1 , indexed.col("label1Index")===indexed1.col("label2Index")).drop(indexed.col("label1Index")).drop(indexed1.col("label2Index"))
rnd_reslt12.show()
+---------------+--------------------+-------------+------------+
| label1| sentence1| label2| sentence2|
+---------------+--------------------+-------------+------------+
| 630-0746| Dixon value|222-2222-5555| Tata|
| 4444-444| 3M INdustries| 22222-22224| Reliance|
| 555-55|Dixon coupling valve| 33333-3333|V industries|
|405048011-62815| CRC Industries| 7777-88886| WestSide|
+---------------+--------------------+-------------+------------+
With string indexer i have done with java, this will work.
public class StringIndexer11 {
public static void main(String[] args) {
Dataset<Row> csvDataSet=null;
try{
System.setProperty("hadoop.home.dir", "D:\\AI matching\\winutil");
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("SparkJdbcDs").setMaster("local[*]"));
SQLContext sqlContext = new SQLContext(sc);
SparkSession spark = SparkSession.builder().appName("JavaTokenizerExample").getOrCreate();
List<Row> data = Arrays.asList(
RowFactory.create("405-048011-62815", "CRC Industries"),
RowFactory.create("630-0746","Dixon value"),
RowFactory.create("4444-444","3M INdustries"),
RowFactory.create("555-55","Dixon coupling valve"));
StructType schema = new StructType(new StructField[] {new StructField("label1", DataTypes.StringType, false,Metadata.empty()),
new StructField("sentence1", DataTypes.StringType, false,Metadata.empty()) });
Dataset<Row> sentenceDataFrame = spark.createDataFrame(data, schema);
List<String> listStrings = new ArrayList<String>();
listStrings.add("405-048011-62815");
listStrings.add("630-0746");
Dataset<Row> matchFound1=sentenceDataFrame.filter(col("label1").isin(listStrings.stream().toArray(String[]::new)));
matchFound1.show();
listStrings.clear();
listStrings.add("222-2222-5555");
listStrings.add("7777-88886");
StringIndexer indexer = new StringIndexer()
.setInputCol("label1")
.setOutputCol("label1Index");
Dataset<Row> Dataset1 = indexer.fit(matchFound1).transform(matchFound1);
//Dataset1.show();
List<Row> data2 = Arrays.asList(
RowFactory.create("222-2222-5555", "Tata"),
RowFactory.create("7777-88886","WestSide"),
RowFactory.create("22222-22224","Reliance"),
RowFactory.create("33333-3333","V industries"));
StructType schema2 = new StructType(new StructField[] {new StructField("label2", DataTypes.StringType, false,Metadata.empty()),
new StructField("sentence2", DataTypes.StringType, false,Metadata.empty()) });
Dataset<Row> sentenceDataFrame2 = spark.createDataFrame(data2, schema2);
Dataset<Row> matchFound2=sentenceDataFrame2.filter(col("label2").isin(listStrings.stream().toArray(String[]::new)));
matchFound2.show();
StringIndexer indexer1 = new StringIndexer()
.setInputCol("label2")
.setOutputCol("label2Index");
Dataset<Row> Dataset2 = indexer1.fit(matchFound2).transform(matchFound2);
//Dataset2.show();
Dataset<Row> Finalresult = Dataset1.join(Dataset2 , Dataset1.col("label1Index").equalTo(Dataset2.col("label2Index"))).drop(Dataset1.col("label1Index")).drop(Dataset2.col("label2Index"));
Finalresult.show();
}catch(Exception e)
{
e.printStackTrace();
}
}

Append a column to Data Frame in Apache Spark 1.4 in Java

I am trying to add a column to my DataFrame that serves as a unique ROW_ID for the column. So, it would be something like this
1, user1
2, user2
3, user3
...
I could have done this easily using a hashMap with an integer iterating but I can't do this in spark using the map function on DataFrame since I can't have an integer increasing inside the map function. Is there any way that I can do this by appending one column to my existing DataFrame or any other way?
PS: I know there is a very similar post, but that's for Scala and not java.
Thanks in advance
I did it by adding a column containing UUIDs in a new Column in DataFrame.
StructType objStructType = inputDataFrame.schema();
StructField []arrStructField=objStructType.fields();
List<StructField> fields = new ArrayList<StructField>();
List<StructField> newfields = new ArrayList<StructField>();
List <StructField> listFields = Arrays.asList(arrStructField);
StructField a = DataTypes.createStructField(leftCol,DataTypes.StringType, true);
fields.add(a);
newfields.addAll(listFields);
newfields.addAll(fields);
final int size = objStructType.size();
JavaRDD<Row> rowRDD = inputDataFrame.javaRDD().map(new Function<Row, Row>() {
private static final long serialVersionUID = 3280804931696581264L;
public Row call(Row tblRow) throws Exception {
Object[] newRow = new Object[size+1];
int rowSize= tblRow.length();
for (int itr = 0; itr < rowSize; itr++)
{
if(tblRow.apply(itr)!=null)
{
newRow[itr] = tblRow.apply(itr);
}
}
newRow[size] = UUID.randomUUID().toString();
return RowFactory.create(newRow);
}
});
inputDataFrame = objsqlContext.createDataFrame(rowRDD, DataTypes.createStructType(newfields));
Ok, I found the solution to this problem and I'm posting it in case someone would have the same problem:
The way to do this it zipWithIndex from JavaRDD()
df.javaRDD().zipWithIndex().map(new Function<Tuple2<Row, Long>, Row>() {
#Override
public Row call(Tuple2<Row, Long> v1) throws Exception {
return RowFactory.create(v1._1().getString(0), v1._2());
}
})

Categories