How to convert ArrayList into Scala Array in Spark

How to convert ArrayList into Scala Array in Spark - java

I want to create a StructType dynamically out of a json file. I iterate my fields and I want to find out how (if it is even possible) can I create some list with my iteration, and then to create a StructType from it.
The code I've tried:
List<StructField> structFieldList = new ArrayList<>();
for (String field : fields.values()) {
StructField sf = Datatypes.createStructField(field, DataTypes.StringType, true);
structFieldList.add(sf);
}
StructType structType = new StructType(structFieldList.toArray());
But this one is pretty impossible. Is there any way to figure this out?

Here you don't need to convert ArrayList to Scala Array, as StructType constructor takes java StructField[] array as argument.
Your code can be changed by setting type when calling .toArray() method in your last line of code snippet so it returns a StructField[] array instead of an Object[] array:
List<StructField> structFieldList = new ArrayList<>();
for (String field : fields.values()) {
StructField sf = DataTypes.createStructField(field, DataTypes.StringType, true);
structFieldList.add(sf);
}
StructType structType = new StructType(structFieldList.toArray(new StructField[0]));

Related

Spark Java: How to sort ArrayType(MapType) by key and access the values of the sorted array

So, I have a dataframe with this schema:
StructType(StructField(experimentid,StringType,true), StructField(descinten,ArrayType(MapType(StringType,DoubleType,true),true),true))
and the content is like this:
+----------------+-------------------------------------------------------------+
|experimentid |descinten |
+----------------+-------------------------------------------------------------+
|id1 |[[x1->120.51513], [x2->57.59762], [x3->83028.867]] |
|id2 |[[x2->478.5698], [x3->79.6873], [x1->341.89]] |
+----------------+-------------------------------------------------------------+
I want to sort "descinten" by key in ascending order and then take the sorted values. I tried with mapping and sorting each row separately but I was getting errors like:
ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.collection.Map
or similar. Is there a more straight forward way to do it in Java?

For anyone interested I managed to solve it with map and TreeMap for sorting. My aim was to create vectors of the values in an ascending order based on their key.
StructType schema = new StructType(new StructField[] {
new StructField("expids",DataTypes.StringType, false,Metadata.empty()),
new StructField("intensity",new VectorUDT(),false,Metadata.empty())
});
Dataset<Row> newdf=olddf.map((Row originalrow) -> {
String firstpos = new String();
firstpos=originalrow.get(0).toString();
List<scala.collection.Map<String,Double>>mplist=originalrow.getList(1);
int s = mplist.size();
Map<String,Double>treemp=new TreeMap<>();
for(int k=0;k<s;k++){
Object[] exid = JavaConversions.mapAsJavaMap(mplist.get(k)).values().toArray();
Object[] kvlist= JavaConversions.mapAsJavaMap(mplist.get(k)).keySet().toArray();
treemp.put(exid[0].toString(),Double.parseDouble(kvlist[0].toString()));
}
Object[] yo1 = treemp.values().toArray();
double[] tmplist= new double[s];
for(int i=0;i<s;i++){
tmplist[i]=Double.parseDouble(yo1[i].toString());
}
Row newrow = RowFactory.create(firstpos,Vectors.dense(tmplist));
return newrow;
}, RowEncoder.apply(schema));

Spark SQL Dataframe- java.lang.ArrayIndexOutOfBoundsException error

Using spark java I have created dataframe on comma delimiter source file. In sourcefile if last column contains blank value then its throwing arrayindexoutofbound error. Below is sample data and code. is there any way I can handle this error because there is lot of chance getting blank values in last column. In below sample data 4th row causing issue.
Sample Data:
1,viv,chn,34
2,man,gnt,56
3,anu,pun,22
4,raj,bang,*
Code:
JavaRDD<String> dataQualityRDD = spark.sparkContext().textFile(inputFile, 1).toJavaRDD();
String schemaString = schemaColumns;
List<StructField> fields = new ArrayList<>();
for (String fieldName : schemaString.split(" ")) {
StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);
fields.add(field);
}
StructType schema = DataTypes.createStructType(fields);
JavaRDD<Row> rowRDD = dataQualityRDD.map((Function<String, Row>) record -> {
// String[] attributes = record.split(attributes[0], attributes[1].trim());
Object[] items = record.split(fileSplit);
// return RowFactory.create(attributes[0], attributes[1].trim());
return RowFactory.create(items);
});
}
}

I used spark 2.0 and was able to read the csv without any exception:
SparkSession spark = SparkSession.builder().config("spark.master", "local").getOrCreate();
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(spark.sparkContext());
JavaRDD<Row> csvRows = spark.read().csv("resources/csvwithnulls.csv").toJavaRDD();
StructType schema = DataTypes.createStructType(
new StructField[] { new StructField("id", DataTypes.StringType, false, Metadata.empty()),
new StructField("fname", DataTypes.StringType, false, Metadata.empty()),
new StructField("lname", DataTypes.StringType, false, Metadata.empty()),
new StructField("age", DataTypes.StringType, false, Metadata.empty()) });
Dataset<Row> newCsvRows = spark.createDataFrame(csvRows, schema);
newCsvRows.show();
Used exactly the rows you have and it worked fine: see the output:

Using map on dataset with arbitrary rows in Spark SQL

I'm trying to use the Dataframe map function on an arbitrary dataset. However I don't understand how you would map from Row-> Row. No examples are given for arbitrary data in the spark sql documentation:
Dataset<Row> original_data = ...
Dataset<Row> changed_data = original_data.map(new MapFunction<Row,Row>{
#Override
public Row call(Row row) throws Exception {
Row newRow = RowFactory.create(obj1,obj2);
return newRow;
}
}, Encoders.bean(Row.class));
However this does not work since there needs to be some sort of Encoder?
How can I map to a generic Row?

If obj1 and and obj2 are not primitive type then represent their schema to StructType to create Row encoder. I would suggest instead of using Row type, create custom bean which stores both obj1 and obj2 then use that custom bean encoder in map transformation.
Row type:
StructType customStructType = new StructType();
customStructType = customStructType.add("obj1", DataTypes.< type>, false);
customStructType = customStructType.add("obj2", DataTypes.< type >, false);
ExpressionEncoder<Row> customTypeEncoder = null;
Dataset<Row> changed_data = original_data.map(row->{
return RowFactory.create(obj1,obj2);;
}, RowEncoder.apply(customStructType));
Custom Bean type:
class CustomBean implements ....{
Object obj1;
Object obj2;
....
}
Dataset<CustomBean> changed_data = original_data.map(row->{
return new CustomBean(obj1,obj2);
}, Encoders.bean(CustomBean));

Duplicate column when I create an IndexedRowMatrix using Spark

I need to calculate the pairwise similarity between several Documents. For that, I procced as follows:
JavaPairRDD<String,String> files = sc.wholeTextFiles(file_path);
System.out.println(files.count()+"**");
JavaRDD<Row> rowRDD = files.map((Tuple2<String, String> t) -> {
return RowFactory.create(t._1,t._2.replaceAll("[^\\w\\s]+","").replaceAll("\\d", ""));
});
StructType schema = new StructType(new StructField[]{
new StructField("id", DataTypes.StringType, false, Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> rows = spark.createDataFrame(rowRDD, schema);
Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
Dataset<Row> tokenized_rows = tokenizer.transform(rows);
StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered_words");
Dataset<Row> filtred_rows = remover.transform(tokenized_rows);
CountVectorizerModel cvModel = new CountVectorizer().setInputCol("filtered_words").setOutputCol("rowF").setVocabSize(100000).fit(filtred_rows);
Dataset<Row> verct_rows = cvModel.transform(filtred_rows);
IDF idf = new IDF().setInputCol("rowF").setOutputCol("features");
IDFModel idfModel = idf.fit(verct_rows);
Dataset<Row> rescaledData = idfModel.transform(verct_rows);
JavaRDD<IndexedRow> vrdd = rescaledData.toJavaRDD().map((Row r) -> {
//DenseVector dense;
String s = r.getAs(0);
int index = new Integer(s.replace(s.substring(0,24),"").replace(s.substring(s.indexOf(".txt")),""));
SparseVector sparse = (SparseVector) r.getAs(5);
//dense = sparse.toDense();parseVector) r.getAs(5);
org.apache.spark.mllib.linalg.Vector vec = org.apache.spark.mllib.linalg.Vectors.dense(sparse.toDense().toArray());
return new IndexedRow(index, vec);
});
System.out.println(vrdd.count()+"---");
IndexedRowMatrix mat = new IndexedRowMatrix(vrdd.rdd());
System.out.println(mat.numCols()+"---"+mat.numRows());
Unfortunately, the results show that the IndexedRowMatrix is created with 4 columns (the first one is duplicated) even if my dataset contains 3 documents.
3**
3--
1106---4
Can you help me to detect the cause of this duplication?

Most likely there is no duplication at all and your data simply doesn't follow the specification, which requires indices to be consecutive, zero-based, integers. Therefore numRows is max(row.index for row in rows) + 1
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.mllib.linalg.Vectors
new IndexedRowMatrix(sc.parallelize(Seq(
IndexedRow(5, Vectors.sparse(5, Array(), Array()))) // Only one non-empty row
)).numRows
// res4: Long = 6

finding Bigrams in spark with java(8)

I have tokenized the sentences into word RDD. so now i need Bigrams.
ex. This is my test => (This is), (is my), (my test)
I have search thru and found .sliding operator for this purpose. But I'm not getting this option on my eclipse (may it is available for newer version of spark)
So what how can i make this happen w/o .sliding?
Adding code to get started-
public static void biGram (JavaRDD<String> in)
{
JavaRDD<String> sentence = in.map(s -> s.toLowerCase());
//get bigram from sentence w/o sliding - CODE HERE
}

You can simply use n-gram transformation feature in spark.
public static void biGram (JavaRDD<String> in)
{
//Converting string into row
JavaRDD<Row> sentence = sentence.map(s -> RowFactory.create(s.toLowerCase()));
StructType schema = new StructType(new StructField[] {
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
//Creating dataframe
DataFrame dataFrame = sqlContext.createDataFrame(sentence, schema);
//Tokenizing sentence into words
RegexTokenizer rt = new RegexTokenizer().setInputCol("sentence").setOutputCol("split")
.setMinTokenLength(4)
.setPattern("\\s+");
DataFrame rtDF = rt.transform(dataFrame);
//Creating bigrams
NGram bigram = new NGram().setInputCol(rt.getOutputCol()).setOutputCol("bigram").setN(2); //Here setN(2) means bigram
DataFrame bigramDF = bigram.transform(rtDF);
System.out.println("Result :: "+bigramDF.select("bigram").collectAsList());
}

sliding is indeed the way to go with ngrams. thing is, sliding works on iterators, just split your sentence and slide over the array. I am adding a Scala code.
val sentences:RDD[String] = in.map(s => s.toLowerCase())
val biGrams:RDD[Iterator[Array[String]]] = sentences.map(s => s.split(" ").sliding(2))

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to convert ArrayList into Scala Array in Spark - java

Related

Spark Java: How to sort ArrayType(MapType) by key and access the values of the sorted array

Spark SQL Dataframe- java.lang.ArrayIndexOutOfBoundsException error

Using map on dataset with arbitrary rows in Spark SQL

Duplicate column when I create an IndexedRowMatrix using Spark

finding Bigrams in spark with java(8)

Categories

Resources