Replace the WindowSpec by groupBy in spark java - java

I would like to replace the WindowSpec below with a groupBy and retrieve all the columns of the initial dataset,
resultDataset = targetDataset.unionByName(sourceDs);
Column[] keys = getKey(listOfColumnsAsKey);
WindowSpec windowSpec = Window.partitionBy(keys).orderBy(functions.col(ORDER).asc());
resultDataset = resultDataset.withColumn(RANK_COLUMN, functions.dense_rank().over(windowSpec))
.filter(functions.col(RANK_COLUMN).equalTo(1))
.drop(functions.col(RANK_COLUMN));
I tried the code below but it's not working, not giving the right result as expected
Seq<String> joinedColumns = getJoinedColumns(listOfColumnsAsKey + order column);
Dataset<Row> groupedDataset = resultDataset.groupBy(keys).agg(functions.min(ORDER).as(ORDER));
resultDataset = resultDataset.join(groupedDataset, joinedColumns);

Related

How to add a new column with values from a list in Apache Spark Java?

I've run into an issue with my code. I'm attempting to read a CSV file through dataframe and then add a new column with values from an ArrayList.
However, I cannot seem to use either the ArrayList or an array without an error. It wants me to enter the values for the new column manually. How can I get around this, please?
Exception in thread "main" org.apache.spark.SparkRuntimeException: The feature is not supported: literal for '[[153.41, [153.41, .... Then it keeps going until "of class java.util.ArrayList.
at org.apache.spark.sql.errors.QueryExecutionErrors$.literalTypeUnsupportedError"
I've put the line in bold and added a comment on the same line
public static void dataframe(){
// TODO Auto-generated method stub
SparkSession spark = SparkSession.builder().appName("RDD or DataFrame").getOrCreate();
String path = "C:\\Users\\Paolo Agyei\\Desktop\\Computer Science\\Java\\SparkSimpleApp\\data.csv";
Dataset<Row> csvDataset = spark.read().format("csv").option("header", "true")
.load(path);
// Filtering columns by value
Dataset<Row> result = csvDataset.filter( col("status").equalTo("authorized"));
result = result.filter( col("card_present_flag").equalTo("0"));
// Collecting columns to be split
List<Row> long_lat = csvDataset.select("long_lat").collectAsList();
List<Row> merchant_long_lat = csvDataset.select("merchant_long_lat").collectAsList();
// Lists to hold result of long_lat
**ArrayList<String> longing = new ArrayList<String>();**
ArrayList<String> lat = new ArrayList<String>();
// Lists to hold result of merchant_long_lat
ArrayList<String> merch_long = new ArrayList<String>();
ArrayList<String> merch_lat = new ArrayList<String>();
for (Row row: long_lat) {
convert = row.toString().split(" -",2);
**longing.add(convert[0]);**
lat.add(convert[1]);
}
for (Row row: merchant_long_lat) {
convert = row.toString().split("-",2);
merch_long.add(convert[0]);
if(convert.length>1)
merch_lat.add(convert[1]);
else
merch_lat.add("null");
}
// Adding new columns
**result = result.withColumn("long",lit(longing));** // Issue
/*
result = result.withColumn("lat", null);
result = result.withColumn("merch_long", null);
result = result.withColumn("merch_lat", null);
result = result.drop("long_lat","merchant_long_lat");
result.show();
*/
System.out.println("Hello World!");
}

Spark Java: How to sort ArrayType(MapType) by key and access the values of the sorted array

So, I have a dataframe with this schema:
StructType(StructField(experimentid,StringType,true), StructField(descinten,ArrayType(MapType(StringType,DoubleType,true),true),true))
and the content is like this:
+----------------+-------------------------------------------------------------+
|experimentid |descinten |
+----------------+-------------------------------------------------------------+
|id1 |[[x1->120.51513], [x2->57.59762], [x3->83028.867]] |
|id2 |[[x2->478.5698], [x3->79.6873], [x1->341.89]] |
+----------------+-------------------------------------------------------------+
I want to sort "descinten" by key in ascending order and then take the sorted values. I tried with mapping and sorting each row separately but I was getting errors like:
ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to scala.collection.Map
or similar. Is there a more straight forward way to do it in Java?
For anyone interested I managed to solve it with map and TreeMap for sorting. My aim was to create vectors of the values in an ascending order based on their key.
StructType schema = new StructType(new StructField[] {
new StructField("expids",DataTypes.StringType, false,Metadata.empty()),
new StructField("intensity",new VectorUDT(),false,Metadata.empty())
});
Dataset<Row> newdf=olddf.map((Row originalrow) -> {
String firstpos = new String();
firstpos=originalrow.get(0).toString();
List<scala.collection.Map<String,Double>>mplist=originalrow.getList(1);
int s = mplist.size();
Map<String,Double>treemp=new TreeMap<>();
for(int k=0;k<s;k++){
Object[] exid = JavaConversions.mapAsJavaMap(mplist.get(k)).values().toArray();
Object[] kvlist= JavaConversions.mapAsJavaMap(mplist.get(k)).keySet().toArray();
treemp.put(exid[0].toString(),Double.parseDouble(kvlist[0].toString()));
}
Object[] yo1 = treemp.values().toArray();
double[] tmplist= new double[s];
for(int i=0;i<s;i++){
tmplist[i]=Double.parseDouble(yo1[i].toString());
}
Row newrow = RowFactory.create(firstpos,Vectors.dense(tmplist));
return newrow;
}, RowEncoder.apply(schema));

Iterating over multiple CSVs and joining with Spark SQL

I have several csv files with the same headers and the same IDs. I am attempting to iterate to merge all files up to one indexed '31'. In my while loop, I'm trying to initialise the merged dataset so it can be used for the remainder of the loop. In the last line, I was told that the 'local variable merged may not have been initialised'. How should I instead be doing this?
SparkSession spark = SparkSession.builder().appName("testSql")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///c:tmp")
.getOrCreate();
Dataset<Row> first = spark.read().option("header", true).csv("mypath/01.csv");
Dataset<Row> second = spark.read().option("header", true).csv("mypath/02.csv");
IntStream.range(3, 31)
.forEach(i -> {
while(i==3) {
Dataset<Row> merged = first.join(second, first.col("customer_id").equalTo(second.col("customer_id")));
}
Dataset<Row> next = spark.read().option("header", true).csv("mypath/"+i+".csv");
Dataset<Row> merged = merged.join(next, merged.col("customer_id").equalTo(next.col("customer_id")));
EDITED based on feedback in the comments.
Following your pattern, something like this would work:
Dataset<Row> ds1 = spark.read().option("header", true).csv("mypath/01.csv");
Dataset<?>[] result = {ds1};
IntStream.range(2, 31)
.forEach(i -> {
Dataset<Row> next = spark.read().option("header", true).csv("mypath/"+i+".csv");
result[0] = result[0].join(next, "customer_id");
});
We're wrapping Dataset into an array in order to work around the restriction on variable capture in lambda expressions.
The more straightforward way, for this particular case, is to simply use a for-loop rather than stream.forEach:
Dataset<Row> result = spark.read().option("header", true).csv("mypath/01.csv");
for( int i = 2 ; i < 31 ; i++ ) {
Dataset<Row> next = spark.read().option("header", true).csv("mypath/"+i+".csv");
result[0] = result[0].join(next, "customer_id");
};

Duplicate column when I create an IndexedRowMatrix using Spark

I need to calculate the pairwise similarity between several Documents. For that, I procced as follows:
JavaPairRDD<String,String> files = sc.wholeTextFiles(file_path);
System.out.println(files.count()+"**");
JavaRDD<Row> rowRDD = files.map((Tuple2<String, String> t) -> {
return RowFactory.create(t._1,t._2.replaceAll("[^\\w\\s]+","").replaceAll("\\d", ""));
});
StructType schema = new StructType(new StructField[]{
new StructField("id", DataTypes.StringType, false, Metadata.empty()),
new StructField("sentence", DataTypes.StringType, false, Metadata.empty())
});
Dataset<Row> rows = spark.createDataFrame(rowRDD, schema);
Tokenizer tokenizer = new Tokenizer().setInputCol("sentence").setOutputCol("words");
Dataset<Row> tokenized_rows = tokenizer.transform(rows);
StopWordsRemover remover = new StopWordsRemover().setInputCol("words").setOutputCol("filtered_words");
Dataset<Row> filtred_rows = remover.transform(tokenized_rows);
CountVectorizerModel cvModel = new CountVectorizer().setInputCol("filtered_words").setOutputCol("rowF").setVocabSize(100000).fit(filtred_rows);
Dataset<Row> verct_rows = cvModel.transform(filtred_rows);
IDF idf = new IDF().setInputCol("rowF").setOutputCol("features");
IDFModel idfModel = idf.fit(verct_rows);
Dataset<Row> rescaledData = idfModel.transform(verct_rows);
JavaRDD<IndexedRow> vrdd = rescaledData.toJavaRDD().map((Row r) -> {
//DenseVector dense;
String s = r.getAs(0);
int index = new Integer(s.replace(s.substring(0,24),"").replace(s.substring(s.indexOf(".txt")),""));
SparseVector sparse = (SparseVector) r.getAs(5);
//dense = sparse.toDense();parseVector) r.getAs(5);
org.apache.spark.mllib.linalg.Vector vec = org.apache.spark.mllib.linalg.Vectors.dense(sparse.toDense().toArray());
return new IndexedRow(index, vec);
});
System.out.println(vrdd.count()+"---");
IndexedRowMatrix mat = new IndexedRowMatrix(vrdd.rdd());
System.out.println(mat.numCols()+"---"+mat.numRows());
Unfortunately, the results show that the IndexedRowMatrix is created with 4 columns (the first one is duplicated) even if my dataset contains 3 documents.
3**
3--
1106---4
Can you help me to detect the cause of this duplication?
Most likely there is no duplication at all and your data simply doesn't follow the specification, which requires indices to be consecutive, zero-based, integers. Therefore numRows is max(row.index for row in rows) + 1
import org.apache.spark.mllib.linalg.distributed._
import org.apache.spark.mllib.linalg.Vectors
new IndexedRowMatrix(sc.parallelize(Seq(
IndexedRow(5, Vectors.sparse(5, Array(), Array()))) // Only one non-empty row
)).numRows
// res4: Long = 6

How can i groupby a column in Mongodb using java? [duplicate]

I'd like to learn how to implement group by query using Mongo DB Java 3.x Driver. I want to group my collection through the usernames, and sort the results by the count of results DESC.
Here is the shell query which I want to implement the Java equivalent:
db.stream.aggregate({ $group: {_id: '$username', tweetCount: {$sum: 1} } }, { $sort: {tweetCount: -1} } );
Here is the Java code that I have implemented:
BasicDBObject groupFields = new BasicDBObject("_id", "username");
// count the results and store into countOfResults
groupFields.put("countOfResults", new BasicDBObject("$sum", 1));
BasicDBObject group = new BasicDBObject("$group", groupFields);
// sort the results by countOfResults DESC
BasicDBObject sortFields = new BasicDBObject("countOfResults", -1);
BasicDBObject sort = new BasicDBObject("$sort", sortFields);
List < BasicDBObject > pipeline = new ArrayList < BasicDBObject > ();
pipeline.add(group);
pipeline.add(sort);
AggregateIterable < Document > output = collection.aggregate(pipeline);
The result I need is the count of documents grouped by username. countOfResults returns the total number of the documents the collection has.
You should try not to use old object (BasicDBObject) types with Mongo 3.x. You can try something like this.
import static com.mongodb.client.model.Accumulators.*;
import static com.mongodb.client.model.Aggregates.*;
import static java.util.Arrays.asList;
Bson group = group("$username", sum("tweetCount", 1));
Bson sort = sort(new Document("tweetCount", -1));
AggregateIterable <Document> output = collection.aggregate(asList(group, sort));

Categories