explode an spark array column to multiple columns sparksql - java

I have a column which has type Value defined like below
val Value: ArrayType = ArrayType(
new StructType()
.add("unit", StringType)
.add("value", StringType)
)
and data like this
[[unit1, 25], [unit2, 77]]
[[unit2, 100], [unit1, 40]]
[[unit2, 88]]
[[unit1, 33]]
I know spark sql can use functions.explode to make the data become multiple rows, but what i want is explode to multiple columns (or the 1 one column but 2 items for the one has only 1 item).
so the end result looks like below
unit1 unit2
25 77
40 100
value1 88
33 value2
How could I achieve this?
addtion after initial post and update
I want to get result like this (this is more like my final goal).
transformed-column
[[unit1, 25], [unit2, 77]]
[[unit2, 104], [unit1, 40]]
[[unit1, value1], [unit2, 88]]
[[unit1, 33],[unit2,value2]]
where value1 is the result of applying some kind of map/conversion function using the [unit2, 88]
similarly, value2 is the result of applying the same map /conversion function using the [unit1, 33]

I solved this problem using the map_from_entries as suggested by #jxc, and then used UDF to convert the map of 1 item to map of 2 items, using business logic to convert between the 2 units.
one thing to note is the map returned from map_from_entries is scala map. and if you use java, need to make sure the udf method takes scala map instead.
ps. maybe I did not have to use map_from_entries, instead maybe i could make the UDF to take array of structType

Related

How to access the entries in every row and apply custom functions?

My input was a kafka-stream with only one value which is comma-separated. It looks like this.
"id,country,timestamp"
I already splitted the dataset so that i have something like the following structured stream
Dataset<Row> words = df
.selectExpr("CAST (value AS STRING)")
.as(Encoders.STRING())
.withColumn("id", split(col("value"), ",").getItem(0))
.withColumn("country", split(col("value"), ",").getItem(1))
.withColumn("timestamp", split(col("value"), ",").getItem(2));
+----+---------+----------+
|id |country |timestamp |
+----+---------+----------+
|2922|de |1231231232|
|4195|de |1231232424|
|6796|fr |1232412323|
+----+---------+----------+
Now I have a dataset with 3 columns. Now i want to use the entries in each row in a custom function e.g.
Dataset<String> words.map(row -> {
//do something with every entry of each row e.g.
Person person = new Person(id, country, timestamp);
String name = person.getName();
return name;
};
In the end i want to sink out again a comma-separated String.
Data frame has a schema so you cant just call a map function on it without defining a new schema.
You can either cast to RDD and use a map , or use a DF map with encoder.
Another option is I think you can use spark SQL with user defined functions, you can read about it.
If your use case is really simple as you are showing, doing something like :
var nameRdd = words.rdd.map(x => {f(x)})
which seems like is all you need
if you still want a dataframe you can use something like:
val schema = StructType(Seq[StructField](StructField(dataType = StringType, name = s"name")))
val rddToDf = nameRdd.map(name => Row.apply(name))
val df = sparkSession.createDataFrame(rddToDf, schema)
P.S dataframe === dataset
If you have a custom function that is not available by composing functions in the existing spark API[1], then you can either drop down to the RDD level (as #Ilya suggested), or use a UDF[2].
Typically I'll try to use the spark API functions on a dataframe whenever possible, as they generally will be the best optimized.
If thats not possible I will construct a UDF:
import org.apache.spark.sql.functions.{col, udf}
val squared = udf((s: Long) => s * s)
display(spark.range(1, 20).select(squared(col("id")) as "id_squared"))
In your case you need to pass multiple columns to your UDF, you can pass them in comma separated squared(col("col_a"), col("col_b")).
Since you are writing your UDF in Scala it should be pretty efficient, but keep in mind if you use Python, in general there will be extra latency due to data movements between JVM and Python.
[1]https://spark.apache.org/docs/latest/api/scala/index.html#package
[2]https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html

JavaPairRDD to Dataset<Row> in SPARK

I have data in JavaPairRDD in format
JavaPairdRDD<Tuple2<String, Tuple2<String,String>>>
I tried using below code
Encoder<Tuple2<String, Tuple2<String,String>>> encoder2 =
Encoders.tuple(Encoders.STRING(), Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
Dataset<Row> userViolationsDetails = spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");
But how to generate Dataset with 3 columns ??? As output of above code gives me data in 2 columns. Any pointers / suggestion ???
Try to run printSchema - you will see, that value2 is a complex type.
Having such information, you can write:
Dataset<Row> uvd = userViolationsDetails.selectExpr("value1", "value2._1 as value2", "value2._2 as value3")
value2._1 means first element of a tuple inside current "value2" field. We overwrite value2 field to have one value only
Note that this will work after https://issues.apache.org/jira/browse/SPARK-24548 is merged to master branch. Currently there is a bug in Spark and tuple is converted to struct with two fields named value

CouchDB How to make queries with multiple complex keys

I am trying to make a CouchDB view to obtain some document that is in set 1 and in set 2. For example, when I have a single key I can make some query like:
dbname/_design_doc/viewName?keys=[value1, value2, value3]
and it returns all the documents where it finds either the value1, 2 or 3. What I want is something like this but for a complex key.
For example,
dbname/_design_doc/viewName?keys=[[key1, key12, key13],[key21, key22]]
where key1x is a value for the first key and key2x is a value for the second key, meaning I would like to get every document that has key11 and key21, key11 and key22, key12 and key21, key12 and key22 and so on.
My view is this one:
"twokeys": {
"map": "function(doc) {\n if (doc.uid && doc.hid){\n
emit([doc.uid, doc.hid], doc);\n }\n}"
}
Is this possible?
Thanks in advance
You can query with the keys parameter using complex keys if you follow this answer.
Unfortunately, you can't query both the startkey or the endkey with the keys.

ELKI DBSCAN : How to set dbc.parser?

I am doing DBSCAN clustering and I have one more column apart from latitude longitude which I want to see with cluster results. For example data looks like this:
28.6029445 77.3443552 1
28.6029511 77.3443573 2
28.6029436 77.3443458 3
28.6029011 77.3443032 4
28.6028967 77.3443042 5
28.6029087 77.3442829 6
28.6029132 77.3442797 7
Now in minigui if i set parser.labelindices to 2 and run the task then the output looks like this:
# Cluster: Cluster 0
ID=63222 28.6031295 77.3407848 441
ID=63225 28.603134 77.3407744 444
ID=63220 28.6031566667 77.3407816667 439
ID=63226 28.6030819 77.3407605 445
ID=63221 28.6032 77.3407616667 440
ID=63228 28.603085 77.34071 447
ID=63215 28.60318 77.3408583333 434
ID=63229 28.6030751 77.3407096 448
So it is still connected to the 3rd column which I passed as a label. I have checked the clustering result by passing just latitude and longitude and its exactly same. So in a way by passing a column as 'label' I can retrieve that column with lat long in cluster results.
Now I want to use this in my java code
// Setup parameters:
ListParameterization params = new ListParameterization();
params.addParameter(
FileBasedDatabaseConnection.Parameterizer.INPUT_ID,
fileLocation);
params.addParameter(
NumberVectorLabelParser.Parameterizer.LABEL_INDICES_ID,
2);
params.addParameter(AbstractDatabase.Parameterizer.INDEX_ID,
RStarTreeFactory.class);
But this is giving a NullPointerException. In MiniGui dbc.parser is NumberVectorLabelParser by default. So this should work fine. What am I missing?
I will have a look into the NPE, it should return a more helpful error message instead.
Most likely, the problem is that this parameter is of type List<Integer>, i.e. you would need to pass a list. Alternatively, you can pass a String, which will be parsed. The following should work just fine:
params.addParameter(
NumberVectorLabelParser.Parameterizer.LABEL_INDICES_ID,
"2");
Note that the text writer might (I have not checked this) print labels as is. So you cannot take the output as indication that it considered your data set to be 3 dimensional.
The debugging handler -resulthandler LogResultStructureResultHandler -verbose should give you type output:
java -jar elki.jar KDDCLIApplication -dbc.in dbpedia.gz \
-algorithm NullAlgorithm \
-resulthandler LogResultStructureResultHandler -verbose
should yield an output like this:
de.lmu.ifi.dbs.elki.datasource.FileBasedDatabaseConnection.load: 1941 ms
de.lmu.ifi.dbs.elki.algorithm.NullAlgorithm.runtime: 0 ms
BasicResult: Algorithm Step (main)
StaticArrayDatabase: Database (database)
DBIDView: Database IDs (DBID)
MaterializedRelation: DoubleVector,dim=2 (relation)
MaterializedRelation: LabelList (relation)
SettingsResult: Settings (settings)
In this case, my data set are coordinates from Wikipedia, along with a name each. I have a 2 dimensional DoubleVector relation, and a LabelList relation storing the object names.

MongoDB find has the same speed with and without index

I use MongoTemplate from Spring to access a MongoDB.
final Query query = new Query(Criteria.where("_id").exists(true));
query.with(new Sort(Direction.ASC, "FIRSTNAME", "LASTNAME", "EMAIL"));
if (count > 0) {
query.limit(count);
}
query.skip(start);
query.fields().include("FIRSTNAME");
query.fields().include("LASTNAME");
query.fields().include("EMAIL");
return mongoTemplate.find(query, User.class, "users");
I generated 400.000 records in my MongoDB.
When asking for the first 25 Users without using the above written sort line, I get the result within less then 50 milliseconds.
With sort it lasts over 4 seconds.
I then created indexes for FIRSTNAME, LASTNAME, EMAIL. Single indexes, not combined ones
mongoTemplate.indexOps("users").ensureIndex(new Index("FIRSTNAME", Order.ASCENDING));
mongoTemplate.indexOps("users").ensureIndex(new Index("LASTNAME", Order.ASCENDING));
mongoTemplate.indexOps("users").ensureIndex(new Index("EMAIL", Order.ASCENDING));
After creating these indexes the query again lasts over 4 seconds.
What was my mistake?
-- edit
MongoDB writes this on the console...
Thu Jul 04 10:10:11.442 [conn50] query mydb.users query: { query: { _id: { $exists: true } }, orderby: { LASTNAME: 1, FIRSTNAME: 1, EMAIL: 1 } } ntoreturn:25 ntoskip:0 nscanned:382424 scanAndOrder:1 keyUpdates:0 numYields: 2 locks(micros) r:6903475 nreturned:25 reslen:3669 4097ms
You have to create a compound index for FIRSTNAME, LASTNAME, and EMAIL, in this order and all of them using ascending order.
Thu Jul 04 10:10:11.442 [conn50] query mydb.users query:
{ query: { _id: { $exists: true } }, orderby: { LASTNAME: 1, FIRSTNAME: 1, EMAIL: 1 } }
ntoreturn:25 ntoskip:0 nscanned:382424 scanAndOrder:1 keyUpdates:0 numYields: 2
locks(micros) r:6903475 nreturned:25 reslen:3669 4097ms
Possible bad signs:
Your scanAndOrder is coming true (scanAndOrder=1), correct me if I am wrong.
It has to return (ntoreturn:25) means 25 documents but it is scanning (nscanned:382424) 382424 documents.
indexed queries, nscanned is the number of index keys in the range that Mongo scanned, and nscannedObjects is the number of documents it looked at to get to the final result. nscannedObjects includes at least all the documents returned, even if Mongo could tell just by looking at the index that the document was definitely a match. Thus, you can see that nscanned >= nscannedObjects >= n always.
Context of Question:
Case 1: When asking for the first 25 Users without using the above written sort line, I get the result within less then 50 milliseconds.
Case 2: With sort it lasts over 4 seconds.
query.with(new Sort(Direction.ASC, "FIRSTNAME", "LASTNAME", "EMAIL"));
As in this case there is no index: so it is doing as mentioned here:
This means MongoDB had to batch up all the results in memory, sort them, and then return them. Infelicities abound. First, it costs RAM and CPU on the server. Also, instead of streaming my results in batches, Mongo just dumps them all onto the network at once, taxing the RAM on my app servers. And finally, Mongo enforces a 32MB limit on data it will sort in memory.
Case 3: created indexes for FIRSTNAME, LASTNAME, EMAIL. Single indexes, not combined ones
I guess it is still not fetching data from index. You have to tune your indexes according to Sorting order
Sort Fields (ascending / descending only matters if there are multiple sort fields)
Add sort fields to the index in the same order and direction as your query's sort
For more details, check this
http://emptysqua.re/blog/optimizing-mongodb-compound-indexes/
Possible Answer:
In the query orderby: { LASTNAME: 1, FIRSTNAME: 1, EMAIL: 1 } } order of sort is different than the order you have specified in :
mongoTemplate.indexOps("users").ensureIndex(new Index("FIRSTNAME", Order.ASCENDING));
mongoTemplate.indexOps("users").ensureIndex(new Index("LASTNAME", Order.ASCENDING));
mongoTemplate.indexOps("users").ensureIndex(new Index("EMAIL", Order.ASCENDING));
I guess Spring API might not be retaining order:
https://jira.springsource.org/browse/DATAMONGO-177
When I try to sort on multiple fields the order of the fields is not maintained. The Sort class is using a HashMap instead of a LinkedHashMap so the order they are returned is not guaranteed.
Could you mention spring jar version?
Hope this answers your question.
Correct me where you feel I might be wrong, as I am little rusty.

Categories