I have an aggregation:
AggregationResults<Integer> result = mongoTemplate.aggregate(
Aggregation.newAggregation(
Aggregation.group().count().as("value"),
Aggregation.project("value").andExclude("_id"),
MyData.class, Integer.class);
In the mongo shell, when I don't have to map an object, I get: { "value" : 2 }
However, I get the following error when trying to map this lone value: org.springframework.data.mapping.model.MappingException: No mapping metadata found for java.lang.Integer
Can I get around having to create a new output type class, when I only want to get a single java primitive?
Note: I'm going this approach instead of db.collection.count() for the sharding inaccuracies stated here - https://docs.mongodb.com/manual/reference/method/db.collection.count/#sharded-clusters
AggregationResults<DBObject> result = mongoTemplate.aggregate(
Aggregation.newAggregation(
Aggregation.group().count().as("value"),
Aggregation.project("value").andExclude("_id"),
MyData.class, DBObject.class);
int count = (Integer) result.getUniqueMappedResult().get("value");
So, not exactly what I wanted, because I still have to traverse over an object, but it's not any more code than I had before and I didn't need to make another class as the outputType.
Related
Can I get Pair as an output for jdbcTemplate? I tried the following (which work for separate Integers)
Pair<Integer, Integer> result = jdbcTemplate.queryForObject(GET_PAIR, new Object[]{}, Pair.class);
But it returns exception
org.springframework.jdbc.IncorrectResultSetColumnCountException: Incorrect column count: expected 1, actual 2
at org.springframework.jdbc.core.SingleColumnRowMapper.mapRow(SingleColumnRowMapper.java:92)
at org.springframework.jdbc.core.RowMapperResultSetExtractor.extractData(RowMapperResultSetExtractor.java:93)
at org.springframework.jdbc.core.RowMapperResultSetExtractor.extractData(RowMapperResultSetExtractor.java:60)
at org.springframework.jdbc.core.JdbcTemplate$1.doInPreparedStatement(JdbcTemplate.java:703)
at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:639)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:690)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:722)
at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:732)
at org.springframework.jdbc.core.JdbcTemplate.queryForObject(JdbcTemplate.java:800)
Tried with org.apache.commons.lang3.tuple.Pair
queryForObject requires one result and just one result. So when you get EmptyResultDataAccessException it means that query for Object didn't find anything.
However I still don't think it will work, even if you get a result. A better way is to use a RowMapper.
jdbcTemplate.query(GET_PAIR, (rs, i) -> new Pair(rs.getInt(1), rs.getInt(2)))
Which will allow you to map the elements to a pair (this will return a list, one for each row).
My input was a kafka-stream with only one value which is comma-separated. It looks like this.
"id,country,timestamp"
I already splitted the dataset so that i have something like the following structured stream
Dataset<Row> words = df
.selectExpr("CAST (value AS STRING)")
.as(Encoders.STRING())
.withColumn("id", split(col("value"), ",").getItem(0))
.withColumn("country", split(col("value"), ",").getItem(1))
.withColumn("timestamp", split(col("value"), ",").getItem(2));
+----+---------+----------+
|id |country |timestamp |
+----+---------+----------+
|2922|de |1231231232|
|4195|de |1231232424|
|6796|fr |1232412323|
+----+---------+----------+
Now I have a dataset with 3 columns. Now i want to use the entries in each row in a custom function e.g.
Dataset<String> words.map(row -> {
//do something with every entry of each row e.g.
Person person = new Person(id, country, timestamp);
String name = person.getName();
return name;
};
In the end i want to sink out again a comma-separated String.
Data frame has a schema so you cant just call a map function on it without defining a new schema.
You can either cast to RDD and use a map , or use a DF map with encoder.
Another option is I think you can use spark SQL with user defined functions, you can read about it.
If your use case is really simple as you are showing, doing something like :
var nameRdd = words.rdd.map(x => {f(x)})
which seems like is all you need
if you still want a dataframe you can use something like:
val schema = StructType(Seq[StructField](StructField(dataType = StringType, name = s"name")))
val rddToDf = nameRdd.map(name => Row.apply(name))
val df = sparkSession.createDataFrame(rddToDf, schema)
P.S dataframe === dataset
If you have a custom function that is not available by composing functions in the existing spark API[1], then you can either drop down to the RDD level (as #Ilya suggested), or use a UDF[2].
Typically I'll try to use the spark API functions on a dataframe whenever possible, as they generally will be the best optimized.
If thats not possible I will construct a UDF:
import org.apache.spark.sql.functions.{col, udf}
val squared = udf((s: Long) => s * s)
display(spark.range(1, 20).select(squared(col("id")) as "id_squared"))
In your case you need to pass multiple columns to your UDF, you can pass them in comma separated squared(col("col_a"), col("col_b")).
Since you are writing your UDF in Scala it should be pretty efficient, but keep in mind if you use Python, in general there will be extra latency due to data movements between JVM and Python.
[1]https://spark.apache.org/docs/latest/api/scala/index.html#package
[2]https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html
I have data in JavaPairRDD in format
JavaPairdRDD<Tuple2<String, Tuple2<String,String>>>
I tried using below code
Encoder<Tuple2<String, Tuple2<String,String>>> encoder2 =
Encoders.tuple(Encoders.STRING(), Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
Dataset<Row> userViolationsDetails = spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");
But how to generate Dataset with 3 columns ??? As output of above code gives me data in 2 columns. Any pointers / suggestion ???
Try to run printSchema - you will see, that value2 is a complex type.
Having such information, you can write:
Dataset<Row> uvd = userViolationsDetails.selectExpr("value1", "value2._1 as value2", "value2._2 as value3")
value2._1 means first element of a tuple inside current "value2" field. We overwrite value2 field to have one value only
Note that this will work after https://issues.apache.org/jira/browse/SPARK-24548 is merged to master branch. Currently there is a bug in Spark and tuple is converted to struct with two fields named value
I am trying to use Aggregates.project to slice the array in my documents.
My documents is like
{
"date":"",
"stype_0":[1,2,3,4]
}
in the mongochef looks like
and my code in java is :
Aggregates.project(Projections.fields(
Projections.slice("stype_0", pst-1, pen-pst),Projections.slice("stype_1", pst-1, pen-pst),
Projections.slice("stype_2", pst-1, pen-pst),Projections.slice("stype_3", pst-1, pen-pst))))
finally i get error
First argument to $slice must be an array, but is of type: int
I guess that is because the first element in stype_0 is int , but I really do not know why? Thanks a lot!
Slice has two versions. $slice(aggregation) & $slice(projection). You are using the wrong one.
Aggregate slice function doesn't have any built-in support. Below is an example for one such projection. Do the same for all the other projection fields.
List stype_0 = Arrays.asList("$stype_0", 1, 1);
Bson project = Aggregates.project(Projections.fields(new Document("stype_0", new Document("$slice", stype_0))));
AggregateIterable<Document> iterable = dbCollection.aggregate(Arrays.asList(project));
Update: I should have mentioned this right off the bat: I first considered a Java/JSON mapping framework, but my manager does not want me adding any more dependencies to the project, so that is out as an option. The JSON-Java jar is already on our classpath, so I could use that, but still not seeing the forest through the trees on how it could be used.
My Java program is being handed JSON of the following form (althought the values will change all the time):
{"order":{"booze":"1","handled":"0","credits":"0.6",
"execute":0,"available":["299258"],"approved":[],
"blizzard":"143030","reviewable":["930932","283982","782821"],
"units":"6","pending":["298233","329449"],"hobbit":"blasphemy"}}
I'm looking for the easiest, efficient, surefire way of cherry-picking specific values out of this JSON string and aggregating them into a List<Long>.
Specifically, I'm looking to extract-and-aggregate all of the "ids", that is, all the numeric values that you see for the available, approved, reviewable and pending fields. Each of these fields is an array of 0+ "ids". So, in the example above, we see the following breakdown of ids:
available: has 1 id (299258)
approved: has 0 ids
reviewable: has 3 ids (930932, 283982, 782821)
pending: has 2 ids (298233, 329449)
I need some Java code to run and produce a List<Long> with all 6 of these extracted ids, in no particular order. The ids just need to make it into the list.
This feels like an incredibly complex, convoluded regex, and I'm not even sure where too begin. Any help at all is enormously appreciated. Thanks in advance.
The easiest way IMO is use a json library such as gson, jackson, json.org, etc, parse de JSON into an object and create a new List<Long> with the values of the properties you need.
Pseudocode with gson:
class Order {
long[] available;
long[] approved;
...
}
Order order = gson.fromJson("{ your json goes here }", Order.class);
List<Long> result = new ArrayList<Long>();
result.add(order.getAvailable());
result.add(order.getApproved());
...
Pseudocode with json.org/java:
JSONObject myobject = new JSONObject("{ your json goes here"});
JSONObject order = myobject.getJSONObject("order");
List<Long> result = new ArrayList<Long>();
for (int i=0; i<order.getJSONArray("approved").length(); i++) {
Long value = order.getJSONArray("approved").getLong(i);
result.add(value);
}
...