Read values from Java Map using Spark Column using java

Read values from Java Map using Spark Column using java - java

I have tried below code to get Map values via spark column in java but getting null value expecting exact value from Map as per key search.
and Spark Dataset contains one column and name is KEY and dataset name dataset1
values in dataset :
KEY
1
2
Java Code -
Map<String,string> map1 = new HashMap<>();
map1.put("1","CUST1");
map1.put("2","CUST2");
dataset1.withColumn("ABCD", functions.lit(map1.get(col("KEY"))));
Current Output is:
ABCD (Column name)
null
null
Expected Output :
ABCD (Column name)
CUST1
CUST2
please me get this expected output.

The reason why you get this output is pretty simple. The get function in java can take any object as input. If that object is not in the map, the result is null.
The lit function in spark is used to create a single value column (all rows have the same value). e.g. lit(1) creates a column that takes the value 1 for each row.
Here, map1.get(col("KEY")) (that is executed on the driver), asks map1 the value corresponding to a column object (not the value inside the column, the java/scala object representing the column). The map does not contain that object so the result is null. Therefore, you could as well write lit(null). This is why you get a null result inside your dataset.
To solve your problem, you could wrap your map access within a UDF for instance. Something like:
UserDefinedFunction map_udf = udf(new UDF1<String, String>() {
#Override
public String call(String x) {
return map1.get(x);
}
}, DataTypes.StringType );
spark.udf().register("map_udf", map_udf);
result.withColumn("ABCD", expr("map_udf(KEY)"));

Related

Java Streams with combining multiple rows to one

My code consists of a class with 10 variables. The class will get the data from a database table and the results from it is a List. Here is a sample class:
#Data
class pMap {
long id;
String rating;
String movieName;
}
The data will be as follows:
id=1, rating=PG-13, movieName=
id=1, rating=, movieName=Avatar
id=2, rating=, movieName=Avatar 2
id=2, rating=PG, movieName=
I want to combine both the rows to a single row grouping by id using Java streams. The end result should like this Map<Long, pMap>:
1={id=1, rating=PG-13, movieName=Avatar},
2={id=2, rating=PG, movieName=Avatar 2}
I am not sure how I can get the rows combined to one by pivoting them.

You can use toMap to achieve this:
Map<Long, pMap> myMap = myList.stream().collect(Collectors.toMap(x -> x.id, Function.identity(),
(x1, x2) -> new pMap(x1.id, x1.rating != null ? x1.rating : x2.rating, x1.movieName != null ? x1.movieName : x2.movieName)));
I am passing two functions to toMap method:
First one is a key mapper. It maps an element to a key of the map. In this case, I want the key to be the id.
The second one is a value mapper. I want the value to be the actual pMap so this is why I am passing the identity function.
The third argument is a merger function that tells how to merge two values with the same id.

How to get column names of Spark Row using java

I am trying to convert a spark dataframe to rdd and apply a function using map.
In pyspark, we can fetch the values of corresponding column by converting the Row to dictionary (key being column name, value being the value of that column) as below
row_dict = Row.asDict()
val = row_dict['column1'] # I can access the value of any column
Now, in java, I am trying to do similar thing. I am getting the Row and I found that it has APIs to get the values based on index value
JavaRDD<Row> resultRdd = df.JavaRDD().map(x -> customFunction(x, customParam1, customParam2));
public static Row customFunction(Row row, Object o1, Object o2) {
// need to access "column1" value from the row
// how to get column name of each index if we have to use row.get(index)
}
How can I access the row values based on column names in java code?

How to Alias a DataSet column before writing to a parquet in Java

I am working with apache spark in java and what I am trying to do is filter some data, group it by a specific key and then count the number of elements for each key. At the moment I am doing this:
Dataset<MyBean> rawEvents = readData(spark);
Dataset<MyBean> filtered = rawEvents.filter((FilterFunction<MyBean>) events ->
//filter function
));
KeyValueGroupedDataset<String, MyBean> grouped = filtered
.groupByKey((MapFunction<MyBean, String>) event -> {
return event.getKey();
}, Encoders.STRING());
grouped.count().write().parquet("output.parquet");
It fails to write because: org.apache.spark.sql.AnalysisException: Attribute name "count(1)" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
How can I alias the count column so this does not happen?

grouped.count() returns a Dataset<Tuple2<String, Object> in your case.
Essentially, renaming a column in the Dataset object will solve your problem.
You can use withColumnRenamed method of Dataset API.
grouped.count().withColumnRenamed("count(1)", "counts").write().parquet("output.parquet")

After grouped.count() select all columns and also add the alias to count column then use write method.
Example:
import static org.apache.spark.sql.functions.col;
import org.apache.spark.sql.Column;
Column[] colList = { col("column1"), col("column2"), col("count(1)").alias("count") };
grouped.count().select(colList).write.parquet("output.parquet");

Get values from hashmap in Eclipse

I am adding to a project done in Eclipse that displays data pulled from various databases to a webpage. I am trying to figure out how to use a hashmap. I have a variable/column called "description" that will show a description based on the value of the column just before it. The descriptions are in my hashmap. I just don't know how to pull the "description" value.
Here is part of the DtoBuilder -
private HashMap<String,String> itemDescrMap = null;
public VersionsDto build(Versions oneVersion){
VersionsDto result = null;
if(itemDescrMap==null){
itemDescrMap = loadItemDescrMap();
}
// Create instance of versions object and build it.
if(oneVersion != null){
result = new VersionsDto();
result.setStore(oneVersion.getStore());
result.setUpdatePackage(oneVersion.getUpdatePackage());
result.setDescription(oneVersion.getDescription());
and here is part of the hashmap -
private static HashMap<String,String> loadItemDescrMap(){
HashMap<String,String> map = new HashMap<String,String>();
map.put("CDSA", "Color Match");
map.put("CDSB", "New Formula Book");
map.put("CDSC", "Base Assignments");
map.put("CDSD", "Product Formulation");
map.put("CDSE", "Old TAC");
map.put("CDSF", "Colorant Systems");
map.put("CDSG", "Miscellaneous");
map.put("CDSH", "AFCD");
Initially, I was just grabbing the same data for "description" as I was for "updatePackage", just to test that it would populate all the fields and display it to the webpage. So now I need to know how to set the "description" value based on the hash map, where the first values (CDSD, CDSE, etc) are all the possible values in the "updatePackage" column and the second value is the corresponding "description" that I need.

Look at the javadoc https://docs.oracle.com/javase/7/docs/api/java/util/HashMap.html
I think itemDescrMap.get(oneVersion.getUpdatePackage()) should do the job

How to get just the desired field from an array of sub documents in Mongodb using Java

I have just started using Mongo Db . Below is my data structure .
It has an array of skillID's , each of which have an array of activeCampaigns and each activeCampaign has an array of callsByTimeZone.
What I am looking for in SQL terms is :
Select activeCampaigns.callsByTimeZone.label,
activeCampaigns.callsByTimeZone.loaded
from X
where skillID=50296 and activeCampaigns.campaign_id= 11371940
and activeCampaigns.callsByTimeZone='PT'
The output what I am expecting is to get
{"label":"PT", "loaded":1 }
The Command I used is
db.cd.find({ "skillID" : 50296 , "activeCampaigns.campaignId" : 11371940,
"activeCampaigns.callsByTimeZone.label" :"PT" },
{ "activeCampaigns.callsByTimeZone.label" : 1 ,
"activeCampaigns.callsByTimeZone.loaded" : 1 ,"_id" : 0})
The output what I am getting is everything under activeCampaigns.callsByTimeZone while I am expecting just for PT
DataStructure :
{
"skillID":50296,
"clientID":7419,
"voiceID":1,
"otherResults":7,
"activeCampaigns":
[{
"campaignId":11371940,
"campaignFileName":"Aaron.name.121.csv",
"loaded":259,
"callsByTimeZone":
[{
"label":"CT",
"loaded":6
},
{
"label":"ET",
"loaded":241
},
{
"label":"PT",
"loaded":1
}]
}]
}
I tried the same in Java.
QueryBuilder query = QueryBuilder.start().and("skillID").is(50296)
.and("activeCampaigns.campaignId").is(11371940)
.and("activeCampaigns.callsByTimeZone.label").is("PT");
BasicDBObject fields = new BasicDBObject("activeCampaigns.callsByTimeZone.label",1)
.append("activeCampaigns.callsByTimeZone.loaded",1).append("_id", 0);
DBCursor cursor = coll.find(query.get(), fields);
String campaignJson = null;
while(cursor.hasNext()) {
DBObject campaignDBO = cursor.next();
campaignJson = campaignDBO.toString();
System.out.println(campaignJson);
}
the value obtained is everything under callsByTimeZone array. I am currently parsing the JSON obtained and getting only PT values . Is there a way to just query the PT fields inside activeCampaigns.callsByTimeZone .
Thanks in advance .Sorry if this question has already been raised in the forum, I have searched a lot and failed to find a proper solution.
Thanks in advance.

There are several ways of doing it, but you should not be using String manipulation (i.e. indexOf), the performance could be horrible.
The results in the cursor are nested Maps, representing the document in the database - a Map is a good Java-representation of key-value pairs. So you can navigate to the place you need in the document, instead of having to parse it as a String. I've tested the following and it works on your test data, but you might need to tweak it if your data is not all exactly like the example:
while (cursor.hasNext()) {
DBObject campaignDBO = cursor.next();
List callsByTimezone = (List) ((DBObject) ((List) campaignDBO.get("activeCampaigns")).get(0)).get("callsByTimeZone");
DBObject valuesThatIWant;
for (Object o : callsByTimezone) {
DBObject call = (DBObject) o;
if (call.get("label").equals("PT")) {
valuesThatIWant = call;
}
}
}
Depending upon your data, you might want to add protection against null values as well.
The thing you were looking for ({"label":"PT", "loaded":1 }) is in the variable valueThatIWant. Note that this, too, is a DBObject, i.e. a Map, so if you want to see what's inside it you need to use get:
valuesThatIWant.get("label"); // will return "PT"
valuesThatIWant.get("loaded"); // will return 1
Because DBObject is effectively a Map of String to Object (i.e. Map<String, Object>) you need to cast the values that come out of it (hence the ugliness in the first bit of code in my answer) - with numbers, it will depend on how the data was loaded into the database, it might come out as an int or as a double:
String theValueOfLabel = (String) valuesThatIWant.get("label"); // will return "PT"
double theValueOfLoaded = (Double) valuesThatIWant.get("loaded"); // will return 1.0
I'd also like to point out the following from my answer:
((List) campaignDBO.get("activeCampaigns")).get(0)
This assumes that "activeCampaigns" is a) a list and in this case b) only has one entry (I'm doing get(0)).
You will also have noticed that the fields values you've set are almost entirely being ignored, and the result is most of the document, not just the fields you asked for. I'm pretty sure you can only define the top-level fields you want the query to return, so your code:
BasicDBObject fields = new BasicDBObject("activeCampaigns.callsByTimeZone.label",1)
.append("activeCampaigns.callsByTimeZone.loaded",1)
.append("_id", 0);
is actually exactly the same as:
BasicDBObject fields = new BasicDBObject("activeCampaigns", 1).append("_id", 0);
I think some of the points that will help you to work with Java & MongoDB are:
When you query the database, it will return you the whole document of
the thing that matches your query, i.e. everything from "skillID"
downwards. If you want to select the fields to return, I think those will only be top-level fields. See the documentation for more detail.
To navigate the results, you need to know that a DBObjects are returned, and that these are effectively a Map<String,
Object> in Java - you can use get to navigate to the correct node,
but you will need to cast the values into the correct shape.

Replacing while loop from your Java code with below seems to give "PT" as output.
`while(cursor.hasNext()) {
DBObject campaignDBO = cursor.next();
campaignJson = campaignDBO.get("activeCampaigns").toString();
int labelInt = campaignJson.indexOf("PT", -1);
String label = campaignJson.substring(labelInt, labelInt+2);
System.out.println(label);
}`

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Read values from Java Map using Spark Column using java - java

Related

Java Streams with combining multiple rows to one

How to get column names of Spark Row using java

How to Alias a DataSet column before writing to a parquet in Java

Get values from hashmap in Eclipse

How to get just the desired field from an array of sub documents in Mongodb using Java

Categories

Resources