Spark error when accessing an empty or null array - java

I have a JSON file with this type of schema:
{
"name" : "john doe",
"phone-numbers" : {
"home": ["1111", "222"],
"country" : "England"
}
}
The home phone numbers array could sometimes be empty.
My spark application receives a list of these JSONS and does this:
val dataframe = spark.read.json(filePaths: _*)
val result = dataframe.select($"name",
explode(dataframe.col("phone-numbers.home")))
When the 'home' array is empty, I receive the following error when I try to explode it:
org.apache.spark.sql.AnalysisException: cannot resolve
'phone-numbers['home']' due to data type mismatch: argument 2
requires integral type, however, ''home'' is of string type.;;
Is there an elegant way to prevent spark from exploding this field if it's empty or null?

The problem are not empty arrays ("home" : []) but arrays which are null ("home" : null) which do not work with explode
So either filter the null-values first:
val result = df
.filter($"phone-numbers.home".isNotNull)
.select($"name", explode($"phone-numbers.home"))
or replace the null-values with an empty array (which I would prefer in your situaion):
val nullToEmptyArr = udf(
(arr:Array[Long]) => if(arr==null) Array.empty[Long] else arr
)
val result = df
.withColumn("phone-numbers.home",nullToEmptyArr($"phone-numbers.home")) // clean existing column
.select($"name", explode($"phone-numbers.home"))

In spark there's a class called DataFrameNaFunctions, this class is specialized for working with missing data in DataFrames.
this class contains three essentials method : drop, replace and fill
to use this methods the only thing that you have to do is to call the df.na method wich return a DataFrameNaFunctions for your df then apply one of the three method which returns your df with the specified operation.
to resolve your problem you can use something like that :
val dataframe = spark.read.json(filePaths: _*)
val result = dataframe.na.drop().select("name",
explode(dataframe.col("phone-numbers.home")))
Hope this help, Best Regards

Related

How to create a Hashmap in Scala with key as String and value as another String or another Hashmap

I want to create a variable in Scala which can have data of the following format:
"one" -> "two",
"three -> {"four" -> "five", "six -> "seven"},
"eight" -> {"nine" -> { "ten" -> "eleven", "twelve" -> "thirteen"},
"fourteen" -> {"fifteen" -> "sixteen}
}
I tried creating a Java HashMap using:
var requiredVar = new HashMap[String, Object]()
I am able to do something like :
var hm = new HashMap[String, String]
hm.put("four","five")
requiredVar.put("three",hm)
But if I try to add :
requiredVar.get("three").put("six","seven")
I get the error that
value put is not a member of Object
How can I get this done?
I have tried something like native to Scala as well:
val a = Map("one" -> "two" , "three" -> Map("four"->"five"))
a.get("three").put("six"->"seven")
but get the following error:
error: value put is not a member of Option[Any]
In the first case, when using Java, you get the error because the compiler doesn't know that the value that retrieved from requiredVar is a Map.
You declare requiredVar to be a HashMap[String, Object], so the compiler will only know anything retrieved from the map is an Object, nothing more specific.
Specifically, in your case:
requiredVar.get("three").put("six","seven")
requiredVar.get("three") returns an Object, which doesn't have a put() method.
You are running into a similar issue in the Scala version of your code as well. When you create the Map:
val a = Map("one" -> "two" , "three" -> Map("four"->"five"))
the compiler must infer the types of the keys and values, which it is doing by finding the closest common ancestor of all the values, which for a String and another Map, is Any, Scala's equivalent to Java's Object. So when you try to do
a.get("three").put("six"->"seven")
a.get("three") is returning an Option[Any], which doesn't have a put method. By the way, Scala's Map.get returns an Option, so that if the key is not present in the map, a None is returned instead an exception being thrown. You can also use the more concise method a("three"), which returns the value type directly (in this case Any), but will throw an exception if the key is not in the map.
There are a few ways I can think of try to achieve what you want to do.
1) Casting
If you are absolutely sure that the value you are retrieving from the map is another Map instead of a String, you can cast the value:
requiredVar.get("three").asInstanceOf[HashMap[String, String]].put("six","seven")
This is a fairly brittle approach, as if the value is a String, then you will get a runtime exception thrown.
2) Pattern Matching
Rather than casting arbitrarily, you can test the retrieved value for its type, and only call put on values you know are maps:
requiredVar.get("three") match {
case m: HashMap[String, String] => m.put("six", "seven")
case _ => // the value is probably a string here, handle this how you'd like
}
This allows you to guard against the case that the value is not a map. However, it is still brittle because the value type is Any, so in the case _ case, you don't actually know the value is a String, and would have to pattern match or cast to know for sure and use the value as a String
3) Create a new value type
Rather than rely on a top type like Object or Any, you can create types of your own to use as the value type. Something like the following could work:
import scala.collection.mutable.Map
sealed trait MyVal
case class StringVal(s: String) extends MyVal
case class MapVal(m: Map[String, String]) extends MyVal
object MyVal {
def apply(s: String): StringVal = StringVal(s)
def apply(m: Map[String, String]): MapVal = MapVal(m)
}
var rv = Map[String, MyVal]()
rv += "one" -> MyVal("two")
rv += "three" -> MyVal(Map[String, String]("four" -> "five"))
rv.get("three") match {
case Some(v) => v match {
case MapVal(m) => m("six") = "seven"
case StringVal(s) => // handle the string case as you'd like
}
case None => // handle the key not being present in the map here
}
The usage may look similar, but the advantage now is that the pattern match on the rv.get("three") is complete.
4) Union types
If you happen to be using a 3.x version of Scala, you can use a union type to specify exactly what types of values you will have in your map, and achieve something like the above option much less verbosely:
import scala.collection.mutable.Map
val rv: Map[String, String | Map[String, String]] = Map()
rv += "one" -> "two"
rv += "three" -> Map[String, String]("four" -> "five")
rv.get("three") match {
case Some(m: Map[String, String]) => m += "six" -> "seven"
case Some(s: String) => // handle string values
case None => // handle key not present
}
One thing to note though, with all of the above options, is that in Scala, it is preferable to use immutable collections, instead of mutable versions like HashMap or scala.collection.mutable.Map (which is by default a HashMap under the hood). I would do some research about immutable collections and try to think about how you can redesign your code accordingly.

Jackson & Scala: How to get property value from a list of objects by property value?

I'd like to get the requestedInstanceCount from instanceGroupName = slave. How can this be achieved with Jackson?
Below is the job-flow.json:
{
"generalId": "ABC"
"instanceCount": 4,
"instanceGroups": [
{
"instanceGroupId": "CDE",
"instanceGroupName": "master",
"requestedInstanceCount": 1
},
{
"instanceGroupId": "FGH",
"instanceGroupName": "slave",
"requestedInstanceCount": 8
}
]
}
So far this is what I have:
val jobFlowJson: String = new String(Files.readAllBytes(Paths.get("/mnt/var/lib/info/job-flow.json")))
val jsonNode = mapper.readValue(jobFlowJson, classOf[JsonNode])
val instanceCount = jsonNode.get("requestedInstanceCount").asInt
But there are 2 values and the order between master & slave can change at any time. Thanks in advance!
You have to go through the JSON tree step by step:
get the instanceGroups as an array
iterate over the array to find the item you want
extract the value requestedInstanceCount
Something like this (pseudo Scala code):
jsonNode.get("instance groups")
.asArray
.collect {
case item if item.get("instanceGroupName").asString == "..." =>
item.get("requestedInstanceCount")
}
Or define some case class representing the structure and pass on your JSON into the case class. It will be way easier to manipulate if you have no specific reason to not do this.

How to check if an list of objects contains a key from a List of strings

Hello I'm still new to Java. I have a List of objects called formData coming from the UI which looks like this:
[
{
fieldName: "phone",
value: "1111111111"
{
[
I have a List of strings called requiredTypes that represent my keys that I need to check which is:
["phone", "DOB"]
I want to be able to loop through formData to see if my formData has these types in requiredTypes. If the fieldNames in formData contains these types from requiredTypes then return true. If not, return false.
So for this case, it will return false because it's missing a DOB.
I tried using Java 8 Streams to see if I could make the comparison but it's returning void
val hasFormType = formData.stream().forEach(fieldData ->
requiredTypes.forEach(requiredType ->
requiredType.contains(fieldData.getFieldName())));
Am I misusing streams? Is there a better approach to this case? Thank you for your time.
Collect the field names to a set and compare it to the collection of required fields
Set<String> existingFields = formData.stream()
.map(FieldData::getFieldName).collect(Collectors.toSet());
val hasFormType = existingFields.containsAll(requiredTypes);

convert scala hashmap with List to java hashmap with java list

I am new to scala and spark.I have below case class A
case class A(uniqueId : String,
attributes: HashMap[String, List[String]])
Now I have a dataFrame of type A. I need to call a java function on each row of that DF. I need to convert Hashmap to Java HashMap and List to java list..
How can i do that.
I am trying to do following
val rddCaseClass = RDD[A]
val a = rddCaseClass.toDF().map ( x=> {
val rowData = x.getAs[java.util.HashMap[String,java.util.List[String]]]("attributes")
callJavaMethod(rowData)
But this is giving me error :
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to java.util.List
Please help.
You can convert Scala Wrapped array to Java List using
scala.collection.JavaConversions
val wrappedArray: WrappedArray[String] = WrappedArray.make(Array("Java", "Scala"))
val javaList = JavaConversions.mutableSeqAsJavaList(wrappedArray)
JavaConversions.asJavaList can also be used but its deprecated: use mutableSeqAsJavaList instead
I think, you could use Seq instead of List for your parameters to work efficiently with List. This way it should work with most of the Seq implementations and no need to to convert the seqs like WrappedArray.
val rddCaseClass = RDD[A]
val a = rddCaseClass.toDF().map ( x=> {
val rowData = x.getAs[java.util.HashMap[String, Seq[String]]]("attributes")
callJavaMethod(rowData)

How to get just the desired field from an array of sub documents in Mongodb using Java

I have just started using Mongo Db . Below is my data structure .
It has an array of skillID's , each of which have an array of activeCampaigns and each activeCampaign has an array of callsByTimeZone.
What I am looking for in SQL terms is :
Select activeCampaigns.callsByTimeZone.label,
activeCampaigns.callsByTimeZone.loaded
from X
where skillID=50296 and activeCampaigns.campaign_id= 11371940
and activeCampaigns.callsByTimeZone='PT'
The output what I am expecting is to get
{"label":"PT", "loaded":1 }
The Command I used is
db.cd.find({ "skillID" : 50296 , "activeCampaigns.campaignId" : 11371940,
"activeCampaigns.callsByTimeZone.label" :"PT" },
{ "activeCampaigns.callsByTimeZone.label" : 1 ,
"activeCampaigns.callsByTimeZone.loaded" : 1 ,"_id" : 0})
The output what I am getting is everything under activeCampaigns.callsByTimeZone while I am expecting just for PT
DataStructure :
{
"skillID":50296,
"clientID":7419,
"voiceID":1,
"otherResults":7,
"activeCampaigns":
[{
"campaignId":11371940,
"campaignFileName":"Aaron.name.121.csv",
"loaded":259,
"callsByTimeZone":
[{
"label":"CT",
"loaded":6
},
{
"label":"ET",
"loaded":241
},
{
"label":"PT",
"loaded":1
}]
}]
}
I tried the same in Java.
QueryBuilder query = QueryBuilder.start().and("skillID").is(50296)
.and("activeCampaigns.campaignId").is(11371940)
.and("activeCampaigns.callsByTimeZone.label").is("PT");
BasicDBObject fields = new BasicDBObject("activeCampaigns.callsByTimeZone.label",1)
.append("activeCampaigns.callsByTimeZone.loaded",1).append("_id", 0);
DBCursor cursor = coll.find(query.get(), fields);
String campaignJson = null;
while(cursor.hasNext()) {
DBObject campaignDBO = cursor.next();
campaignJson = campaignDBO.toString();
System.out.println(campaignJson);
}
the value obtained is everything under callsByTimeZone array. I am currently parsing the JSON obtained and getting only PT values . Is there a way to just query the PT fields inside activeCampaigns.callsByTimeZone .
Thanks in advance .Sorry if this question has already been raised in the forum, I have searched a lot and failed to find a proper solution.
Thanks in advance.
There are several ways of doing it, but you should not be using String manipulation (i.e. indexOf), the performance could be horrible.
The results in the cursor are nested Maps, representing the document in the database - a Map is a good Java-representation of key-value pairs. So you can navigate to the place you need in the document, instead of having to parse it as a String. I've tested the following and it works on your test data, but you might need to tweak it if your data is not all exactly like the example:
while (cursor.hasNext()) {
DBObject campaignDBO = cursor.next();
List callsByTimezone = (List) ((DBObject) ((List) campaignDBO.get("activeCampaigns")).get(0)).get("callsByTimeZone");
DBObject valuesThatIWant;
for (Object o : callsByTimezone) {
DBObject call = (DBObject) o;
if (call.get("label").equals("PT")) {
valuesThatIWant = call;
}
}
}
Depending upon your data, you might want to add protection against null values as well.
The thing you were looking for ({"label":"PT", "loaded":1 }) is in the variable valueThatIWant. Note that this, too, is a DBObject, i.e. a Map, so if you want to see what's inside it you need to use get:
valuesThatIWant.get("label"); // will return "PT"
valuesThatIWant.get("loaded"); // will return 1
Because DBObject is effectively a Map of String to Object (i.e. Map<String, Object>) you need to cast the values that come out of it (hence the ugliness in the first bit of code in my answer) - with numbers, it will depend on how the data was loaded into the database, it might come out as an int or as a double:
String theValueOfLabel = (String) valuesThatIWant.get("label"); // will return "PT"
double theValueOfLoaded = (Double) valuesThatIWant.get("loaded"); // will return 1.0
I'd also like to point out the following from my answer:
((List) campaignDBO.get("activeCampaigns")).get(0)
This assumes that "activeCampaigns" is a) a list and in this case b) only has one entry (I'm doing get(0)).
You will also have noticed that the fields values you've set are almost entirely being ignored, and the result is most of the document, not just the fields you asked for. I'm pretty sure you can only define the top-level fields you want the query to return, so your code:
BasicDBObject fields = new BasicDBObject("activeCampaigns.callsByTimeZone.label",1)
.append("activeCampaigns.callsByTimeZone.loaded",1)
.append("_id", 0);
is actually exactly the same as:
BasicDBObject fields = new BasicDBObject("activeCampaigns", 1).append("_id", 0);
I think some of the points that will help you to work with Java & MongoDB are:
When you query the database, it will return you the whole document of
the thing that matches your query, i.e. everything from "skillID"
downwards. If you want to select the fields to return, I think those will only be top-level fields. See the documentation for more detail.
To navigate the results, you need to know that a DBObjects are returned, and that these are effectively a Map<String,
Object> in Java - you can use get to navigate to the correct node,
but you will need to cast the values into the correct shape.
Replacing while loop from your Java code with below seems to give "PT" as output.
`while(cursor.hasNext()) {
DBObject campaignDBO = cursor.next();
campaignJson = campaignDBO.get("activeCampaigns").toString();
int labelInt = campaignJson.indexOf("PT", -1);
String label = campaignJson.substring(labelInt, labelInt+2);
System.out.println(label);
}`

Categories