I'm new in Java. I want to drop partition in hiveTable. I want to use SparkSession.ExternalCatalog().listPartitions and SparkSession.ExternalCatalog().dropPartitions.
I saw this methods on scala How to truncate data and drop all partitions from a Hive table using Spark
But I can't understand how to run it on Java. It's a part of etl process and I want to understand how to deal with it on Java.
My code failed because of the misunderstanding how to manipulate with datatypes and convert it to java. What type of object need and how to understand what data return API.
Example of my code:
ExternalCatalog ec = SparkSessionFactory.getSparkSession.sharedState().externalCatalog();
ec.listPartitions("need_schema", "need_table");
And it fails because of the:
method listPartitions in class org.apache.spark.sql.catalog.ExternalCatalog cannot be applied to given types.
I can't beat it because of less information about api (https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-ExternalCatalog.html#listPartitions) and java knowledges and because the all examples I find wrote on scala.
Finally I need to convert this code that works on scala to java:
def dropPartitions(spark:SparkSession, shema:String, table:String, need_date:String):Unit = {
val cat = spark.sharedState.externalCatalog
val partit = cat.ListPartitions(shema,table).map(_.spec).map(t => t.get("partition_field")).flatten
val filteredPartit = partita.filter(_<dt).map(x => Map("partition_field" -> x))
cat.dropPartitions(
shema
,table
,filteredPartitions
,ignoreIfNotExists=true
,purge=false
,retainData=false
}
Please, if you know how deal with it can you help in this things:
some example of code in java to write my own container to manipulate data from externalCatalog
what data structure use in this api and some theoretical source which can help me to understand how they are usable with java
what's mean in scala code string: cat.ListPartitions(shema,table).map(_.spec).map(t => t.get("partition_field")).flatten?
tnx
UPDATING
Thank you very much for your feedback #jpg. I'll try. I have big etl task and goal of it to writing into dynamic partitioned table data once a week. Business rules of making this datamart: (sysdate - 90 days). And because of that I want to drop arrays of partition (by days) in target table in public access schema. And I have read that the right way of drop partition - using externalCatalog. I should use java because of the historical tradition this project) and try to understand how to do this most efficiently. Some of methods of externalCatalog I can return into terminal through System.out.println():
externalCatalog.tableExists(), externalCatalog.listTables() and methods of externalCatalog.getTable. But I don't understand how to deal with externalCatalog.listPartitions.
UPDATING ONE MORE TIME
Hello everyone. I have one step forward in my task:
Now I can return in terminal buffer of list partitions:
ExternalCatalog ec = SparkSessionFactory.getSparkSession.sharedState().externalCatalog();
ec.listPartitions("schema", "table", Option.empty()); // work! null or miss parameter fail program
Seq<CatalogTablePartition> ctp = ec.listPartitions("schema", "table", Option.empty());
List<CatalogTablePartition> catalogTablePartitions = JavaConverters.seqAsJavaListConverter(ctp).asJava();
for CatalogTablePartition catalogTablePartition: catalogTablePartitions) {
System.out.println(catalogTablePartition.toLinkedHashMap().get("Partition Values"));//retutn me value of partition like "Some([validation_date=2021-07-01])"
)
But this is another problem.
I can return values in this api in method ec.dropPartitions like Java List. It's want in 3d parameter Seq<Map<String, String>> structure. Also I can't filtered partition in this case - in my dreams I want filtered the values of partition less by date parameter and then drop it.
If anyone know how to wrote map method with this api to return it like in my scala example please help me.
I solved it by myself. Maybe it'll help someone.
public static void partitionDeleteLessDate(String db_name, String table_name, String date_less_delete) {
ExternalCatalog ec = SparkSessionFactory.getSparkSession.sharedState().externalCatalog();
Seq<CatalogTablePartition> ctp = ec.listPartitions(db_name, table_name, Option.empty());
List<CatalogTablePartition> catalogTablePartitions = JavaConverters.seqAsJavaListConverter(ctp).asJava();
List<Map<String, String>> allPartList = catalogTablePartitions.stream.
.map(s -> s.spec.seq())
.collect(Collectors.toList());
List<String> datePartDel =
allPartList.stream()
.map(x -> x.get("partition_name").get())
.sorted()
.collect(Collectors.toList());
String lessThisDateDelete = date_less_delete;
DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd");
LocalDate date = LocalDate.parse(lessThisDateDelete, formatter);
List<String> filteredDates = datePartDel.stream()
.map(s -> LocalDate.parse(s, formatter))
.filter(d -> d.isBefore(date))
.map(s -> s.toString())
.collect(Collectors.toList());
for (String seeDate : filteredDates)) {
List<Map<String, String>> elem = allPartList.stream()
.filter(x -> x.get("partition_name").get().equals(seeDate))
.collect(Collectors.toList());
Seq<Map<String, String>> seqElem = JavaConverters.asScalaIteratorConverter(elem.iterator()).asScala.toSeq();
ec.dropPartitions(
db_name
, table_name
, seqElem
, true
, false
, false
);
}
}
Related
Java 11 here. I have a List<Foobar> as well as a Map<Foobar,List<String>>.
I would like to iterate over the list and:
if the current Foobar is a key in the map, and a specific string ("Can't please everyone") to that entry's value list
if the current Foobar is not a key in the map, and it as a new key, with a value that is an ArrayList consisting of a single string with the same value
I can accomplish this like so:
List<Foobar> foobarList = getSomehow();
Map<Foobar,List<String>> foobarMap = getItSomehow();
String msg = "Can't please everyone";
for (Foobar fb : foobarList) {
if (foobarMap.containsKey(fb)) {
foobarMap.get(fb).add(msg);
} else {
foobarMap.put(fb, Collections.singletonList(msg));
}
}
This works great, but I'm trying to get this to work using the Java Stream API. My best attempt thus far:
List<Foobar> foobarList = getSomehow();
Map<Foobar,List<String>> foobarMap = getItSomehow();
String msg = "Can't please everyone";
foobarList.stream()
.filter(fb -> foobarMap.containsKey(fb))
.map(fb -> foobarMap.get(fb).add(msg))
.filter(fb -> !foobarMap.containsKey(fb))
.map(fb -> foobarMap.put(fb. Collections.singleton(msg));
Yields several compiler errors. Can anyone spot where I'm going awry?
Streams are used either
To modify the contents of the stream elements, or
To produce another stream from it, or
To iterate over the elements and do something that doesn't affect the elements of this stream.
Since your use case is the last type, the logical operation is simply forEach(..). (I know it is a dampener :-), but that is how the use case is.)
foobarList.forEach( fb -> {
if (foobarMap.containsKey(fb)) {
foobarMap.get(fb).add(msg);
} else {
foobarMap.put(fb, Collections.singletonList(msg));
}
} );
As noticed by #Sree Kumar, you should use forEach().
However, I would suggest leveraging the Map.merge() method:
foobarList.forEach(fb -> foobarMap.merge(fb, Collections.singletonList(msg),
(l1, l2) -> Stream.concat(l1.stream(), l2.stream()).toList()));
I am looking at a code that has deeply nested for loop that I wanted to rewrite in a pure functional form using java-8 streams but what I see is that there are multiple values that are needed at each level and I am not sure how to approach to solve this in a clean way.
List<Report> reports = new ArrayList();
for (DigitalLogic dl : digitalLogics){
for (Wizard wiz : dl.getWizards){
for(Vice vice : wiz.getVices()){
reports.add(createReport(dl, wiz, vice));
}
}
}
//
Report createReport(DigitalLogic dl, Wizard wiz, Vice vice){
//Gets certain elements from all parameters and creates a report object
}
My real case scenario is much more complicated than this but I am wondering if there is a cleaner pure functional way of writing this using streams. Below is my initial attempt
List<Report> reports = new ArrayList();
digitalLogics.stream()
.map(dl -> dl.getWizards())
.flatMap(List::stream())
.map(wiz -> wiz.getVices())
.flatMap(List::stream())
.forEach(vice -> reports.add(createReport(?, ?, vice));
Obviously, I have lost the DigitalLogic and Wizard references.
I will go with forEach method because stream solution makes this complicated
List<Report> reports = new ArrayList<>();
digitalLogics.forEach(dl->dl.getWizards()
.forEach(wiz->wiz.getVices()
.forEach(v->reports.add(createReport(dl, wiz, v)))));
Though currently what you have(for loops) is much cleaner than what it would be with streams, yet if you were to try it out :
public void createReports(List<DigitalLogic> digitalLogics) {
List<Report> reports = digitalLogics.stream()
.flatMap(dl -> dl.getWizards().stream()
.map(wizard -> new AbstractMap.SimpleEntry<>(dl, wizard)))
.flatMap(entry -> entry.getValue().getVices().stream()
.map(vice -> createReport(entry.getKey(), entry.getValue(), vice)))
.collect(Collectors.toList());
}
I have streaming RDD and applied java POJO on it, now have inputRDD, this contains id,.. other details. I want groupby/filterby id then each rdd should save in their own DB. I have tried and code working with for loop, but if this need to happen in spark parallel processing. any help appreciated.
messages.transform(this::getClass).foreachRDD(inputRDD -> {
List<String> idList = inputRDD.map(ClassObject::getEmpid).distinct().collect();
for (String id : idList){
String EmpName = EmpCache.getEmpNameFor(id).toLowerCase();
inputRDD.filter(f -> f.getEmpid().equals(id));
javaFunctions(inputRDD).writerBuilder(in_EmpName , tableName, mapToRow(agg)).saveToCassandra();
}
}
I have two lists one is messagePermissionResponseDTOList and another dispatchSMSQList.
I want to take out list of blocked numbers from dispatchSMSQList. Below is my working code snippet.
Can you please guide how this can be converted Lambda expression or its correct.?
Working link with test data is https://repl.it/repls/FriendlyImmenseClasses
TreeSet<Long> blockedNumbersSet = new TreeSet<>();
for (MessagePermissionResponseDTO permission: messagePermissionResponseDTOList) {
if (permission.isBlocked()) {
blockedNumbersSet.add(permission.getPhoneNumber());
}
}
List<DispatchSMSQ> blockedNumbers = dispatchSMSQList.stream().filter(t -> blockedNumbersSet.contains(t.getMdn())).collect(Collectors.toList());
System.out.print("blockedNumbers-->"+ blockedNumbers.size());
You may do it like so,
List<DispatchSMSQ> blockedNumbers = messagePermissionResponseDTOList.stream()
.filter(MessagePermissionResponseDTO::isBlocked)
.map(MessagePermissionResponseDTO::getPhoneNumber)
.collect(Collectors.collectingAndThen(Collectors.toSet(),
s -> dispatchSMSQList.stream()
.filter(d -> s.contains(d.getMdn())).collect(Collectors.toList())));
This should do it
messagePermissionResponseDTOList
.stream()
.filter(MessagePermissionResponseDTO::isBlocked)
.map(MessagePermissionResponseDTO::getPhoneNumber)
.collect(Collectors.toSet())
I'd like to imagine there's existing API functionality for this. Suppose there was Java code that looks something like this:
JavaRDD<Integer> queryKeys = ...; //values not particularly important
List<Document> allMatches = db.getCollection("someDB").find(queryKeys); //doesn't work, I'm aware
JavaPairRDD<Integer, Iterator<ObjectContainingKey>> dbQueryResults = ...;
Goal of this: After a bunch of data transformations, I end up with an RDD of integer keys that I'd like to make a single db query with (rather than a bunch of queries) based on this collection of keys.
From there, I'd like to turn the query results into a pair RDD of the key and all of its results in an iterator (making it easy to hit the ground going again for the next steps I'm intending to take). And to clarify, I mean a pair of the key and its results as an iterator.
I know there's functionality in MongoDB capable of coordinating with Spark, but I haven't found anything that'll work with this yet (it seems to lean towards writing to a database rather than querying it).
I managed to figure this out in an efficient enough manner.
JavaRDD<Integer> queryKeys = ...;
JavaRDD<BasicDBObject> queries = queryKeys.map(value -> new BasicDBObject("keyName", value));
BasicDBObject orQuery = SomeHelperClass.buildOrQuery(queries.collect());
List<Document> queryResults = db.getCollection("docs").find(orQuery).into(new ArrayList<>());
JavaRDD<Document> parallelResults = sparkContext.parallelize(queryResults);
JavaRDD<ObjectContainingKey> results = parallelResults.map(doc -> SomeHelperClass.fromJSONtoObj(doc));
JavaPairRDD<Integer, Iterable<ObjectContainingKey>> keyResults = results.groupBy(obj -> obj.getKey());
And the method buildOrQuery here:
public static BasicDBObject buildOrQuery(List<BasicDBObject> queries) {
BasicDBList or = new BasicDBList();
for(BasicDBObject query : queries) {
or.add(query);
}
return new BasicDBObject("$or", or);
}
Note that there's a fromJSONtoObj method that will convert your object back from JSON into all of the required field variables. Also note that obj.getKey() is simply a getter method associated to whatever "key" it is.