How to define Spark RDD transformation with non-Lambda Function - java

I recently started using with Spark and Java. I am currently experimenting with RDD transformations and actions. For the moment I am reading data out of a csv that contains some DateTime fields and then I apply a filter to keep only those rows that are younger than 2 days and finally I check if the resulting RDD is empty. I wrote a simple snippet that does what I want on a minimal level.
Function<List<String>, Boolean> filterPredicate = row -> new DateTime(row.get(1).isAfter(dtThreshold);
sc.textFile(inputFilePath)
.map(text -> Arrays.asList(text.split(",")))
.filter(filterPredicate)
.isEmpty();
On this simple case I have assumed that the DateTime objects always live on the first column. I now want to expand that to use multiple column indexes.
But to do that I need to be able to define a predicate function with more than one lines. That is the reason why I have separated the predicate function definition from the transformation code.
How I am supposed to define such a function?

Use the curly brace notation...
Function<List<String>, Boolean> filterPredicate = row -> {
boolean isDateAfter = new DateTime(row.get(1)).isAfter(dtThreshold);
boolean hasName = row.get(2) != "";
return isDateAfter && hasName;
}

Related

jOOQ: returning list with join,groupby and count in single object

Core question: how do you properly fetch information from a query into objects?
Idea
I am creating functions in my DAO, which comes down to the following query:
select A.*, count(*)
from A
left join B on B.aId = A.aId
group by A.*
Im looking for a way to create a jOOQ expression that just gives me a list (or something I can loop over) with objects A (pojo) and Integer.
Concrete case
In my code case: A = Volunteer and B = VolunteerMatch where I store several matches for each volunteer. B has (volunteerId, volunteerMatchId) as primary
key. Thus this query results in both the information from the Volunteer, as well as the number of matches. Clearly this can be done in two seperate queries, but I want to do it as one!
Problem
I cannot find a single object to return in my function. I am trying to get something like List<VolunteerPojo, Integer>. Let me explain this better using examples and why they dont fit for me.
What I tried 1
SelectHavingStep<Record> query = using(configuration())
.select(Volunteer.VOLUNTEER.fields())
.select(Volunteermatch.VOLUNTEERMATCH.VOLUNTEERID.count())
.from(Volunteer.VOLUNTEER)
.leftJoin(Volunteermatch.VOLUNTEERMATCH).on(Volunteermatch.VOLUNTEERMATCH.VOLUNTEERID.eq(Volunteer.VOLUNTEER.VOLUNTEERID))
.groupBy(Volunteer.VOLUNTEER.fields());
Map<VolunteerPojo, List<Integer>> map = query.fetchGroups(
r -> r.into(Volunteer.VOLUNTEER).into(VolunteerPojo.class),
r -> r.into(Volunteermatch.VOLUNTEERMATCH.VOLUNTEERID.count()).into(Integer.class)
);
The problem with this, is that I create a List from the integers. But that is not what I want, I want a single integer (the count will always return one row). Note: I don't want the solution "just create your own map without list", since my gut says there is a solution inside jOOQ. Im here to learn!
What I tried 2
SelectHavingStep<Record> query = using(configuration())
.select(Volunteer.VOLUNTEER.fields())
.select(Volunteermatch.VOLUNTEERMATCH.VOLUNTEERID.count())
.from(Volunteer.VOLUNTEER)
.leftJoin(Volunteermatch.VOLUNTEERMATCH).on(Volunteermatch.VOLUNTEERMATCH.VOLUNTEERID.eq(Volunteer.VOLUNTEER.VOLUNTEERID))
.groupBy(Volunteer.VOLUNTEER.fields());
Result<Record> result = query.fetch();
for (Record r : result) {
VolunteerPojo volunteerPojo = r.into(Volunteer.VOLUNTEER).into(VolunteerPojo.class);
Integer count = r.into(Volunteermatch.VOLUNTEERMATCH.VOLUNTEERID.count()).into(Integer.class);
}
However, I do not want to return the result object in my code. On each place I call this function, I am calling the r.into(...).into(...). During compile time, this won't give an error if it returns an integer or a real pojo. I don't want this to prevent future errors. But at least it doesn't give it in a List I suppose.
Reasoning
Either option is probably fine, but I have the feeling there is something better that I missed in the documentation. Maybe I can adapt (1) to not get a list of integers. Maybe I can change Result<Record> into something like Result<VolunteerPojo, Integer> to indicate what objects really are returned. A solution for each problem would be nice, since I am using jOOQ more and more and this would be a good learning experience!
So close! Don't use ResultQuery.fetchGroups(). Use ResultQuery.fetchMap() instead:
Map<VolunteerPojo, Integer> map =
using(configuration())
.select(VOLUNTEER.fields())
.select(VOLUNTEERMATCH.VOLUNTEERID.count())
.from(VOLUNTEER)
.leftJoin(VOLUNTEERMATCH)
.on(VOLUNTEERMATCH.VOLUNTEERID.eq(VOLUNTEER.VOLUNTEERID))
.groupBy(VOLUNTEER.fields())
.fetchMap(
r -> r.into(VOLUNTEER).into(VolunteerPojo.class),
r -> r.get(VOLUNTEERMATCH.VOLUNTEERID.count())
);

Filtering JavaPairRDD based on a JavaRDD in Spark

I'm very new in Apache Spark. I need a Java solution for the problem below:
JavaPairRDD: JavaRDD: Desired Output:
1,USA France 2,England
2,Engand England 3,France
3,France
4,Italy
Edit:
Frankly, I have no idea about what I can try. Like I said, I'm very very newbie at spark. I just thought I can use a method something like instersection. But it requires another JavaPairRDD object. I think the filter method won't work for this problem. For example,
Function<Tuple2<String, String>, Boolean> myFilter =
new Function<Tuple2<String, String>, Boolean>() {
public Boolean call(Tuple2<String, String> keyValue)
{
return ("some boolean expression");
}
};
myPairRDD.filter(myFilter);
I have no idea what kind of boolean expression I can write instead of "some boolean expression" in above function. Sorry for my English by the way.
There are at least three options:
map JavaRDD to JavaPairRDD with arbitrary value, join and map to drop dummy values
if number of unique values in JavaRDD is small, collect distinct values, convert to Set, broadcast and use it to filter JavaPairRDD
convert both RDDs to DataFrames and use inner join followed by drop / select.

Merge CSV files with dynamic headers in Java

I have two or more .csv files which have the following data:
//CSV#1
Actor.id, Actor.DisplayName, Published, Target.id, Target.ObjectType
1, Test, 2014-04-03, 2, page
//CSV#2
Actor.id, Actor.DisplayName, Published, Object.id
2, Testing, 2014-04-04, 3
Desired Output file:
//CSV#Output
Actor.id, Actor.DisplayName, Published, Target.id, Target.ObjectType, Object.id
1, Test, 2014-04-03, 2, page,
2, Testing, 2014-04-04, , , 3
For the case some of you might wonder: the "." in the header is just an additional information in the .csv file and shouldn't be treated as a separator (the "." results from the conversion of a json-file to csv, respecting the level of the json-data).
My problem is that I did not find any solution so far which accepts different column counts.
Is there a fine way to achieve this? I did not have code so far, but I thought the following would work:
Read two or more files and add each row to a HashMap<Integer,String> //Integer = lineNumber, String = data, so that each file gets it's own HashMap
Iterate through all indices and add the data to a new HashMap.
Why I think this thought is not so good:
If the header and the row data from file 1 differs from file 2 (etc.) the order won't be kept right.
I think this might result if I do the suggested thing:
//CSV#Suggested
Actor.id, Actor.DisplayName, Published, Target.id, Target.ObjectType, Object.id
1, Test, 2014-04-03, 2, page //wrong, because one "," is missing
2, Testing, 2014-04-04, 3 // wrong, because the 3 does not belong to Target.id. Furthermore the empty values won't be considered.
Is there a handy way I can merge the data of two or more files without(!) knowing how many elements the header contains?
This isn't the only answer but hopefully it can point you in a good direction. Merging is hard, you're going to have to give it some rules and you need to decide what those rules are. Usually you can break it down to a handful of criteria and then go from there.
I wrote a "database" to deal with situations like this a while back:
https://github.com/danielbchapman/groups
It is basically just a Map<Integer, Map<Integer. Map<String, String>>> which isn't all that complicated. What I'd recommend is you read each row into a structure similar to:
(Set One) -> Map<Column, Data>
(Set Two) -> Map<Column, Data>
A Bidi map (as suggested in the comments) will make your lookups faster but carries some pitfalls if you have duplicate values.
Once you have these structures you lookup can be as simple as:
public List<Data> process(Data one, Data two) //pseudo code
{
List<Data> result = new List<>();
for(Row row : one)
{
Id id = row.getId();
Row additional = two.lookup(id);
if(additional != null)
merge(row, additional);
result.add(row);
}
}
public void merge(Row a, Row b)
{
//Your logic here.... either mutating or returning a copy.
}
Nowhere in this solution am I worried about the columns as this is just acting on the raw data-types. You can easily remap all the column names either by storing them each time you do a lookup or by recreating them at output.
The reason I linked my project is that I'm pretty sure I have a few methods in there (such as outputing column names etc...) that might save you considerable time/point you in the right direction.
I do a lot of TSV processing in my line of work and maps are my best friends.

object search optimization in memory

I have an ArrayList of MyObjects. The MyObjects class has more than 10 properties, but I need to search only from 4 properties. The user will press a button and is able to select values for property1.
Lets say the user will select property1Value1, property1Value2 and property1Value4, than he will press a button and will make a selection for property2 values: property2Value1, property2Value5, property2Value7 and so on. Those are the filter1 and filter2.
The property2Value2, property2Value3 and property2Value4 is not visible to user because he filtered out with the filter1. Is like doing a search before he enter to a new filter screen.
I need to store somewhere what has he selected at each filter because when he navigate back I must show to him the selected values.
I think easier to understand with pictures, since similar in implemented at ebay:
No filters at beginning: user able to select all values for each property:
The user selected "Tablet" for type property. - a search is done and some property values aren't visible anymore:
The second filter value is selected:
Pressing ( automatically) the search I should do something like this in SQL:
SELECT * FROM MyObjects WHERE
( (property1 = property1Value1) || (property1 = property1Value2) || (property1 = property1Value4) )
AND
( (property2 = property2Value1) || (property2 = property2Value5) )
Since I have the objects in memory I don't think is a good idea to make an sqlLite3 database, write out than select. At iOS implementation I did very complex caching algorithm. Caching the filter values separated. A loooooot of auxiliary index holders(min 20), because for each filter I need some extra to do, not mentioned here and the data only once are stored.
I am scared to rewrite that algorithm to Android, what is at iOS, must be something easy.
Edit:
Basically I need to rewrite that SQL search in Java object searching.
Edit2:
Based on answer with Multimap.
The Multimap is not better than a HashMap<String, <ArrarList<Integer>>
where the key is the value of property (property2Value3) and the value is a list of index to my ArrayList<MyObjects> (1,2,3,4,5...100)
Need to build up at each filter, each filter value the HashMap<String, <ArrarList<Integer>> and than exactly there I am, where the iOS...maybe with a few auxiliary collections less.
Any idea?
What you're talking about is basically indexing. A similar approach to what you describe is perfectly manageable in Java, it just takes the same careful coding it would in Objective C.
You haven't specified much about questions like whether multiple items are allowed to have the same values in their fields, so I'll presume they are. In that case, here's how I'd start:
Use Guava's Multimap, probably HashMultimap, where the key is the property being index and each object being indexed gets put into the map under that key.
When you're trying to search on multiple fields, call multimap.get(property) to get a Collection of all of the objects that match that property and keep only the objects that match all the properties:
Set<Item> items = new Set<Items>(typeMultimap.get("tablet"));
items.retainAll(productLineMultimap.get("Galaxy Tab"));
// your results are now in "items"
If your property list is stable, write a wrapper Indexer class that has fields for all of the Multimaps and ensures that objects are inserted into and removed from all of the property indexes, and maybe has convenience wrappers for the map getters.
How it executes the MYSQL that SQL behind of scene? - At a MyISAM table has a file, where he has the data, at other file the id positions.
SELECT * FROM mytable will put all IDs to the result set, because there is no filter.
Because is the * will copy all fields to id. This is equivalent with:
ArrayList<MyObject> result = new ArrayList<MyObject>();
for(int i=0; i < listMyObjects.size(); i++){
if(true == true){// SELECT * FROM has a hidden WHERE 1, which is always true
result.add(listMyObjects.get(i));
}
}
in case of filter it should have a list of filters:
ArrayList<String> filterByProperty1 = new ArrayList<String> ();
at filter interface I will add some Strings property1Value1, property1Value2.... The search algorithm it will be:
ArrayList<MyObject> result = new ArrayList<MyObject>();
for(int i=0; i < listMyObjects.size(); i++){
MyObject curMyObject = listMyObjects.get(i);
// lets see if bypass the filter, if filter exists
boolean property1Allow = false;
boolean property2Allow = false;
if(filterByProperty1.size() > 0){
String theCurProperty1Value = curMyObject.getProperty1();
if(filterByProperty1.contains(theCurPropertyValue)){
property1Allow = true;
}
}
else{// no filter by property1: allowed to add to result
property1Allow = true;
}
// do the same with property2,3,4, lazzy to write it
if(property1Allow && property2Allow){
result.add(theCurPropertyValue);
}
}
}
Not sure if this is a lot slower, but I least I have escaped from tenth / hundred of auxiliary collections, indexes. After this I will make the extra stuff required and is done

Sort Order in HBase with Pig/Piglatin in Java

I created a HBase Table in the shell and added some data. In http://hbase.apache.org/book/dm.sort.html is written that the datasets are first sorted by the rowkey and then the column. So I tried something in the HBase Shell:
hbase(main):013:0> put 'mytable', 'key1', 'cf:c', 'val'
0 row(s) in 0.0110 seconds
hbase(main):011:0> put 'mytable', 'key1', 'cf:d', 'val'
0 row(s) in 0.0060 seconds
hbase(main):012:0> put 'mytable', 'key1', 'cf:a', 'val'
0 row(s) in 0.0060 seconds
hbase(main):014:0> get 'mytable', 'key1'
COLUMN CELL
cf:a timestamp=1376468325426, value=val
cf:c timestamp=1376468328318, value=val
cf:d timestamp=1376468321642, value=val
3 row(s) in 0.0570 seconds
Everything looks fine. I got the right order a -> c -> d like expected.
Now i tried the same with Apache Pig in Java:
pigServer.registerQuery("mytable_data = load 'hbase://mytable' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('cf', '-loadKey true') as (rowkey:chararray, columncontent:map[]);");
printAlias("mytable_data"); // own function, which itereate over the keys
I got this result:
(key1,[c#val,d#val,a#val])
So, now the order is c -> d -> a. That seems a little odd to me, shouldn't it be the same like in HBase? It's important for me to get the right order because I transform the map afterwards into a bag and then join it with other tables. If both inputs are sorted I could use a merge join without sorting these to datasets?! So does anyone now how it is possible to get the sorted map (or bag) of the columns?
You're fundamentally misunderstanding something -- the HBaseStorage backend loads each row as a single Tuple. You've told Pig to load the column family cf as a map:[], which is exactly what Pig is doing. A Pig map under the hood is just a java.util.HashMap, which obviously has no order.
There is no way currently in pig to convert the map to a bag, but that should be a trivial UDF to write, barring the null checks and other boilerplate, the body is something like
public DataBag exec(Tuple input) {
DataBag resultBag = bagFactory.newDefaultBag();
HashMap<String, Object> map = (HashMap<String, Object>) input.get(0);
for (Map.Entry<String, Object> entry : map) {
Tuple t = tupleFactory.newTuple();
t.append(entry.getKey());
t.append(entry.getValue().toString());
resultBag.add(t);
}
return resultBag;
}
With that then you can generate a bag{(k:chararray, v:chararray)}, use FLATTEN to get a list of (k:chararray, v:chararray) and ORDER those by k.
As for whether there is a way to get the data sorted -- generally no. If the amount of fields in the column family is not constant or the fields are not always the same / defined, your only options are
transforming the map to a bag of tuples and sorting then
or writing a custom LoadFunc which takes a table, a column family and emits a tuple per KeyValue pair scanned. HBase will ensure the ordering and give you the data in the sorted order you see in the shell, but note that the order is only guaranteed upon loading. Any further transformation you apply ruins that.

Categories