Calling pipe() from a PairRDD and passing a Java Object to it

Calling pipe() from a PairRDD and passing a Java Object to it - java

I have a PairRDD like JavaPairRDD<String, Graph> where Graph is a Java object I created using
PairFunction<Row, String, Graph> pairFunction = new PairFunction<Row, String, Graph>() {
private static final long serialVersionUID = 1L;
public Tuple2<String, Graph> call(Row row) throws Exception {
Integer parameter = row.getAs("foo");
String otherParameter = row.getAs("bar");
Graph graph = new Graph( parameter, otherParameter );
String key = someKeyGenerator();
return new Tuple2<String, Graph>( key, graph );
}
};
Now I need to run an external program using myPairRdd.pipe('external.sh') but I think Spark will pass the Graph object to external.sh via stdin.
I need to access Graph.parameter and Graph.otherParameter inside external.sh.
How to manage this situation?

Found it !!
Just need to override the toString() method of my POJO (Graph) to expose the desirable attributes !!!
In this case:
#Override
public String toString() {
return this.parameter + "," + this.otherParameter;
}
Now the output is:
(62,foo,bar)

Related

Searching on an object with different keys in Java

I'm a developer transitioning from C++ to Java.So I still dont have all the expertise to getting stuff done the Java Way.
I have the following class
class Site
{
String siteName;
Integer siteId;
Integer views;
Integer searches;
}
I maintain 2 maps to search over the objects of this class(using sitename or siteid)
HashMap<String, Site> siteNameToSiteMap;
HashMap<Integer, Site> siteIdToSiteMap;
However going forward, I have to add a one more field called parentBrand to the class Site. This will force me to create another Map to be able to search over it.
HashMap<String, Site> parentBrandToSiteMap;
Such "indexing" variables might increase going ahead and thus also increase the number of maps I maintain.
I remember using Boost Multi-indexed container while solving a similar issue while developing in C++. Is there an equivalent well supported, well documented library in Java that I can use. If no, is there a way I can refactor my code to solve my problem.

I'm surprised that there isn't a version of something like the boost multi-index containers available. (Maybe there is somewhere...) But its not too hard to hook up your own version in Java.
A rough, but working, version might look like this:
The basic site object
I've used a slightly different Site object, just to keep things simple (and because I didn't have access to this post on the bus...)
public class Site {
Integer id;
String name;
String rating;
// .. Constructor and toString removed for brevity
}
A wrapped version
I'm going to introduce some workhorse classes later, but they're a little ugly. This is just to show what the final interface would look like once you've wrapped it up a little:
class SiteRepository {
private final MultiMap<Site> sites = new MultiMap<>();
public final AbstractMap<String, Site> byName = sites.addIndex((site)->site.name);
public final AbstractMap<Integer,Site> byId = sites.addIndex((site)->site.id);
public final AbstractMap<String,List<Site>> byRating = sites.addMultiIndex((Site site)->site.rating);
public void add(Site s) { sites.add(s); }
}
SiteRepository repo = new SiteRepository();
repo.add(...);
Site site = repo.byId.get(1234);
repo.byId.forEach((Integer id, Site s) -> System.err.printf(" %s => %s\n", id, s));
The MultiMap core
Probably should be called MultiIndex since MultiMap means something else...
public static class MultiMap<V> {
public static class MultiMapIndex<K,V> extends AbstractMap<K,V> {
#Override
public Set<Entry<K, V>> entrySet() {
return map.entrySet();
}
HashMap<K,V> map = new HashMap<>();
}
public <K> MultiMapIndex<K,V> addIndex(Function<V, K> f) {
MultiMapIndex<K,V> result = new MultiMapIndex<>();
Consumer<V> e = (V v) -> result.map.put(f.apply(v), v);
mappers.add(e);
values.forEach(e);
return result;
}
public <K> MultiMapIndex<K,List<V>> addMultiIndex(Function<V, K> f) {
MultiMapIndex<K,List<V>> result = new MultiMapIndex<>();
Consumer<V> e = (V v) -> {
K key = f.apply(v);
List<V> list = result.map.get(key);
if (list == null) {
list = new ArrayList<>();
result.map.put(key, list);
}
list.add(v);
};
mappers.add(e);
values.forEach(e);
return result;
}
public void add(V v) {
values.add(v);
mappers.forEach( e -> e.accept(v));
}
private List<Consumer<V>> mappers = new ArrayList<>();
private List<V> values = new ArrayList<>();
}
More low level examples
public static void main(String[] args) {
// Create a multi-map
MultiMap<Site> multiMap = new MultiMap<>();
// Add an index by Site.id
MultiMapIndex<Integer, Site> byId = multiMap.addIndex((site)->site.id);
// Add some entries to the map
multiMap.add(new Site(1234,"A Site","A"));
multiMap.add(new Site(4321,"Another Site","B"));
multiMap.add(new Site(7777,"My Site","A"));
// We can add a new index after the entries have been
// added - this time by name.
MultiMapIndex<String, Site> byName = multiMap.addIndex((site)->site.name);
// Get a value by Id.
System.err.printf("Get by id=7777 = %s\n", byId.get(7777));
// Get a value by Name
System.err.printf("Get by name='A Site' = %s\n", byName.get("A Site"));
// We can do usual mappy things with the indexes,
// such as list the keys, or iterate over all entries
System.err.printf("byId.keys() = %s\n", byId.keySet());
byId.forEach((Integer id, Site s) -> System.err.printf(" %s => %s\n", id, s));
// In some cases the map is not unique, so I provide a
// way to get all entries with the same value as a list.
// in this case by their rating value.
MultiMapIndex<String, List<Site>> byRating = multiMap.addMultiIndex((Site site)->site.rating);
System.err.printf("byRating('A') = %s\n", byRating.get("A"));
System.err.printf("byRating('B') = %s\n", byRating.get("B"));
// Adding stuff after creating the indices is fine.
multiMap.add(new Site(3333,"Last Site","B"));
System.err.printf("byRating('A') = %s\n", byRating.get("A"));
System.err.printf("byRating('B') = %s\n", byRating.get("B"));
}
}

I think you can search your objects over List :
List<Site> sites;
for (Site s : sites) {
if (s.getSiteName().equal(siteName)) {
// do something
}
if (s.getSiteId().equal(siteId)) {
// do something
}
}

You should create a Bean (Container) as Java is not requiring code optimisation as it will be optimised by the JIT compiler anyway.
public class SiteMap {
String siteName;
Integer siteId;
String parentBrand;
.... Getters and setters ...
}
List<SiteMap> myList = new ArrayList<>();
If you need to compare or sort then you can implement Comparable interface on the SiteMap class allowing you to sort the details if needed.
you can, if using Java 8 then also use Streams to filter or fetch the one you want. as there is a fetchFirst
SiteMap mysite = myList.stream()
.filter(e -> e.siteName.equals("Amazon.com"))
.findFirst()
.get();

Save and Read Key-Value pair in Spark

I have a JavaPairRDD in the following format:
JavaPairRDD< String, Tuple2< String, List< String>>> myData;
I want to save it as a Key-Value format (String, Tuple2< String, List< String>>).
myData.saveAsXXXFile("output-path");
So my next job could read in the data directly to my JavaPairRDD:
JavaPairRDD< String, Tuple2< String, List< String>>> newData = context.XXXFile("output-path");
I am using Java 7, Spark 1.2, Java API. I tried saveAsTextFile and saveAsObjectFile, neither works. And I don't see saveAsSequenceFile option in my eclipse.
Does anyone have any suggestion for this problem?
Thank you very much!

You could use SequenceFileRDDFunctions that is used through implicits in scala, however that might be nastier than using the usual suggestion for java of:
myData.saveAsHadoopFile(fileName, Text.class, CustomWritable.class,
SequenceFileOutputFormat.class);
implementing CustomWritable via extending
org.apache.hadoop.io.Writable
Something like this should work (did not check for compilation):
public class MyWritable extends Writable{
private String _1;
private String[] _2;
public MyWritable(Tuple2<String, String[]> data){
_1 = data._1;
_2 = data._2;
}
public Tuple2<String, String[]> get(){
return new Tuple2(_1, _2);
}
#Override
public void readFields(DataInput in) throws IOException {
_1 = WritableUtils.readString(in);
ArrayWritable _2Writable = new ArrayWritable();
_2Writable.readFields(in);
_2 = _2Writable.toStrings();
}
#Override
public void write(DataOutput out) throws IOException {
Text.writeString(out, _1);
ArrayWritable _2Writable = new ArrayWritable(_2);
_2Writable.write(out);
}
}
such that it fits your data model.

Getting MapReduce results on RIAK (using the Java client)

I am storing Person POJOs (4 string fields - id, name, lastUpdate, Data) on RIAK, then trying to fetch these objects with MapReduce.
I am doing it very similar to Basho documentation:
BucketMapReduce m = riakClient.mapReduce("person");
m.addMapPhase(new NamedJSFunction("Riak.mapByFields"), true);
MapReduceResult result = m.execute();
Collection<Person> tmp = result.getResult(Person.class);
the Person's String constructor is invoked:
public Person(String str){}
(I must have this constructor, otherwise I get an exception for it is missing)
In there I get the object as a String - the Object's fields in one string with a strange delimiter.
why am I not getting the object automatically converted to my POJO? do I really need to go over the string and deserialize it? am i doing something wrong?s

The JS function you're using doesn't do what you think it does :) It selects objects based on a field with a specific value you have to supply as an argument to the phase.
I think what you're looking for is mapValuesJson which will do what you seem to be wanting to do.
Also, you don't need a constructor at all in your POJO.
The code below should point you in the right direction (obviously this is super-simple with all public fields in the POJO and no annotations):
public class App {
public static void main( String[] args ) throws IOException, RiakException
{
IRiakClient client = RiakFactory.httpClient();
Bucket b = client.fetchBucket("test_mr").execute();
b.store("myobject", new Person()).execute();
IRiakObject o = b.fetch("myobject").execute();
System.out.println(o.getValueAsString());
BucketMapReduce m = client.mapReduce("test_mr");
m.addMapPhase(new NamedJSFunction("Riak.mapValuesJson"), true);
MapReduceResult result = m.execute();
System.out.println(result.getResultRaw());
Collection<Person> tmp = result.getResult(Person.class);
for (Person p : tmp)
{
System.out.println(p.data);
}
client.shutdown();
}
}
class Person
{
public String id = "12345";
public String name = "my name";
public String lastUpdate = "some time";
public String data = "some data";
}

Using String to find Class in java?

I have made a class named Entity, and have the following code:
Entity zombie1 = new Entity();
I get input 'zombie' from a scanner, and then concatenate a number, based on level on the end of that, leaving 'zombie1' as the string... I want to be able to use that string and call
zombie1.shoot("shotgun");
but I can't seem to find a solution. I'd just do a if statement but I want to be able to create as many zombies as I want and not have to put in more if statements every single time.
I've read articles using reflection and forString but that doesn't seem to be what i'm looking for.
Any help would be nice.

Possible solutions are to use a Map<String, Entity> to be able to store and retrieve entities based on specific Strings. If you have a limited number of sub-types of Entity such as Zombies, Vampires, Victims, etc, you could have a Map<String, List<Entity>>, allowing you to map a String to a specific type of entity and then get that type by number.
e.g.,
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public class Foo002 {
private static final String ZOMBIE = "zombie";
public static void main(String[] args) {
Map<String, List<Entity>> entityMap = new HashMap<String, List<Entity>>();
entityMap.put(ZOMBIE, new ArrayList<Entity>());
entityMap.get(ZOMBIE).add(new Entity(ZOMBIE, "John"));
entityMap.get(ZOMBIE).add(new Entity(ZOMBIE, "Fred"));
entityMap.get(ZOMBIE).add(new Entity(ZOMBIE, "Bill"));
for (Entity entity : entityMap.get(ZOMBIE)) {
System.out.println(entity);
}
}
}
class Entity {
private String type;
private String name;
public Entity(String type, String name) {
this.type = type;
this.name = name;
}
public String getType() {
return type;
}
public String getName() {
return name;
}
#Override
public String toString() {
return type + ": " + name;
}
}

This is not your best bet. Your best bet is to have a Map;
// PLEASE LOOK INTO WHICH MAP WOULD BE BEST FOR YOUR CASE OVERALL
// HASHMAP IS JUST AN EXAMPLE.
Map<String, Entity> zombieHoard = new HashMap<String, Entity>;
String getZombieID( int id )
{
return String.format( "zombie%s", id );
}
String createZombie() {
String zid = getZombieID( Map.size() );
Map.put( zid, new Entity() );
return zid;
}
void sendForthTheHoard() {
createZombie();
createZombie();
String currentZombie = createZombie();
zombieHoard.get( currentZombie ).shoot( "blow-dryer" );
zombieHoard.get( getZombieID( 1 ) ).eatBrains();
}

Put your zombies in an ArrayList. Example:
ArrayList<Entity> zombies = new ArrayList<Entity>();
Entity zombie1 = new Entity();
zombies.add(zombie1);
Entity zombie2 = new Entity();
zombies.add(zombie2);
etc...
Then when it is time to call a certain zombie to the following:
zombies.get(1).shoot("shotgun");

If you are talking about dynamically invoking a method on an object, you can use Reflection to get the method object and invoke it (Note: I may have inadvertantly mixed up some C# syntax in this Java):
Entity zombie1 = new Entity();
Method shootMethod = Entity.class.getMethod("shoot", new Class[] { string.class });
shootMethod.invoke(zombie1, new Object[] { "shotgun" });

java/scala: faster type-aware serialization of only basic types?

in scala, i have a need to serialize objects that are limited to a small set of basic types: array, list, map, set, int, boolean, etc. i want to be able to serialize and deserialize those in a way that preserves the type information in the serialized format. specifically, if i have serialized an Array[Any], i want to be able to deserialize it and only specify that the resulting object is Array[Any]. that is, i don't want to specify a structure definition for every single thing i'm going to serialize. at the same time it needs to be able to distinguish between int and long, tuple and array, etc.
for example:
val obj = Array[Any](...) // can have any basic types in here
val ser = serialize(obj)
val newObj = deserialize[Array[Any]](ser) // recovers the exact types from the original obj
json is not appropriate for this case because it has a many-to-one mapping of scala types to json types. i'm currently using java serialization but it's extremely slow. since i don't need to serialize any arbitrary object type, is there a faster alternative for my narrower use case?

I don't about speed or indeed availability of library support, but have you looked at ASN.1?

I'd use a simple interface like this:
public interface Serializer{
public <T> T deserialize(String serializedData);
public String serialize(Object data);
}
And an enum to implement it:
public enum StandardSerializer implements Serializer{
INTEGER("I", Integer.class, int.class){
#Override
protected Integer doDeserialize(final String stripped){
return Integer.valueOf(stripped);
}
},
STRING("I", String.class){
#Override
protected Object doDeserialize(final String stripped){
return stripped;
}
},
LIST("L", List.class){
#Override
protected String doSerialize(final Object data){
final Iterator<?> it = ((List<?>) ((List<?>) data)).iterator();
final StringBuilder sb = new StringBuilder();
if(it.hasNext()){
Object next = it.next();
sb.append(StandardSerializer
.forType(next.getClass())
.serialize(next));
while(it.hasNext()){
sb.append(',');
next = it.next();
sb.append(StandardSerializer
.forType(next.getClass())
.serialize(next));
}
}
return sb.toString();
}
#Override
protected Object doDeserialize(final String stripped){
final List<Object> list = new ArrayList<Object>();
for(final String item : stripped.split(",")){
list.add(StandardSerializer.forData(item).deserialize(item));
}
return list;
}
}
/* feel free to implement more enum entries */
;
private static final String DELIMITER = ":";
public static StandardSerializer forType(final Class<?> type){
for(final StandardSerializer candidate : values()){
for(final Class<?> supportedType : candidate.supportedClasses){
if(supportedType.isAssignableFrom(type)) return candidate;
}
}
throw new IllegalArgumentException("Unmapped type: " + type);
}
private final String prefix;
private final Class<?>[] supportedClasses;
private StandardSerializer(final String prefix,
final Class<?>... supportedClasses){
this.prefix = prefix;
this.supportedClasses = supportedClasses;
}
private String base64decode(final String removePrefix){
// TODO call one of the many base64 libraries here
return null;
}
private String base64encode(final String data){
// TODO call one of the many base64 libraries here
return null;
}
#SuppressWarnings("unchecked")
#Override
public final <T> T deserialize(final String serializedData){
return (T) doDeserialize(base64decode(removePrefix(serializedData)));
}
public static StandardSerializer forData(final String serializedData){
final String prefix =
serializedData.substring(0, serializedData.indexOf(DELIMITER));
for(final StandardSerializer candidate : values()){
if(candidate.prefix.equals(prefix)) return candidate;
}
throw new IllegalArgumentException("Unknown prefix: " + prefix);
}
protected abstract Object doDeserialize(String strippedData);
private String removePrefix(final String serializedData){
return serializedData.substring(prefix.length() + DELIMITER.length());
}
// default implementation calles toString()
protected String doSerialize(final Object data){
return data.toString();
}
#Override
public String serialize(final Object data){
return new StringBuilder()
.append(prefix)
.append(DELIMITER)
.append(base64encode(doSerialize(data)))
.toString();
}
}
Now here's how you can code against that:
List<?> list = Arrays.asList("abc",123);
String serialized = StandardSerializer.forType(list.getClass()).serialize(list);
List<?> unserialized = StandardSerializer.forData(serialized)
.deserialize(serialized);
(While you might choose a different format for serialization, using an enum strategy pattern is probably still a good idea)

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Calling pipe() from a PairRDD and passing a Java Object to it - java

Found it !! Just need to override the toString() method of my POJO (Graph) to expose the desirable attributes !!! In this case: #Override public String toString() { return this.parameter + "," + this.otherParameter; } Now the output is: (62,foo,bar)

Related

Searching on an object with different keys in Java

Save and Read Key-Value pair in Spark

Getting MapReduce results on RIAK (using the Java client)

Using String to find Class in java?

java/scala: faster type-aware serialization of only basic types?

Categories

Resources