I am trying to build a utility layer in Java over apache spark streaming where users are able to aggregate data over a period of time (using window functions in spark) but it seems all the available options require associative functions (taking two arguments). However for some fairly common use cases like averaging temperature sensor values over an hour, etc. dont seem possible with the spark API.
Is there any alternative for achieving this kind of functionality? I am thinking of implementing repetitive interactive queries to achieve that, but it will be too slow.
Statistical aggregates (average, variance) are actually associative and can be computed online. See here for a good numerical way of doing this.
In terms of number of arguments, remember the type of what you put in arguments is of your choice. You can nest several arguments in one of these using a Tuple.
Finally, you can also use stateful information with something like updateStateByKey.
Related
In Apache Spark, what are the differences between those API? Why and when should we choose one over the others?
First, lets define what spark does
Simply put what it does is to execute operations on distributed data. Thus, the operations also need to be distributed. Some operations are simple, such as filter out all items that doesn't respect some rule. Others are more complex, such as groupBy that needs to move data around, and join that needs to associate items from 2 or more datasets.
Another important fact is that input and output are stored in different formats, spark has connectors to read and write those. But that means to serialize and deserialize them. While being transparent, serialization is often the most expensive operation.
Finally, spark tries to keep data in memory for processing but it will [ser/deser]ialize data on each worker locally when it doesn't fit in memory. Once again, it is done transparently but can be costly. Interesting fact: estimating the data size can take time
The APIs
RDD
It's the first API provided by spark. To put is simply it is a not-ordered sequence of scala/java objects distributed over a cluster. All operations executed on it are jvm methods (passed to map, flatmap, groupBy, ...) that need to be serialized, send to all workers, and be applied to the jvm objects there. This is pretty much the same as using a scala Seq, but distributed. It is strongly typed, meaning that "if it compiles then it works" (if you don't cheat). However, there are lots of distribution issues that can arise. Especially if spark doesn't know how to [de]serialize the jvm classes and methods.
Dataframe
It came after and is semantically very different from RDD. The data are considered as tables and operations such as sql operations can be applied on it. It is not typed at all, so error can arise at any time during execution. However, there are I think 2 pros: (1) many people are used to the table/sql semantic and operations, and (2) spark doesn't need to deserialize the whole line to process one of its column, if the data format provide suitable column access. And many do, such as the parquet file format that is the most commonly used.
Dataset
It is an improvement of Dataframe to bring some type-safety. Dataset are dataframe to which we associate an "encoder" related to a jvm class. So spark can check that the data schema is correct before executing the code. Note however that, we can read sometime that dataset are strongly type, but it is not: it brings some strongly type safety where you cannot compile code that use a Dataset with a type that is not what has been declared. But it is very easy to make code that compile but still fail at runtime. This is because many dataset operations loose the type (pretty much everything apart from filter). Still it is a huge improvements because even when we make mistake, it will fail fast: failure happens when interpreting the spark DAG (i.e. at start) instead of during data processing.
Note: Dataframe are now simply untyped Dataset (Dataset<Row>)
Note2: Dataset provide the main API of RDD, such as map and flatMap. From what I know, it is a short cut to convert to rdd, then apply map/flatMap, then convert to dataset. It's practical, but also hide the conversion making it difficult to realize that possibly costly ser/deser-ialization happened.
Pros and cons
Dataset:
pros: has optimized operations over column oriented storages
pros: also many operations doesn't need deserialization
pros: provide table/sql semantic if you like it (I don't ;)
pros: dataset operations comes with an optimization engine "catalyst" that improves the performance of your code. I'm not sure however if it is really that great. If you know what you code, i.e. what is done to the data, your code should be optimized by itself.
cons: most operation loose typing
cons: dataset operations can become too complicated for complex algorithm that doesn't suit it. The 2 main limits I know are managing invalid data and complex math algorithm.
Dataframe:
pros: required between dataset operations that lose type
cons: just use Dataset it has all the advantages and more
RDD:
pros: (really) strongly typed
pros: scala/java semantic. You can design your code pretty much how you would for a single-jvm app that process in-memory collections. Well, with functional semantic :)
cons: full jvm deserialization is required to process data, at any step mentioned before: after reading input, and between all processing steps that requires data to be moved between worker, or stored locally to manage memory bound.
Conclusion
Just use Dataset by default:
read input with an Encoder, if the data format allows it it will validate input schema at start
use dataset operations and when you loose type, go back to a typed dataset. Typically, use typed dataset as input and output of all methods.
There are cases where what you want to code would be too complex to express using dataset operations. Most app doesn't, but it often happen in my work where I implements complex mathematical models. In this case:
start with dataset
filter and shuffle (groupBy, join) data as much as possible with dataset ops
once you have only the required data, and need not move them, convert to rdd and apply you complex computing.
In short:
RDDs are coming from the early versions of Spark. Still used "under the hood" by the Dataframes.
Dataframes were introduced in late Spark 1.x and really matured in Spark 2.x. They are the preferred storage now. They are implemented as a Dataset in Java.
Datasets are the generic implementation, as you could have a Dataset for example.
I use dataframes and highly recommend them: Spark's optimizer, Catalyst, understands better datasets (and as such, dataframes) and the Row is a better storage container than a pure JVM object. You will find a lot of blog posts (including Databricks') on the internals.
We have a requirement to incorporate an excel based tool in java web application. This excel tool has set of master data and couple of result outputs using formula calculations on master data.
Master data can be captured in database with relational tables. We are looking for the best way to provide capability to capture, validate and evaluate. formulas.
So far looked at using scripting engines nashorn and provide formula support using eval. We would like to know how people are doing in other places.
I've searched and found two possible libraries that could be useful for you please have a look.
http://mathparser.org/
http://mathparser.org/mxparser-hello-world/mxparser-hello-world-java/
https://lallafa.objecthunter.net/exp4j/
https://lallafa.objecthunter.net/exp4j/#Evaluating_an_expression_asynchronously
Depends on how big your data is and what your required SLA is. Also on what kind of formulas/other functions that you want to support.
For example, consider a function like sum or max. Now, the master data is in some relation table containing 10k rows. You could pull in all this data inside a java app and do a sum (or run any function). However, imagine if the table contained 500K rows. This would take some time to stream all 500K rows to Java app but consumes lot of cpu and network bandwidth (database resources, local cpu resources). A better optimized scenario in that case would be index that column in the database and let database do all the hard work for you.
Personally, I don't like using eval. I would rather parse the user input to determine what actions to take.
I am assuming that data is not big to use big data tools.
I am using libsvm library for document classification of resumes. I have multiple resumes and I need to classify them. Do I need multilabel classification OR multiclass classification in this case. Which above option should I consider and also please suggest a way to do it?
Your requirement is not straight forward, In order to develop such system you need to come up with several steps, as an Example :
You need a data set of different types of documents (various type of resumes)
Then you need to identify what kind of features that can be use to separate them(how do you going to distinguish them, based on what (ex, resume length, count of word, content of resume header, etc))
Then you need to prepare sets of feature vectors in order to train the SVM. (if you need to classify only relevant and irrelevant resumes, this will be two classes. If there are more than two classes , this will be multi-class and LibSVM supports multi-class)
When training, you need to perform scaling, cross validation in order to increse the accuracy (read here )
You need to complete above steps in order to make successful prediction.
I need to run a k-medoids clustering algorithm by using ELKI programmatically. I have a similarity matrix that I wish to input to the algorithm.
Is there any code snippet available for how to run ELKI algorithms?
I basically need to know how to create Database and Relation objects, create a custom distance function, and read the algorithm output.
Unfortunately the ELKI tutorial (http://elki.dbs.ifi.lmu.de/wiki/Tutorial) focuses on the GUI version and on implementing new algorithms, and trying to write code by looking at the Javadoc is frustrating.
If someone is aware of any easy-to-use library for k-medoids, that's probably a good answer to this question as well.
We do appreciate documentation contributions! (Update: I have turned this post into a new ELKI tutorial entry for now.)
ELKI does advocate to not embed it in other applications Java for a number of reasons. This is why we recommend using the MiniGUI (or the command line it constructs). Adding custom code is best done e.g. as a custom ResultHandler or just by using the ResultWriter and parsing the resulting text files.
If you really want to embed it in your code (there are a number of situations where it is useful, in particular when you need multiple relations, and want to evaluate different index structures against each other), here is the basic setup for getting a Database and Relation:
// Setup parameters:
ListParameterization params = new ListParameterization();
params.addParameter(FileBasedDatabaseConnection.INPUT_ID, filename);
// Add other parameters for the database here!
// Instantiate the database:
Database db = ClassGenericsUtil.parameterizeOrAbort(
StaticArrayDatabase.class,
params);
// Don't forget this, it will load the actual data...
db.initialize();
Relation<DoubleVector> vectors = db.getRelation(TypeUtil.DOUBLE_VECTOR_FIELD);
Relation<LabelList> labels = db.getRelation(TypeUtil.LABELLIST);
If you want to program more general, use NumberVector<?>.
Why we do (currently) not recommend using ELKI as a "library":
The API is still changing a lot. We keep on adding options, and we cannot (yet) provide a stable API. The command line / MiniGUI / Parameterization is much more stable, because of the handling of default values - the parameterization only lists the non-default parameters, so only if these change you'll notice.
In the code example above, note that I also used this pattern. A change to the parsers, database etc. will likely not affect this program!
Memory usage: data mining is quite memory intensive. If you use the MiniGUI or command line, you have a good cleanup when the task is finished. If you invoke it from Java, changes are really high that you keep some reference somewhere, and end up leaking lots of memory. So do not use above pattern without ensuring that the objects are properly cleaned up when you are done!
By running ELKI from the command line, you get two things for free:
no memory leaks. When the task is finished, the process quits and frees all memory.
no need to rerun it twice for the same data. Subsequent analysis does not need to rerun the algorithm.
ELKI is not designed as embeddable library for good reasons. ELKI has tons of options and functionality, and this comes at a price, both in runtime (although it can easily outperform R and Weka, for example!) memory usage and in particular in code complexity.
ELKI was designed for research in data mining algorithms, not for making them easy to include in arbitrary applications. Instead, if you have a particular problem, you should use ELKI to find out which approach works good, then reimplement that approach in an optimized manner for your problem.
Best ways of using ELKI
Here are some tips and tricks:
Use the MiniGUI to build a command line. Note that in the logging window of the "GUI" it shows the corresponding command line parameters - running ELKI from command line is easy to script, and can easily be distributed to multiple computers e.g. via Grid Engine.
#!/bin/bash
for k in $( seq 3 39 ); do
java -jar elki.jar KDDCLIApplication \
-dbc.in whatever \
-algorithm clustering.kmeans.KMedoidsEM \
-kmeans.k $k \
-resulthandler ResultWriter -out.gzip \
-out output/k-$k
done
Use indexes. For many algorithms, index structures can make a huge difference!
(But you need to do some research which indexes can be used for which algorithms!)
Consider using the extension points such as ResultWriter. It may be the easiest for you to hook into this API, then use ResultUtil to select the results that you want to output in your own preferred format or analyze:
List<Clustering<? extends Model>> clusterresults =
ResultUtil.getClusteringResults(result);
To identify objects, use labels and a LabelList relation. The default parser will do this when it sees text along the numerical attributes, i.e. a file such as
1.0 2.0 3.0 ObjectLabel1
will make it easy to identify the object by its label!
UPDATE: See ELKI tutorial created out of this post for updates.
ELKI's documentation is pretty sparse (I don't know why they don't include a simple "hello world" program in the examples)
You could try Java-ML. Its documentation is a bit more user friendly, and it does have K-medoid.
Clustering example with Java-ML |
http://java-ml.sourceforge.net/content/clustering-basics
K-medoid |
http://java-ml.sourceforge.net/api/0.1.7/
My understanding is to calculate percentiles, the data needs to be sorted. Would this be possible with a huge amount of data spread across multiple servers, without moving it around?
While MapReduce as a paradigm does not looks suited for the problem, hadoop's implementation of MR - is.
Hadoop's implementation of map reduce is based on distributed sort - and it is what you need. Hadoop is doing sort by moving data between servers only once - not that bad.
I would suggest to look onto hadoop terasort implementaiton which illustrate the good (and probabbly the best) way to sort massive data with hadoop. http://hadoop.apache.org/docs/current/api/org/apache/hadoop/examples/terasort/package-summary.html
I would first create a histogram, either on one machine or multiple machines. Once you have a count for each possible value of buckets of possible values you can combine these if needed. The gain for using a histogram is that it has O(1) insertion/sort time instead of O(log n) and uses O(M) space where M is the number of possible values or buckets instead of O(N) where N is the number of sample.
A histogram is naturally sorted so you can get a total count and find the percentiles by counting from either end.
The answer to your question is yes, it is possible. But Map-Reduce isn't really designed for this kind of task. Map-Reduce (as is used in a Hadoop cluster, for instance) shines on unstructured or semi-structured data. While it has the ability to process other kinds, it is not best suited for it. (I had one project at a company where they wanted to analyze XML in a Hadoop cluster... it wasn't the most fun thing.)
This scholarly article describes some of the issues with Map-Reduce on structured data and offers an alternative approach with "Clydesdale". (I have never heard of or used this, so I can neither endorse it or speak to its strengths/weaknesses.)
I'm looking for more links that offer explanations and alternatives.