I have no experience with either Flink or Spark, and I would like to use one of them for my use case. I'd like to present my use case and hopefully get some insight of whether this can be done with either, and if they can both do that, which one would work best.
I have a bunch of entities A stored in a data store (Mongo to be precise but it doesn't matter really). I have a Java application that can load these entities and run some logic on them to generate a Stream of some data type E (to be 100% clear I don't have the Es in any data set, I need to generate them in Java after I load the As from the DB)
So I have something like this
A1 -> Stream<E>
A2 -> Stream<E>
...
An -> Stream<E>
The data type E is a bit like a long row in Excel, it has a bunch of columns. I need to collect all the Es and run some sort of pivot aggregation like you would do in Excel. I can see how I could do that easily in either Spark or Flink.
Now is the part I cannot figure out.
Imagine that one of the entity A1 is changed (by a user or a process), that mean that all the Es for A1 need updating. Of course I could reload all my As, recompute all the Es, and then re-run the whole aggregation. By I'm wondering if it's possible to be a bit more clever here.
Would it be possible to only recompute the Es for A1 and do the minimum amount of processing.
For Spark would it be possible to persist the RDD and only update part of it when needed (here that would be the Es for A1)?
For Flink, in the case of streaming, is it possible to update data points that have already been processed? Can it handle that sort of case? Or could I perhaps generate negative events for A1's old Es (i.e that would remove them from the result) and then add the new ones?
Is that a common use case? Is that even something that Flink or Spark are designed to do? I would think so but again I haven't used either so my understanding is very limited.
I think your question is very broad and depends on many conditions. In flink you could have a MapState<A, E> and only update the values for the changed A's and then depending on your use-case either generate the updated E's downstream or generate the difference (retraction stream).
In Flink there exists the concept of Dynamics Tables and Retraction streams that may inspire you, or maybe event the Table API already covers your use case. You can check out the docs here
Related
I have a Kafka topic and a Spark application. The Spark application gets data from Kafka topic, pre aggregates it and stores it in Elastic Search. Sounds simple, right?
Everything works fine as expected, but the minute I set "spark.cores" property something other than 1, I start getting
version conflict, current version [2] is different than the one provided [1]
After researching a bit, I think the error is because multiple cores can have same document at the same time and thus, when one core is done with aggregation on its part and tries to write back to the document, it gets this error
TBH, I am a bit surprised by this behaviour because I thought Spark and ES would handle this on their own. This leads me to believe that maybe, there is something wrong with my approach.
How can I fix this? Is there some sort of "synchronized" or "lock" sort of concept that I need to follow?
Cheers!
It sounds like you have several messages in the queue that all update the same ES document, and these messages are being a processed concurrently. There are two possible solutions:
First, you can use Kafka partitions to ensure that all the messages that update the same ES document are handled in sequence. This assumes that’s there’s some property in your message that Kafka can use to determine how messages map to ES documents.
The other way is the standard way of handling optimistic concurrency conflicts: retry the transaction. If you have some data from a Kafka message that you need to add to an ES document, and the current document in ES is version 1, then you can try to update it and save back version 2. But if someone else already wrote version 2, you can retry by using version 2 as a starting point, adding your new data, and saving version 3.
If either of these approaches destroys the concurrency you were expecting to get from Kafka and Spark, then you may need to rethink your approach. You may have to introduce a new processing stage that does some heavy lifting but doesn’t actually write to ES, then do the ES updates in a separate step.
I would like to answer my own question. In my use case, I was updating the document counter. So, all I had to do was retry whenever a conflict arise because I just needed to aggregate my counter.
My use case was somewhat this:
For many uses of partial update, it doesn’t matter that a document has been changed. For instance, if two processes are both incrementing the page-view counter, it doesn’t matter in which order it happens; if a conflict occurs, the only thing we need to do is reattempt the update.
This can be done automatically by setting the retry_on_conflict parameter to the number of times that update should retry before failing; it defaults to 0.
Thanks to Willis and this blog, I was able to configure Elastic Search settings and now I am not having any problems at all
I am staring to work with Datasets after several projects I worked with RDDs. I am using Java for development.
As far as I understand columns are immutable - there is no map function for column and the standard way to map column is adding a column with withColumn.
My question is what is really happening when I call withColumn? is there a performance penalty? should I try to make as few withColumn calls as possible or it doesn't matter?
Piggybacked question: Is there any performance penalty when I call any other row/column creation function such as explode or pivot?
The performance of the various functions to interact with a DataFrame are all fast enough that you will never have a problem (or really notice them).
This will make more sense if you understand how spark executes the transormations you define in your driver. When you call the various transformation functions (withColumn, select, etc) Spark isn't actually doing anything immediately. It just registers what operations you want to run in it's execution plan. Spark doesn't start computations on your data until you call an action, typically to get results or write out data.
Knowing all the operations you want to run allows spark to perform optimizations on the execution plan before actually running it. For example, imagine you use withColumn to create a new column, but then drop that column before you write the data out to a file. Spark knows that it never actually needs to compute that column.
The things that will typically determine the performance of your driver are:
How many wide transformations (shuffles of data between executors) are there and how much data is being shuffled
Do I have any expensive transformation functions
For your extra question about explode and pivot:
Explode creates new rows but is a narrow transformation. It can change the partitions in place without needing to move data between executors. This means it is relatively cheap to perform. There is an exception to this if you have very large arrays you are exploding as Raphael pointed out in the comments.
Pivot requires a groupBy operation which is a wide transformation. It must send data from every executor to every other executor to ensure that all the data for a given key is in the same partition. This is an expensive operation because of all the extra network traffic required.
If the result set is large, then having the entire result set in memory (server cache e.g. hazelcast) will not be feasible. With large result sets, you cannot afford to have them in memory. In such case, you have to fetch a chunk of data at a time (query based paging). The down side of using query based paging, is that there will be multiple calls to the database for multiple page requests.
Can anyone suggest how to implement a hybrid approach of it.
I haven't put any sample code here since I think the question is more about a logic instead of specific code. Still if you need sample code I can put it.
Thanks in advance.
The most effective solution is to use the primary key as a paging criterion.This enables us to rely of first class constructs like a between range query which is simple for the RDBMS to optimize, the primary key of the queried entity will most likely be indexed already.
Retrieving data using a range query on the primary key is a two-step process. First one have to retrieve the collection of primary-keys, followed by a step to generate the intervals to properly identify a proper subset of the data,followed by the actual queries against the data.
This approach is almost as fast as the brute-force version. The memory consumption is about one tenth. By selecting the appropriate page-size for this implementation, you may alter the ratio between execution time and memory consumption. This version is also stateless, it does not keep references to resources like the ScrollableResults version does, nor does it strain the database like the version using setFirstResult/setMaxResult.
Effective pagination using Hibernate
How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation?
I have tried to put cache() with the map call but that still doesn't do the trick. My map method actually uploads results to HDFS. So, its not useless, but Spark thinks it is.
Short answer:
To force Spark to execute a transformation, you'll need to require a result. Sometimes a simple count action is sufficient.
TL;DR:
Ok, let's review the RDD operations.
RDDs support two types of operations:
transformations - which create a new dataset from an existing one.
actions - which return a value to the driver program after running a computation on the dataset.
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
Conclusion
To force Spark to execute a call to map, you'll need to require a result. Sometimes a count action is sufficient.
Reference
Spark Programming Guide.
Spark transformations only describe what has to be done. To trigger an execution you need an action.
In your case there is a deeper problem. If goal is to create some kind of side effect, like storing data on HDFS, the right method to use is foreach. It is both an action and has a clean semantics. What is also important, unlike map, it doesn't imply referential transparency.
I work on an application that is deployed on the web. Part of the app is search functions where the result is presented in a sorted list. The application targets users in several countries using different locales (= sorting rules). I need to find a solution for sorting correctly for all users.
I currently sort with ORDER BY in my SQL query, so the sorting is done according to the locale (or LC_LOCATE) set for the database. These rules are incorrect for those users with a locale different than the one set for the database.
Also, to further complicate the issue, I use pagination in the application, so when I query the database I ask for rows 1 - 15, 16 - 30, etc. depending on the page I need. However, since the sorting is wrong, each page contains entries that are incorrectly sorted. In a worst case scenario, the entire result set for a given page could be out of order, depending on the locale/sorting rules of the current user.
If I were to sort in (server side) code, I need to retrieve all rows from the database and then sort. This results in a tremendous performance hit given the amount of data. Thus I would like to avoid this.
Does anyone have a strategy (or even technical solution) for attacking this problem that will result in correctly sorted lists without having to take the performance hit of loading all data?
Tech details: The database is PostgreSQL 8.3, the application an EJB3 app using EJB QL for data query, running on JBoss 4.5.
Are you willing to develop a small Postgres custom function module in C? (Probably only a few days for an experienced C coder.)
strxfrm() is the function that transforms the language-dependent text string based on the current LC_COLLATE setting (more or less the current language) into a transformed string that results in proper collation order in that language if sorted as a binary byte sequence (e.g. strcmp()).
If you implement this for Postgres, say it takes a string and a collation order, then you will be able to order by strxfrm(textfield, collation_order). I think you can then even create multiple functional indexes on your text column (say one per language) using that function to store the results of the strxfrm() so that the optimizer will use the index.
Alternatively, you could join the Postgres developers in implementing this in mainstream Postgres. Here are the wiki pages about this issues: Collation, ICU (which is also used by Java as far as I know).
Alternatively, as a less sophisticated solution if data input is only through Java, you could compute these strxfrm() values in Java (Java will probably have a different name for this concept) when you add the data to the database, and then let Postgres index and order by these precomputed values.
How tied are you to PostgreSQL? The documentation isn't promising:
The nature of some locale categories is that their value has to be fixed for the lifetime of a database cluster. That is, once initdb has run, you cannot change them anymore. LC_COLLATE and LC_CTYPE are those categories. They affect the sort order of indexes, so they must be kept fixed, or indexes on text columns will become corrupt. PostgreSQL enforces this by recording the values of LC_COLLATE and LC_CTYPE that are seen by initdb. The server automatically adopts those two values when it is started.
(Collation rules define how text is sorted.)
Google throws up patch under discussion:
PostgreSQL currently only supports one collation at a time, as fixed by the LC_COLLATE variable at the time the database cluster is initialised.
I'm not sure I'd want to manage this outside the database, though I'd be interested in reading about how it can be done. (Anyone wanting a good technical overview of the issues should check out Sorting Your Linguistic Data inside the Oracle Database on the Oracle globalization site.)
I don't know any way to switch the database order by order. Therefore, one has to consider other solutions.
If the number of results is really big (hundred thousands ?), I have no solutions, except showing only the number of results, and asking the user to make a more precise request. Otherwise, the server-side could do, depending on the precise conditions....
Especially, using a cache could improve things tremendously. The first request to the database (unlimited) would not be so much slower than for a query limited in number of results. And the subsequent requests would be much faster. Often, paging and re-sorting makes for several requests, so the cache would work well (even with a few minutes duration).
I use EhCache as a technical solution.
Sorting and paging go together, sorting then paging.
The raw results could be memorized in the cache.
To reduce the performance hit, some hints:
you can run the query once for result set size, and warn the user if there are too many results (ask either for confirming a slow query, or add some selection fields)
only request the columns you need, let go all other columns (usually some data is not shown immediately for all results, but displayed on mouse move for example ; this data can be requested lazyly, only as needed, therefore reducing the columns requested for all results)
if you have computed values, cache the smaller between the database columns and the computed values
if you have repeated values in multiple results, you can request that data/columns separately (so you retrieve from the database once, and cache them only once), retrieve only a key (typically, and id) in the main request.
You might want to checkout this packge: http://www.fi.muni.cz/~adelton/l10n/postgresql-nls-string/. It hasn't been updated in a long time, and may not work anymore, but it seems like a reasonable startingpoint if you want to build a function that can do this for you.
This module is broken for Postgres 8.4.3. I fixed it - you can download fixed version from http://www.itreport.eu/__cw_files/.01/.17/.ee7844ba6716aa36b19abbd582a31701/nls_string.c and you'll have to compile and install it by hands (as described at related README and INSTALL from original module) but anyway sorting is working incorrectly. I tried it on FreeBSD 8.0, LC_COLLATE is cs_CZ.UTF-8