How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation?
I have tried to put cache() with the map call but that still doesn't do the trick. My map method actually uploads results to HDFS. So, its not useless, but Spark thinks it is.
Short answer:
To force Spark to execute a transformation, you'll need to require a result. Sometimes a simple count action is sufficient.
TL;DR:
Ok, let's review the RDD operations.
RDDs support two types of operations:
transformations - which create a new dataset from an existing one.
actions - which return a value to the driver program after running a computation on the dataset.
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
Conclusion
To force Spark to execute a call to map, you'll need to require a result. Sometimes a count action is sufficient.
Reference
Spark Programming Guide.
Spark transformations only describe what has to be done. To trigger an execution you need an action.
In your case there is a deeper problem. If goal is to create some kind of side effect, like storing data on HDFS, the right method to use is foreach. It is both an action and has a clean semantics. What is also important, unlike map, it doesn't imply referential transparency.
Related
I use executeBatch() with JDBC to insert multiple rows and I want to get id of inserted rows for another insert I use this code for that purpose:
insertInternalStatement = dbConncetion.prepareStatement(INSERT_RECORD, generatedColumns);
for (Foo foo: foosHashSet) {
insertInternalStatement.setInt(1, foo.getMe());
insertInternalStatement.setInt(1, foo.getMe2());
// ..
insertInternalStatement.addBatch();
}
insertInternalStatement.executeBatch();
// now get inserted ids
try (ResultSet generatedKeys = insertInternalStatement.getGeneratedKeys()) {
Iterator<Foo> fooIterator= foosHashSet.iterator();
while (generatedKeys.next() && fooIterator.hasNext()) {
fooIterator.next().setId(generatedKeys.getLong(1));
}
}
It works fine and ids are returned, my question are:
if I iterate over getGeneratedKeys() and foosHashSet will ids return in same order so that each returned id from database belongs to corresponding Foo instance?
What about when I use multi thread and above code run in multiple threads simultaneously?
Is there any other solution for this? I have two table foo1 and foo2 and I want first insert foo1 records then use their primary ids as foo2 foreign key.
Given support for getGeneratedKeys for batch execution is not defined in the JDBC specification, the behavior will depend on the driver used. I would expect any driver that supports generated keys for batch execution, to return the ids in order they where added to the batch.
However the fact you are using a Set is problematic. Iteration order for most sets are not defined, and could change between iterations (usually only after modification, but in theory you can't assume anything about the order). You need to use something with a guaranteed order, eg a List or maybe a LinkedHashSet.
Applying multi-threading here would probably be a bad idea: you should only use a JDBC connection from a single-thread at a time. Accounting for multi-threading would either require correct locking, or requiring you to split up the workload so it can use separate connections. Whether that would improve or worsen performance is hard to say.
You should be able to iterate through multiple generated keys without problem. They will return in the correct order they were inserted.
I think there should not be any problem adding threads in this matter. The only thing I'm pretty sure is that you would not be able to control the order the ids are inserted on both tables without some code complication.
You could store all firstly inserted ids on a Collection and after all threads/iterations have finished, insert them on second table.
The iteration is the same as long as the fooHashSet is not altered.
One could think using a LinkedHashSet which yields the items in order of insertion. Especially when nothing is removed or overwritten that would be nice.
Concurrent access would be problematic.
Use LinkedHashSet without removal, only adding new items. And additionally wrap it in Collections.synchronizedMap. For set alterations one
would need a Semaphore or such, as synchronizing such a large code block is a no-go.
An - even better performing - solution might be to make a local copy:
List<Me> list = fooHashSet.stream().map(Foo::Me)
.collect(Collectors.toList());
However this still is a somewhat unsatisfying solution:
a batch for multiple inserts and then per insert several other updates/inserts.
Transition to JPA instead of JDBC would somewhat alleviate the situation.
After some experience however I would pose the question whether a database at that point is still the correct tool (hammer)? If it is a graph, a hierarchical data structure, then storing the entire data structure as XML with JAXB in a single database table, could be the best solution. Faster. Easier development. Verifiable data.
Using the database for main data, and the XML for an edited/processed document.
Yes as per the definition of executing batch it says
createFcCouponStatement.executeBatch()
Submits a batch of commands to the database for execution and if all commands execute successfully, returns an array of update counts. The int elements of the array that is returned are ordered to correspond to the commands in the batch, which are ordered according to the order in which they were added to the batch. The elements in the array returned by the method executeBatch may be one of the following:
A number greater than or equal to zero -- indicates that the command was processed successfully and is an update count giving the number of rows in the database that was affected by the command's execution
A value of SUCCESS_NO_INFO -- indicates that the command was processed successfully but that the number of rows affected is unknown
If one of the commands in a batch update fails to execute properly, this method throws a BatchUpdateException, and a JDBC driver may or may not continue to process the remaining commands in the batch. However, the driver's behavior must be consistent with a particular DBMS, either always continuing to process commands or never continuing to process commands. If the driver continues processing after a failure, the array returned by the method BatchUpdateException.getUpdateCounts will contain as many elements as there are commands in the batch, and at least one of the elements will be the following:
A value of EXECUTE_FAILED -- indicates that the command failed to execute successfully and occurs only if a driver continues to process commands after a command fails
The possible implementations and return values have been modified in the Java 2 SDK, Standard Edition, version 1.3 to accommodate the option of continuing to process commands in a batch update after a BatchUpdateException object has been thrown.
I am staring to work with Datasets after several projects I worked with RDDs. I am using Java for development.
As far as I understand columns are immutable - there is no map function for column and the standard way to map column is adding a column with withColumn.
My question is what is really happening when I call withColumn? is there a performance penalty? should I try to make as few withColumn calls as possible or it doesn't matter?
Piggybacked question: Is there any performance penalty when I call any other row/column creation function such as explode or pivot?
The performance of the various functions to interact with a DataFrame are all fast enough that you will never have a problem (or really notice them).
This will make more sense if you understand how spark executes the transormations you define in your driver. When you call the various transformation functions (withColumn, select, etc) Spark isn't actually doing anything immediately. It just registers what operations you want to run in it's execution plan. Spark doesn't start computations on your data until you call an action, typically to get results or write out data.
Knowing all the operations you want to run allows spark to perform optimizations on the execution plan before actually running it. For example, imagine you use withColumn to create a new column, but then drop that column before you write the data out to a file. Spark knows that it never actually needs to compute that column.
The things that will typically determine the performance of your driver are:
How many wide transformations (shuffles of data between executors) are there and how much data is being shuffled
Do I have any expensive transformation functions
For your extra question about explode and pivot:
Explode creates new rows but is a narrow transformation. It can change the partitions in place without needing to move data between executors. This means it is relatively cheap to perform. There is an exception to this if you have very large arrays you are exploding as Raphael pointed out in the comments.
Pivot requires a groupBy operation which is a wide transformation. It must send data from every executor to every other executor to ensure that all the data for a given key is in the same partition. This is an expensive operation because of all the extra network traffic required.
I have no experience with either Flink or Spark, and I would like to use one of them for my use case. I'd like to present my use case and hopefully get some insight of whether this can be done with either, and if they can both do that, which one would work best.
I have a bunch of entities A stored in a data store (Mongo to be precise but it doesn't matter really). I have a Java application that can load these entities and run some logic on them to generate a Stream of some data type E (to be 100% clear I don't have the Es in any data set, I need to generate them in Java after I load the As from the DB)
So I have something like this
A1 -> Stream<E>
A2 -> Stream<E>
...
An -> Stream<E>
The data type E is a bit like a long row in Excel, it has a bunch of columns. I need to collect all the Es and run some sort of pivot aggregation like you would do in Excel. I can see how I could do that easily in either Spark or Flink.
Now is the part I cannot figure out.
Imagine that one of the entity A1 is changed (by a user or a process), that mean that all the Es for A1 need updating. Of course I could reload all my As, recompute all the Es, and then re-run the whole aggregation. By I'm wondering if it's possible to be a bit more clever here.
Would it be possible to only recompute the Es for A1 and do the minimum amount of processing.
For Spark would it be possible to persist the RDD and only update part of it when needed (here that would be the Es for A1)?
For Flink, in the case of streaming, is it possible to update data points that have already been processed? Can it handle that sort of case? Or could I perhaps generate negative events for A1's old Es (i.e that would remove them from the result) and then add the new ones?
Is that a common use case? Is that even something that Flink or Spark are designed to do? I would think so but again I haven't used either so my understanding is very limited.
I think your question is very broad and depends on many conditions. In flink you could have a MapState<A, E> and only update the values for the changed A's and then depending on your use-case either generate the updated E's downstream or generate the difference (retraction stream).
In Flink there exists the concept of Dynamics Tables and Retraction streams that may inspire you, or maybe event the Table API already covers your use case. You can check out the docs here
I want to use Spring Batch to process CSV files. Each CSV file contains one record per line. For a given file, some records may be inter related i.e. processing of such records MUST follow the order they appear in the file. Using the regular sequential approach (i.e. single thread for the entire file) yields me bad performances, therefore I want to use the partitioning feature. Due to my processing requirement, inter related records MUST be in the same partition (as well as in the order they appear in the file). I thought about the idea of using a hash based partitioning algorithm with a carefully chosen hash function (so that near equally sized partitions are created).
Any idea if this is possible with Spring Batch?
How should the Partitioner be implemented for such case? According to one of the Spring Batch author/developer, the master does not send the actual data, only the information required for the slave to obtain the data it is supposed to process. In my case, I guess this information would be the hash value. Therefore, does the FlatFileItemReader of each slave need to read the entire file line by line skipping the lines with a different hash?
Thanks,
Mickael
What you're describing is something normally seen in batch processing. You have a couple options here:
Split the file by sequence and partition based on the created files - In this case, you'd iterate through the file once to divide it up into each of the list of records that needs to be processed in sequence. From there, you can use the MultiResourcePartitioner to process each file in parallel.
Load the file into a staging table - This is the easier method IMHO. Load the file into a staging table. From there, you can partition the processing based on any number of factors.
In either case, the results allows you to scale out the process as wide as you need to go to obtain the performance you need to achieve.
Flat file item reader is not thread safe so you cannot simply use it in parallell procesing.
There is more info in the docs:
Spring Batch provides some implementations of ItemWriter and ItemReader. Usually they say in the Javadocs if they are thread safe or not, or what you have to do to avoid problems in a concurrent environment. If there is no information in Javadocs, you can check the implementation to see if there is any state. If a reader is not thread safe, it may still be efficient to use it in your own synchronizing delegator. You can synchronize the call to read() and as long as the processing and writing is the most expensive part of the chunk your step may still complete much faster than in a single threaded configuration.
I think your question is somehow duplicate to this: multithreaded item reader
I am using the REPLACE output type, meaning the MR result is stored in a collection.
Two http requests do MR simultaneously in different threads - it means I cannot use the same output collection name, so there will be a collection per request, which may result in creation of many MR result collections.
How do you deal with this situation? How do you limit the number of concurrent requests? Do you keep the MR result collections around in case the queries repeat?
In short, I am interested to know how others manage these MR collections (if at all).
I am using mongo java driver (2.7.3) and Restlet (2.0.10)
Thanks.
Well if the results are going to be used more than one time it make sense to create a unique collection for each map-reduce query and whenever you need an answer retrieve the answer from that collection.
Putting up a flag in the server indicating that MR is running might save you from executing simultaneous MR collections. If the MR job is a resource-consuming job, it is a good practice to keep the result of MR somewhere and retrieve it whenever you need it.