I am staring to work with Datasets after several projects I worked with RDDs. I am using Java for development.
As far as I understand columns are immutable - there is no map function for column and the standard way to map column is adding a column with withColumn.
My question is what is really happening when I call withColumn? is there a performance penalty? should I try to make as few withColumn calls as possible or it doesn't matter?
Piggybacked question: Is there any performance penalty when I call any other row/column creation function such as explode or pivot?
The performance of the various functions to interact with a DataFrame are all fast enough that you will never have a problem (or really notice them).
This will make more sense if you understand how spark executes the transormations you define in your driver. When you call the various transformation functions (withColumn, select, etc) Spark isn't actually doing anything immediately. It just registers what operations you want to run in it's execution plan. Spark doesn't start computations on your data until you call an action, typically to get results or write out data.
Knowing all the operations you want to run allows spark to perform optimizations on the execution plan before actually running it. For example, imagine you use withColumn to create a new column, but then drop that column before you write the data out to a file. Spark knows that it never actually needs to compute that column.
The things that will typically determine the performance of your driver are:
How many wide transformations (shuffles of data between executors) are there and how much data is being shuffled
Do I have any expensive transformation functions
For your extra question about explode and pivot:
Explode creates new rows but is a narrow transformation. It can change the partitions in place without needing to move data between executors. This means it is relatively cheap to perform. There is an exception to this if you have very large arrays you are exploding as Raphael pointed out in the comments.
Pivot requires a groupBy operation which is a wide transformation. It must send data from every executor to every other executor to ensure that all the data for a given key is in the same partition. This is an expensive operation because of all the extra network traffic required.
Related
In Java code I am trying to fetch 3500 rows from DB(Oracle). It takes almost 15 seconds to load the data. I have approached storing the result in Cache and retrieving from it too. I am using simple Select statement and displaying 8 columns from a single table (No joins used) .Using List to save the data from DB and using it as source for Datatable. I have also thought from hardware side such as RAM capacity, Storage, Network speed etc... It exceeds the minimum requirements comfortably. Can you help to do it quicker (Shouldn't take more than 3 seconds)?
Have you implemented proper indexing to your tables? I don't like to ask this since this is a very basic way of optimizing your tables for queries and you mention that you have already tried several ways. One of the workarounds that works for me is that if the purpose of the query is to display the results, the code can be designed in such a way that the query should immediately display the initial data while it is still loading more data. This implies to implement a separate thread for loading and separate thread for displaying.
It is most likely that the core problem is that you have one or more of the following:
a poorly designed schema,
a poorly designed query,
an badly overloaded database, and / or
a badly overloaded / underprovisioned network connection between the database and your client.
No amount of changing the client side (Java) code is likely to make a significant difference (i.e. a 5-fold increase) ... unless you are doing something crazy in the way you are building the list, or the bottleneck is in the display code not the retrieval.
You need to use some client-side and server-side performance tools to figure out whether the real bottleneck is the client, the server or the network. Then use those results to decide where to focus your attention.
How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation?
I have tried to put cache() with the map call but that still doesn't do the trick. My map method actually uploads results to HDFS. So, its not useless, but Spark thinks it is.
Short answer:
To force Spark to execute a transformation, you'll need to require a result. Sometimes a simple count action is sufficient.
TL;DR:
Ok, let's review the RDD operations.
RDDs support two types of operations:
transformations - which create a new dataset from an existing one.
actions - which return a value to the driver program after running a computation on the dataset.
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
Conclusion
To force Spark to execute a call to map, you'll need to require a result. Sometimes a count action is sufficient.
Reference
Spark Programming Guide.
Spark transformations only describe what has to be done. To trigger an execution you need an action.
In your case there is a deeper problem. If goal is to create some kind of side effect, like storing data on HDFS, the right method to use is foreach. It is both an action and has a clean semantics. What is also important, unlike map, it doesn't imply referential transparency.
There doesn't seem to be any direct way to know affected rows in cassandra for update, and delete statements.
For example if I have a query like this:
DELETE FROM xyztable WHERE PKEY IN (1,2,3,4,5,6);
Now, of course, since I've passed 6 keys, it is obvious that 6 rows will be affected.
But, like in RDBMS world, is there any way to know affected rows in update/delete statements in datastax-driver?
I've read cassandra gives no feedback on write operations here.
Except that I could not see any other discussion on this topic through google.
If that's not possible, can I be sure that with the type of query given above, it will either delete all or fail to delete all?
In the eventually consistent world you can look at these operations as if it was saving a delete request, and depending on the requested consistency level, waiting for a confirmation from several nodes that this request has been accepted. Then the request is delivered to the other nodes asynchronously.
Since there is no dependency on anything like foreign keys, then nothing should stop data from being deleted if the request was successfully accepted by the cluster.
However, there are a lot of ifs. For example, deleting data with a consistency level one, successfully accepted by one node, followed by an immediate node hard failure may result in the loss of that delete if it was not replicated before the failure.
Another example - during the deletion, one node was down, and stayed down for a significant amount of time, more than the gc_grace_period, i.e., more than it is required for the tombstones to be removed with deleted data. Then if this node is recovered, then all suddenly all data that has been deleted from the rest of the cluster, but not from this node, will be brought back to the cluster.
So in order to avoid these situations, and consider operations successful and final, a cassandra admin needs to implement some measures, including regular repair jobs (to make sure all nodes are up to date). Also applications need to decide what is better - faster performance with consistency level one at the expense of possible data loss, vs lower performance with higher consistency levels but with less possibility of data loss.
There is no way to do this in Cassandra because the model for writes, deletes, and updates in Cassandra is basically the same. In all of those cases a cell is added to the table which has either the new information or information about the delete. This is done without any inspection of the current DB state.
Without checking the rest of the replicas and doing a full merge on the row there is no way to tell if any operation will actually effect the current read state of the database.
This leads to the oft cited anti-pattern of "Reading before a write." In Cassandra you are meant to write as fast as possible and if you need to have history, use a datastructure which preservations a log of modifications rather than just current state.
There is one option for doing queries like this, using the CAS syntax of IF value THEN do other thing but this is a very expensive operation compared normal write and should be used sparingly.
I'm fairly new to the app-engine datastore but get that it is designed more like a Hashtable than a database table. This leads me to think it's better to have fewer rows (entities) and more columns (object properties) "in general".
That is, you can create a Car object with properties color and count or you can create it with properties redCount, blueCount, greenCount, assuming you know all the colors (dimensions). If you are storing instances of those objects you would have either three or one:
For each color and count, save new entity:
"red", 3
"blue", 8
"green", 4
Or save one entity with properties for each possible color: 3, 8, 4
Obviously there are some design challenges with the latter approach but wondering what the advice is for getting out of relational thinking? Seems datastore is quite happy with hundreds of "columns" / properties.
Good job trying to get out of relational thinking. It's good to move away from the row/table thinking.
A closer approximation, at least on the programming side, would be to think of entities as data structure or class instances stored remotely. These entities have properties. Separate from the entities are indexes, which essentially store lists of entities that match certain criteria for properties.
When you write an entity, the datastore updates that instance in memory/storage, and then updates all the indexes.
When you do a query, you essentially walk through one of the index lists.
That should give you a basic framework to think about the datastore.
When you design for the datastore, you generally have to design for cost, and to a lesser degree, performance. On the write side, you want to minimize the number of indexes. On the read side, you want to minimize the number of entities you're reading, so the idea of having separate entities for red, blue, green could be a bad idea, tripling your read costs if you constantly need to read back the number of red/blue/green cars. There could be some really obscure corner case where this makes sense.
Your design considerations generally should go along the lines of:
What types of queries do I need to do?
How do I structure my data to make these queries easy to do (since the GAE query capabilities are limited)? Would a query be easier if I duplicate data somehow, and would I be able to maintain this duplicated data on my own?
How can I minimize the number of indexes that need to be updated when I update an entity?
Are there any special cases where I must have full consistency and therefore need to adjust the structure so that consistent queries can be made?
Are there any write performance cases I need to be careful about.
Without knowing exactly what kind of query you're going to make, this answer will likely not be right, but it should illustrate how you might want to think of this.
I'll assume you have an application where people register their cars, and you have some dashboard that polls the datastore and displays the number of cars of each color, the traditional mechanism of having a Car class with color, count attributes still makes sense because it minimizes the number of indexed properties, thus reducing your write costs.
It's a bit of an odd example, because I can't tell if you want to just have a single entity that keeps track of your counts (in which case you don't even need to do a query, you can just fetch that count), or if you have a number of entities of counts that you may fetch and sum up.
If user updates modify the same entity though, you might run into performance problems, you should read through this: https://developers.google.com/appengine/articles/sharding_counters
I would recommend not storing things in your own standard in the one cell. Unless it is encoded in JSON or something similar.
{'red':3, 'blue':4}
JSON is ok because it can be easily decoded into a data structure within java like a list or something.
There is nothing wrong with lots of columns in an app. You will get more gains by having a column for red, blue and green. There would have to be a very large number of columns to see a big slow down.
I think it safe to say that there is no significant performance penalty for having a lot of properties (columns) for each entity (row) in a database model. Nor is there a penalty for lots of rows (entities), or even lots of tables (db classes). If I were doing your example, I would definitely set up separate properties for color and count. We always explicitly call out indexed=False/True to ensure we avoid the dread problem of wondering why your indexes are so large when you only have a few properties indexed (forgetting that the default is True). Although GAE gives you nice properties such as lists that can be indexed, these specialized properties are not without their overhead costs. Understand these well whenever you use them.
One thing that I think is important to remember with GAE when plotting your design is that standard queries are slow, and slow equates to increased latency, and increased latency results in more instances, and more expense (and other frustrations). Before defaulting to a standard query, always ask (if this is a mission-critical part of your code) if you can accomplish the same by setting up a more denormalized datastructure. For example, linking a set of entities together using a common key then doing a series of get_by_id() calls can often be advantageous (be sure to manage ndb's auto memcache when doing this - not everything needs to be cached). Ancestor queries are also much faster than standard queries (but impose a 1 update per second limit on the family group.)
Concluding: within reason the number properties (columns) in an entity (rows) and also the total number of classea (tables) will not impose any real issues. However, if you are coming from a standard relational DB background, your inclination will be to use SQL-like queries to move your logic along. Remember in GAE that standard GQL queries are slow and costly, and always think about things links using denormalization to avoid them. GAE is a big, flat, highly performant noSQL-like resource. Use it as such. Take the extra time to avoid reliance on GQL queries, it will be worth it.
I have a database with a lot of web pages stored.
I will need to process all the data I have so I have two options: recover the data to the program or process directly in database with some functions I will create.
What I want to know is:
do some processing in the database, and not in the application is a good
idea?
when this is recommended and when not?
are there pros and cons?
is possible to extend the language to new features (external APIs/libraries)?
I tried retrieving the content to application (worked), but was to slow and dirty. My
preoccupation was that can't do in the database what can I do in Java, but I don't know if this is true.
ONLY a example: I have a table called Token. At the moment, it has 180,000 rows, but this will increase to over 10 million rows. I need to do some processing to know if a word between two token classified as `Proper Name´ is part of name or not.
I will need to process all the data. In this case, doing directly on database is better than retrieving to application?
My preoccupation was that can't do in the database what can I do in
Java, but I don't know if this is true.
No, that is not a correct assumption. There are valid circumstances for using database to process data. For example, if it involves calling a lot of disparate SQLs that can be combined in a store procedure then you should do the processing the in the stored procedure and call the stored proc from your java application. This way you avoid making several network trips to get to the database server.
I do not know what are you processing though. Are you parsing XML data stored in your database? Then perhaps you should use XQuery and a lot of the modern databases support it.
ONLY an example: I have a table called Token. At the moment, it has
180,000 rows, but this will increase to over 10 million rows. I need
to do some processing to know if a word between two token classified
as `Proper Name´ is part of name or not.
Is there some indicator in the data that tells it's a proper name? Fetching 10 million rows (highly susceptible to OutOfMemoryException) and then going through them is not a good idea. If there are certain parameters about the data that can be put in a where clause in a SQL to limit the number of data being fetched is the way to go in my opinion. Surely you will need to do explains on your SQL, check the correct indices are in place, check index cluster ratio, type of index, all that will make a difference. Now if you can't fully eliminate all "improper names" then you should try to get rid of as many as you can with SQL and then process the rest in your application. I am assuming this is a batch application, right? If it is a web application then you definitely want to create a batch application to do the staging of the data for you before web applications query it.
I hope my explanation makes sense. Please let me know if you have questions.
Directly interacting with the DB for every single thing is a tedious job and affects the performance...there are several ways to get around this...you can use indexing, caching or tools such as Hibernate which keeps all the data in the memory so that you don't need to query the DB for every operation...there are tools such as luceneIndexer which are very popular and could solve your problem of hitting the DB everytime...