How to cache only part of the RDD in Spark?

How to cache only part of the RDD in Spark? - java

I have a PairRDD<Metadata, BigData>.
I want to do two actions: one on all the data in the RDD and then another action on only the Metadata.
The input comes from reading massive files, which I don't want to repeat.
I understand that the classic thing to do is to use cache() or persist() on the input RDD so that it is kept in memory:
JavaPairRDD<Metadata, Bigdata> inputRDD = expensiveSource();
JavaPairRDD<Metadata, Bigdata> cachedRDD = inputRDD.cache();
cachedRDD.foreach(doWorkOnAllData);
cachedRDD.keys().foreach(doWorkOnMetadata);
The problem is that the input is so big that it doesn't fit in memory and cache() therefore does not do anything.
I could use persist() to cache it on a disk but since the data is so big, saving and reading all that data will actually be slower than reading the original source.
I could use MEMORY_SERDE to gain a bit of space, but it is probably not enough, and even then serializing the whole thing when I am just interested in 0.1% of the data seems silly.
What I want is to cache only the key part of my PairRDD. I thought I could do that by calling cache() on the keys() RDD:
JavaPairRDD<Metadata, Bigdata> inputRDD = expensiveSource();
JavaRDD<Metadata, Bigdata> cachedRDD = inputRDD.keys().cache();
inputRDD.foreach(doWorkOnAllData);
cachedRDD.foreach(doWorkOnMetadata);
But in that case it doesn't seem to cache anything, and just go back to load the source.
Is it possible to only put a part of the data in cache? The operation on the metadata is ridiculously small but I have to do it after the operation on the whole data.

Spark will only load you RDD from cache if you call inputRDD.keys()
What you can try is : JavaRDD<Metadata> keys = inputRDD.keys().cache(); to cache your JavaRDD<Metadata>
Then to make your cachedRDD you do :
JavaRDD<Metadata,Bigdata> cachedRDD = keys.join(JavaPairRDD<Bigdata>)
Also if you RDD is huge, read from cache is slowest the first time because you have to save your RDD, but the next time to read it, it will be faster.

Related

How keep keys in aerospike effectively?

For not very big amount of data we store all keys in one bin with List.
But there are limitations on the size of bin.
Function scanAll with ScanCallback in Java client, actually works very slowly, so we cannot afford it in our project. Aerospike works fast when you give him the Key.
Now we have some sets where are a lot of records and keys. What is the best way to store all keys, or maybe there are some way to get it fast and without scanAll ?

Scanning small sets is currently an inefficient operation, because there are 4K logical partitions, and a scan thread has to reduce each of those partitions during the scan. Small sets don't necessarily have records in all the partitions, so you're paying for the overhead of scanning those regardless. This is likely to change in future versions, but is the case for now.
There are two ways to get all the records in a set faster:
If you actually know what the key space is like, you can iterate over batch-reads to fetch them (which can also be done in parallel). Trying to access a non-existent key in a batch-read does not cause an error, it just comes back with no value in the specific index.
Alternatively, you can add a bin that has the set name, and create a secondary index over that bin, then query for all the records WHERE setname=XYZ. This will come back much faster than the scan, for a small set.

How can I force Spark to execute code?

How can I force Spark to execute a call to map, even if it thinks it does not need to be executed due to its lazy evaluation?
I have tried to put cache() with the map call but that still doesn't do the trick. My map method actually uploads results to HDFS. So, its not useless, but Spark thinks it is.

Short answer:
To force Spark to execute a transformation, you'll need to require a result. Sometimes a simple count action is sufficient.
TL;DR:
Ok, let's review the RDD operations.
RDDs support two types of operations:
transformations - which create a new dataset from an existing one.
actions - which return a value to the driver program after running a computation on the dataset.
For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away.
Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
Conclusion
To force Spark to execute a call to map, you'll need to require a result. Sometimes a count action is sufficient.
Reference
Spark Programming Guide.

Spark transformations only describe what has to be done. To trigger an execution you need an action.
In your case there is a deeper problem. If goal is to create some kind of side effect, like storing data on HDFS, the right method to use is foreach. It is both an action and has a clean semantics. What is also important, unlike map, it doesn't imply referential transparency.

Fetching millions of records in java [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Very Open question,
I need to write a java client that reads millions of records (let's say account information) from an Oracle database. Dump it into a XML and send it through webservices to a vendor.
What is the most optimized way to do this? starting from fetching the millions of records. I Went the JPA/hibernate route I got outofMemory errors fetching 2 million records.
Is JDBC better approach? fetch each row and build the XML as I go? any other alternatives?
I am not an expert in Java so any guidance is appreciated.

We faced similar problem sometime back and our record size was in excess of 2M. This is how we approached.
Using any OR mapping tool is simply ruled out due to large overheads like creation of large POJOs which basically is not required if the data is to be dumped to an XML.
Plain JDBC is the way to go. The main advantage of this is that it returns a ResultSet object which actually does not contain all the results at once. So loading of entire data in memory is solved. The data is loaded as we iterate over the ResultSet
Next comes the creation of XML file. We create an XML file and opened than in Append mode.
Now in loop where we iterate over Resultset object, we create XML fragments and then append the same to the XML file. This goes on till entire Resultset is iterated.
In the end what we have is XML file will all the records.
Now for sharing this file, we created a web services which would return the URL to this XML file (archived/zipped) if the file is available.
The client could download this file anytime after this.
Note this this is not a synchronous system, meaning The file does not become available after the client makes the call. Since creating XML call takes a lot of time, HTTP wold normally timeout hence this approach.
Just an approach you can take clue from. Hope this helps.

Use ResultSet#setFetchSize() to optimize the records fetched at time from database.
See What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?
In JDBC, the ResultSet#setFetchSize(int) method is very important to
performance and memory-management within the JVM as it controls the
number of network calls from the JVM to the database and
correspondingly the amount of RAM used for ResultSet processing.
Read here about Oracle ResultSet Fetch Size

For this size of data, you can probably get away with starting java with more memory. Check out using -Xmx and -Xms when you start Java.
If your data is truly too big to fit in memory, but not big enough to warrant investment in different technology, think about operating in chunks. Does this have to be done at once? Can you slice up the data into 10 chunks and do each chunk independently? If it has to be done in one shot, can you stream data from the database, and then stream it into the file, forgetting about things you are done with (to keep memory use in the JVM low)?

Read the records in chunks, as explained by previous answers.
Use StAX http://stax.codehaus.org/ to stream the record chunks to your XML file as opposed to all records into one large document

As far as the Hibernate side is concerned, fetch using a SELECT query (instead of a FROM query) to prevent filling up the caches; alternatively use a statelessSession. Also be sure to use scroll() instead of list(). Configuring hibernate.jdbc.fetch_size to something like 200 is also recommended.
On the response side, XML is a quite bad choice because parsing is difficult. If this is already set, then make sure you use a streaming XML serializer. For example, the XPP3 library contains one.

While a reasonable Java approach would probably involve a StAX construction of your XML in conjunction to paginated result sets (straightforward JDBC or JPA), keep in mind that you may need to lock your database for updates all the while which may or may not be acceptable in your case.
We took a different, database-centric approach using stored procedures and triggers on INSERT and UPDATE to generate the XML node corresponding to each row/[block of] data. This constantly ensures that 250GB+ of raw data and its XML representation (~10 GB) are up-to-date and reduces (no pun intended) the export to a mere concatenation matter.

You can still use Hibernate to fetch millions of data, it's just that you cannot do it in one round because millions is a big number and of course you will have out of memory exception. You can divide it into pages and then dump to XML each time, so that the records won't be keep in RAM and your program would not be needing so huge of memory.
I have these 2 methods in my previous project that I used very frequently. Unfortunately I did not like to use HQL so much so I don't have the code for that.
So here INT_PAGE_SIZE is the amount of rows that you would like to fetch each round, and getPageCount is to get the amount of total rounds to do to fetch all of the records.
Then paging is to fetch the records by page, from 1 to getPageCount.
public int getPageCount(Criteria criteria) {
ProjectionList pl = Projections.projectionList();
pl.add(Projections.rowCount());
criteria.setProjection(pl);
int rowCount = (Integer) criteria.list().get(0);
criteria.setProjection(null);
if (rowCount % INT_PAGE_SIZE == 0) {
return rowCount / INT_PAGE_SIZE;
}
return rowCount / INT_PAGE_SIZE + 1;
}
public Criteria paging(Criteria criteria, int page) {
if (page != -1) {
criteria.setFirstResult((page - 1) * INT_PAGE_SIZE);
criteria.setMaxResults(INT_PAGE_SIZE);
}
return criteria;
}

Distributed multimap based on HBase and Hadoop MapReduce

I'm sorry that I haven't deeply understood HBase and Hadoop MapReduce, but I think you can help me to find the way of using them, or maybe you could propose frameworks I need.
Part I
There is 1st stream of records that I have to store somewhere. They should be accessible by some keys depending on them. Several records could have the same key. There are quite a lot of them. I have to delete old records by timeout.
There is also 2nd stream of records, that is very intensive too. For each record (argument-record) I need to: get all records from 1st strem with that argument-record's key, find first corresponding record, delete it from 1st stream storage, return the result (res1) of merging these two records.
Part II
The 3rd stream of records is like 1st. Records should be accessable by keys (differ from that ones of part I). Several records as usual will have the same key. There are not so many of them like in the 1st stream. I have to delete old records by timeout.
For each res1 (argument-record) I have to: get all records from 3rd strem with that record's another key, map these records having res1 as parameter, reduce into result. 3rd stream records should stay unmodified in storage.
The records with the same key are prefered to be stored at the same node, and procedures that get records by the key and make some actions based on given argument-record are preferred to be run on the node where that records are.
Are HBase and Hadoop MapReduce applicable in my case? And how such app should look like (base idea)? If the answer is no, is there frameworks to buld such app?
Please, ask questions, if you couldn't get what I want.

I am relating to the storage backend technologies. Front end accepting records can be stateless and thereof trivially scalable.
We have streams of records and we want to join them on the fly. Some of records should be persisted why some (as far as I understood - 1st stream) are transient.
If we take scalability and persistence out of equation - it can be implemented in single java process using HashMap for randomly accessible data and TreeMap for data we want to store sorted
Now let see how it can be mapped into NoSQL technologies to gain scalability and performance we need.
HBase is distributed sorted map. So it can be good candidate for stream 2. If we used our key as hbase table key - we will gain data locality for the records with the same key.
MapReduce on top of HBase is also available.
Stream 1 looks like transient randomly accessed data. I think it does not make sense to pay a price of persistence for those records - so distributed in memory hashtable should do. For example: http://memcached.org/ Probably element of storage there will be list of records with the same key.
I still not 100% sure about 3rd stream requirements but need for secondary index (if it known beforehand) can be implemented on application level as another distributed map.
In a nutshell - my suggestion to pick up HBase for data you want to persist and store sorted and consider some more lightweight solutions for transient (but still considerable big) data.

Serializing a very large list

I'm gettint a large amount of data from a database query and I'm making objects of them. I finally have a list of these objects (about 1M of them) and I want to serialize that to disk for later use. Problem is that it barely fits in memory and won't fit in the future, so I need some system to serialize say the first 100k, the next 100k etc; and also to read the data back in in 100k increments.
I could make some obvious code that checks if the list gets too big and then wirites it to file 'list1', then 'list2' etc but maybe there's a better way to handle this?

You could go through the list, create an object, and then feed it immediately to an ObjectOutputStream which writes them to the file.

Read the objects one by one from the DB
Don't put them into a list but write them into the file as you get them from the DB
Never keep more than a single object in RAM. When you read the object, terminate the reading loop when readObject() returns null (= End of file)

I guess that you checked, it's really necessary to save the data to disk. It couldn't stay in the database, could it?
To handle data that is too big, you need to make it smaller :-)
One idea is to get the data by chunks:
start with the request, so you don't build this huge list (because that will become a point of failure sooner or later)
serialize your smaller list of objects
then loop

Think about setting the fetch size for the JDBC driver also, for example the JDBC driver for mysql defaults to fetching the whole resultset.
read here for more information: fetch size

It seems that you are retreiving a large dataset from db and convert them into list of objects and serialize them in a single shot.
Dont do that.. finally it may lead to application crash.
Instead you have to
minimize the amount of data retrieved from database. (let say
1000 records instead 1 M)
convert them into business object
And serialize them.
And perform the same procedure until the last record
this way you can avoid the performance problem.

ObjectOutputStream will work but it has more overhead. I think DataOutputStream/DataInputStream is a better choice.
Just read/write one by one and let stream worry about buffering. For example, you can do something like this,
DataOutputStream os = new DataOutputStream(new FileOutputStream("myfile"));
for (...)
os.writeInt(num);
One Gotcha with both object and data stream is that write(int) only writes one byte. Please use writeInt(int).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.