Migrating A Java Application to Hadoop : Architecture/Design Roadblocks?

Migrating A Java Application to Hadoop : Architecture/Design Roadblocks? - java

Alrite.. so.. here's a situation:
I am responsible for architect-ing the migration of an ETL software (EAI, rather) that is java-based.
I'll have to migrate this to Hadoop (the apache version). Now, technically this is more like a reboot and not a migration - coz I've got no database to migrate. This is about leveraging Hadoop, such that, the Transformation phase (of 'ETL') is parallel-iz-ed. This would make my ETL software,
Faster - with transformation parallel-iz-ed.
Scalable - Handling more data / big data is about adding more nodes.
Reliable - Hadoop's redundancy and reliability will add to my product's features.
I've tested this configuration out - changed my transformation algos into a mapreduce model, tested it out on a high end Hadoop cluster and bench-marked the performance. Now, I'm trying to understand and document all those things that could stand in the way of this application redesign/ rearch / migration. Here's a few I could think of:
The other two phases: Extraction and Load - My ETL tool can handle a variety of datasources - So, do I redesign my data adapters to read data from these data sources, load it to HDFS and then transform it and load it into the target datasource? Could this step act as a huge bottleneck to the entire architecture?
Feedback: So my transformation fails on a record - how do I let the end user know that the ETL hit an error on a particular record? In short, how do I keep track of what is actually going on at the app level with all the maps/reduces/merges and sorts happening - The default Hadoop web interface is not for the end-user - its for admins. So should I build a new web app that scrapes from the Hadoop web interface? (I know this is not recommended)
Security: How do I handle authorization at Hadoop level? Who can run jobs, who are not allowed to run 'em - how to support ACL?
I look forward to hearing from you with possible answers to above questions and more questions/facts I'd need to consider, based on your experiences with Hadoop / problem analysis.
Like always, I appreciate your help and thank ya all in advance.

I do not expect loading to the HDFS to be a bottlneck, since the load is distributed among datanodes - so the network interface will be only bottleneck. Loading data back to the database might be a bottlneck but I think it is no worse then now. I would design jobs to have their input and their output to sit in the HDFS, and then run some kind of bulk load of results into the database.
Feedback is a problematic point, since actually MR have only one result - and it is transformed data. All other tricks, like write failed records into HDFS files will lack "functional" reliability of the MR, because it is a side effect. One of the ways to mitigate this problem you should design you software in the way to be ready for duplicated failed records. There is also scoop = the tool specific for migrating data between SQL databases and Hadoop. http://www.cloudera.com/downloads/sqoop/
In the same time I would consider usage of HIVE - if Your SQL transformations are not that complicated - it might be practical to create CSV files, and make initial preaggregation with Hive, therof reducing data volumes before going to (perhaps single node) database.

Related

Luigi vs Spring Batch

I have to load txt files into oracle tables. Nowadays the process is being done using bash scripting, sql loader and command line tools for validations.
I'm trying to find more robust alternatives. The two options I came up with are Luigi (Python framework) and Spring Batch.
I made a little POC using Spring Batch, but I believe it has a lot of boilerplate code and might be an overkill. I also prefer Python over Java. The good thing about Batch is the job tracking schema that comes out of the box with the framework.
Files contain from 200k to 1kk records. No transformations are performed, only datatype and lenght validations. First steps of the job consist on checking header, trailer, some dates, making queries to parameters table and truncating the staging table.
Could you give me some pro and cons of each framework for this use case?

I would argue they are not equivalent technologies. Luigi is more of a workflow/process management framework that can help organize and orchestrate many different batch job
The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. You want to chain many tasks, automate them, and failures will happen. These tasks can be anything, but are typically long running things like Hadoop jobs, dumping data to/from databases, running machine learning algorithms, or anything else. https://luigi.readthedocs.io/en/stable/
Spring Batch gives you a reusable framework for structuring a batch job. It gives you a lot of things out of the box, like being able to read input from text files and write output to databases.
A lightweight, comprehensive batch framework designed to enable the development of robust batch applications vital for the daily operations of enterprise systems.
Spring Batch provides reusable functions that are essential in processing large volumes of records, including logging/tracing, transaction management, job processing statistics, job restart, skip, and resource management https://spring.io/projects/spring-batch
You could theoretically run Spring Batch jobs with Luigi.
Based on the brief description of your use case, it sounds like the bread and butter of what inspired Spring Batch in the first place. In fact, their 15 minute demo application covers the use case of reading from a file and loading records into a JDBC database https://spring.io/guides/gs/batch-processing/.

Keeping neo4j updated with production MSSQL

I'm investigating the possibility of using neo4j to handle some of the queries of our java web application that simply take too long to run on MSSQL as they require so many joins on large tables, even with indexes implemented.
I am however concerned about the time that it might take to complete the ETL ultimately impacting on how outdated the information may be when queries.
Can someone advise on either a production strategy or toolkit / library that can assist in reading a production sql-server database (using deltas if possible to optimise) and updating a running instance of a neo4j database? I imagine that there will have to be some kind of mapping configuration but the idea is to have this run in an automated manner, updating the neo4j database with one or more sql-server table or view contents.

The direct way to connect a MS SQL database to a Neo4j database would be using the apoc.load.jdbc procedure.
For an initial load you can use Neo4j ETL (https://neo4j.com/blog/rdbms-neo4j-etl-tool/).
There is however no way around the fact that some planning and work will be involved if you want to keep two databases in sync (and if the logic involved goes beyond a few simple queries) continiously. You might want to offload a delta every so often (monthly, daily, hourly, ...) into CSV files and load those (with CYPHER syntax determining what needs to be added, removed, changed or connected) with LOAD CSV.
Sadly enough there's no such thing as a free lunch.
Hope this helps,
Tom

Preventing Mule servers from reprocessing same information from a database

I am working on a Mule application which reads a series of database records generates reports and posts them to a number of HTTP locations. Unfortunately, the servers are not clustered, so it is possible that both servers could read the records and post them multiple times which is undesirable. Could someone suggest the simplest way to prevent all three Mule servers reading the database, generating the reports and sending them off??

Short answer - use cluster.
Long answer - there is no magic in this world. If you don't use cluster which coordinates your efforts then you should do it by yourself. Since servers are not in cluster they should communicate somehow to prevent duplication. Cluster is the best answer and it is designed to do so. Without cluster - do it "manually".
There are many ways to do so. Main point is that it should be the only one place responsible for coordination (may I say cluster? :) - the best way IMHO it is database - it is one place which is common for all these servers. simplest way is to mark processed records and process only unprocessed ones. How you do this - extra table or extra field - it's up to you.

Solr as primary database [duplicate]

My team is working with a third party CMS that uses Solr as a search index. I've noticed that it seems like the authors are using Solr as a database of sorts in that each document returned contains two fields:
The Solr document ID (basically a classname and database id)
An XML representation of the entire object
So basically it runs a search against Solr, download the XML representation of the object, and then instantiate the object from the XML rather than looking it up in the database using the id.
My gut feeling tells me this is a bad practice. Solr is a search index, not a database... so it makes more sense to me to execute our complex searches against Solr, get the document ids, and then pull the corresponding rows out of the database.
Is the current implementation perfectly sound, or is there data to support the idea that this is ripe for refactoring?
EDIT: When I say "XML representation" - I mean one stored field that contains an XML string of all of the object's properties, not multiple stored fields.

Yes, you can use SOLR as a database but there are some really serious caveats :
SOLR's most common access pattern, which is over http doesnt respond particularly well to batch querying. Furthermore, SOLR does NOT stream data --- so you can't lazily iterate through millions of records at a time. This means you have to be very thoughtful when you design large scale data access patterns with SOLR.
Although SOLR performance scales horizontally (more machines, more cores, etc..) as well as vertically (more RAM, better machines, etc), its querying capabilities are severely limited compared to those of a mature RDBMS. That said, there are some excellent functions, like the field stats queries, which are quite convenient.
Developers who are used to using relational databases will often run into problems when they use the same DAO design patterns in a SOLR paradigm, because of the way SOLR uses filters in queries. There will be a learning curve for developing the right approach to building an application that uses SOLR for part of its large queries or statefull modifications.
The "enterprisy" tools that allow for advanced session management and statefull entities that many advanced web-frameworks (Ruby, Hibernate, ...) offer will have to be thrown completely out the window.
Relational databases are meant to deal with complex data and relationships - and they are thus accompanied by state of the art metrics and automated analysis tools. In SOLR, I've found myself writing such tools and manually stress-testing alot, which can be a time sink.
Joining : this is the big killer. Relational databases support methods for building and optimizing views and queries that join tuples based on simple predicates. In SOLR, there aren't any robust methods for joining data across indices.
Resiliency : For high availability, SolrCloud uses a distributed file system underneath (i.e. HCFS). This model is quite different then that of a relational database, which usually does resiliency using slaves and masters, or RAID, and so on. So you have to be ready to provide the resiliency infrastructure SOLR requires if you want it to be cloud scalable and resistent.
That said - there are plenty of obvious advantages to SOLR for certain tasks : (see http://wiki.apache.org/solr/WhyUseSolr) -- loose queries are much easier to run and return meaningful results. Indexing is done as a matter of default, so most arbitrary queries run pretty effectively (unlike a RDBMS, where you often have to optimize and de-normalize after the fact).
Conclusion: Even though you CAN use SOLR as an RDBMS, you may find (as I have) that there is ultimately "no free lunch" - and the cost savings of super-cool lucene text-searches and high-performance, in-memory indexing, are often paid for by less flexibility and adoption of new data access workflows.

It's perfectly reasonable to use Solr as a database, depending on your application. In fact, that's pretty much what guardian.co.uk is doing.
It's definitely not bad practice per se. It's only bad if you use it the wrong way, just like any other tool at any level, even GOTOs.
When you say "An XML representation..." I assume you're talking about having multiple stored Solr fields and retrieving this using Solr's XML format, and not just one big XML-content field (which would be a terrible use of Solr). The fact that Solr uses XML as default response format is largely irrelevant, you can also use a binary protocol, so it's quite comparable to traditional relational databases in that regard.
Ultimately, it's up to your application's needs. Solr is primarily a text search engine, but can also act as a NoSQL database for many applications.

This was probably done for performance reasons, if it doesn't cause any problems I would leave it alone. There is a big grey area of what should be in a traditional database vs a solr index. Ive seem people do similar things to this (usually key value pairs or json instead of xml) for UI presentation and only get the real object from the database if needed for updates/deletes. But all reads just go to Solr.

I've seen similar things done because it allows for very fast lookup. We're moving data out of our Lucene indexes into a fast key-value store to follow DRY principles and also decrease the size of the index. There's not a hard-and-fast rule for this sort of thing.

I had similar idea, in my case to store some simple json data in Solr, using Solr as a database. However, a BIG caveat that changed my mind was the Solr upgrade process.
Please see https://issues.apache.org/jira/browse/LUCENE-9127.
Apparently, there has been in the past (pre v6) the recommendation to re-index documents after major version upgrades (not just use IndexUpdater) although you did not have to do this to maintain functionality (I cannot vouch for this myself, this is from what I have read). Now, after you have upgraded 2 major versions but did not re-index (actually, fully delete docs then the index files themselves) after the first major version upgrade, your core is now not recognized.
Specifically in my case, I started with Solr v6. After upgrade to v7, I ran IndexUpdater so index is now at v7. After upgrade to v8, the core would not load. I had no idea why - my index was at v7, so that satisfies the version-minus-1 compatibility statement from Solr, right? Well, no - wrong.
I did an experiment. I started fresh from v6.6, created a core and added some documents. Upgraded to v7.7.3 and ran IndexUpdater, so index for that core is now at v7.7.3. Upgraded to v8.6.0, after which the core would not load. Then I repeated the same steps, except after running IndexUpdater I also re-indexed the documents. Same problem. Then I again repeated everything, except I did not just re-index, I deleted the docs from the index and deleted the index files and then re-indexed. Now, when I arrived in v8.6.0, my core was there and everything OK.
So, the takeaway for the OP or anyone else contemplating this idea (using Solr as db) is that you must EXPECT and PLAN to re-index your documents/data from time to time, meaning you must store them somewhere else anyway (a previous poster alluded to this idea), which sort of defeats the concept of a database. Unless of course your Solr core/index will be short-lived (not last more than one major version Solr upgrade), you never intend to upgrade Solr more than 1 version, or the Solr devs change this upgrade limitation. So, as an index for data stored elsewhere (and readily available for re-indexing when necessary), Solr is excellent. As a database for the data itself, it strongly "depends".

Adding to #Jayunit100 response, using solar as a database, you get availability and partition tolerance at the cost of some consistency. There is going to be a configurable lag between what you write and when you can read it back.

Using Solr search index as a database - is this "wrong"?

My team is working with a third party CMS that uses Solr as a search index. I've noticed that it seems like the authors are using Solr as a database of sorts in that each document returned contains two fields:
The Solr document ID (basically a classname and database id)
An XML representation of the entire object
So basically it runs a search against Solr, download the XML representation of the object, and then instantiate the object from the XML rather than looking it up in the database using the id.
My gut feeling tells me this is a bad practice. Solr is a search index, not a database... so it makes more sense to me to execute our complex searches against Solr, get the document ids, and then pull the corresponding rows out of the database.
Is the current implementation perfectly sound, or is there data to support the idea that this is ripe for refactoring?
EDIT: When I say "XML representation" - I mean one stored field that contains an XML string of all of the object's properties, not multiple stored fields.

It's perfectly reasonable to use Solr as a database, depending on your application. In fact, that's pretty much what guardian.co.uk is doing.
It's definitely not bad practice per se. It's only bad if you use it the wrong way, just like any other tool at any level, even GOTOs.
When you say "An XML representation..." I assume you're talking about having multiple stored Solr fields and retrieving this using Solr's XML format, and not just one big XML-content field (which would be a terrible use of Solr). The fact that Solr uses XML as default response format is largely irrelevant, you can also use a binary protocol, so it's quite comparable to traditional relational databases in that regard.
Ultimately, it's up to your application's needs. Solr is primarily a text search engine, but can also act as a NoSQL database for many applications.

This was probably done for performance reasons, if it doesn't cause any problems I would leave it alone. There is a big grey area of what should be in a traditional database vs a solr index. Ive seem people do similar things to this (usually key value pairs or json instead of xml) for UI presentation and only get the real object from the database if needed for updates/deletes. But all reads just go to Solr.

I've seen similar things done because it allows for very fast lookup. We're moving data out of our Lucene indexes into a fast key-value store to follow DRY principles and also decrease the size of the index. There's not a hard-and-fast rule for this sort of thing.

Adding to #Jayunit100 response, using solar as a database, you get availability and partition tolerance at the cost of some consistency. There is going to be a configurable lag between what you write and when you can read it back.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.