Hadoop integration with e-commerce portal - java

We are building a new e-commerce portal from scratch using java rest services and we are planning to use MySQL (for now, Oracle in the future). We are using ElasticSearch also. We are building this whole portal as microservices. My Questions is, do I need to take care of analytics from the beginning (like hadoop and HDFS integration) ?

Singular relational databases work fine, but they scale poorly. Especially for large scale web services.
You need to measure your data ingestion volume/size to determine if you need Hadoop (more specifically HDFS) for batch analytics on top of Elasticsearch. But likely not. You can use a Standalone Apache Spark cluster to run things against Elasticseach directly.
However, you could also use Kafka as a message bus between your JDBC compatible database as well as loading an Elasticsearch index. And Spark Streaming works great with Kafka.
And if you want to add Hadoop into the mix, you can just pull the same data from Kafka to fill in an HDFS directory.
There are many blogs talking about microservices communication via Kafka

Related

Worker role with java example in Azure

I couldn't find any good example of a worker role for java on azure cloud.
I am writing an amqp publisher jms application for event hubs to simulate large amount of data as a stream. I wanted to run this application on cloud and scale it to produce data according to changing needs.
As I known, Azure plugins for Eclipse is to support the features of Cloud Services few years ago, so you can search many resources like the channel9 videos as #Micah_MSFT said. But now, I found it has removed these features for Cloud Services after I tried to install the plugin in my Eclipse.
There are two old blogs which may be helpful in your scenario.
Deploying Java Applications in Azure
Installing Java Runtime in Azure Cloud Services with Chocolatey
Meanwhile, Microsoft Azure Service Fabric is the next-generation cloud application platform for highly scalable, highly reliable distributed applications, that can be instead of Cloud Service, you can refer to the offical document Learn about the differences between Cloud Services and Service Fabric before migrating applications. to compare them, and there is a tutorial for Java.
Just per my experience, as workaround, there are other simple services which be suitable for generating data by Java on Azure cloud, and that can be scaled.
For using App Services, Continuous WebJobs can be scaled with the number of WebApp instances.
On Azure, Use Batch to run large-scale parallel and high-performance computing (HPC) applications efficiently in the cloud. So you can write a Java Application to produce data and run on Batch service parallelly. There is an offical sample in Java which you can refer to.

Scalable spring batch job on kubernetes

I am developing an ETL batch application using spring batch. My ETL process takes data from one pagination based REST API and loads it to the Google Big-query. I would like to deploy this batch application in kubernetes cluster and want to exploit pod scalability feature. I understand spring batch supports both horizontal and vertical scaling. I have few questions:-
1) How to deploy this ETL app on kubernetes so that it creates pod on demand using remote chunking / remote partitioning?
2) I am assuming there would be main master pod and different slave pods provisioned based on load. Is it correct?
3) There is one kubernetes batch API also available. Use kubernetes batch API or use Spring Cloud feature.Whis option is the better one?
I have used Spring Boot with Spring Batch and Spring Cloud Task to do something similar to what you want to do. Maybe it will help you.
The way it works is like this: I have a manager app that deploys pods on Kubernetes with my master application. The master application does some work and then starts the remote partitioning deploying several other pods with "workers".
Trying to answer your questions:
1) You can create a docker image of an application that has a Spring Batch job. Let's call it Master application.
The application that will deploy the master application could uses a TaskLauncher or an AppDeployer from spring cloud deployer kubernetes
2) Correct. In this case you could use remote partitioning. Each partition would be another docker image with a Job. This would be your worker.
An example of remote partitioning can be found here.
3) In my case I used spring batch and manage to do everything I needed. The only problems I have now is with Upscalling and Downscaling my cluster. Since my workers are not stateful I'm experiencing some problems when instances are removed from the cluster. If you don't need to upscale or downscale your cluster, you are good to go.

How does the Embedded Neo4j actually work?

I am new to neo4j and based on the reading I have done so far it seem there are two ways to interact with neo4j using Neo4j REST and Embedded. Where I am a little confused is does the Embedded option only give you the ability use the native Neo4j API to manipulate the datastore or can you also embed Neo4j and package it with your java application and if so how would I go about doing it?
As far as I know, Embedded term coined out to integrate neo4j with your application. In embedded mode, your db is locked and your application is solely authorized to access it. You can not access your db from any where else as far as your application is running and accessing it.
Where as in Neo4j Rest or Say Neo4j Server support REST API through which you can perform all the data store related operation via API call. In Rest API mode, you can handle your db externally using Neo4j GUI console along with your application.
Performance wise, I found embedded mode is much faster than Server mode.
does the Embedded option only give you the ability use the native Neo4j API to manipulate the datastore
You can use either of mode (Server REST API mode or embedded mode) to manipulate datastore.
Package with Java Application
it depends on your application configuration, in embedded mode you generally don't need external neo4j server running. You just need to explicitly mention your db path along with other configuration (I have used Spring data neo4j). Where as in Neo4j Server mode, you will require neo4j server running.
You can have look on this thread as well.

How can I access the HDFS(Hadoop File System) from existing web application

I have installed hadoop 1.0.4 on my cluster, of 1 master and 3 slaves, Now I want to access my HDFS file system through my web application for storing and accessing the data for the existing web application.
As my web application currently using MySQL as a database, I want replace that by HDFS.
So what can use, so that I am able to access HDFS by existing web application?
For backend data migration purpose I am using sqoop and flume but I want the real time application synchronization with HDFS. As what I saved from web-page should directly go to the HDFS and what I want search should directly come from HDFS.
Please suggest.
Thanks in advance.
It's like replacing an apple with an orange.
You can't replace MySQL with HDFS. MySQL is a database while HDFS is a file system like ext3/ext4. HDFS operates in a distribued fashion while ext3/ext4 won't.
HDFS provides high latency and high throughput, while a MySQL database provides low latency and low throughput. Think of replacing a RDBMS (MySQL, Oracle etc) with a NoSQL DB (Cassandra, HBase etc).
There are tons of NoSQL databases, based on the requirement analysis the appropriate one has to be chosen.

Java deployment to a cloud for fast computation

I have an application written in Java that performs some straightforward but time consuming calculations analysing some texts, printing results to the terminal. I want to speed up the process by deploying that application on a cloud and letting it be calculated there. Which cloud service allows for such deployment with minimal change of code?
Most cloud computing servers are designed to host web applications (Servlets mostly). I'm guessing your application is not a web application. You could write a simple web application that wraps around your application and uses some kind of messaging layer to distribute the load. You could then deploy on any of the major cloud sites (e.g. GAE, AWS, CloudFoundry).
Alternatively, you can find an existing cloud framework such as Amazon MapReduce (link is to a ppt describing the tool) and fit your application into that framework. This would probably be the fastest approach, especially if you don't have much experience with Servlets.

Categories