Polling data from REST API to HDFS

Polling data from REST API to HDFS - java

I have a blog that offers a REST API to download data. The API gives the list of topics (in JSON). It's possible to iterate on the list to download the messages of each topic. I want to download all messages of the forum every day and store them in HDFS.
I was thinking about writing a Java program that calls the API to get the data and store it on HDFS using Hadoop API. I can run the Java program withing a daily Oozie batch.
Is there a better way for doing this? maybe store the data on the local file system and put the file on HDFS at the end. I was wondering if Flume can be used in this case and what would be it's added value ?
Thanks in advance

This seems to be such a "simple" program. You can use any language / tool to read JSON from a rest API and then upload the content to hdfs.
And you also need a scheduler to schedule the job.
With Oozie + java/shell action/, it provides better tracking in terms of job history. I would go for this if oozie is already available.

Related

How can we ingest data to elastic search through java without logstash and beats

How can we ingest data to elastic search through java without logstash and beats is there any option like kafka or something like using only java without any tools

I am not sure why you dont want to consider Filebeats --> Elastic. But yes, there are other ways to send your logs to Elastic search.
Also, you did not mention whats the source, whether you want to insert app logs, database. Assuming you want to send microservices logs also, and below options holds good for sending other data too.
As you dont want to use Filebeat, you should add custom code to collect, refine, format and publish the logs.
you can use Kafka Sink Connector to Elastic search to move all your logs
Also, you can use UDP protocol to send(client) logs and listen(server), then implement buffer and ingest to Elastic.
you can develop a commons lib which holds all this code and use in all your java applications.
Simple udp client server code - https://github.com/suren03/udp-server-client

Spring Boot Rest API - input file + endpoints

I have common question about architecture I should use for my specific problem.
I have .TSV file with some informations and my task is to create REST API app that will consume this .TSV file and there will be 3 REST API endpoints. Each endpoint will return JSON data I processed from .TSV file.
My question is: Should I crate some POST method that will upload the TSV file and I will save it eg to the session and do the logic with using the API Endpoints?
Or should I POST the content of TFS file as JSON in every request to the specific endpoint?
I dont know how to glue it all together.
There is no requirement fot the DB. The program will be tested just with numerous requests through the API and I dont know how to process or store the .TSV content in my app so one user could call all three endpoint sequentially above the same data without reuploading the TSV file.

It's better to upload the file and then do the processing on server. The file will upload in one request and it's better rather than send multiple request.

I believe the solution will depend on the size of the file. Storing the file in the memory can not be a good approach if the file is very large. And also, saving the file in a session may not be good, because if you need to scale your service in the future, you will not be able to do it. Even storing the file in a /tmp directory can also be a bad approach, because the solution continues to be not scalable.
It will be a good idea using a Storage Service like AWS S3 or Google Firebase or any other related. When you would call one of your three RESTs, your application will verify if that file was not yet processed, read that file, process anything you want and save the result to your S3 Bucket (If you don't want to save the processed files, you can use a retention policy on S3 to delete the file after X period of time).
And only after this, you will return the result. As you can see, this is a synchronous solution.
If the file processing need a lot of CPU and takes so long, you will need an asynchronous solution. So instead of processing the files directly when you call the REST API, you will have to create another application that will read that file from S3, process it and save it. All asynchronously. And your REST API would only get the file from S3 and return it.

Building a File Polling/Ingest Task with Spring Batch and Spring Cloud Data Flow

We are planning to create a new processing mechanism which consists of listening to a few directories e.g: /opt/dir1, /opt/dirN and for each document create in these directories, start a routine to process, persist it's registries in a database (via REST calls to an existing CRUD API) and generate a protocol file to another directory.
For testing purposes, I am not using any modern (or even decent) framework/approach, just a regular SpringBoot app with WatchService implementation that listens to these directories and poll the files to be processed as soon as they are created. It works but, clearly I am most definitely having some performance implications at some time when I move to production and start receiving dozens of files to be processed in parallel, which isn't a reality in my example.
After some research and some tips from a few colleagues, I found Spring Batch + Spring Cloud Data Flow to be the best combination for my needs. However, I have never dealt with neither of Batch or Data Flow before and I'm kinda confuse on what and how I should build these blocks in order to get this routine going in the most simple and performatic manner. I have a few questions regarding it's added value and architecture and would really appreciate hearing your thoughts!
I managed to create and run a sample batch file ingest task based on this section of Spring Docs. How can I launch a task every time a file is created in a directory? Do I need a Stream for that?
If I do, How can I create a stream application that launches my task programmaticaly for each new file passing it's path as argument? Should I use RabbitMQ for this purpose?
How can I keep some variables externalized for my task e.g directories path? Can I have these streams and tasks read an application.yml somewhere else than inside it's jar?
Why should I use Spring Cloud Data Flow alongside Spring Batch and not only a batch application? Just because it spans parallel tasks for each file or do I get any other benefit?
Talking purely about performance, how would this solution compare to my WatchService + plain processing implementation if you think only about the sequential processing scenario, where I'd receive only 1 file per hour or so?
Also, if any of you have any guide or sample about how to launch a task programmaticaly, I would really thank you! I am still searching for that, but doesn't seem I'm doing it right.
Thank you for your attention and any input is highly appreciated!
UPDATE
I managed to launch my task via SCDF REST API so I could keep my original SpringBoot App using WatchService launching a new task via Feign or XXX. I still know this is far from what I should do here. After some more research I think creating a stream using file source and sink would be my way here, unless someone has any other opinion, but I can't get to set the inbound channel adapter to poll from multiple directories and I can't have multiple streams, because this platform is supposed to scale to the point where we have thousands of particiants (or directories to poll files from).

Here are a few pointers.
I managed to create and run a sample batch file ingest task based on this section of Spring Docs. How can I launch a task every time a file is created in a directory? Do I need a Stream for that?
If you'd have to launch it automatically upon an upstream event (eg: new file), yes, you could do that via a stream (see example). If the events are coming off of a message-broker, you can directly consume them in the batch-job, too (eg: AmqpItemReader).
If I do, How can I create a stream application that launches my task programmaticaly for each new file passing it's path as argument? Should I use RabbitMQ for this purpose?
Hopefully, the above example clarifies it. If you want to programmatically launch the Task (not via DSL/REST/UI), you can do so with the new Java DSL support, which was added in 1.3.
How can I keep some variables externalized for my task e.g directories path? Can I have these streams and tasks read an application.yml somewhere else than inside it's jar?
The recommended approach is to use Config Server. Depending on the platform where this is being orchestrated, you'd have to provide the config-server credentials to the Task and its sub-tasks including batch-jobs. In Cloud Foundry, we simply bind config-server service instance to each of the tasks and at runtime the externalized properties would be automatically resolved.
Why should I use Spring Cloud Data Flow alongside Spring Batch and not only a batch application? Just because it spans parallel tasks for each file or do I get any other benefit?
Ad a replacement for Spring Batch Admin, SCDF provides monitoring and management for Tasks/Batch-Jobs. The executions, steps, step-progress, and stacktrace upon errors are persisted and available to explore from the Dashboard. You can directly also use SCDF's REST endpoints to examine this information.
Talking purely about performance, how would this solution compare to my WatchService + plain processing implementation if you think only about the sequential processing scenario, where I'd receive only 1 file per hour or so?
This is implementation specific. We do not have any benchmarks to share. However, if performance is a requirement, you could explore remote-partitioning support in Spring Batch. You can partition the ingest or data processing Tasks with "n" number of workers, so that way you can achieve parallelism.

Configuring storm cluster for production cluster

We have configured storm cluster with one nimbus server and three supervisors. Published three topologies which does different calculations as follows
Topology1 : Reads raw data from MongoDB, do some calculations and store back the result
Topology2 : Reads the result of topology1 and do some calculations and publish results to a queue
Topology3 : Consumes output of topology2 from the queue, calls a REST Service, get reply from REST service, update result in MongoDB collection, finally send an email.
As new bee to storm, looking for an expert advice on the following questions
Is there a way to externalize all configurations, for example a config.json, that can be referred by all topologies?
Currently configuration to connect MongoDB, MySql, Mq, REST urls are hard-coded in java file. It is not good practice to customize source files for each customer.
Wanted to log at each stage [Spouts and Bolts], Where to post/store log4j.xml that can be used by cluster?
Is it right to execute blocking call like REST call from a bolt?
Any help would be much appreciated.

Since each topology is just a Java program, simply pass the configuration into the Java Jar, or pass a path to a file. The topology can read the file at startup, and pass any configuration to components as it instantiates them.
Storm uses slf4j out of the box, and it should be easy to use within your topology as such. If you use the default configuration, you should be able to see logs either through the UI, or dumped to disk. If you can't find them, there are a number of guides to help, e.g. http://www.saurabhsaxena.net/how-to-find-storm-worker-log-directory/.
With storm, you have the flexibility to push concurrency out to the component level, and get multiple executors by instantiating multiple bolts. This is likely the simplest approach, and I'd advise you start there, and later introduce the complexity of an executor inside of your topology for asynchronously making HTTP calls.
See http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html for the canonical overview of parallelism in storm. Start simple, and then tune as necessary, as with anything.

Creating a Listening Service In Java

I have a webapp with an architecture I'm not thrilled with. In particular, I have a servlet that handles a very large file upload (via commons-fileupload), then processes the file, passing it to a service/repository layer.
What has been suggested to me is that I simply have my servlet upload the file, and a service on the backend do the processing. I like the idea, but I have no idea to go about it. I do not know JMS.
Other details:
- App is a GWT app split into the recommended client/server/shared subpackages, using an MVP architecture.
- Currently, I am only running in GWT hosted mode, but am planning to move to Tomcat in the very near future.
I'm perfectly willing to learn whatever I need to in order to get this working (in fact, that's the point of writing the app). I'm not expecting anyone to write code for me, but can someone point me in the right direction to get started?

There are many options for this scenario, but the simplest may be just copying the uploaded file to a known location on the file system, and have a background daemon monitor the location and process when it finds it.

#Jason, there are many ways to solve your problem.
i) Have dump you file data into Database with column type BLOB. and have a DB polling thread(after a particular time period) polls table for newly inserted file .
ii) Have dump file into file system and have a file montioring process.
Benefit of i) over ii) is that DB is centralized and fast resource where as file systems are genrally slow and non-centalized in nature.
So basically servlet would dump either to DB or file system. Now about who will process that dumped file:- a) It could be either montioring process as discussed above or b) you can use JMS which is asynchronous in nature what it means servlet would put a trigger event in queue which will asynchronously trigger new processing thread.
Well don't introduce JMS in your system unnecessarily if you are ok with monitoring process.

This sounds interesting and familiar to me :). We do it in the similar way.
We have our four projects, all four projects includes file upload and file processing (Image/Video/PDF/Docs) etc. So we created a single project to handle all file processing, it is something like below:
All four projects and File processor use Amazon S3/Our File Storage for file storage so file storage is shared among all five projects.
We make request to File Processor providing details in XML via http request which include file-path on S3/Stoarge, aws authentication details, file conversion/processing parameters. File Processor does processing and puts processed files on S3/Storage, constructs XML with processed files details and sends XML via response.
We use Spring Frameowrk and Tomcat.

Since this is foremost a learning exercise, you need to pick an easy to use JMS provider. This discussion suggested FFMQ just one year ago.
Since you are starting with a simple processor, you can keep it simple and use a JMS Queue.
In the simplest form, each message send by the servlet has to correspond to a single job. You can either put the entire payload of the upload in the message, or just send a filename as reference to the content in the message. These are details you can refactor later.
On the processor side, if you are using Java EE, you can use a MessageBean. If you are not, then I would suggest a 3 JVM solution -- one each for Tomcat, the JMS server, and the message processor. This article includes the basics of a message consuming client.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.