How does Hadoop run in "real-time" against non-stale data?

How does Hadoop run in "real-time" against non-stale data? - java

My abysmally-rudimentary understanding of Hadoop and its "data ingest" tools (such as Flume or Sqoop) is that Hadoop must always run its MR jobs against data that is stored in structured files on its HDFS. And, that these tools (again, Flume, Sqoop, etc.) are responsible for essentially importing data from disparate systems (RDBMS, NoSQL, etc.) into HDFS.
To me, this means that Hadoop will always be running on "stale" (for lack of a better word) data that is minutes/hours/etc. old. Because, to import big data from these disparate systems onto HDFS takes time. By the time MR can even run, the data is stale and may no longer be relevant.
Say we have an app that has real-time constraints of making a decision within 500ms of something occurring. Say we have a massive stream of data that is being imported into HDFS, and because the data is so big it takes, say, 3 seconds to even get the data on to HDFS. Then say that the MR job that is responsible for making the decision takes 200ms. Because the loading of the data takes so long, we've already blown our time constraint, even though the MR job processing the data would be able to finish inside the given window.
Is there a solution for this kind of big data problem?

With the help of tools Apache Spark streaming API & another one is Storm which you can use for real time stream processing.

Related

What is the right way to create/write a large file in java that are generated by a user?

I have looked at examples that tell best practices for file write/create operations but have not seen an example that takes into consideration my requirements. I have to create a class which reads the contents of 1 file, does some data transformation, and then write the transformed contents to a different file then sends the file to a web service. Both files ultimately can be quite large like up to 20 MB and also it is unpredictable when these files will be created because they are generated by the user. Therefore it could be like 2 minutes between the time when this process occurs or it could be several all in the same second. The system is not like crazy in the sense that it could be like hundreds of these operations in the same second but it could be several.
My instinct says to solve it by:
Creating a separate thread when the process begins.
Read the first file.
Do the data transformation.
Write the contents to the new file.
Send the file to the service.
Delete the created file.
Am I missing something? Is there a best practice to tackle this kind of issue?

The first question you should ask is weather you need to write the file to the disk in the first place. Even if you are supposed to send a file to a consumer at the end of your processing phase, you could keep the file contents in memory and send that. The consumer doesn't care weather the file is stored on disk or not, since it only receives an array of bytes with the file contents.
The only scenario in which it would make sense to store the file on disk would be if you would communicate between your processes via disk files (i.e. your producer writes a file to disk, sends some notification to your consumer and afterwards your consumer reads the file from disk - for example based on a file name it receives from the notification).
Regarding I/O best practices, make sure you use buffers to read (and potentially write) files. This could greatly reduce the memory overhead (since you would end up keeping only a chunk instead of the whole 20 MB file in memory at a given moment).
Regarding adding multiple threads, you should test weather that improves your application performance or not. If your application is already I/O intensive, adding multiple threads will result in adding even more contention on your I/O streams, which would result in a performance degradation.

Without the full details of the situation, a problem like this may be better solved with existing software such as Apache NiFi:
An easy to use, powerful, and reliable system to process and distribute data.
It's very good at picking up files, transforming them, and putting them somewhere else (and sending emails, and generating analytics, and...). NiFi is a very powerful tool, but may be overkill if you're needs are just a couple of files given the additional set-up.

Given the description you have given, I think you should perform the operations for each file on one thread; i.e. on thread will download the file, process it and then upload the results.
If you need parallelism, then implement the download / process / upload as a Runnable and submit the tasks to an ExecutorService with a bounded thread pool. And tune the size of the thread pool. (That's easy if you expose the thread pool size as a config property.)
Why this way?
It is simple. Minimal synchronization is required.
One of the three subtasks is likely to be your performance bottleneck. So by combining all three into a single task, you avoid the situation where the non-bottleneck tasks get too far ahead. And if you get too far ahead on some of the subtasks you risk running out of (local) disk space.
I'm going to contradict what Alex Rolea said about buffering. Yes, it may help. But in on a modern (e.g. Linux) operating system on a typical modern machine, memory <-> disk I/O is unlikely to be the main bottleneck. It is more likely that the bottleneck will be network I/O or server-side I/O performance (especially if the server is serving other clients at the same time.)
So, I would not prematurely tune the buffering. Get the system working, benchmark it, profile / analyze it, and based on those results figure out where the real bottlenecks are and how best to address them.
Part of the solution may be to not use disk at all. (I know you think you need to, but unless your server and its protocols are really strange, you should be able to stream the data to the server out of memory on the client side.)

Apache Beam / Google Dataflow Final step to run only once

I have a pipeline where I download thousands of files, then transform them and store them as CSV on google cloud storage, before running a load job on bigquery.
This works fine, but as I run thousands of load jobs (one per downladed file), I reached the quota for imports.
I've changed my code so it lists all the files in a bucket and runs one job with all the files as parameters of the job.
So basically I need the final step to be run only once, when all the data has been processed. I guess I could use a groupBy transform to make sure all the data has been processed, but I'm wondering whether there is a better / more standard approach to it.

If I understood your question correctly, we might have had similar problem in one of our dataflows - we were hitting 'Load jobs per table per day' BigQuery limit due to the fact that the dataflow execution was triggered for each file in GCS separately and we had 1000+ files in the bucket.
In the end, the solution to our problem was quite simple - we modified our TextIO.read transform to use wildcards instead of individual file names
i.e TextIO.read().from("gs://<BUCKET_NAME>/<FOLDER_NAME>/**")
In this way only one dataflow job was executed and as a consequence all the data written to BigQuery was considered as a single load job, despite the fact that there were multiple sources.
Not sure if you can apply the same approach, tho.

Spring batch - using in-memory database for huge file processing

I am using Spring batch to process huge data (150 GB) to produce 60 GB output file. I am using Vertical Scaling approach and with 15 threads (Step partitioning approach).
The Job execution details are stored in the in-memory database. The CPU Utilization is more because its running on single machine and the file size is huge. But the Server is having a good configuration like 32 core processor and i am using 10 GB memory for this process.
My question is, if i move this to separate database will it reduce some CPU Utilization? Also, Using In-Memory database for Production is a bad choice /decision?
Regards,
Shankar

When you are talking about moving from in-memory db to a separate db, you are just talking about the batch runtime tables (job_instance, job_execution, step_execution, ...), right?
If so, I wouldn't expect that the CPU usage will drop a lot. Depending on your chunksize, a lot more CPU usage will be needed for your data processing, than for your updating the batch runtime tables.
If using an in-memory db for production is a good decision or not, depends on your needs. Two points to consider:
You lose any-information which was written into the batch-runtime tables. This could be helpful for debug sessions or simply to have a kind of history. But you can "persist" such information also in logfiles.
You will not be able to implement a restartable job. This could be an issue, if your job takes hours to complete. But for job, that only reads from a file, writes to a file, and is completed within a couple of minutes, this is not really a problem.

Fastest way to store data from sensors in java

I am currently writing a Java application that receives data from various sensors. How often this happens varies, but I believe that my application will receive signals about 100k times per day. I would like to log the data received from a sensor every time the application receives a signal. Because the application does much more than just log sensor data, performance is an issue. I am looking for the best and fastest way to log the data. Thus, I might not use a database, but rather write to a file and keep 1 file per day.
So what is faster? Use a database or log to files? No doubt there is also a lot of options to what logging software to use. Which is the best for my purpose if logging to file is the best option?
The data stored might be used later for analytical purposes, so please keep this in mind as well.

I would recommend you first of all to use log4j (or any other logging framework).
You can use a jdbc appender that writes into the db or any kind of file appender that writes into the file. The point is that your code will be generic enough to be changed later if you like...
In general files are much faster than db access, but there is a place for optimizations here.
If the performance is critical, you can use batching/asynchronous calls to the logging infrastructure.

A free database on a cheap PC should be able to record 10 records per second easily.
A tuned database on a good system or a logger on a cheap PC should be able to write 100 records/lines per second easily.
A tuned logger should be able to write 1000 lines per second easily.
A fast binary logger can perform 1 million records per second easily (depending on the size of the record)
Your requirement is about 1.2 records per second per signal which should be able to achieve any way you like. I assume you want to be able to query your data so you want it in a database eventually so I would put it there.

Ah the world of embedded systems. I had a similar problem when working with a hovercraft. I solved it with a separate computer(you can do this with a separate program) over the local area network that would just SIT and LISTEN as a server for logs I sent to it. The FileWriter program was written in C++. This must solve two problems of yours. First is the obvious performance gain while writing the logs. And secondly the Java program is FREED of writing any logs at all(but will act as a proxy) and can concentrate on performance critical tasks. Using a DB for this is going to be an overkill, except if you're using SQLite.
Good luck!

Downloading A Large SQLite Database From Server in Binary vs. Creating It On The Device

I have an application that requires the creation and download of a significantly large SQLite database. Depending on the user's data, creation of the db and the syncing of data from the server can take upwards of 20 to 25 minutes (some customers have a LOT of data). The data is downloaded as JSON and processed with Android's built in JSON classes.
To account for OutOfMemory issues I was having with some devices, I needed to limit the per-call download from the server to 500 records at a time. But, as of now, all of the above is working successfully - although slow.
Recently, there has been talk from my team of creating the complete SQLite db on the server side and then just downloading it to the device in binary in an effort to speed things up. I've never done this before. Is this indeed a viable option OR should I just be looking into speeding up the processing of the JSON through a 3rd party lib like GSON or Jackson.
Thanks in advance for your input.

From my experience with mobile devices, reinventing synchronization is an overkill most of the time. It obviously depends on the hardware, software and amounts of data you're working with. But most of the time long operation execution times on mobile devices are caused by faulty design, careless coding or specifics of embedded systems not taken into consideration.
Unfortunately, I can only give you some hints which you may consider, given pretty vague description of issues you're facing. I mean "LOT" doesn't mean much to me - I've seen mobile apps with DBs containing millions of records running pretty smoothly and ones that had around a 1K records running horribly slow and causing UI to freeze. You also didn't mentioned what OS version and device (or at least it's capabilities) you're using. What's the server configuration, what software is installed, what libraries/frameworks are used and in what modes. It all matters when you want to really speed things up.
Apart of encoding being gzip (which I believe you left default, which is on), you should give this ideas a try:
Streaming! - make sure both the client and the server use a streaming version of JSON API and use buffered streams. If either doesn't - replace it with a library that does. Jackson has one of the fastest streaming API. Sure it's more cumbersome to write a (de)serializer, but it pays off. When done properly, none of the sides must create a buffer large enough for (de)serialization of all the data, fill it with contents, and then parse/write it. Instead, a much smaller buffer is allocated and filled gradually as successive fields are serialized. When this buffer gets filled, it's contents is immediately sent to the other end of data channel. There it can be deserialized right away. The process continues until all data have been transmitted in small chunks. It makes the data interchange much more fluent and less resource-intensive.
For large batch inserts or updates use prepared statements. It also sometimes helps to insert your data without constraints and then create them - that way, for example, an index can be computed in one run instead of for each insert. Don't use transactions (they require maintaining extra database logs) or commit every 300 rows to minimize the overhead. If you're updating existing database and atomic modifications are necessary - load new data to a temporary database and, if everything is ok, replace old database with new one on the fly.
Almost always some data can be precomputed and stored on an sd-card for example. Or it can be loaded directly to an sd-card as a prepared SQLite DB in the company. If a task requires data that is so large that an import takes more than 10 minutes, you probably shouldn't do that task on mobile devices in the first place.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.