Using local file system as Flume source - java

I've just started learning Big Data, and at this time, I'm working on Flume. The common example I've encountered is for processing of tweets (the example from Cloudera) using some Java.
Just for testing and simulation purposes, can I use my local file system as a Flume source? particularly, some Excel or CSV files? Do I also need to use some Java code, aside from Flume configuration file, just like in Twitter extraction?
Will this source be an event-driven or pollable?
Thanks for your input.

I assume you are using a cloudera sandbox and are talking about putting a file on the sandbox local to the flume agent you are planning on kicking off. A flume agent contains a:
Source
Channel
Sink
These should sit local to the flume agent. The list of available flume sources is on the user guide: https://flume.apache.org/FlumeUserGuide.html. You could use an Exec source if you just want to stream data from a file with a tail or cat commands.
You could also use a Spooling Directory Source will watch the specified directory for new files, and will parse events out of new files as they appear.
Have a good read of the user guide. Contains everything you need.

Related

Embedding model files to jar in Tensorflow Java

I want to embed a pre-trained model in JAR file, and later use it in prediction using Tensorflow's Java API. My Tensorflow version is 1.12.0.
I have tensorflow model exported in Python using tf.saved_model.simple_save. This saves the PB file and variables to an export directory. I can successfully load this model in Java using SavedModelBundle.load and run predictions as long as the export directory is locally available.
Now I want to use my predictor in a restricted/remote environment as a custom function for Aster Analytics database. This database takes a jar file, and runs a user defined function during SQL calls.
My problem is that SavedModelBundle loads from a local directory and does not utilize file streams or similar methods to read resources in a JAR file. As this database has access restrictions, I can not create a permanent local directory and move the exported files there. Similarly, I can not call Tensorflow Serving or any other RPC/REST calls to an external server.
I don't want to create a temp directory and copy exported directory there at each function call (as it may leave left-over directories, and there could be simultaneous calls creating race conditions etc.)
Are there any efficient way to deliver the model in the jar file, and then read it? I'm hoping there is a way to use SavedModelBundle to read JAR resources. Alternatively, I'm hoping that I can read the model's graph using a file stream, and then initialize variables using files included in the JAR. However I do not know how to do either.
I appreciate any suggestions and/or directions.
Thx in advance.

How to process 50 k files received over ftp in every 10 seconds

I have 50k machines and each machine is having a unique id.
every 10 seconds machine will send a file in machine_feed directory located in ftp server.Not all files are received at same time.
Machine will create file with it's id name.
I need to process all received files. If file is not processed in short time then machine will send new file that will override existing file and i will loose existing data.
My Solution is
I have created spring boot application contains one scheduler that execute every 1 millisecond, that will rename received file and will copy it to processing dir. current date time will be appended to each file.
I have one more job written in apache camel that will poll received file from processnig location for every 500 milisecond and will process it and insert data in DB.if error is received then it will move file in error dir.
File is not big. It contains only one line of information.
Issue is if files are less then it is doing great job. If files are increasing then though file is valid it is moving in error folder.
when camel is polling file then found zero length file and after that file is copied to error directory then it contains valid data. Some how camel is polling file that is not copied completely.
Anyone know good solution for this problem?.
Thanks in advance.
I've faced a similar problem before but I used a slightly different set of tools...
I would recommend taking a look at Apache Flume - it is a lightweight java process. This is what I used in my situation. The documentation is pretty decent so you should be able to find your way but I just thought of giving a brief introduction anyway just to get you started.
Flume has 3 main components and each of these can be configured in various ways:
Source - The component responsible for sourcing the data
Channel - Buffer component
Sink - This would represent the destination where the data needs to land
There are other optional components as well such as Interceptor - which is primarily useful for intercepting the flow and carrying out basic filtering, transformations etc.
There is wide variety of options to choose from for each of these but if none of the ones available suit your use case - you could write your own component.
Now, for your situation - following are a couple of options I could think of:
Since your file location needs almost continuous monitoring, you might want to use Flume's Spooling Directory Source that would continuously watch your machine_feed directory and pick it up as soon as the file arrives (You could choose to alter the name yourself before the file gets overwritten).
So, the idea is to pick up the file and hand it over to the processing directory and then carry on with the processing with Apache Camel as you are already doing it.
The other option would be (and this is the one I would recommend considering) - Do everything in one Flume agent.
Your flume set-up could look like this:
Spooling Directory Source
One of the interceptors (Optional: for your processing before inserting the data into the DB. If none of the available options are suitable - you could even write your own custom interceptor)
One of the channels (Memory channel - May be...)
Lastly, one of the sinks (This might just need to be a custom sink in your case for landing the data in a DB)
If you do need to write up a custom component (an interceptor or a sink), you could just look at the source code of one the default components for reference. Here's the link to the source code repository.
I understand that I've gone in a slightly different tangent by suggesting a new tool altogether but this worked magically for me as the tool is a very light weight tool with a fairly straightforward set up and configuration.
I hope this helps.

Writing mapreduce output directly onto a webpage

I have a mapreduce job which writes its output to a file in HDFS. But instead of writing it to HDFS, I want the output to be written directly on a webpage. I have created a web project in eclipse and written driver, mapper and reducer classes in it. When I run it with tomcat server, it didn't work.
So how can the output be displayed on a webpage?
If you are using MAP-R distribution , you can write the output of your map reduce job to the file system (not the HDFS), but to fix your issue will require more info.
HDFS (by itself) is not really designed for low-latency random read/writes. A few options you do have however are WebHDFS / HTTPfs. This exposes a REST API to HDFS. http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.6.0/hadoop-project-dist/hadoop-hdfs/WebHDFS.html and http://hadoop.apache.org/docs/r2.4.1/hadoop-hdfs-httpfs/. You could have the webserver pull whatever file you want and serve it on the webpage. I don't think this is a very good solution however.
A better solution might be to have MapReduce output to HBase (http://hbase.apache.org/) and have your webserver pull from HBase. It is far better suited for low-latency random read / writes.

Windows7 Batch file: Net Use command

I have a team of users that have read only access to a shared network drive. Sometimes these users will need to deploy their project resources to the drive. I am trying to come up with a secure build process for them to use. Currently I am using a batch file that they can execute from their local system which will do the following...
User starts batch file
Batch file calls a java program (the credentials are 'hidden' and 'encrypted' within the java program)
The java program handles the encryption process and then calls a final batch file that actually runs the NET USE command to map the drive with admin credentials
The final batch file maps the drive, copies the required resources onto the shared drive, and then re-maps the drive with original user credentials (read only).
My major problem is that users will have direct access to the batch files that do this entire process and they could simply remove the #ECHO off command from the final batch file to display all the credentials to the cmd output window.
I'm not sure if there's a better solution to this sort of thing? Any ideas will be greatly appreciated!
Also, all machines are using Windows 7 and using a Windows network drive.
The best solution would be to copy the resources directly in the Java program using the jCIFS library.
A second option would be to map the drive from within the Java program. There's more information in this SO question: How can I mount a windows drive in Java?
There are some .bat to .exe compliers out there. Not sure how well they will work for your particular batch file, but probably worth a look. You can search for them. Here's a couple of them out there
Advanced BAT to EXE Complier
Quick Batch File Compiler
Batch File Complier PE

Copy Files Between Windows Servers with Java?

I'm looking for code, or a library, which I can use to copy files between Windows servers using Java. Picture a primary server that runs a background thread, so that whenever a transaction is completed, it backs up the database files to the backup server. (This is required protocol, so no need to discuss the pros/cons of this action). In other words, transaction completes, Java code gets executed which copies one directory to the back-up server.
The way the Windows machines are set up, the primary server has the back-up server's C: drive mapped as it's own Z: drive. Both machines running Windows 2003 or 2008 Server. Java 1.6.
Found the correct answer on another forum and from messing around a little with the settings. The problem with copying files from one machine to another in Windows (using either a .bat file or using straight-up Java code) is the user permissions. On the primary server, you MUST set the Tomcat process to run as the administrator, using that administrator's username and password. (Right-click on the Tomcat service, select "Log On" tab, enter administrator's username/password). The default user that Tomcat runs on (local user), isn't sufficient to copy files between networked drives on Windows. When I set that correctly, both the .bat file solution I had tried previous to this post, and a straight-Java solution suggested here worked just fine.
Hope that helps someone else, and thanks for the suggestions.
Obtain the files by File#listFiles() on a directory from one disk, iterate over each of them, create a new File on the other disk, read from a FileInputStream from the file from one disk and write it to a FileOutputStream on the file on the other disk.
In a nut:
for (File file : new File("C:/path/to/files").listFiles()) {
if (!file.isDirectory()) {
File destination = new File("Z:/path/to/files", file.getName());
// Do the usual read InputStream, write OutputStream job here.
}
}
See also:
Java IO tutorial
By the way, if you were using Java 7, you would have used Path#copyTo() for this.
Paths.get("C:/path/to/files").copyTo(Paths.get("Z:/path/to/files"));
I would really recommend using Apache Commons IO for this.
The FileUtils class provides methods to iterate over directories and copy files from one directory to another. You don't have to worry about reading and writing files yourself because it's all done by the library for you.

Categories