(Don't suggest Hadoop or map reduce solution even they sounds logically the same)
I have a big file - 70GB of raw html files and I need to do the parsing to get the information I need.
I have delt with 10GB file successfully before using standardI/O:
cat input_file | python parse.py > output_file
And in my python script, it reads every html (each html per line) from standard input and writes the result back to standard output.
from bs4 import BeautifulSoup
import sys
for line in sys.stdin:
print ....
The code is very simple but right now, I am dealing with a big file which is horribly slow on one node. I have a cluster of about 20 nodes. And I am wondering how could I easily distribute this work.
What I have done so far:
split -l 5000 input_file.all input_file_ # I have 60K lines in total in that 70G file
Now the big file have been splitted into several small files:
input_file_aa
input_file_ab
input_file_ac
...
Then I have no problem working with each one of them:
cat input_file_aa | python parser.py > output_file_aa
What I gonna do is probably scp the input_file to each node and do the parsing and then scp the result back, but there are 10+ nodes! I it is so tedious to do that manually.
I am wondering how could I easily distribute these files to other nodes and do the parsing and move the result back?
I am open to basic SHELL, JAVA, Python solution. Thanks a lot in advance and let me know if you need more explanation.
Note, I do have a folder called /bigShare/ that could be assessible on every node and the contents are synchronized and stay the same. I don't know how the architect implemented that (NFS..? I don't know how to check) but I could put my input_file and python script there so the rest is how to easily log into those nodes and execute the command.
Btw, I am on Red Hat.
Execute the command remotely with remote piping to stdout. Then make the local command pipe to a local file.
Example:
ssh yourUserName#node1 "cat input_file_node1 | python parser.py" >output_file_node1
If the files have not been copied to the different nodes, then:
ssh yourUserName#node1 "python parser.py" <input_file_node1 >output_file_node1
This assumes that yourUserName has been configured with key-based authentication. Otherwise, you will need to enter your password manually (20 times! :-( ). To avoid this, you can use expect, but I will strongly suggest to setup key-based authentication. You can do the later using expect too.
Assuming you want to process a piece of each file on a host of its own: first copy the python script to the remote hosts. Then loop over the remote hosts:
for x in aa ab ac ...; do
ssh user#remote-$x python yourscript.py <input_file_$x >output_file_$x &
done;
If the processing nodes don't have names that are easy to generate you can create aliases for them in your .ssh/config, for example:
Host remote-aa
Hostname alpha.int.youcompany
Host remote-ab
Hostname beta.int.yourcompany
Host remote-ac
Hostname gamma.int.yourcompany
This particular use case could be more easily solved by editing /etc/hosts though.
Related
My Java application has to work like this:
User select bash commands in GUI and press "send."
Application return distinct and independent answers for each command (e.g. we could store them in different files).
Commands each run interactively, not in a batch (it can't be something like "ls\n pwd \n" etc)
After each command, the application will check if the results are ok. If so, it will send the next command.
We need to execute su <user> on the remote host.
This will be a plugin for a bigger app, so answers like "try something else" (i.e. RPC, web services) will not help me :(
As far as i understand i have to use SHELL or at least keep channel connected.
I have tested jsch , sshj and ethz.ssh2 but with bad results.
I've dug throu stackoverflow answers for questions like: "sending-commands-to-server-via-jsch-shell-channel" etc. But they all focus on sending whole commands in one line. I need an interactive, persistent SSH session.
I've used ExpectJ (with a little hack of output stream). It has resolved points 1,3,4,5.
But there is a problem with point 2. In my app I need to get separated answer. But we will not know their length. Command prompts can be different. Anyone knows how to "hack" ExpectJ so it will be some how more synchronized? I am looking for acting like this : send , wait for full answer, send, wait... I've tried some basic synchronization tricks but this end in timeouts and connection lost usually.
You should use ExpectJ, a Java implementation of the Unix expect utility.
not sure if you still have the problems,
in any case, it might contribute to other people.
ExpectJ is indeed the Java implementation of Unix expect.
and you should definitely buy the "explore expect book" then look into it, it is worth it.
For your question:
when you spawn a process, you listen to the return output, match it to a prompt, then send some command.
if you want to analyze the output, you buffer that output, and do some actions before the next send()
to do so, you need to use the interact() method of the spawn class you used.
http://expectj.sourceforge.net/apidocs/index.html
and for interact and how it works:
http://oreilly.com/catalog/expect/chapter/ch03.html
look for this part: "The interact Command"
I am learning Java programming and I am doing well.
I am working on this project at work. Let me carefully lay out what I need.
My computer is connected to a tool I will call 'A'. This tool streams data to my computer 'B' in real time.
Two logs are created:
One on the screen that I can clear or save.
Second one is a A log created in the background.
Is there a way I can query this changing background log which is a copy of the data being streamed for predetermined keywords or strings and output that immediately on the screen? How do I go about this? what's the efficient way? How about listening to the ports.
I just want to accomplish this in real time.
if this is linux box you can do
tail -f <my_log_file_name>| grep <my key word>
if you have Windows box try baretail - free (and I think great tool) which color your keyword
I'm building a web application that helps people improve their English pronunciation for some words. The website displays a sentence for the user, and he/she speaks it and then press "Results" button. The web application then sends two files to the server: .txt and .wav files.
The server (which is Linux (Ubuntu)) should take the files, and do some analysis and calculations, and then print out the results on a file called "Results.txt". Then the web application (which is php based) should read the results from the file and displays them to the user.
The problem is: I'm not sure what is best to do the communication between the web application and the Linux server. Till now, I succeeded in writing the .txt and .wav files on the server. And I can build a Linux script that takes these two files and do the required calculations. What I'm facing is that: I don't know how to properly and effectively start the script. And more importantly: when the script is done, how to know that I can safely read the results from the "Results.txt" file? I need a synchronization tool or method.
I asked some guys, and they told me to use a java application on the server side, but I not sure how to do it!
Any help, Please?? :)
First, you can do it with PHP. Using the shell_exec() method you can run commands your Linux scripts and also read the output. Ex:
$output = shell_exec("ls -l > outputFile.txt");
will write the current directory listing to a file called outputFile.txt
Second, you can also do the same using Java. Use:
Runtime.getRuntime().exec("ls -l").waitfor();
Not using the waitfor() in the end will cause asynchronous execution of your shell script. You can also read the stdout and stderr streams of your Linux script using
Process.getInputStream()
and
Process.getErrorStream()
methods respectively. Hope that helps.
My Java application has to work like this:
User select bash commands in GUI and press "send."
Application return distinct and independent answers for each command (e.g. we could store them in different files).
Commands each run interactively, not in a batch (it can't be something like "ls\n pwd \n" etc)
After each command, the application will check if the results are ok. If so, it will send the next command.
We need to execute su <user> on the remote host.
This will be a plugin for a bigger app, so answers like "try something else" (i.e. RPC, web services) will not help me :(
As far as i understand i have to use SHELL or at least keep channel connected.
I have tested jsch , sshj and ethz.ssh2 but with bad results.
I've dug throu stackoverflow answers for questions like: "sending-commands-to-server-via-jsch-shell-channel" etc. But they all focus on sending whole commands in one line. I need an interactive, persistent SSH session.
I've used ExpectJ (with a little hack of output stream). It has resolved points 1,3,4,5.
But there is a problem with point 2. In my app I need to get separated answer. But we will not know their length. Command prompts can be different. Anyone knows how to "hack" ExpectJ so it will be some how more synchronized? I am looking for acting like this : send , wait for full answer, send, wait... I've tried some basic synchronization tricks but this end in timeouts and connection lost usually.
You should use ExpectJ, a Java implementation of the Unix expect utility.
not sure if you still have the problems,
in any case, it might contribute to other people.
ExpectJ is indeed the Java implementation of Unix expect.
and you should definitely buy the "explore expect book" then look into it, it is worth it.
For your question:
when you spawn a process, you listen to the return output, match it to a prompt, then send some command.
if you want to analyze the output, you buffer that output, and do some actions before the next send()
to do so, you need to use the interact() method of the spawn class you used.
http://expectj.sourceforge.net/apidocs/index.html
and for interact and how it works:
http://oreilly.com/catalog/expect/chapter/ch03.html
look for this part: "The interact Command"
Hey I have run into the following problem when attempting to build a program in java which executes commands on a remote linux server and returns the output for processing...
Basically I have installed Cygwin with an SSH client and want to do the following:
Open Cygwin,
Send command "user#ip";
Return output;
Send command "password";
Return output;
Send multiple other commands,
Return output;
...etc...
So far:
Process proc = Runtime.getRuntime().exec("C:/Power Apps/Cygwin/Cygwin.bat");
Works nicely except I am at a loss as to how to attempt the next steps.
Any help?
The quick way: Don't go through cygwin. Pass your login info and commands as arguments to ssh.
A better way: Install and use the open source and very mature Sun Grid Engine and use its DRMAA binding for Java to exec your commands. You might also consider switching to a scripting language (yours is a very script like task). If you do DRMAA has Perl, Ruby and other bindings as well.
You could also use Plink:
Download here
There is a good set of instructions link here
You can use a command like:
plink root#myserver -pw passw /etc/backups/do-backup.sh
Use a ssh implementation in java. I used Ganymede a couple of years ago, there are perhaps better alternatives now. (?)
Using Ganymede, you will get an input stream to read from, and an output stream to write to.
You can create a LineInputReader on the input stream and use that to read Strings representing the output from the remote server. Then use a regexp Pattern/Matcher to parse responses.
Create a PrintWriter on the output stream and use println() to send your commands.
Its simple and actually quite powerful (if you know regexp... It might require some trial and error to get it right...)