Hadoop Remote Copying

Hadoop Remote Copying - java

I need to copy some files from hdfs:///user/hdfs/path1 to hdfs:///user/hdfs/path2. I wrote a java code to do the job:-
ugi = UserGroupInformation.createRemoteUser("hdfs", AuthMethod.SIMPLE);
System.out.println(ugi.getUserName());
conf = new org.apache.hadoop.conf.Configuration();
// TODO: Change IP
conf.set("fs.defaultFS", URL);
conf.set("hadoop.job.ugi", user);
conf.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
conf.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
// paths = new ArrayList<>();
fs = FileSystem.get(conf);
I am getting all paths for the wild card as
fs.globStatus(new Path(regPath));
and copy as
FileUtil.copy(fs, p, fs, new Path(to + "/" + p.getName()), false, true, conf);
However copying is failing with following message whereas globstatus execute successfully
WARN BlockReaderFactory:682 - I/O error constructing remote block reader.
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.110.80.177:50010]
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:532)
at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3044)
at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:744)
at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:659)
at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:327)
at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:574)
at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:797)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:844)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:78)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:52)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:112)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:366)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:338)
Note that I am running code remotely over Internet using port forwarding. I.e.
192.168.1.10[JAVA API] ---> 154.23.56.116:8082[Name Node Public I/P]======10.1.3.4:8082[Name Node private IP]
I guess following is the reason:-
Query is made to namenode for globStatus which is successfully executed by the name node
Copying command is passed to namenode which will return 10.110.80.177:50010 for other data nodes address on other machines, and then Java IP will try to pass the copy commands to these data nodes, since they are not exported to outside world I got this error!
Am I right in this deduction? How to solve the issue? Do I need to create a java server at namenode which will collect copy command and copy the files locally in the cluster.

Related

After retrieving multiple files I get an error on the next file '425 can't open data connection for transfer of file'

I'm using apache commons-net FTPClient (ver 3.3) and below error was produced on one of my clients machine (I've tried reproducing the error on my dev machine without luck using testing folder with the same requests on the same server with the same login)
I have a process that check's the remote FTP server for new requests in form of an XML-files. After listing all of those files i proceed in loop to check if they're in XML format. If the file is in this format I do first change their name by changing format from *.xml to *.xmlProcessing, retrive them to a input stream, parse them to my object and create a request in my queue and finally change the name and move them to subfolder working as an archive.
After downloading random amount of files I get stuck while calling retrieveFileStream on the next file, without a timeout or IO Exception.
I've managed to get logs from FTP server and it just says it can't open a data connection
05.06.2019 12:45:45 - > RNFR /folder/file.xml
05.06.2019 12:45:45 - > 350 File exists, ready for destination name.
05.06.2019 12:45:45 - > RNTO /folder/file.xmlProcessing
05.06.2019 12:45:45 - > 250 file renamed successfully
05.06.2019 12:45:45 - > PORT *ports*
05.06.2019 12:45:45 - > 200 Port command successful
05.06.2019 12:45:45 - > RETR /folder/file.xmlProcessing
05.06.2019 12:45:45 - > 150 Opening data channel for file download from server of "/folder/file.xmlProcessing"
05.06.2019 12:45:45 - > 425 Can't open data connection for transfer of "/folder/file.xmlProcessing"
I've already tried diffrent FTP modes. Active local, remote, passive etc. (currently stuck with passive local mode).
I've tried the data timeout but it looks like while i finally got stuck on one of the files the method took more than 1 minute on that file despite me setting the timeout on 30s.
ftp = new FTPClient();
ftp.addProtocolCommandListener(new PrintCommandListener(new PrintWriter(System.out)));
ftp.connect(server, port);
int reply = ftp.getReplyCode();
if (!FTPReply.isPositiveCompletion(reply)) {
ftp.disconnect();
throw new IOException("Exception in connecting to FTP Server");
}
ftp.login(user, password);
ftp.enterLocalPassiveMode();
ftp.setKeepAlive(true);
ftp.setDataTimeout(30000);
Collection<String> listOfFiles = listFiles(FOLDER_PATH);
for(String filePath : listOfFiles){
if (filePath != null && filePath.endsWith(".xml")) {
ftp.rnfr(folder + filePath);
ftp.rnto(folder + filePath + "Processing");
InputStream fileInputStream = ftp.retrieveFileStream(folder + filePath + "Processing");
ftp.completePendingCommand();
//Parsing file to an instance of my object and creating request
ftp.rnfr(folder + filePath + "Processing");
ftp.rnto(archiveFolder + filePath);
if(fileInputStream != null){
fileInputStream.close();
}
}
}
Is there's a bigger likehood that this is fault of the FTP Server, Firewall issues or something else ?
I've runed the same code from my dev machine and it processed all files from test folder (there were around 400 of them) i don't know if im being unlucky for error not occuring on my local dev machine or is it actually something wrong with communication of my contractor with the remote server ?

Everytime you do a file transfer or a directory listing with FTP, the server (or client if using an active mode) assigns a random port number out of a configured range to that transfer. The port number is not released immediately, when the transfer completes. There's some cooldown interval. If you do too many file transfers in a short time interval, it can happen that the server runs out of the available ports – Because all ports end up in the cooldown state.
If you can, check the server configuration and configure a larger range of ports.
Or as a workaround, you can try to slow down the transfer rate.
For some background, see:
How many data channel ports do I need for an FTP?
Why does FTP passive mode require a port range as opposed to only one port?
Though this is just a guess, you should check the server's log, as it can show more details.
Another possibility is, that there's simply a limited number of transfers the server allows for a specific user or source address in some time interval.

Inconsistent I/O when reading int from DataInputStream

I have a client-server topology in which the client asks for a listing of the directories or files in the current working directory on the server, and the server replies with the appropriate information.
See client code
controlSocket.writeByte(LSDIR);
int dirCount = controlSocket.readInt();
Map<String, Long> dirMap = new HashMap<>();
for (int i = 0; i < dirCount; i++) {
dirMap.put(controlSocket.readString(), controlSocket.readLong());
}
and server code
dir = new File(cwd);
output.writeInt(dir.listFiles(File::isDirectory).length);
for (File file : dir.listFiles(File::isDirectory)){
output.writeUTF(file.getName());
output.writeLong(file.lastModified());
}
Now when I don't change the directory on the server the directory listing works just fine, no matter how many times I call it. However, if I cd using this client code
controlSocket.writeByte(CD);
controlSocket.writeString(path);
and this server code
String inputDir = input.readUTF();
if (inputDir.equals("..")) {
cwd = cwd.substring(0, cwd.lastIndexOf("/"));
} else {
cwd += "/" + inputDir;
}
the directory listing runs but the integer that is read from the socket is not what the server sends (ex. on the server I see 1 is sent, but on the client something like 16777216 is read). The server reads the directory content with no problem, so there is no issue on this side.
It seems like the Data I/O Stream is not consistent, or else I'm missing out on something. Note that both the client and the server run on the same machine.

The problem was that the server wrote an additional confirmation boolean after changing the directory, which was read by the client as the integer.
So the code up here is working without any problem.

Start Selenium Hub and Node using ProcessBuilder.java - port error

I have Swing front end for use as a client to kick off Selenium tests. At the point where I kick off the test process, the first thing I do is fire up a local Selenium hub and node using the following code:
String[] nodeCmd = new String[]{"java", "-jar", "selenium-server-standalone-2.39.0.jar", "-role node", "-nodeConfig config\\DefaultNode.json"};
ProcessBuilder pbNode = new ProcessBuilder(nodeCmd);
pbNode.directory(new File("C:\\selenium\\"));
File nodeLog = new File("C:\\selenium\\logs\\nodeOut.log");
pbNode.redirectErrorStream(true);
pbNode.redirectOutput(nodeLog);
Process nodeP = pbNode.start();
String[] hubCmd = new String[]{"java", "-jar", "selenium-server-standalone-2.39.0.jar", "-role hub", "-hubConfig config\\DefaultHub.json"};
ProcessBuilder pbHub = new ProcessBuilder(hubCmd);
pbHub.directory(new File("C:\\selenium\\"));
File hubLog = new File("C:\\selenium\\logs\\hubOut.log");
pbHub.redirectErrorStream(true);
pbHub.redirectOutput(hubLog);
Process hubP = pbHub.start();
Whilst the hub is started up correctly, when the node process is started it seems to be doing so as a hub (log output is exactly the same) with the result that it complains the port is already in use.
Exception in thread "main" java.net.BindException: Selenium is already running on port 4444. Or some other service is.
at org.openqa.selenium.server.SeleniumServer.start(SeleniumServer.java:491)
at org.openqa.selenium.server.SeleniumServer.boot(SeleniumServer.java:300)
at org.openqa.selenium.server.SeleniumServer.main(SeleniumServer.java:245)
at org.openqa.grid.selenium.GridLauncher.main(GridLauncher.java:96)
Any ideas what I am doing wrong? I have checked the config files and they are definitely right.
Update
So I finally worked out what I was doing wrong! My mistake is in the way I was constructing my Array of parameters to pass into the ProcessBuilder constructor.
Wrong:
String[] nodeCmd = new String[]{"java", "-jar", "selenium-server-standalone-2.39.0.jar", "-role node", "-nodeConfig config\\DefaultNode.json"};
Right:
String[] nodeCmd = new String[]{"java", "-jar", "selenium-server-standalone-2.39.0.jar", "-role", "node", "-nodeConfig", "config\\DefaultNode.json"};
I wan't splitting the key string and value string for the role and config params. Grrr!

Make sure that your nodeConfig.json is using the "port" property of an unused port (defaults to 5555), and make sure that in your hubConfig.json, it is using the "port" property of an unused port (defaults to 4444)
If you can validate that the above are both correct, then the only other issue that i can imagine, is that the previous hubs (probably started up when testing this very functionality) were left open.
I would ensure that all java.exe processes are terminated, and add a failsafe in your code to quit the grid.

Reading remote HDFS file with Java

I’m having a bit of trouble with a simple Hadoop install. I’ve downloaded hadoop 2.4.0 and installed on a single CentOS Linux node (Virtual Machine). I’ve configured hadoop for a single node with pseudo distribution as described on the apache site (http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-common/SingleCluster.html). It starts with no issues in the logs and I can read + write files using the “hadoop fs” commands from the command line.
I’m attempting to read a file from the HDFS on a remote machine with the Java API. The machine can connect and list directory contents. It can also determine if a file exists with the code:
Path p=new Path("hdfs://test.server:9000/usr/test/test_file.txt");
FileSystem fs = FileSystem.get(new Configuration());
System.out.println(p.getName() + " exists: " + fs.exists(p));
The system prints “true” indicating it exists. However, when I attempt to read the file with:
BufferedReader br = null;
try {
Path p=new Path("hdfs://test.server:9000/usr/test/test_file.txt");
FileSystem fs = FileSystem.get(CONFIG);
System.out.println(p.getName() + " exists: " + fs.exists(p));
br=new BufferedReader(new InputStreamReader(fs.open(p)));
String line = br.readLine();
while (line != null) {
System.out.println(line);
line=br.readLine();
}
}
finally {
if(br != null) br.close();
}
this code throws the exception:
Exception in thread "main" org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block: BP-13917963-127.0.0.1-1398476189167:blk_1073741831_1007 file=/usr/test/test_file.txt
Googling gave some possible tips but all checked out. The data node is connected, active, and has enough space. The admin report from hdfs dfsadmin –report shows:
Configured Capacity: 52844687360 (49.22 GB)
Present Capacity: 48507940864 (45.18 GB)
DFS Remaining: 48507887616 (45.18 GB)
DFS Used: 53248 (52 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Datanodes available: 1 (1 total, 0 dead)
Live datanodes:
Name: 127.0.0.1:50010 (test.server)
Hostname: test.server
Decommission Status : Normal
Configured Capacity: 52844687360 (49.22 GB)
DFS Used: 53248 (52 KB)
Non DFS Used: 4336746496 (4.04 GB)
DFS Remaining: 48507887616 (45.18 GB)
DFS Used%: 0.00%
DFS Remaining%: 91.79%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Last contact: Fri Apr 25 22:16:56 PDT 2014
The client jars were copied directly from the hadoop install so no version mismatch there. I can browse the file system with my Java class and read file attributes. I just can’t read the file contents without getting the exception. If I try to write a file with the code:
FileSystem fs = null;
BufferedWriter br = null;
System.setProperty("HADOOP_USER_NAME", "root");
try {
fs = FileSystem.get(new Configuraion());
//Path p = new Path(dir, file);
Path p = new Path("hdfs://test.server:9000/usr/test/test.txt");
br = new BufferedWriter(new OutputStreamWriter(fs.create(p,true)));
br.write("Hello World");
}
finally {
if(br != null) br.close();
if(fs != null) fs.close();
}
this creates the file but doesn’t write any bytes and throws the exception:
Exception in thread "main" org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /usr/test/test.txt could only be replicated to 0 nodes instead of minReplication (=1). There are 1 datanode(s) running and 1 node(s) are excluded in this operation.
Googling for this indicated a possible space issue but from the dfsadmin report, it seems there is plenty of space. This is a plain vanilla install and I can’t get past this issue.
The environment summary is:
SERVER:
Hadoop 2.4.0 with pseudo-distribution (http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-common/SingleCluster.html)
CentOS 6.5 Virtual Machine 64 bit server
Java 1.7.0_55
CLIENT:
Windows 8 (Virtual Machine)
Java 1.7.0_51
Any help is greatly appreciated.

Hadoop error messages are frustrating. Often they don't say what they mean and have nothing to do with the real issue. I've seen problems like this occur when the client, namenode, and datanode cannot communicate properly. In your case I would pick one of two issues:
Your cluster runs in a VM and its virtualized network access to the client is blocked.
You are not consistently using fully-qualified domain names (FQDN) that resolve identically between the client and host.
The host name "test.server" is very suspicious. Check all of the following:
Is test.server a FQDN?
Is this the name that has been used EVERYWHERE in your conf files?
Can the client and all hosts forward and reverse resolve
"test.server" and its IP address and get the same thing?
Are IP addresses being used instead of FQDN anywhere?
Is "localhost" being used anywhere?
Any inconsistency in the use of FQDN, hostname, numeric IP, and localhost must be removed. Do not ever mix them in your conf files or in your client code. Consistent use of FQDN is preferred. Consistent use of numeric IP usually also works. Use of unqualified hostname, localhost, or 127.0.0.1 cause problems.

We need to make sure to have configuration with fs.default.name space set such as
configuration.set("fs.default.name","hdfs://ourHDFSNameNode:50000");
Below I've put a piece of sample code:
Configuration configuration = new Configuration();
configuration.set("fs.default.name","hdfs://ourHDFSNameNode:50000");
FileSystem fs = pt.getFileSystem(configuration);
BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(pt)));
String line = null;
line = br.readLine
while (line != null) {
try {
line = br.readLine
System.out.println(line);
}
}

The answer above is pointing to the right direction. Allow me to add the following:
Namenode does NOT directly read or write data.
Client (your Java program using Direct access to HDFS) interacts with Namenode to update HDFS namespace and retrieve block locations for reading/writing.
Client interacts directly with Datanode to read/write data.
You were able to list directory contents because hostname:9000was accessible to your client code. You were doing the number 2 above.
To be able to read and write, your client code needs access to the Datanode (number 3). The default port for Datanode DFS data transfer is 50010. Something was blocking your client communication to hostname:50010. Possibly a firewall or SSH tunneling configuration problem.
I was using Hadoop 2.7.2, so maybe you have a different port number setting.

Running Hbase ImportTSV job remotely

I am trying to run HBase importTSV hadoop job to load data into HBase from a TSV file. I am using the following code.
Configuration config = new Configuration();
Iterator iter = config.iterator();
while(iter.hasNext())
{
Object obj = iter.next();
System.out.println(obj);
}
Job job = new Job(config);
job.setJarByClass(ImportTsv.class);
job.setJobName("ImportTsv");
job.getConfiguration().set("user", "hadoop");
job.waitForCompletion(true);
I am getting this error
ERROR security.UserGroupInformation: PriviledgedActionException as:E317376 cause:org.apache.hadoop.security.AccessControlException: org.apache.hadoop.security.AccessControlException: Permission denied: user=E317376, access=WRITE, inode="staging":hadoop:supergroup:rwxr-xr-x
I dont know how user name E317376 is being set. This is my windows machine user from where I am trying to run this job in a remote cluster. My haddop user account in linux machine is "hadoop"
when i run this in linux machine which is part of Hadoop cluster under hadoop user account, everything works well. But I want to programatically run this job in a java web application. Am I doing anything wrong. Please help...

you should have a property like bellow in your mapred-site.xml file
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/user</value>
<property>
and maybe it is necessary to chmod the /user folder of your dfs file system to 777
do not forget to stop/start your jobtrackers and tasktrackers (sh stop-mapred.sh and sh start-mapred.sh)

I havent tested these solutions, but try adding something like this in your job configuration
conf.set("hadoop.job.ugi", "hadoop");
The above may be abolete so you can also try the following with user set as hadoop
(code from http://hadoop.apache.org/common/docs/r1.0.3/Secure_Impersonation.html)
:
UserGroupInformation ugi =
UserGroupInformation.createProxyUser(user, UserGroupInformation.getLoginUser());
ugi.doAs(new PrivilegedExceptionAction<Void>() {
public Void run() throws Exception {
//Submit a job
JobClient jc = new JobClient(conf);
jc.submitJob(conf);
//OR access hdfs
FileSystem fs = FileSystem.get(conf);
fs.mkdir(someFilePath);
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Hadoop Remote Copying - java

Related

After retrieving multiple files I get an error on the next file '425 can't open data connection for transfer of file'

Inconsistent I/O when reading int from DataInputStream

Start Selenium Hub and Node using ProcessBuilder.java - port error

Reading remote HDFS file with Java

Running Hbase ImportTSV job remotely

Categories

Resources