How to append data to existing file in HDFS using Java API? - java

I am using hadoop version 2.6.0 and trying to append data to existing file in HDFS but seem that it doesn't work for me. Here's my method to write into HDFS using FileSystem.Append function.
HdfsIO hdfsIO = new HdfsIO(hdfsCoreSite,hdfsSite);
FileSystem fs = FileSystem.get(hdfsIO.getConfiguration());
FSDataOutputStream out = fs.append(new Path("/test_dir_10/append_data_to_this_file.txt"));
out.writeUTF("Append demo...");
fs.close();
Write() and create() functions work well but append() function.
I got this error :
Failed to close file /test_dir_10/append_data_to_this_file.txt. Lease
recovery is in progress. Try again later.
I also added this property to hdsf-site.xml
<property>
<name>dfs.support.append</name>
<value>true</value>
</property>
So anyone have an idea what I'm missing or doing wrong?
Thanks.

Problem was solved by running on file system
hadoop dfs -setrep -R -w 2 /

Related

Yarn Distributed cache, no mapper/reducer

I am unable to access files in distributed cache in Hadoop 2.6. Below is a code snippet. I am attempting to place a file pattern.properties, which is in args[0] in the distributed cache of Yarn
Configuration conf1 = new Configuration();
Job job = Job.getInstance(conf1);
DistributedCache.addCacheFile(new URI(args[0]), conf1);
Also, I am trying to access the file in cache using the below:
Context context =null;
URI[] cacheFiles = context.getCacheFiles(); //Error at this line
System.out.println(cacheFiles);
But I am getting the below error at the line mentioned above:
java.lang.NullPointerException
I am not using Mapper class. It's just a spark stream code to access a file in cluster. I want the file to be distributed in the cluster. But I can't take it from HDFS.
I don't know whether I understood your question correctly.
We had some local files which we need to access in Spark streaming jobs.
We used this option:-
time spark-submit --files
/user/dirLoc/log4j.properties#log4j.properties 'rest other options'
Another way we tried was :- SparkContext.addFile()

File Not Found - Spark standalone cluster

I have two machines named: ubuntu1 and ubuntu2.
In ubuntu1, I started the master node in Spark Standalone Cluster and ubuntu2 I started with a worker (slave).
I am trying to execute the example workCount available on github.
When I submit the application, the worker send an error message
java.io.FileNotFoundException: File file:/home/ubuntu1/demo/test.txt does not exist.
My command line is
./spark-submit --master spark://ubuntu1-VirtualBox:7077 --deploy-mode cluster --clas br.com.wordCount.App -v --name"Word Count" /home/ubuntu1/demo/wordCount.jar /home/ubuntu1/demo/test.txt
The file test.txt has only to stay in one machine ?
Note: The master and the worker are in different machine.
Thank you
I got the same problem while loading the JSON file. I recognized by default windows storing file format as Textfile regardless of the name. identify the file format then you can load easily.
example: think you saved the file as test.JSON. but by default windows adding .txt to it.
check that and try to run again.
I hope your problem will get resolved with this idea.
Thank you.
You should put your file on hdfs by going to the folder and typing :
hdfs dfs -put <file>
Otherwise each node has to have access to it by having the same path folder existing on each machine.
Don't forget to change file:/ to hdfs:/ after you do that

Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs

So I've installed Hadoop File System on my machine and I'm using maven dependency to provide my code spark environment. (spark-mllib_2.10)
Now, My code is using spark mllib. And accessing data from Hadoop file system with this code.
String finalData = ProjectProperties.hadoopBasePath + ProjectProperties.finalDataPath;
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), finalData).toJavaRDD();
With following properties set.
finalDataPath = /data/finalInput.txt
hadoopBasePath = hdfs://127.0.0.1:54310
I am starting the dfs nodes externally through command
start-dfs.sh
Now, my code works perfectly fine when running from eclipse. But if I export the whole code to an executable jar, it gives me following exception.
Exception in thread "main" java.io.IOException: No FileSystem for scheme: hdfs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2421)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
I also checked different solutions online given for this issue where people are asking me to add following
hadoopConfig.set("fs.hdfs.impl",
org.apache.hadoop.hdfs.DistributedFileSystem.class.getName()
);
hadoopConfig.set("fs.file.impl",
org.apache.hadoop.fs.LocalFileSystem.class.getName()
);
OR
<property>
<name>fs.file.impl</name>
<value>org.apache.hadoop.fs.LocalFileSystem</value>
<description>The FileSystem for file: uris.</description>
</property>
<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
<description>The FileSystem for hdfs: uris.</description>
</property>
But I don't use any Hadoop context or hadoop config into my project. Simply load the data from Hadoop using the URL.
Can someone give some answer relevant to this issue?
Please mind that this totally works fine from Eclipse. And only doesn't work if I export the same project as an executable Jar.
Update
As suggested in the comment and from the solutions found online, I tried two things.
Added dependencies into my pom.xml for hadoop-core, hadoop-hdfs and hadoop-client libraries.
Added the above properties configuration to hadoop's site-core.xml as suggested here http://grokbase.com/t/cloudera/scm-users/1288xszz7r/no-filesystem-for-scheme-hdfs
But still no luck in getting the error resolved. Gives the same issue locally on my machine as well as one of the remote machines I tried it on.
I also installed hadoop the same way I did on my machine using the link mentioned above.

Writing to file from jar run from Oozie shell

I have jar file that needs to be run before running our map reduce process. This is going to process the data to be fed in later to the map reduce process. The jar file works fine without oozie, but I like to automate the workflow.
The jar if runs should accept two inputs: <input_file> and <output_dir>
And it should be expected to output two files <output_file_1>, <output_file_2> under the <output_dir> specified.
This is the workflow:
<workflow-app name="RI" xmlns="uri:oozie:workflow:0.4">
<start to="RI"/>
<action name="RI">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>java </exec>
<argument>-jar</argument>
<argument>RI-Sequencer.jar </argument>
<argument>log.csv</argument>
<argument>/tmp</argument>
<file>/user/root/algo/RI-Sequencer.jar#RI-Sequencer.jar</file>
<file>/user/root/algo/log.csv#log.csv</file>
<capture-output/>
</shell>
<ok to="end"/>
<error to="kill"/>
</action>
<kill name="kill">
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name="end"/>
</workflow-app>
I run the task using Hue, and currently I can't get the output of the process to be written to files. It runs fine, but the supposed files are no where to be found.
I have also changed the output directory to be in HDFS, but with same result, no files are generated.
If it helps, this is sample of codes from my jar file:
File fileErr = new File(targetPath + "\\input_RI_err.txt");
fileErr.createNewFile();
textFileErr = new BufferedWriter(new FileWriter(fileErr));
//
// fill in the buffer with the result
//
textFileErr.close();
UPDATE:
If it helps, I can upload the jar file for testing.
UPDATE 2:
I've changed to make it write to HDFS. Still not working when using Oozie to execute the job. Running the job independently works.
It seems like you are creating a regular output file (on the local filesystem, not HDFS). As the job is going to run on one of the node of the cluster, the output is going to be on the local /tmp of the machine picked.
I do not understand why are you want to preprocess data before mapreduce. Think it is not too effective. But as Roamin said, you are saving your output file into local filesystem (file should be in your user home folder ~/). If you want to save your data into hdfs directly from java (without using mapreduce library) look here - How to write a file in HDFS using hadoop or Write a file in hdfs with java.
Eventually you can generate your file to local directory and then load it into HDFS with this command:
hdfs dfs -put <localsrc> ... <dst>

InvalidInputException When loading file into Hbase MapReduce

I am very new for Hadoop and Map Reduce. For starting bases i executed Word Count Program. It executed well but when i try running csv file into Htable which i followed [Csv File][1]
It throwing me in to following error which i am not aware of it, please can any one help me in knowing the above error
12/09/07 05:47:31 ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path [1]: http://salsahpc.indiana.edu/ScienceCloud/hbase_hands_on_1.htm#shell_exercises
This error is really kiiling my time, please can any one help me with this exception
The problem why you are directing to path hdfs://HadoopMaster:54310/user/hduser/csvtable instead of csvtable is.
1) Add your Hbase jars into Hadoop class path because your Map reduce doesn't by default configure to hbase jars.
2) GO to hadoop-ev.sh and edit Hadoop_classpath and add all your hbase jars in it. hope it might work now
your job is attempting to read an input file from:
hdfs://HadoopMaster:54310/user/hduser/csvtable
you should verify that this file exists on HDFS using the hadoop shell tools:
hadoop fs -ls /user/hduser/csvtable
my guess is that your file hasn't been loaded onto HDFS.

Categories