Copy JSON file from Local to HDFS - java

import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.OutputStream;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class HdfsWriter extends Configured implements Tool {
public int run(String[] args) throws Exception {
//String localInputPath = args[0];
Path outputPath = new Path(args[0]); // ARGUMENT FOR OUTPUT_LOCATION
Configuration conf = getConf();
FileSystem fs = FileSystem.get(conf);
OutputStream os = fs.create(outputPath);
InputStream is = new BufferedInputStream(new FileInputStream("/home/acadgild/acadgild.txt")); //Data set is getting copied into input stream through buffer mechanism.
IOUtils.copyBytes(is, os, conf); // Copying the dataset from input stream to output stream
return 0;
}
public static void main(String[] args) throws Exception {
int returnCode = ToolRunner.run(new HdfsWriter(), args);
System.exit(returnCode);
}
}
Need to Move the data from Local to HDFS.
The above code I got from another blog , it's not working. can anyone help me on this.
Also i need to parse the Json using MR and group by DateTime and move to HDFS

Map Reduce is a distributed job processing framework
for each mapper local means the local filesytem on the node on which that mapper is running.
What you want is reading from local on a given node to be put on to HDFS and then processing it via MapReduce.
There are multiple tools available for copying from Local of one node to HDFS
hdfs put localPath HdfsPath (Shell script)
flume

Related

Writing to Google Persistent DIsk

I've written some java code that creates a CSV file at the mount point of a disk I have attached to a Google Compute instance. I run the script in the form of a SQL stored procedure from the instance that the disk is attached to. The issue is that at the mount point, a "lost+found" folder is created where I would expect to find my CSV file. What am I doing wrong? Thank you for your time! The code is similar to as follows:
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.FileOutputStream;
public class file_write {
public static void main(String[] args) throws IOException{
String filePath = "/mnt/point/file.csv";
// Creates file in mount point
File myFile = new File(filePath);
myFile.getParentFile().mkdirs();
myFile.createNewFile();
FileWriter stringWriter = new FileWriter(myFile);
for(int i = 0; i < 100000; i++) {
stringWriter.write(i + ", ");
stringWriter.write("something");
stringWriter.write(System.lineSeparator());
}
stringWriter.close();
}
}

Typo in word "hdfs" gives me: "java.io.IOException: No FileSystem for scheme: hdfs". Using FileSystem lib over hadoop 2.7.7

While using FileSystem.get(URI.create("hdfs://localhost:9000/"), configuration) I'm getting the error "Typo in word hdfs", when I tried to run the code gives me the IOException:
java.io.IOException: No FileSystem for scheme: hdfs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2658)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2665)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:93)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2701)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2683)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:372)
at com.oracle.hadoop.client.Test.main(Test.java:53)
I already tried to use in different ways to use the call to hdfs, I'm using the libraries for hadoop 2.7.7
Here is my current code:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.log4j.BasicConfigurator;
import java.io.IOException;
import java.io.InputStream;
import java.net.URI;
public class Test {
public static void main(String []args) {
Configuration conf = new Configuration();
InputStream in = null;
try {
FileSystem fs = FileSystem.get(URI.create("hdfs://localhost:9000/"), conf);
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
} catch (IOException e) {
e.printStackTrace();
} finally {
IOUtils.closeStream(in);
}
}
Actually, I just added this maven dependency: http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs/2.7.7
to maven pom.xml and problem solved.

Copy Json Flat file from local to HDFS

package com.Main;
import java.io.BufferedInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
public class Main {
public static void main(String[] args) throws IOException {
//Source file in the local file system
String localSrc = args[0];
//Destination file in HDFS
String dst = args[1];
//Input stream for the file in local file system to be written to HDFS
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
//Get configimport org.apache.commons.configuration.Configuration;uration of Hadoop system
Configuration conf = new Configuration();
System.out.println("Connecting to -- "+conf.get("fs.defaultFS"));
//Destination file in HDFS
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst));
//Copy file from local to HDFS
IOUtils.copyBytes(in, out, 4096, true);
System.out.println(dst + " copied to HDFS");
}
}
AM getting following error message "Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 0
at com.Main.Main.main(Main.java:22)"
I have Json file in my local , have to move that in HDFS
Ex:
{"Del":"Ef77xvP","time":1509073785106},
{"Del":"2YXsF7r","time":1509073795109}
Specify command line arguments to your program. You code snippet expects first argument to be source and next arguments to be destination.
For more details refer to What is "String args[]"? parameter in main method Java

Can a Jar File be updated programmatically without rewriting the whole file?

It is possible to update individual files in a JAR file using the jar command as follows:
jar uf TicTacToe.jar images/new.gif
Is there a way to do this programmatically?
I have to rewrite the entire jar file if I use JarOutputStream, so I was wondering if there was a similar "random access" way to do this. Given that it can be done using the jar tool, I had expected there to be a similar way to do it programmatically.
It is possible to update just parts of the JAR file using Zip File System Provider available in Java 7:
import java.net.URI;
import java.nio.file.FileSystem;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.StandardCopyOption;
import java.util.HashMap;
import java.util.Map;
public class ZipFSPUser {
public static void main(String [] args) throws Throwable {
Map<String, String> env = new HashMap<>();
env.put("create", "true");
// locate file system by using the syntax
// defined in java.net.JarURLConnection
URI uri = URI.create("jar:file:/codeSamples/zipfs/zipfstest.zip");
try (FileSystem zipfs = FileSystems.newFileSystem(uri, env)) {
Path externalTxtFile = Paths.get("/codeSamples/zipfs/SomeTextFile.txt");
Path pathInZipfile = zipfs.getPath("/SomeTextFile.txt");
// copy a file into the zip file
Files.copy( externalTxtFile,pathInZipfile,
StandardCopyOption.REPLACE_EXISTING );
}
}
}
Yes, if you use this opensource library you can modify it in this way as well.
https://truevfs.java.net
public static void main(String args[]) throws IOException{
File entry = new TFile("c:/tru6413/server/lib/nxps.jar/dir/second.txt");
Writer writer = new TFileWriter(entry);
try {
writer.write(" this is writing into a file inside an archive");
} finally {
writer.close();
}
}

How to process a range of hbase rows using spark?

I am trying to use HBase as a data source for spark. So the first step turns out to be creating a RDD from a HBase table. Since Spark works with hadoop input formats, i could find a way to use all rows by creating an rdd http://www.vidyasource.com/blog/Programming/Scala/Java/Data/Hadoop/Analytics/2014/01/25/lighting-a-spark-with-hbase But how do we create a RDD for a range scan ?
All suggestions are welcome.
Here is an example of using Scan in Spark:
import java.io.{DataOutputStream, ByteArrayOutputStream}
import java.lang.String
import org.apache.hadoop.hbase.client.Scan
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.client.Result
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Base64
def convertScanToString(scan: Scan): String = {
val out: ByteArrayOutputStream = new ByteArrayOutputStream
val dos: DataOutputStream = new DataOutputStream(out)
scan.write(dos)
Base64.encodeBytes(out.toByteArray)
}
val conf = HBaseConfiguration.create()
val scan = new Scan()
scan.setCaching(500)
scan.setCacheBlocks(false)
conf.set(TableInputFormat.INPUT_TABLE, "table_name")
conf.set(TableInputFormat.SCAN, convertScanToString(scan))
val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
rdd.count
You need to add related libraries to the Spark classpath and make sure they are compatible with your Spark. Tips: you can use hbase classpath to find them.
You can set below conf
val conf = HBaseConfiguration.create()//need to set all param for habse
conf.set(TableInputFormat.SCAN_ROW_START, "row2");
conf.set(TableInputFormat.SCAN_ROW_STOP, "stoprowkey");
this will load rdd only for those reocrds
Here is a Java example with TableMapReduceUtil.convertScanToString(Scan scan):
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HConstants;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.TableInputFormat;
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import java.io.IOException;
public class HbaseScan {
public static void main(String ... args) throws IOException, InterruptedException {
// Spark conf
SparkConf sparkConf = new SparkConf().setMaster("local[4]").setAppName("My App");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
// Hbase conf
Configuration conf = HBaseConfiguration.create();
conf.set(TableInputFormat.INPUT_TABLE, "big_table_name");
// Create scan
Scan scan = new Scan();
scan.setCaching(500);
scan.setCacheBlocks(false);
scan.setStartRow(Bytes.toBytes("a"));
scan.setStopRow(Bytes.toBytes("d"));
// Submit scan into hbase conf
conf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan));
// Get RDD
JavaPairRDD<ImmutableBytesWritable, Result> source = jsc
.newAPIHadoopRDD(conf, TableInputFormat.class,
ImmutableBytesWritable.class, Result.class);
// Process RDD
System.out.println(source.count());
}
}

Categories