How to read files with an offset from Hadoop using Java

How to read files with an offset from Hadoop using Java - java

Problem: I want to read a section of a file from HDFS and return it, such as lines 101-120 from a file of 1000 lines.
I don't want to use seek because I have read that it is expensive.
I have log files which I am using PIG to process down into meaningful sets of data. I've been writing an API to return the data for consumption and display by a front end. Those processed data sets can be large enough that I don't want to read the entire file out of Hadoop in one slurp to save wire time and bandwidth. (Let's say 5 - 10MB)
Currently I am using a BufferedReader to return small summary files which is working fine
ArrayList lines = new ArrayList();
...
for (FileStatus item: items) {
// ignoring files like _SUCCESS
if(item.getPath().getName().startsWith("_")) {
continue;
}
in = fs.open(item.getPath());
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String line;
line = br.readLine();
while (line != null) {
line = line.replaceAll("(\\r|\\n)", "");
lines.add(line.split("\t"));
line = br.readLine();
}
}
I've poked around the interwebs quite a bit as well as Stack but haven't found exactly what I need.
Perhaps this is completely the wrong way to go about doing it and I need a completely separate set of code and different functions to manage this. Open to any suggestions.
Thanks!
As added noted based on research from the below discussions:
How does Hadoop process records records split across block boundaries?
Hadoop FileSplit Reading

I think SEEK is a best option for reading files with huge volumes. It did not cause any problems to me as the volume of data that i was reading was in the range of 2 - 3GB. I did not encounter any issues till today but we did use file splitting to handle the large data set. below is the code which you can use for reading purpose and test the same.
public class HDFSClientTesting {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
try{
//System.loadLibrary("libhadoop.so");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
conf.addResource(new Path("core-site.xml"));
String Filename = "/dir/00000027";
long ByteOffset = 3185041;
SequenceFile.Reader rdr = new SequenceFile.Reader(fs, new Path(Filename), conf);
Text key = new Text();
Text value = new Text();
rdr.seek(ByteOffset);
rdr.next(key,value);
//Plain text
JSONObject jso = new JSONObject(value.toString());
String content = jso.getString("body");
System.out.println("\n\n\n" + content + "\n\n\n");
File file =new File("test.gz");
file.createNewFile();
}
catch (Exception e ){
throw new RuntimeException(e);
}
finally{
}
}
}

Related

Updating a particular column in csv with huge amount of data with java

I have a csv file 'Master List' with 800 K records, each record have 13 values.
combination of cell[0] and cell[1] give a unique record and I need to update value of cell[12] say status for every record.
I have another csv file say 'Updated subset list'. This is sort of subset of file 'Master list'. For all the records in my 2nd csv which are less in number say 10000, I need to update cell[11] aka status column value of each matching record.
I tried direct BufferedReader, CsvParser from commons-csv and CsvParser from univocity.parsers.
But reading whole file and creating List of 800K is giving out of memory exception.
Same code will be deployed on different servers so I want to have a efficient code for reading huge csv file and updating same file.
Partially reading huge file and writing in same file might corrupt the data.
Any suggestions on how can I do this. ??
File inputF = new File(inputFilePath);
if (inputF.exists()) {
InputStream inputFS = new FileInputStream(inputF);
BufferedReader br = new BufferedReader(new InputStreamReader(inputFS));
// skip the header of the file
String line = br.readLine();
mandatesList = new ArrayList<DdMandates>();
while ((line = br.readLine()) != null) {
mandatesList.add(mapToItem(line));
}
br.close();
}
Memory issue is resolved via doing it in chunks. reading single line and writing single line might result is taking more time. I didn't tried it as my issue was resolved with using batches of 100k records at time and clearing list after writing 100k records
Now issue is updating status is taking too much looping....
I have two csv's. Master sheet (Master list) have 800 K records then I have a subset csv as well say it have 10 k records. This subset csv is updated from some other system and it have updated status say 'OK' and 'NOT OK'. I need to update this status in Master sheet. How can I do that in best possible way. ??? Dumbest way I am using is follwing : –
// Master list have batches but it contains 800 k records and 12 columns
List<DdMandates> mandatesList = new ArrayList<DdMandates>();
// Subset list have updated status
List<DdMandates> updatedMandatesList = new ArrayList<DdMandates>();
// Read Subset csv file and map DdMandates item and then add to updated mandate list
File inputF = new File(Property.inputFilePath);
if(inputF.exists()) {
InputStream inputFS = new FileInputStream(inputF);
BufferedReader br = new BufferedReader(new InputStreamReader(inputFS, "UTF-8"));
checkFilterAndmapToItem(br);
br.close();
In Method checkFilterAndmapToItem(BufferedReader br)
private static void checkFilterAndmapToItem(BufferedReader br) {
FileWriter fileWriter = null;
try {
// skip the header of the csv
String line = br.readLine();
int batchSize = 0, currentBatchNo=0;
fileWriter = new FileWriter(Property.outputFilePath);
//Write the CSV file header
fileWriter.append(FILE_HEADER.toString());
//Add a new line separator after the header
fileWriter.append(NEW_LINE_SEPARATOR);
if( !Property.batchSize.isEmpty()) {
batchSize = Integer.parseInt(Property.batchSize.trim());
}
while ((line = br.readLine()) != null) {
DdMandates item = new DdMandates();
String[] p = line.concat(" ").split(SEPERATOR);
Parse each p[x] and map to item of type DdMandates\
Iterating here on updated mandate list to check if this item is present in updated mandate list
then get that item and update that status to item . so here is a for loop for say 10K elements
mandatesList.add(item);
if (batchSize != 0 && mandatesList.size() == batchSize) {
currentBatchNo++;
logger.info("Batch no. : "+currentBatchNo+" is executing...");
processOutputFile(fileWriter);
mandatesList.clear();
}
}
processing output file here for the last batch ...
}
It will have while loop (800 K iteration) { insider loop 10K iteration for each element )
so at least 800K * 10K loop
Please help in getting its best possible way and reduce iteration .
Thanks in advance

Suppose you are reading 'Main Data File' in batches of 50K:
Store this data in java HashMap using cell[0] and cell[1] as key and rest of the columns as value.
The complexity of get and put is O(1) most of the time. see here
So the complexity for searching 10K records in that particular batch will be O(10K).
HashMap<String, DdMandates> hmap = new HashMap<String, DdMandates>();
Use key=DdMandates.get(0)+DdMandates.get(1)
Note: If 50K records are exceeding the memory limit of HashMap create smaller batches.
For further performance enhancement you can use multi-threading by creating small batches and processing them on different threads.

The first suggestion, when you create the ArrayList, it will make list capacity of 10. So, if you work with large amount of data, initialize it first like:
private static final int LIST_CAPACITY = 800000;
mandatesList = new ArrayList<DdMandates>(LIST_CAPACITY);
The second suggestion, don't store data in the memory, read the data line by line, make your business logic needs, then free up memory, like:
FileInputStream inputStream = null;
Scanner sc = null;
try {
inputStream = new FileInputStream(path);
sc = new Scanner(inputStream, "UTF-8");
while (sc.hasNextLine()) {
String line = sc.nextLine();
/* your business rule here */
}
// note that Scanner suppresses exceptions
if (sc.ioException() != null) {
throw sc.ioException();
}
} finally {
if (inputStream != null) {
inputStream.close();
}
if (sc != null) {
sc.close();
}
}

How to append existing line in text file

How do i append an existing line in a text file? What if the line to be edited is in the middle of the file? Please kindly offer a suggestion, given the following code.
Have went through & tried the following:
How to add a new line of text to an existing file in Java?
How to append existing line within a java text file
My code:
filePath = new File("").getAbsolutePath();
BufferedReader reader = new BufferedReader(new FileReader(filePath + "/src/DBTextFiles/Customer.txt"));
try
{
String line = null;
while ((line = reader.readLine()) != null)
{
if (!(line.startsWith("*")))
{
//System.out.println(line);
//check if target customer exists, via 2 fields - customer name, contact number
if ((line.equals(customername)) && (reader.readLine().equals(String.valueOf(customermobilenumber))))
{
System.out.println ("\nWelcome (Existing User) " + line + "!");
//w target customer, alter total number of bookings # 5th line of 'Customer.txt', by reading lines sequentially
reader.readLine();
reader.readLine();
int total_no_of_bookings = Integer.valueOf(reader.readLine());
System.out.println (total_no_of_bookings);
reader.close();
valid = true;
//append total number of bookings (5th line) of target customer # 'Customer.txt'
try {
BufferedWriter writer = new BufferedWriter(new FileWriter(new File(filePath + "/src/DBTextFiles/Customer.txt")));
writer.write(total_no_of_bookings + 1);
//writer.write("\n");
writer.close();
}
catch (IOException ex)
{
ex.printStackTrace();
}
//finally
// {
//writer.close();
//}
}
}
}

To be able to append content to an existing file you need to open it in append mode. For example using FileWriter(String fileName, boolean append) and passing true as second parameter.
If the line is in the middle then you need to read the entire file into memory and then write it back when all editing was done.
This might be workable for small files but if your files are too big, then I would suggest to write the actual content and the edited content into a temp file, when done delete the old one an rename the temp file to be the same name as the old one.

The reader.readLine() method increments a line each time it is called. I am not sure if this is intended in your program, but you may want to store the reader.readline() as a String so it is only called once.
To append a line in the middle of the text file I believe you will have to re-write the text file up to the point at which you wish to append the line, then proceed to write the rest of the file. This could possibly be achieved by storing the whole file in a String array, then writing up to a certain point.
Example of writing:
BufferedWriter writer = new BufferedWriter(new FileWriter(new File(path)));
writer.write(someStuff);
writer.write("\n");
writer.close();

You should probably be following the advice in the answer to the second link you posted. You can access the middle of a file using a random access file, but if you start appending at an arbitrary position in the middle of a file without recording what's there when you start writing, you'll be overwriting its current contents, as noted in this answer. Your best bet, unless the files in question are intractably large, is to assemble a new file using the existing file and your new data, as others have previously suggested.

AFAIK you cannot do that. I mean, appending a line is possible but not inserting in the middle. That has nothing to do with java or another language...a file is a sequence of written bytes...if you insert something in an arbitrary point that sequence is no longer valid and needs to be re-written.
So basically you have to create a function to do that read-insert-slice-rewrite

Mahout: CSV to vector and running the program

I'm analysing the k-means algorithm with Mahout. I'm going to run some tests, observe performance, and do some statistics with the results I get.
I can't figure out the way to run my own program within Mahout. However, the command-line interface might be enough.
To run the sample program I do
$ mahout seqdirectory --input uscensus --output uscensus-seq
$ mahout seq2sparse -i uscensus-seq -o uscensus-vec
$ mahout kmeans -i reuters-vec/tfidf-vectors -o uscensus-kmeans-clusters -c uscensus-kmeans-centroids -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25
The dataset is one large CSV file. Each line is a record. Features are comma separated. The first field is an ID.
Because of the input format I can not use seqdirectory right away.
I'm trying to implement the answer to this similar question How to perform k-means clustering in mahout with vector data stored as CSV? but I still have 2 Questions:
How do I convert from CSV to SeqFile? I guess I can write my own
program using Mahout to make this conversion and then use its output
as input for seq2parse. I guess I can use CSVIterator (https://cwiki.apache.org/confluence/display/MAHOUT/File+Format+Integrations). What class should I use to read and write?
How do I build and run my new program? I couldn't figure it out with the book Mahout in action or with other questions here.

For getting your data in SequenceFile format, you have a couple of strategies you can take. Both involve writing your own code -- i.e., not strictly command-line.
Strategy 1
Use Mahout's CSVVectorIterator class. You pass it a java.io.Reader and it will read in your CSV file, turn each row into a DenseVector. I've never used this, but saw it in the API. Looks straight-forward enough if you're ok with DenseVectors.
Strategy 2
Write your own parser. This is really easy, since you just split each line on "," and you have an array you can loop through. For each array of values in each line, you instantiate a vector using something like this:
new DenseVector(<your array here>);
and add it to a List (for example).
Then ... once you have a List of Vectors, you can write them to SequenceFiles using something like this (I'm using NamedVectors in below code):
FileSystem fs = null;
SequenceFile.Writer writer;
Configuration conf = new Configuration();
List<NamedVector> vectors = <here's your List of vectors obtained from CSVVectorIterator>;
// Write the data to SequenceFile
try {
fs = FileSystem.get(conf);
Path path = new Path(<your path> + <your filename>);
writer = new SequenceFile.Writer(fs, conf, path, Text.class, VectorWritable.class);
VectorWritable vec = new VectorWritable();
for (NamedVector vector : dataVector) {
vec.set(vector);
writer.append(new Text(vector.getName()), vec);
}
writer.close();
} catch (Exception e) {
System.out.println("ERROR: "+e);
}
Now you have a directory of "points" in SequenceFile format that you can use for your K-means clustering. You can point the command line Mahout commands at this directory as input.
Anyway, that's the general idea. There are probably other approaches as well.

To run kmeans with csv file, first you have to create a SequenceFile to pass as an argument in KmeansDriver. The following code reads each line of the CSV file "points.csv" and converts it into vector and write it to the SequenceFile "points.seq"
try (
BufferedReader reader = new BufferedReader(new FileReader("testdata2/points.csv"));
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,new Path("testdata2/points.seq"), LongWritable.class, VectorWritable.class)
) {
String line;
long counter = 0;
while ((line = reader.readLine()) != null) {
String[] c = line.split(",");
if(c.length>1){
double[] d = new double[c.length];
for (int i = 0; i < c.length; i++)
d[i] = Double.parseDouble(c[i]);
Vector vec = new RandomAccessSparseVector(c.length);
vec.assign(d);
VectorWritable writable = new VectorWritable();
writable.set(vec);
writer.append(new LongWritable(counter++), writable);
}
}
writer.close();
}
Hope it helps!!

There were a few issues when I was running the above code, so with a few modifications in the syntax here is the working code.
String inputfiledata = Input_file_path;
String outputfile = output_path_for_sequence_file;
FileSystem fs = null;
SequenceFile.Writer writer;
Configuration conf = new Configuration();
fs = FileSystem.get(conf);
Path path = new Path(outputfile);`enter code here`
writer = new SequenceFile.Writer(fs, conf, path, Text.class, VectorWritable.class);
VectorWritable vec = new VectorWritable();
List<NamedVector> vects = new ArrayList<NamedVector>();
try {
fr = new FileReader(inputfiledata);
br = new BufferedReader(fr);
s = null;
while((s=br.readLine())!=null){
// My columns are split by tabs with each entry in a new line as rows
String spl[] = s.split("\\t");
String key = spl[0];
Integer val = 0;
for(int k=1;k<spl.length;k++){
colvalues[val] = Double.parseDouble(spl[k]);
val++;
}
}
NamedVector nmv = new NamedVector(new DenseVector(colvalues),key);
vec.set(nmv);
writer.append(new Text(nmv.getName()), vec);
}
writer.close();
} catch (Exception e) {
System.out.println("ERROR: "+e);
}
}

I would suggest you implement a program to convert the CSV to sparse vector sequence file that mahout accepts.
what you need to do is understand how InputDriver converts text files containing space-delimited floating point numbers into Mahout sequence files of VectorWritable suitable for input to the clustering jobs in particular, and any Mahout job requiring this input in general. You will customize the codes to your needs.
If you have downloaded the source code of Mahout, the InputDriver is at package org.apache.mahout.clustering.conversion.

org.apache.mahout.clustering.conversion.InputDriver is a class that you can use to create sparse vectors.
Sample code is given below
mahout org.apache.mahout.clustering.conversion.InputDriver -i testdata -o output1/data -v org.apache.mahout.math.RandomAccessSparseVector
If you run mahout org.apache.mahout.clustering.conversion.InputDriver
it will list out the parameters it expects.
Hope this helps.
Also here is an article I wrote to explain how I ran kmeans clustering on an arff file
http://mahout-hadoop.blogspot.com/2013/10/using-mahout-to-cluster-iris-data.html

Java: How to I change the configuration file value in Java easily?

I have a config file, named config.txt, look like this.
IP=192.168.1.145
PORT=10022
URL=http://www.stackoverflow.com
I wanna change some value of the config file in Java, say the port to 10045. How can I achieve easily?
IP=192.168.1.145
PORT=10045
URL=http://www.stackoverflow.com
In my trial, i need to write lots of code to read every line, to find the PORT, delete the original 10022, and then rewrite 10045. my code is dummy and hard to read. Is there any convenient way in java?
Thanks a lot !

If you want something short you can use this.
public static void changeProperty(String filename, String key, String value) throws IOException {
Properties prop =new Properties();
prop.load(new FileInputStream(filename));
prop.setProperty(key, value);
prop.store(new FileOutputStream(filename),null);
}
Unfortunately it doesn't preserve the order or fields or any comments.
If you want to preserve order, reading a line at a time isn't so bad.
This untested code would keep comments, blank lines and order. It won't handle multi-line values.
public static void changeProperty(String filename, String key, String value) throws IOException {
final File tmpFile = new File(filename + ".tmp");
final File file = new File(filename);
PrintWriter pw = new PrintWriter(tmpFile);
BufferedReader br = new BufferedReader(new FileReader(file));
boolean found = false;
final String toAdd = key + '=' + value;
for (String line; (line = br.readLine()) != null; ) {
if (line.startsWith(key + '=')) {
line = toAdd;
found = true;
}
pw.println(line);
}
if (!found)
pw.println(toAdd);
br.close();
pw.close();
tmpFile.renameTo(file);
}

My suggestion would be to read the entire config file into memory (maybe into a list of (attribute:value) pair objects), do whatever processing you need to do (and consequently make any changes), then overwrite the original file with all the changes you have made.
For example, you could read the config file you have provided by line, use String.split("=") to separate the attribute:value pairs - making sure to name each pair read accordingly. Then make whatever changes you need, iterate over the pairs you have read in (and possibly modified), writing them back out to the file.
Of course, this approach would work best if you had a relatively small number of lines in your config file, that you can definitely know the format for.

this code work for me.
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Properties;
public void setProperties( String key, String value) throws IOException {
Properties prop = new Properties();
FileInputStream ip;
try {
ip = new FileInputStream("config.txt");
prop.load(ip);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
prop.setProperty(key, value);
PrintWriter pw = new PrintWriter("config.txt");
prop.store(pw, null);
}

Use the Properties class to load/save configuration. Then simply set the value and save it again.
Properties p = new Properties();
p.load(...);
p.put("key", "value");
p.save(...)
It's easy and straightforward.
As a side, if your application is a single application that does not need to scale to run on multiple computers, do not bother to use a database to save config. It is utter overkill. However, if you application needs real time config changes and needs to scale, Redis works pretty well to distribute config and handle the synchronization for you. I have used it for this purpose with great success.

Consider using java.util.Properties and it's load() and store() methods.
But remember that this would not preserve comments and extra line breaks in the file.
Also certain chars need to be escaped.

If you are open to use third party libraries, explore http://commons.apache.org/configuration/. It supports configurations in multiple format. Comments will be preserved as well. (Except for a minor bug -- apache-commons-config PropertiesConfiguration: comments after last property is lost)

How to read and update row in file with Java

currently i creating a java apps and no database required
that why i using text file to make it
the structure of file is like this
unique6id username identitynumber point
unique6id username identitynumber point
may i know how could i read and find match unique6id then update the correspond row of point ?
Sorry for lack of information
and here is the part i type is
public class Cust{
string name;
long idenid, uniqueid;
int pts;
customer(){}
customer(string n,long ide, long uni, int pt){
name = n;
idenid = ide;
uniqueid = uni;
pts = pt;
}
FileWriter fstream = new FileWriter("Data.txt", true);
BufferedWriter fbw = new BufferedWriter(fstream);
Cust newCust = new Cust();
newCust.name = memUNTF.getText();
newCust.ic = Long.parseLong(memICTF.getText());
newCust.uniqueID = Long.parseLong(memIDTF.getText());
newCust.pts= points;
fbw.write(newCust.name + " " + newCust.ic + " " + newCust.uniqueID + " " + newCust.point);
fbw.newLine();
fbw.close();
this is the way i text in the data
then the result inside Data.txt is
spencerlim 900419129876 448505 0
Eugene 900419081234 586026 0
when user type in 586026 then it will grab row of eugene
bind into Cust
and update the pts (0 in this case, try to update it into other number eg. 30)
Thx for reply =D

Reading is pretty easy, but updating a text file in-place (ie without rewriting the whole file) is very awkward.
So, you have two options:
Read the whole file, make your changes, and then write the whole file to disk, overwriting the old version; this is quite easy, and will be fast enough for small files, but is not a good idea for very large files.
Use a format that is not a simple text file. A database would be one option (and bear in mind that there is one, Derby, built into the JDK); there are other ways of keeping simple key-value stores on disk (like a HashMap, but in a file), but there's nothing built into the JDK.

You can use OpenCSV with custom separators.
Here's a sample method that updates the info for a specified user:
public static void updateUserInfo(
String userId, // user id
String[] values // new values
) throws IOException{
String fileName = "yourfile.txt.csv";
CSVReader reader = new CSVReader(new FileReader(fileName), ' ');
List<String[]> lines = reader.readAll();
Iterator<String[]> iterator = lines.iterator();
while(iterator.hasNext()){
String[] items = (String[]) iterator.next();
if(items[0].equals(userId)){
for(int i = 0; i < values.length; i++){
String value = values[i];
if(value!=null){
// for every array value that's not null,
// update the corresponding field
items[i+1]=value;
}
}
break;
}
}
new CSVWriter(new FileWriter(fileName), ' ').writeAll(lines);
}

Use InputStream(s) and Reader(s) to read file.
Here is a code snippet that shows how to read file.
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("c:/myfile.txt")));
String line = null;
while ((line = reader.readLine()) != null) {
// do something with the line.
}
Use OutputStream and Writer(s) to write to file. Although you can use random access files, i.e. write to the specific place of the file I do not recommend you to do this. Much easier and robust way is to create new file every time you have to write something. I know that it is probably not the most efficient way, but you do not want to use DB for some reasons... If you have to save and update partial information relatively often and perform search into the file I'd recommend you to use DB. There are very light weight implementations including pure java implementations (e.g. h2: http://www.h2database.com/html/main.html).

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to read files with an offset from Hadoop using Java - java

Related

Updating a particular column in csv with huge amount of data with java

How to append existing line in text file

Mahout: CSV to vector and running the program

Java: How to I change the configuration file value in Java easily?

How to read and update row in file with Java

Categories

Resources