Mahout: CSV to vector and running the program - java

I'm analysing the k-means algorithm with Mahout. I'm going to run some tests, observe performance, and do some statistics with the results I get.
I can't figure out the way to run my own program within Mahout. However, the command-line interface might be enough.
To run the sample program I do
$ mahout seqdirectory --input uscensus --output uscensus-seq
$ mahout seq2sparse -i uscensus-seq -o uscensus-vec
$ mahout kmeans -i reuters-vec/tfidf-vectors -o uscensus-kmeans-clusters -c uscensus-kmeans-centroids -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25
The dataset is one large CSV file. Each line is a record. Features are comma separated. The first field is an ID.
Because of the input format I can not use seqdirectory right away.
I'm trying to implement the answer to this similar question How to perform k-means clustering in mahout with vector data stored as CSV? but I still have 2 Questions:
How do I convert from CSV to SeqFile? I guess I can write my own
program using Mahout to make this conversion and then use its output
as input for seq2parse. I guess I can use CSVIterator (https://cwiki.apache.org/confluence/display/MAHOUT/File+Format+Integrations). What class should I use to read and write?
How do I build and run my new program? I couldn't figure it out with the book Mahout in action or with other questions here.

For getting your data in SequenceFile format, you have a couple of strategies you can take. Both involve writing your own code -- i.e., not strictly command-line.
Strategy 1
Use Mahout's CSVVectorIterator class. You pass it a java.io.Reader and it will read in your CSV file, turn each row into a DenseVector. I've never used this, but saw it in the API. Looks straight-forward enough if you're ok with DenseVectors.
Strategy 2
Write your own parser. This is really easy, since you just split each line on "," and you have an array you can loop through. For each array of values in each line, you instantiate a vector using something like this:
new DenseVector(<your array here>);
and add it to a List (for example).
Then ... once you have a List of Vectors, you can write them to SequenceFiles using something like this (I'm using NamedVectors in below code):
FileSystem fs = null;
SequenceFile.Writer writer;
Configuration conf = new Configuration();
List<NamedVector> vectors = <here's your List of vectors obtained from CSVVectorIterator>;
// Write the data to SequenceFile
try {
fs = FileSystem.get(conf);
Path path = new Path(<your path> + <your filename>);
writer = new SequenceFile.Writer(fs, conf, path, Text.class, VectorWritable.class);
VectorWritable vec = new VectorWritable();
for (NamedVector vector : dataVector) {
vec.set(vector);
writer.append(new Text(vector.getName()), vec);
}
writer.close();
} catch (Exception e) {
System.out.println("ERROR: "+e);
}
Now you have a directory of "points" in SequenceFile format that you can use for your K-means clustering. You can point the command line Mahout commands at this directory as input.
Anyway, that's the general idea. There are probably other approaches as well.

To run kmeans with csv file, first you have to create a SequenceFile to pass as an argument in KmeansDriver. The following code reads each line of the CSV file "points.csv" and converts it into vector and write it to the SequenceFile "points.seq"
try (
BufferedReader reader = new BufferedReader(new FileReader("testdata2/points.csv"));
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf,new Path("testdata2/points.seq"), LongWritable.class, VectorWritable.class)
) {
String line;
long counter = 0;
while ((line = reader.readLine()) != null) {
String[] c = line.split(",");
if(c.length>1){
double[] d = new double[c.length];
for (int i = 0; i < c.length; i++)
d[i] = Double.parseDouble(c[i]);
Vector vec = new RandomAccessSparseVector(c.length);
vec.assign(d);
VectorWritable writable = new VectorWritable();
writable.set(vec);
writer.append(new LongWritable(counter++), writable);
}
}
writer.close();
}
Hope it helps!!

There were a few issues when I was running the above code, so with a few modifications in the syntax here is the working code.
String inputfiledata = Input_file_path;
String outputfile = output_path_for_sequence_file;
FileSystem fs = null;
SequenceFile.Writer writer;
Configuration conf = new Configuration();
fs = FileSystem.get(conf);
Path path = new Path(outputfile);`enter code here`
writer = new SequenceFile.Writer(fs, conf, path, Text.class, VectorWritable.class);
VectorWritable vec = new VectorWritable();
List<NamedVector> vects = new ArrayList<NamedVector>();
try {
fr = new FileReader(inputfiledata);
br = new BufferedReader(fr);
s = null;
while((s=br.readLine())!=null){
// My columns are split by tabs with each entry in a new line as rows
String spl[] = s.split("\\t");
String key = spl[0];
Integer val = 0;
for(int k=1;k<spl.length;k++){
colvalues[val] = Double.parseDouble(spl[k]);
val++;
}
}
NamedVector nmv = new NamedVector(new DenseVector(colvalues),key);
vec.set(nmv);
writer.append(new Text(nmv.getName()), vec);
}
writer.close();
} catch (Exception e) {
System.out.println("ERROR: "+e);
}
}

I would suggest you implement a program to convert the CSV to sparse vector sequence file that mahout accepts.
what you need to do is understand how InputDriver converts text files containing space-delimited floating point numbers into Mahout sequence files of VectorWritable suitable for input to the clustering jobs in particular, and any Mahout job requiring this input in general. You will customize the codes to your needs.
If you have downloaded the source code of Mahout, the InputDriver is at package org.apache.mahout.clustering.conversion.

org.apache.mahout.clustering.conversion.InputDriver is a class that you can use to create sparse vectors.
Sample code is given below
mahout org.apache.mahout.clustering.conversion.InputDriver -i testdata -o output1/data -v org.apache.mahout.math.RandomAccessSparseVector
If you run mahout org.apache.mahout.clustering.conversion.InputDriver
it will list out the parameters it expects.
Hope this helps.
Also here is an article I wrote to explain how I ran kmeans clustering on an arff file
http://mahout-hadoop.blogspot.com/2013/10/using-mahout-to-cluster-iris-data.html

Related

Load Naïve Bayes model in java code using weka jar

I have used weka and made a Naive Bayes classifier, by using weka GUI. Then I have saved this model by following this tutorial. Now I want to load this model through Java code but I am unable to find any way to load a saved model using weka.
This is my requirement that I have to made model separately and then use it in a separate program.
If anyone can guide me in this regard I will be thankful to you.
You can easily load a saved model in java using this command:
Classifier myCls = (Classifier) weka.core.SerializationHelper.read(pathToModel);
For a complete workflow in Java I wrote the following article in SO Documentation, now copied here:
Text Classification in Weka
Text Classification with LibLinear
Create training instances from .arff file
private static Instances getDataFromFile(String path) throws Exception{
DataSource source = new DataSource(path);
Instances data = source.getDataSet();
if (data.classIndex() == -1){
data.setClassIndex(data.numAttributes()-1);
//last attribute as class index
}
return data;
}
Instances trainingData = getDataFromFile(pathToArffFile);
Use StringToWordVector to transform your string attributes to number representation:
Important features of this filter:
tf-idf representation
stemming
lowercase words
stopwords
n-gram representation*
StringToWordVector() filter = new StringToWordVector();
filter.setWordsToKeep(1000000);
if(useIdf){
filter.setIDFTransform(true);
}
filter.setTFTransform(true);
filter.setLowerCaseTokens(true);
filter.setOutputWordCounts(true);
filter.setMinTermFreq(minTermFreq);
filter.setNormalizeDocLength(new SelectedTag(StringToWordVector.FILTER_NORMALIZE_ALL,StringToWordVector.TAGS_FILTER));
NGramTokenizer t = new NGramTokenizer();
t.setNGramMaxSize(maxGrams);
t.setNGramMinSize(minGrams);
filter.setTokenizer(t);
WordsFromFile stopwords = new WordsFromFile();
stopwords.setStopwords(new File("data/stopwords/stopwords.txt"));
filter.setStopwordsHandler(stopwords);
if (useStemmer){
Stemmer s = new /*Iterated*/LovinsStemmer();
filter.setStemmer(s);
}
filter.setInputFormat(trainingData);
Apply the filter to trainingData: trainingData = Filter.useFilter(trainingData, filter);
Create the LibLinear Classifier
SVMType 0 below corresponds to the L2-regularized logistic regression
Set setProbabilityEstimates(true) to print the output probabilities
Classifier cls = null;
LibLINEAR liblinear = new LibLINEAR();
liblinear.setSVMType(new SelectedTag(0, LibLINEAR.TAGS_SVMTYPE));
liblinear.setProbabilityEstimates(true);
// liblinear.setBias(1); // default value
cls = liblinear;
cls.buildClassifier(trainingData);
Save model
System.out.println("Saving the model...");
ObjectOutputStream oos;
oos = new ObjectOutputStream(new FileOutputStream(path+"mymodel.model"));
oos.writeObject(cls);
oos.flush();
oos.close();
Create testing instances from .arff file
Instances trainingData = getDataFromFile(pathToArffFile);
Load classifier
Classifier myCls = (Classifier) weka.core.SerializationHelper.read(path+"mymodel.model");
Use the same StringToWordVector filter as above or create a new one for testingData, but remember to use the trainingData for this command:filter.setInputFormat(trainingData); This will make training and testing instances compatible.
Alternatively you could use InputMappedClassifier
Apply the filter to testingData: testingData = Filter.useFilter(testingData, filter);
Classify!
1.Get the class value for every instance in the testing set
for (int j = 0; j < testingData.numInstances(); j++) {
double res = myCls.classifyInstance(testingData.get(j));
}
res is a double value that corresponds to the nominal class that is defined in .arff file. To get the nominal class use : testintData.classAttribute().value((int)res)
2.Get the probability distribution for every instance
for (int j = 0; j < testingData.numInstances(); j++) {
double[] dist = first.distributionForInstance(testInstances.get(j));
}
dist is a double array that contains the probabilities for every class defined in .arff file
Note. Classifier should support probability distributions and enable them with: myClassifier.setProbabilityEstimates(true);

Reading file containing random numbers, sorting it and than writing to other file

In an interview I was asked the following question,
There is a file named sourceFile.txt containing random numbers aligned one below other like below,
608492
213420
23305
255572
64167
144737
81122
374768
535077
866831
496153
497059
931322
same number can occur more than once. The size of sourceFile.txt is around 65GB.
I need to read that file and write the numbers into new file lets say destinationFile.txt in sorted order.
I wrote the following code for this,
/*
Copy the numbers present in the file, store in
list, sort it and than write into another file.
*/
public static void readFileThanWrite(String sourceFileName,String destinationFileName) throws Exception{
String line = null;
BufferedReader reader = new BufferedReader(new FileReader(sourceFileName));
List<Integer> list = new ArrayList<Integer>();
do{
if(line != null){
list.add(Integer.parseInt(line));
}
line = reader.readLine();
}while(line != null);
Collections.sort(list);
File file = new File(destinationFileName);
FileWriter fileWriter = new FileWriter(file,true); // 'True' means write content to end of file
BufferedWriter buff = new BufferedWriter(fileWriter);
PrintWriter out = new PrintWriter(buff);
for(Iterator<Integer> itr = list.iterator();itr.hasNext();){
out.println(itr.next());
}
out.close();
buff.close();
fileWriter.close();
}
But the interviewer said the above program will fail to load and sort numbers as the file is large.
What should be the better solution ?
If you know that all the numbers are relatively small, keeping an array of occurences would do just fine.
If you don't have any information about the input, you're looking for external sorting. Here's a Java project that could help you, and here is the corresponding class.

How to read files with an offset from Hadoop using Java

Problem: I want to read a section of a file from HDFS and return it, such as lines 101-120 from a file of 1000 lines.
I don't want to use seek because I have read that it is expensive.
I have log files which I am using PIG to process down into meaningful sets of data. I've been writing an API to return the data for consumption and display by a front end. Those processed data sets can be large enough that I don't want to read the entire file out of Hadoop in one slurp to save wire time and bandwidth. (Let's say 5 - 10MB)
Currently I am using a BufferedReader to return small summary files which is working fine
ArrayList lines = new ArrayList();
...
for (FileStatus item: items) {
// ignoring files like _SUCCESS
if(item.getPath().getName().startsWith("_")) {
continue;
}
in = fs.open(item.getPath());
BufferedReader br = new BufferedReader(new InputStreamReader(in));
String line;
line = br.readLine();
while (line != null) {
line = line.replaceAll("(\\r|\\n)", "");
lines.add(line.split("\t"));
line = br.readLine();
}
}
I've poked around the interwebs quite a bit as well as Stack but haven't found exactly what I need.
Perhaps this is completely the wrong way to go about doing it and I need a completely separate set of code and different functions to manage this. Open to any suggestions.
Thanks!
As added noted based on research from the below discussions:
How does Hadoop process records records split across block boundaries?
Hadoop FileSplit Reading
I think SEEK is a best option for reading files with huge volumes. It did not cause any problems to me as the volume of data that i was reading was in the range of 2 - 3GB. I did not encounter any issues till today but we did use file splitting to handle the large data set. below is the code which you can use for reading purpose and test the same.
public class HDFSClientTesting {
/**
* #param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
try{
//System.loadLibrary("libhadoop.so");
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
conf.addResource(new Path("core-site.xml"));
String Filename = "/dir/00000027";
long ByteOffset = 3185041;
SequenceFile.Reader rdr = new SequenceFile.Reader(fs, new Path(Filename), conf);
Text key = new Text();
Text value = new Text();
rdr.seek(ByteOffset);
rdr.next(key,value);
//Plain text
JSONObject jso = new JSONObject(value.toString());
String content = jso.getString("body");
System.out.println("\n\n\n" + content + "\n\n\n");
File file =new File("test.gz");
file.createNewFile();
}
catch (Exception e ){
throw new RuntimeException(e);
}
finally{
}
}
}

How to chain multiple different InputStreams into one InputStream

I'm wondering if there is any ideomatic way to chain multiple InputStreams into one continual InputStream in Java (or Scala).
What I need it for is to parse flat files that I load over the network from an FTP-Server. What I want to do is to take file[1..N], open up streams and then combine them into one stream. So when file1 comes to an end, I want to start reading from file2 and so on, until I reach the end of fileN.
I need to read these files in a specific order, data comes from a legacy system that produces files in barches so data in one depends on data in another file, but I would like to handle them as one continual stream to simplify my domain logic interface.
I searched around and found PipedInputStream, but I'm not positive that is what I need. An example would be helpful.
It's right there in JDK! Quoting JavaDoc of SequenceInputStream:
A SequenceInputStream represents the logical concatenation of other input streams. It starts out with an ordered collection of input streams and reads from the first one until end of file is reached, whereupon it reads from the second one, and so on, until end of file is reached on the last of the contained input streams.
You want to concatenate arbitrary number of InputStreams while SequenceInputStream accepts only two. But since SequenceInputStream is also an InputStream you can apply it recursively (nest them):
new SequenceInputStream(
new SequenceInputStream(
new SequenceInputStream(file1, file2),
file3
),
file4
);
...you get the idea.
See also
How do you merge two input streams in Java? (dup?)
This is done using SequencedInputStream, which is straightforward in Java, as Tomasz Nurkiewicz's answer shows. I had to do this repeatedly in a project recently, so I added some Scala-y goodness via the "pimp my library" pattern.
object StreamUtils {
implicit def toRichInputStream(str: InputStream) = new RichInputStream(str)
class RichInputStream(str: InputStream) {
// a bunch of other handy Stream functionality, deleted
def ++(str2: InputStream): InputStream = new SequenceInputStream(str, str2)
}
}
With that, I can do stream sequencing as follows
val mergedStream = stream1++stream2++stream3
or even
val streamList = //some arbitrary-length list of streams, non-empty
val mergedStream = streamList.reduceLeft(_++_)
Another solution: first create a list of input stream and then create the sequence of input streams:
List<InputStream> iss = Files.list(Paths.get("/your/path"))
.filter(Files::isRegularFile)
.map(f -> {
try {
return new FileInputStream(f.toString());
} catch (Exception e) {
throw new RuntimeException(e);
}
}).collect(Collectors.toList());
new SequenceInputStream(Collections.enumeration(iss)))
Here is a more elegant solution using Vector, this is for Android specifically but use vector for any Java
AssetManager am = getAssets();
Vector v = new Vector(Constant.PAGES);
for (int i = 0; i < Constant.PAGES; i++) {
String fileName = "file" + i + ".txt";
InputStream is = am.open(fileName);
v.add(is);
}
Enumeration e = v.elements();
SequenceInputStream sis = new SequenceInputStream(e);
InputStreamReader isr = new InputStreamReader(sis);
Scanner scanner = new Scanner(isr); // or use bufferedReader
Here's a simple Scala version that concatenates an Iterator[InputStream]:
import java.io.{InputStream, SequenceInputStream}
import scala.collection.JavaConverters._
def concatInputStreams(streams: Iterator[InputStream]): InputStream =
new SequenceInputStream(streams.asJavaEnumeration)

How to read and update row in file with Java

currently i creating a java apps and no database required
that why i using text file to make it
the structure of file is like this
unique6id username identitynumber point
unique6id username identitynumber point
may i know how could i read and find match unique6id then update the correspond row of point ?
Sorry for lack of information
and here is the part i type is
public class Cust{
string name;
long idenid, uniqueid;
int pts;
customer(){}
customer(string n,long ide, long uni, int pt){
name = n;
idenid = ide;
uniqueid = uni;
pts = pt;
}
FileWriter fstream = new FileWriter("Data.txt", true);
BufferedWriter fbw = new BufferedWriter(fstream);
Cust newCust = new Cust();
newCust.name = memUNTF.getText();
newCust.ic = Long.parseLong(memICTF.getText());
newCust.uniqueID = Long.parseLong(memIDTF.getText());
newCust.pts= points;
fbw.write(newCust.name + " " + newCust.ic + " " + newCust.uniqueID + " " + newCust.point);
fbw.newLine();
fbw.close();
this is the way i text in the data
then the result inside Data.txt is
spencerlim 900419129876 448505 0
Eugene 900419081234 586026 0
when user type in 586026 then it will grab row of eugene
bind into Cust
and update the pts (0 in this case, try to update it into other number eg. 30)
Thx for reply =D
Reading is pretty easy, but updating a text file in-place (ie without rewriting the whole file) is very awkward.
So, you have two options:
Read the whole file, make your changes, and then write the whole file to disk, overwriting the old version; this is quite easy, and will be fast enough for small files, but is not a good idea for very large files.
Use a format that is not a simple text file. A database would be one option (and bear in mind that there is one, Derby, built into the JDK); there are other ways of keeping simple key-value stores on disk (like a HashMap, but in a file), but there's nothing built into the JDK.
You can use OpenCSV with custom separators.
Here's a sample method that updates the info for a specified user:
public static void updateUserInfo(
String userId, // user id
String[] values // new values
) throws IOException{
String fileName = "yourfile.txt.csv";
CSVReader reader = new CSVReader(new FileReader(fileName), ' ');
List<String[]> lines = reader.readAll();
Iterator<String[]> iterator = lines.iterator();
while(iterator.hasNext()){
String[] items = (String[]) iterator.next();
if(items[0].equals(userId)){
for(int i = 0; i < values.length; i++){
String value = values[i];
if(value!=null){
// for every array value that's not null,
// update the corresponding field
items[i+1]=value;
}
}
break;
}
}
new CSVWriter(new FileWriter(fileName), ' ').writeAll(lines);
}
Use InputStream(s) and Reader(s) to read file.
Here is a code snippet that shows how to read file.
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("c:/myfile.txt")));
String line = null;
while ((line = reader.readLine()) != null) {
// do something with the line.
}
Use OutputStream and Writer(s) to write to file. Although you can use random access files, i.e. write to the specific place of the file I do not recommend you to do this. Much easier and robust way is to create new file every time you have to write something. I know that it is probably not the most efficient way, but you do not want to use DB for some reasons... If you have to save and update partial information relatively often and perform search into the file I'd recommend you to use DB. There are very light weight implementations including pure java implementations (e.g. h2: http://www.h2database.com/html/main.html).

Categories