Convert txt file to arff file via TextDirectoryToArff.java - java

I am trying to convert txt file to arff via TextDirectoryToArff.java. I am using eclipse on Windows OS, and the usage tells me to TextDirectoryToArff <directory path>. But I am not sure what it means.
Could somebody help me with this program?
TextDirectoryToArff.java :-
import java.io.*;
import weka.core.*;
/**
* Builds an arff dataset from the documents in a given directory.
* Assumes that the file names for the documents end with ".txt".
*
* Usage:<p>
*
* TextDirectoryToArff <directory path><p>
*
* #author Richard Kirkby (rkirkby at cs.waikato.ac.nz)
* #version 1.0
*/
public class TextDirectoryToArff {
public Instances createDataset(String directoryPath) throws Exception {
FastVector atts = new FastVector(2);
atts.addElement(new Attribute("filename", (FastVector) null));
atts.addElement(new Attribute("contents", (FastVector) null));
Instances data = new Instances("text_files_in_" + directoryPath, atts, 0);
File dir = new File(directoryPath);
String[] files = dir.list();
for (int i = 0; i < files.length; i++) {
if (files[i].endsWith(".txt")) {
try {
double[] newInst = new double[2];
newInst[0] = (double)data.attribute(0).addStringValue(files[i]);
File txt = new File(directoryPath + File.separator + files[i]);
InputStreamReader is;
is = new InputStreamReader(new FileInputStream(txt));
StringBuffer txtStr = new StringBuffer();
int c;
while ((c = is.read()) != -1) {
txtStr.append((char)c);
}
newInst[1] = (double)data.attribute(1).addStringValue(txtStr.toString());
data.add(new Instance(1.0, newInst));
} catch (Exception e) {
//System.err.println("failed to convert file: " + directoryPath + File.separator + files[i]);
}
}
}
return data;
}
public static void main(String[] args) {
if (args.length == 1) {
TextDirectoryToArff tdta = new TextDirectoryToArff();
try {
Instances dataset = tdta.createDataset(args[0]);
System.out.println(dataset);
}
catch (Exception e) {
System.err.println(e.getMessage());
e.printStackTrace();
}
}
else {
System.out.println("Usage: java TextDirectoryToArff <directory name>");
}
}
}

Weka is asking you to use JAVA to execute TextDirectoryToArff passing as a parameter the directory.
You have two options:
1) Generate a JAR called "TextDirectoryToArff" and then use JAVA though the console in Windows and execute java TextDirectoryToArff passign a directory as a paratemer.
2) From Eclipse you can pass your directory directly like, modifying the dataset and then doing Run, run
Instances dataset = tdta.createDataset(args[0]);
To:
Instances dataset = tdta.createDataset("C:\\yourDir");

Related

To print one confusion matrix instead of multiple matrices from each mapper

I am trying to print a confusion matrix of weka j48 algorithm and i am getting multiple matrices as output.
This is the class that runs the whole program. It is responsible for getting input from the user, setting up the mapper and reducer, organizing the weka input, etc.
public class WekDoop {
* The main method of this program.
* Precondition: arff file is uploaded into HDFS and the correct
* number of parameters were passed into the JAR file when it was run
*
* #param args
* #throws Exception
*/
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// Make sure we have the correct number of arguments passed into the program
if (args.length != 4) {
System.err.println("Usage: WekDoop <# of splits> <classifier> <input file> <output file>");
System.exit(1);
}
// configure the job using the command line args
conf.setInt("Run-num.splits", Integer.parseInt(args[0]));
conf.setStrings("Run.classify", args[1]);
conf.set("io.serializations", "org.apache.hadoop.io.serializer.JavaSerialization," + "org.apache.hadoop.io.serializer.WritableSerialization");
// Configure the jobs main class, mapper and reducer
// TODO: Make the Job name print the name of the currently running classifier
Job job = new Job(conf, "WekDoop");
job.setJarByClass(WekDoop.class);
job.setMapperClass(WekaMap.class);
job.setReducerClass(WekaReducer.class);
// Start with 1
job.setNumReduceTasks(1);
// This section sets the values of the <K2, V2>
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(weka.classifiers.bayes.NaiveBayes.class);
job.setOutputValueClass(AggregateableEvaluation.class);
// Set the input and output directories based on command line args
FileInputFormat.addInputPath(job, new Path(args[2]));
FileOutputFormat.setOutputPath(job, new Path(args[3]));
// Set the input type of the environment
// (In this case we are overriding TextInputFormat)
job.setInputFormatClass(WekaInputFormat.class);
// wait until the job is complete to exit
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Mapper Class
This class is a mapper for the weka classifiers It is given a chunk of data and it sets up a classifier to run on that data. There is a lot of other handling that occurs in the method as well.
public class WekaMap extends Mapper<Object, Text, Text, AggregateableEvaluation> {
private Instances randData = null;
private Classifier cls = null;
private AggregateableEvaluation eval = null;
private Classifier clsCopy = null;
// Run 10 mappers
private String numMaps = "10";
// TODO: Make sure this is not hard-coded -- preferably a command line arg
// Set the classifier
private String classname = "weka.classifiers.bayes.NaiveBayes";
private int seed = 20;
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
System.out.println("CURRENT LINE: " + line);
//line = "/home/ubuntu/Workspace/hadoop-1.1.0/hadoop-data/spambase_processed.arff";
Configuration conf = new Configuration();
FileSystem fileSystem = FileSystem.get(conf);
Path path = new Path("/home/hduser/very_small_spam.arff");
// Make sure the file exists...
if (!fileSystem.exists(path)) {
System.out.println("File does not exists");
return;
}
JobID test = context.getJobID();
TaskAttemptID tid = context.getTaskAttemptID();
// Set up the weka configuration
Configuration wekaConfig = context.getConfiguration();
numMaps = wekaConfig.get("Run-num.splits");
classname = wekaConfig.get("Run.classify");
String[] splitter = tid.toString().split("_");
String jobNumber = "";
int n = 0;
if (splitter[4].length() > 0) {
jobNumber = splitter[4].substring(splitter[4].length() - 1);
n = Integer.parseInt(jobNumber);
}
FileSystem fs = FileSystem.get(context.getConfiguration());
System.out.println("PATH: " + path);
// Read in the data set
context.setStatus("Reading in the arff file...");
readArff(fs, path.toString());
context.setStatus("Done reading arff! Initializing aggregateable eval...");
try {
eval = new AggregateableEvaluation(randData);
}
catch (Exception e1) {
e1.printStackTrace();
}
// Split the data into two sets: Training set and a testing set
// this will allow us to use a little bit of data to train the classifier
// before running the classifier on the rest of the dataset
Instances trainInstance = randData.trainCV(Integer.parseInt(numMaps), n);
Instances testInstance = randData.testCV(Integer.parseInt(numMaps), n);
// Set parameters to be passed to the classifiers
String[] opts = new String[3];
if (classname.equals("weka.classifiers.lazy.IBk")) {
opts[0] = "";
opts[1] = "-K";
opts[2] = "1";
}
else if (classname.equals("weka.classifiers.trees.J48")) {
opts[0] = "";
opts[1] = "-C";
opts[2] = "0.25";
}
else if (classname.equals("weka.classifiers.bayes.NaiveBayes")) {
opts[0] = "";
opts[1] = "";
opts[2] = "";
}
else {
opts[0] = "";
opts[1] = "";
opts[2] = "";
}
// Start setting up the classifier and its various options
try {
cls = (Classifier) Utils.forName(Classifier.class, classname, opts);
}
catch (Exception e) {
e.printStackTrace();
}
// These are all used for timing different processes
long beforeAbstract = 0;
long beforeBuildClass = 0;
long afterBuildClass = 0;
long beforeEvalClass = 0;
long afterEvalClass = 0;
try {
// Create the classifier and record how long it takes to set up
context.setStatus("Creating the classifier...");
System.out.println(new Timestamp(System.currentTimeMillis()));
beforeAbstract = System.currentTimeMillis();
clsCopy = AbstractClassifier.makeCopy(cls);
beforeBuildClass = System.currentTimeMillis();
System.out.println(new Timestamp(System.currentTimeMillis()));
// Train the classifier on the training set and record how long this takes
context.setStatus("Training the classifier...");
clsCopy.buildClassifier(trainInstance);
afterBuildClass = System.currentTimeMillis();
System.out.println(new Timestamp(System.currentTimeMillis()));
beforeEvalClass = System.currentTimeMillis();
// Run the classifer on the rest of the data set and record its duration as well
context.setStatus("Evaluating the model...");
eval.evaluateModel(clsCopy, testInstance);
afterEvalClass = System.currentTimeMillis();
System.out.println(new Timestamp(System.currentTimeMillis()));
// We are done this iteration!
context.setStatus("Complete");
}
catch (Exception e) {
System.out.println("Debugging strarts here!");
e.printStackTrace();
}
// calculate the total times for each section
long abstractTime = beforeBuildClass - beforeAbstract;
long buildTime = afterBuildClass - beforeBuildClass;
long evalTime = afterEvalClass - beforeEvalClass;
// Print out the times
System.out.println("The value of creation time: " + abstractTime);
System.out.println("The value of Build time: " + buildTime);
System.out.println("The value of Eval time: " + evalTime);
context.write(new Text(line), eval);
}
/**
* This can be used to write out the results on HDFS, but it is not essential
* to the success of this project. If time allows, we can implement it.
*/
public void writeResult() {
}
/**
* This method reads in the arff file that is provided to the program.
* Nothing really special about the way the data is handled.
*
* #param fs
* #param filePath
* #throws IOException
* #throws InterruptedException
*/
public void readArff(FileSystem fs, String filePath) throws IOException, InterruptedException {
BufferedReader reader;
DataInputStream d;
ArffReader arff;
Instance inst;
Instances data;
try {
// Read in the data using a ton of wrappers
d = new DataInputStream(fs.open(new Path(filePath)));
reader = new BufferedReader(new InputStreamReader(d));
arff = new ArffReader(reader, 100000);
data = arff.getStructure();
data.setClassIndex(data.numAttributes() - 1);
// Add each line to the input stream
while ((inst = arff.readInstance(data)) != null) {
data.add(inst);
}
reader.close();
Random rand = new Random(seed);
randData = new Instances(data);
randData.randomize(rand);
// This is how weka handles the sampling of the data
// the stratify method splits up the data to cross validate it
if (randData.classAttribute().isNominal()) {
randData.stratify(Integer.parseInt(numMaps));
}
}
catch (IOException e) {
e.printStackTrace();
}
}
}
Reducer Class
This class is a reducer for the output from the weka classifiers It is given bunch of cross-validated data chunks from the mappers and its job is to aggregate the data into one solution.
public class WekaReducer extends Reducer<Text, AggregateableEvaluation, Text, IntWritable> {
Text result = new Text();
Evaluation evalAll = null;
IntWritable test = new IntWritable();
AggregateableEvaluation aggEval;
/**
* The reducer method takes all the stratified, cross-validated
* values from the mappers in a list and uses an aggregatable evaluation to consolidate
* them.
*/
public void reduce(Text key, Iterable<AggregateableEvaluation> values, Context context) throws IOException, InterruptedException {
int sum = 0;
// record how long it takes to run the aggregation
System.out.println(new Timestamp(System.currentTimeMillis()));
long beforeReduceTime = System.currentTimeMillis();
// loop through each of the values and "aggregate"
// which basically means to consolidate the values
for (AggregateableEvaluation val : values) {
System.out.println("IN THE REDUCER!");
// The first time through, give aggEval a value
if (sum == 0) {
try {
aggEval = val;
}
catch (Exception e) {
e.printStackTrace();
}
}
else {
// combine the values
aggEval.aggregate(val);
}
try {
// This is what is taken from the mapper to be aggregated
System.out.println("This is the map result");
System.out.println(aggEval.toMatrixString());
}
catch (Exception e) {
e.printStackTrace();
}
sum += 1;
}
// Here is where the typical weka matrix output is generated
try {
System.out.println("This is reduce matrix");
System.out.println(aggEval.toMatrixString());
}
catch (Exception e) {
e.printStackTrace();
}
// calculate the duration of the aggregation
context.write(key, new IntWritable(sum));
long afterReduceTime = System.currentTimeMillis();
long reduceTime = afterReduceTime - beforeReduceTime;
// display the output
System.out.println("The value of reduce time is: " + reduceTime);
System.out.println(new Timestamp(System.currentTimeMillis()));
}
}
And lastly InputFormatClass
Takes a JobContext and returns a list of data split into pieces Basically this is a way of handling large data sets. This method allows us to split a large data set into smaller chunks to pass across worker nodes (or in our case, just to make life a little easier and pass the chunks to a single node so that it is not overwhelmed by one large data set)
public class WekaInputFormat extends TextInputFormat {
public List<InputSplit> getSplits(JobContext job) throws IOException {
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
long maxSize = getMaxSplitSize(job);
List<InputSplit> splits = new ArrayList<InputSplit>();
for (FileStatus file: listStatus(job)) {
Path path = file.getPath();
FileSystem fs = path.getFileSystem(job.getConfiguration());
//number of bytes in this file
long length = file.getLen();
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
// make sure this is actually a valid file
if(length != 0) {
// set the number of splits to make. NOTE: the value can be changed to anything
int count = job.getConfiguration().getInt("Run-num.splits", 1);
for(int t = 0; t < count; t++) {
//split the file and add each chunk to the list
splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
}
}
else {
// Create empty array for zero length files
splits.add(new FileSplit(path, 0, length, new String[0]));
}
}
return splits;
}
}

Hadoop stdout is always empty and bytes written is zero

I am trying to execute Weka on MapReduce and the stdout is always empty
This is the class that runs the whole program. It is responsible
for getting input from the user, setting up the mapper and reducer,
organizing the weka input, etc.
public class WekDoop {
* The main method of this program.
* Precondition: arff file is uploaded into HDFS and the correct
* number of parameters were passed into the JAR file when it was run
*
* #param args
* #throws Exception
*/
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
// Make sure we have the correct number of arguments passed into the program
if (args.length != 4) {
System.err.println("Usage: WekDoop <# of splits> <classifier> <input file> <output file>");
System.exit(1);
}
// configure the job using the command line args
conf.setInt("Run-num.splits", Integer.parseInt(args[0]));
conf.setStrings("Run.classify", args[1]);
conf.set("io.serializations", "org.apache.hadoop.io.serializer.JavaSerialization," + "org.apache.hadoop.io.serializer.WritableSerialization");
// Configure the jobs main class, mapper and reducer
// TODO: Make the Job name print the name of the currently running classifier
Job job = new Job(conf, "WekDoop");
job.setJarByClass(WekDoop.class);
job.setMapperClass(WekaMap.class);
job.setReducerClass(WekaReducer.class);
// Start with 1
job.setNumReduceTasks(1);
// This section sets the values of the <K2, V2>
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(weka.classifiers.bayes.NaiveBayes.class);
job.setOutputValueClass(AggregateableEvaluation.class);
// Set the input and output directories based on command line args
FileInputFormat.addInputPath(job, new Path(args[2]));
FileOutputFormat.setOutputPath(job, new Path(args[3]));
// Set the input type of the environment
// (In this case we are overriding TextInputFormat)
job.setInputFormatClass(WekaInputFormat.class);
// wait until the job is complete to exit
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Mapper Class
This class is a mapper for the weka classifiers
It is given a chunk of data and it sets up a classifier to run on that data.
There is a lot of other handling that occurs in the method as well.
public class WekaMap extends Mapper<Object, Text, Text, AggregateableEvaluation> {
private Instances randData = null;
private Classifier cls = null;
private AggregateableEvaluation eval = null;
private Classifier clsCopy = null;
// Run 10 mappers
private String numMaps = "10";
// TODO: Make sure this is not hard-coded -- preferably a command line arg
// Set the classifier
private String classname = "weka.classifiers.bayes.NaiveBayes";
private int seed = 20;
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
System.out.println("CURRENT LINE: " + line);
//line = "/home/ubuntu/Workspace/hadoop-1.1.0/hadoop-data/spambase_processed.arff";
Configuration conf = new Configuration();
FileSystem fileSystem = FileSystem.get(conf);
Path path = new Path("/home/hduser/very_small_spam.arff");
// Make sure the file exists...
if (!fileSystem.exists(path)) {
System.out.println("File does not exists");
return;
}
JobID test = context.getJobID();
TaskAttemptID tid = context.getTaskAttemptID();
// Set up the weka configuration
Configuration wekaConfig = context.getConfiguration();
numMaps = wekaConfig.get("Run-num.splits");
classname = wekaConfig.get("Run.classify");
String[] splitter = tid.toString().split("_");
String jobNumber = "";
int n = 0;
if (splitter[4].length() > 0) {
jobNumber = splitter[4].substring(splitter[4].length() - 1);
n = Integer.parseInt(jobNumber);
}
FileSystem fs = FileSystem.get(context.getConfiguration());
System.out.println("PATH: " + path);
// Read in the data set
context.setStatus("Reading in the arff file...");
readArff(fs, path.toString());
context.setStatus("Done reading arff! Initializing aggregateable eval...");
try {
eval = new AggregateableEvaluation(randData);
}
catch (Exception e1) {
e1.printStackTrace();
}
// Split the data into two sets: Training set and a testing set
// this will allow us to use a little bit of data to train the classifier
// before running the classifier on the rest of the dataset
Instances trainInstance = randData.trainCV(Integer.parseInt(numMaps), n);
Instances testInstance = randData.testCV(Integer.parseInt(numMaps), n);
// Set parameters to be passed to the classifiers
String[] opts = new String[3];
if (classname.equals("weka.classifiers.lazy.IBk")) {
opts[0] = "";
opts[1] = "-K";
opts[2] = "1";
}
else if (classname.equals("weka.classifiers.trees.J48")) {
opts[0] = "";
opts[1] = "-C";
opts[2] = "0.25";
}
else if (classname.equals("weka.classifiers.bayes.NaiveBayes")) {
opts[0] = "";
opts[1] = "";
opts[2] = "";
}
else {
opts[0] = "";
opts[1] = "";
opts[2] = "";
}
// Start setting up the classifier and its various options
try {
cls = (Classifier) Utils.forName(Classifier.class, classname, opts);
}
catch (Exception e) {
e.printStackTrace();
}
// These are all used for timing different processes
long beforeAbstract = 0;
long beforeBuildClass = 0;
long afterBuildClass = 0;
long beforeEvalClass = 0;
long afterEvalClass = 0;
try {
// Create the classifier and record how long it takes to set up
context.setStatus("Creating the classifier...");
System.out.println(new Timestamp(System.currentTimeMillis()));
beforeAbstract = System.currentTimeMillis();
clsCopy = AbstractClassifier.makeCopy(cls);
beforeBuildClass = System.currentTimeMillis();
System.out.println(new Timestamp(System.currentTimeMillis()));
// Train the classifier on the training set and record how long this takes
context.setStatus("Training the classifier...");
clsCopy.buildClassifier(trainInstance);
afterBuildClass = System.currentTimeMillis();
System.out.println(new Timestamp(System.currentTimeMillis()));
beforeEvalClass = System.currentTimeMillis();
// Run the classifer on the rest of the data set and record its duration as well
context.setStatus("Evaluating the model...");
eval.evaluateModel(clsCopy, testInstance);
afterEvalClass = System.currentTimeMillis();
System.out.println(new Timestamp(System.currentTimeMillis()));
// We are done this iteration!
context.setStatus("Complete");
}
catch (Exception e) {
System.out.println("Debugging strarts here!");
e.printStackTrace();
}
// calculate the total times for each section
long abstractTime = beforeBuildClass - beforeAbstract;
long buildTime = afterBuildClass - beforeBuildClass;
long evalTime = afterEvalClass - beforeEvalClass;
// Print out the times
System.out.println("The value of creation time: " + abstractTime);
System.out.println("The value of Build time: " + buildTime);
System.out.println("The value of Eval time: " + evalTime);
context.write(new Text(line), eval);
}
/**
* This can be used to write out the results on HDFS, but it is not essential
* to the success of this project. If time allows, we can implement it.
*/
public void writeResult() {
}
/**
* This method reads in the arff file that is provided to the program.
* Nothing really special about the way the data is handled.
*
* #param fs
* #param filePath
* #throws IOException
* #throws InterruptedException
*/
public void readArff(FileSystem fs, String filePath) throws IOException, InterruptedException {
BufferedReader reader;
DataInputStream d;
ArffReader arff;
Instance inst;
Instances data;
try {
// Read in the data using a ton of wrappers
d = new DataInputStream(fs.open(new Path(filePath)));
reader = new BufferedReader(new InputStreamReader(d));
arff = new ArffReader(reader, 100000);
data = arff.getStructure();
data.setClassIndex(data.numAttributes() - 1);
// Add each line to the input stream
while ((inst = arff.readInstance(data)) != null) {
data.add(inst);
}
reader.close();
Random rand = new Random(seed);
randData = new Instances(data);
randData.randomize(rand);
// This is how weka handles the sampling of the data
// the stratify method splits up the data to cross validate it
if (randData.classAttribute().isNominal()) {
randData.stratify(Integer.parseInt(numMaps));
}
}
catch (IOException e) {
e.printStackTrace();
}
}
}
Reducer Class
This class is a reducer for the output from the weka classifiers
It is given bunch of cross-validated data chunks from the mappers and its
job is to aggregate the data into one solution.
public class WekaReducer extends Reducer<Text, AggregateableEvaluation, Text, IntWritable> {
Text result = new Text();
Evaluation evalAll = null;
IntWritable test = new IntWritable();
AggregateableEvaluation aggEval;
/**
* The reducer method takes all the stratified, cross-validated
* values from the mappers in a list and uses an aggregatable evaluation to consolidate
* them.
*/
public void reduce(Text key, Iterable<AggregateableEvaluation> values, Context context) throws IOException, InterruptedException {
int sum = 0;
// record how long it takes to run the aggregation
System.out.println(new Timestamp(System.currentTimeMillis()));
long beforeReduceTime = System.currentTimeMillis();
// loop through each of the values and "aggregate"
// which basically means to consolidate the values
for (AggregateableEvaluation val : values) {
System.out.println("IN THE REDUCER!");
// The first time through, give aggEval a value
if (sum == 0) {
try {
aggEval = val;
}
catch (Exception e) {
e.printStackTrace();
}
}
else {
// combine the values
aggEval.aggregate(val);
}
try {
// This is what is taken from the mapper to be aggregated
System.out.println("This is the map result");
System.out.println(aggEval.toMatrixString());
}
catch (Exception e) {
e.printStackTrace();
}
sum += 1;
}
// Here is where the typical weka matrix output is generated
try {
System.out.println("This is reduce matrix");
System.out.println(aggEval.toMatrixString());
}
catch (Exception e) {
e.printStackTrace();
}
// calculate the duration of the aggregation
context.write(key, new IntWritable(sum));
long afterReduceTime = System.currentTimeMillis();
long reduceTime = afterReduceTime - beforeReduceTime;
// display the output
System.out.println("The value of reduce time is: " + reduceTime);
System.out.println(new Timestamp(System.currentTimeMillis()));
}
}
And lastly InputFormatClass
Takes a JobContext and returns a list of data split into pieces
Basically this is a way of handling large data sets. This method allows
us to split a large data set into smaller chunks to pass across worker nodes
(or in our case, just to make life a little easier and pass the chunks to a single
node so that it is not overwhelmed by one large data set)
public class WekaInputFormat extends TextInputFormat {
public List<InputSplit> getSplits(JobContext job) throws IOException {
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
long maxSize = getMaxSplitSize(job);
List<InputSplit> splits = new ArrayList<InputSplit>();
for (FileStatus file: listStatus(job)) {
Path path = file.getPath();
FileSystem fs = path.getFileSystem(job.getConfiguration());
//number of bytes in this file
long length = file.getLen();
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
// make sure this is actually a valid file
if(length != 0) {
// set the number of splits to make. NOTE: the value can be changed to anything
int count = job.getConfiguration().getInt("Run-num.splits", 1);
for(int t = 0; t < count; t++) {
//split the file and add each chunk to the list
splits.add(new FileSplit(path, 0, length, blkLocations[0].getHosts()));
}
}
else {
// Create empty array for zero length files
splits.add(new FileSplit(path, 0, length, new String[0]));
}
}
return splits;
}
}
For each of the mapper, reducer and overall job, there is an stderr file, an stdout file and a syslog file.
You are printing to stdout in the mapper and the reducer, so you should check the stdout file of the mapper and the reducer not that of the overall job.
Best of luck

Getting TF-IDF values from index

The below code is for getting tf-idf value from indexes. But I get an error while running it, on the line with Correct_ME.
Using Lucene 4.8.
DocIndexing.java
public class DocIndexing {
private DocIndexing() {}
/** Index all text files under a directory.
* #param args
* #throws java.io.IOException */
public static void main(String[] args) throws IOException {
String usage = "java org.apache.lucene.demo.IndexFiles"
+ " [-index INDEX_PATH] [-docs DOCS_PATH] [-update]\n\n"
+ "This indexes the documents in DOCS_PATH, creating a Lucene index"
+ "in INDEX_PATH that can be searched with Searching";
String indexPath = "C:/Users/dell/Documents/NetBeansProjects/IndexingSearching/Index";
String docsPath = "C:/Users/dell/Documents/NetBeansProjects/IndexingSearching/ToBeIndexed";
boolean create = true;
for(int i=0;i<args.length;i++) {
if (null != args[i]) switch (args[i]) {
case "-index":
indexPath = args[i+1];
i++;
break;
case "-docs":
docsPath = args[i+1];
i++;
break;
case "-update":
create = false;
break;
}
}
if (docsPath == null) {
System.err.println("Usage: " + usage);
System.exit(1);
}
final File docDir = new File(docsPath);
if (!docDir.canRead() && !docDir.isDirectory() &&
!docDir.isHidden() &&
!docDir.exists()) {
System.out.println("Document directory '" +docDir.getAbsolutePath()+ "' does not exist or is not readable, please check the path");
System.exit(1);
}
Date start = new Date();
try {
System.out.println("Indexing to directory '" + indexPath + "'...");
Directory dir = FSDirectory.open(new File(indexPath));
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_48);
//Filter filter = new PorterStemFilter();
IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_48, analyzer);
if (create) {
iwc.setOpenMode(OpenMode.CREATE);
} else {
// Add new documents to an existing index:
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
}
try (
IndexWriter writer = new IndexWriter(dir, iwc)) {
indexDocs(writer, docDir);
}
Date end = new Date();
System.out.println(end.getTime() - start.getTime() + " total milliseconds");
} catch (IOException e) {
System.out.println(" caught a " + e.getClass() +
"\n with message: " + e.getMessage());
}
Tf_Idf tfidf = new Tf_Idf();
String field = null,term = null;
tfidf.scoreCalculator(field, term);
}
/*
* #param writer Writer to the index where the given file/dir info will be stored
* #param file The file to index, or the directory to recurse into to find files to index
* #throws IOException If there is a low-level I/O error
*/
static void indexDocs(IndexWriter writer, File file)
throws IOException {
// do not try to index files that cannot be read
if (file.canRead()) {
if (file.isDirectory()) {
String[] files = file.list();
// an IO error could occur
if (files != null) {
for (int i = 0; i < files.length; i++) {
indexDocs(writer, new File(file, files[i]));
}
}
} else {
FileInputStream fis;
try {
fis = new FileInputStream(file);
} catch (FileNotFoundException fnfe) {
return;
}
try {
// make a new, empty document
Document doc = new Document();
// Field termV = new LongField("termVector", file.g)
Field pathField = new StringField("path", file.getPath(), Field.Store.YES);
doc.add(pathField);
Field modifiedField = new LongField("modified", file.lastModified(), Field.Store.NO);
doc.add(modifiedField);
Field titleField = new TextField("title", file.getName(), Field.Store.YES);
doc.add(titleField);
Field contentsField = new TextField("contents", new BufferedReader(new InputStreamReader(fis, StandardCharsets.UTF_8)));
doc.add(contentsField);
//contentsField.setBoost((float)0.5);
//titleField.setBoost((float)2.5);
/* doc.add(new LongField("modified", file.lastModified(), Field.Store.NO));
doc.add(new TextField("title", file.getName(), Field.Store.YES));
doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(fis, StandardCharsets.UTF_8))));
*/
// StringField..setBoost(1.2F);
if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {
// New index, so we just add the document (no old document can be there):
System.out.println("adding " + file);
writer.addDocument(doc);
} else {
// Existing index (an old copy of this document may have been indexed) so
// we use updateDocument instead to replace the old one matching the exact
// path, if present:
System.out.println("updating " + file);
writer.updateDocument(new Term("path", file.getPath()), doc);
}
} finally {
fis.close();
}
}
}
}
}
Tf-idf.java
public class Tf_Idf {
static float tf = 1;
static float idf = 0;
private float tfidf_score;
static float [] tfidf = null;
IndexReader indexReader;
public Tf_Idf() throws IOException {
this.indexReader = DirectoryReader.open(FSDirectory.open(new File("C:/Users/dell/Documents/NetBeansProjects/IndexingSearching/Index")));
}
public void scoreCalculator (String field, String term) throws IOException
{
TFIDFSimilarity tfidfSIM = new DefaultSimilarity();
Bits liveDocs = MultiFields.getLiveDocs(indexReader);
TermsEnum termEnum = MultiFields.getTerms(indexReader, field).iterator(null);
BytesRef bytesRef=null;
while ((bytesRef = termEnum.next()) != null) {
if(bytesRef.utf8ToString().trim().equals(term.trim())) {
if(termEnum.seekExact(bytesRef)) {
idf = tfidfSIM.idf(termEnum.docFreq(), indexReader.numDocs());
DocsEnum docsEnum = termEnum.docs(liveDocs, null);
if(docsEnum != null) {
int doc=0;
while((doc = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
tf = tfidfSIM.tf(docsEnum.freq());
tfidf_score = tf * idf ;
System.out.println(" -tfidf_score-" + tfidf_score);
}
}
}
}
}
}
}
It's obvious that you pass to MultiFields method a null IndexReader
IndexReader reader = null;
tfidf.scoreCalculator( reader, field,term);
You need to write something like this:
IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(PATH_TO_LUCENE_INDEX)));
tfidf.scoreCalculator( reader, field,term);
You need to repalce PATH_TO_LUCENE_INDEX with real path, of course.
Another problem, that I see - you open IndexReader in Tf_Idf, but don't use it anywhere, may be it's a good idea to remove it or use it, inside of scoreCalculator method, e.g.
tfidf.scoreCalculator(field,term);
but in method use field of this class, - this.indexReader instead of just indexReader that you try to pass inside method scoreCalculator
UPD
public Tf_Idf() throws IOException {
this.reader = DirectoryReader.open(FSDirectory.open(new File("Index")));
}
In this code, you need to replace "Index" with real path to your Lucene index, e.g. - /home/user/index or C://index or wherever you have it.

unable to execute the jar file using java program. i need to pass the file path as a command line argument to the jar file

The below code not able to write to the files.
i wrote a program to execute the commands in cmd. Those commands are nothing but executing the java file which is in jar file. This java file which in jar is expecting file path as a command line argument.
Note:- jar file also created by me only.
here is my java file
public class ExcelDriver extends Thread {
public static void main(String[] args) throws IOException, InterruptedException {
File directory = new File("C://Users//kondeti.venkatarao//Documents//Regresion_sheets//custome");
File[] files = directory.listFiles();
for (File file: files) {
System.out.println("\""+file.getAbsolutePath()+"\"");
if(file.isFile()){
Runtime.getRuntime().exec("cmd.exe /c start java -jar Demo.jar readExcelDemo.Final "+file.getAbsolutePath());
ExcelDriver.sleep(5000);
}
}
}
}
Here is the jar file code
public class Final {
public static int getExcelColumnNumber(String column) {
int result = 0;
for (int i = 0; i < column.length(); i++) {
result *= 26;
result += column.charAt(i) - 'A' + 1;
}
return result;
}
public static String getExcelColumnName(int number) {
final StringBuilder sb = new StringBuilder();
int num = number - 1;
while (num >= 0) {
int numChar = (num % 26) + 65;
sb.append((char)numChar);
num = (num / 26) - 1;
}
return sb.reverse().toString();
}
void run(File file, File errors, File misMatchs) throws IOException{
if (file.getName().endsWith(".xlsx") || file.getName().endsWith(".xlsm")) {
FileInputStream fis = new FileInputStream(file);
StringBuilder error = new StringBuilder();
StringBuilder misMatch = new StringBuilder();
// Create Workbook instance holding reference to .xlsx file
//OPCPackage pkg = OPCPackage.open(file, PackageAccess.READ);
XSSFWorkbook workbook = new XSSFWorkbook(fis);
int i = 1;
while (i < workbook.getNumberOfSheets()) {
// System.out.println(workbook.getNumberOfSheets());
// Get first/desired sheet from the workbook
XSSFSheet sheet = workbook.getSheetAt(i);
if(sheet.getRow(0).getCell(0).getRawValue().equalsIgnoreCase("fail")){
// Iterate through each rows one by one
Iterator<Row> rowIterator = sheet.iterator();
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
// For each row, iterate through all the columns
Iterator<Cell> cellIterator = row.cellIterator();
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
// Check the cell type and format accordingly
switch (cell.getCellType()) {
/*
* case Cell.CELL_TYPE_NUMERIC:
* System.out.print(cell.getNumericCellValue());
* break; case Cell.CELL_TYPE_STRING:
* System.out.print(cell.getStringCellValue());
* break;
*/
// case Cell.CELL_TYPE_FORMULA:
case Cell.CELL_TYPE_FORMULA:
if (cell.getCellFormula().startsWith("IF("))
if (sheet.getRow(row.getRowNum()).getCell(cell.getColumnIndex()).getRawValue().equals("1")) {
HashSet<Integer> number = new HashSet<Integer>();
ArrayList<String> alphas = new ArrayList<String>();
String formula = sheet.getRow(row.getRowNum()).getCell(cell.getColumnIndex()).toString();
Matcher digitMatcher = Pattern.compile("\\d+").matcher(formula);
Matcher alphabetMatcher = Pattern.compile("[a-zA-Z]+").matcher(formula);
while (alphabetMatcher.find()) {
if (!alphabetMatcher.group().equals("TYPE"))
alphas.add(alphabetMatcher.group());
}
int countIF = Collections.frequency(alphas, "IF");
int countABS = Collections.frequency(alphas, "ABS");
HashSet<String> alphaSet = new HashSet<String>(alphas);
if (countIF != 5 && countIF != 6)
alphaSet.remove("IF");
if (countABS != 3 && countABS != 4)
alphaSet.remove("ABS");
while (digitMatcher.find()) {
if (!digitMatcher.group().equals("0") && !digitMatcher.group().equals("1") && !digitMatcher.group().equals("01"))
number.add(Integer.parseInt(digitMatcher.group()));
}
ArrayList<Integer> numberList = new ArrayList<Integer>(number);
ArrayList<String> alphaList = new ArrayList<String>(alphaSet);
System.out.println("alphaSet"+ alphaSet);
System.out.println("numberList"+ numberList);
int rowIndex = numberList.get(0) - 1;
int originalColumnIndex = getExcelColumnNumber(alphaList.get(0)) - 1;
int referenceColumnIndex = getExcelColumnNumber(alphaList.get(1)) - 1;
if (originalColumnIndex > referenceColumnIndex) {
int temp = referenceColumnIndex;
referenceColumnIndex = originalColumnIndex;
originalColumnIndex = temp;
}
// System.out.println(sheet.getRow(row.getRowNum()));
System.out.println("File Name: "+ file.getName());
System.out.println("Sheet Name: "+ sheet.getSheetName());
System.out.println(sheet.getRow(row.getRowNum()).getCell(cell.getColumnIndex()).toString());
if (sheet.getRow(rowIndex).getCell(originalColumnIndex).getCellFormula().equals(""))
System.out.println("please help me out");
System.out.println("Function Name: "+ sheet.getRow(rowIndex).getCell(originalColumnIndex).getCellFormula());
System.out.println("row indext"+ rowIndex);
System.out.println("original column index"+ originalColumnIndex);
System.out.println("ref column index"+ referenceColumnIndex);
/*
* System.out.println("File Name: " +
* file.getName());
* System.out.println("Sheet Name: " +
* sheet.getSheetName());
* System.out.println(cell
* .getCellFormula());
*/
if (sheet.getRow(rowIndex).getCell(originalColumnIndex).getCellFormula().contains("qCRA_")&& sheet.getRow(rowIndex)
.getCell(originalColumnIndex).getRawValue().contains("Error:")) {
error.append(System.getProperty("line.separator"));
error.append("File Name: "+ file.getName());
error.append(System.getProperty("line.separator"));
error.append("Sheet Name: "+ sheet.getSheetName());
error.append(System.getProperty("line.separator"));
error.append("Function Name: "+ sheet.getRow(rowIndex).getCell(originalColumnIndex).getCellFormula());
error.append(System.getProperty("line.separator"));
error.append("Cell Number: "+getExcelColumnName(originalColumnIndex+1)+numberList.get(0));
error.append(System.getProperty("line.separator"));
error.append("Orginal Value : "+sheet.getRow(rowIndex).getCell(originalColumnIndex).getRawValue());
error.append(System.getProperty("line.separator"));
error.append("Reference Value : "+sheet.getRow(rowIndex).getCell(referenceColumnIndex));
error.append(System.getProperty("line.separator"));
} else {
misMatch.append(System.getProperty("line.separator"));
misMatch.append("File Name: "+ file.getName());
misMatch.append(System.getProperty("line.separator"));
misMatch.append("Sheet Name: "+ sheet.getSheetName());
misMatch.append(System.getProperty("line.separator"));
misMatch.append("Function Name: "+ sheet.getRow(rowIndex).getCell(originalColumnIndex).getCellFormula());
misMatch.append(System.getProperty("line.separator"));
misMatch.append("Cell Number: "+getExcelColumnName(originalColumnIndex+1)+numberList.get(0));
misMatch.append(System.getProperty("line.separator"));
misMatch.append("Orginal Value : "+sheet.getRow(rowIndex).getCell(originalColumnIndex).getRawValue());
misMatch.append(System.getProperty("line.separator"));
misMatch.append("Reference Value : "+sheet.getRow(rowIndex).getCell(referenceColumnIndex));
misMatch.append(System.getProperty("line.separator"));
}
}
break;
}
cell = null;
}
row = null;
}
}
i++;
fis.close();
sheet=null;
}
workbook=null;
//FileUtils.writeStringToFile(errors, error.toString(),true);
//FileUtils.writeStringToFile(misMatchs, misMatch.toString(),true);
FileWriter errorsFileWriter = new FileWriter(errors,true);
BufferedWriter errorsBufferedWriter = new BufferedWriter(errorsFileWriter);
errorsBufferedWriter.write(error.toString());
errorsBufferedWriter.flush();
errorsBufferedWriter.close();
FileWriter misMatchFileWriter = new FileWriter(misMatchs, true);
BufferedWriter misMatchesBufferedWriter = new BufferedWriter(misMatchFileWriter);
misMatchesBufferedWriter.write(misMatch.toString());
misMatchesBufferedWriter.flush();
misMatchesBufferedWriter.close();
}
}
public static void main(String[] args) {
try {
String filepath = args[0];//.replace("\" , ", "\\");
//System.out.println(filepath);
File directory = new File(filepath);
File errors = new File("C://Users//kondeti.venkatarao//Documents//Regresion_sheets//Error.txt");
if(!errors.exists()){
errors.createNewFile();
}
File mismatch = new File("C://Users//kondeti.venkatarao//Documents//Regresion_sheets//Mismatch.txt");
if(!mismatch.exists()){
mismatch.createNewFile();
}
Final hvd=new Final();
hvd.run(directory,errors,mismatch);
} catch (Exception e) {
e.printStackTrace();
}
}
}
The issue is what you're passing to the Jar file when calling it. Your Jar file is set to use the first parameter it receives as the directory you want to use:
public static void main(String[] args) {
...
String filepath = args[0];//.replace("\" , ", "\\"); // First parameter
File directory = new File(filepath);
....
But when you're calling it, you're actually passing in "readExcelDemo.Final" as the first argument, and not what it looks like you want the directory to be:
File directory = new File("C://Users//kondeti.venkatarao//Documents//Regresion_sheets//custome");
...
for (File file: files) {
...
Runtime.getRuntime().exec("java -jar Demo.jar readExcelDemo.Final " + file.getAbsolutePath());
So your solution would be one of 3 things:
Change the Jar file to use the second input (easiest, but not the best)
Set your implementation to pass the file path as the first parameter (second easiest, but still not the best)
Parameterize what you send the Jar file (e.g. "dir=C:/path/to/dir") and you won't need to have to worry about parameter ordering (hardest, but worth it if you're likely to pass in more parameters)
Hope that helps~
Edit:
PS: I'm assuming you're also not getting any exceptions because it actually is working, however the directory containing your results will be located in your root project directory/readExcelDemo.Final instead of the path it looks like you want it to use
PPS: Also, your execution command doesn't need the first part of cmd.exe /c start and should instead be replaced with:
Runtime.getRuntime().exec("java -jar Demo.jar readExcelDemo.Final " + file.getAbsolutePath());
I also took the liberty of replacing that in the first part of my original post.

Deleting Multiple Files Java (Android)

I'm new to programming Android, and I want to delete Files on the sd-card. This is my current (working) code...
File appvc = new File(Environment.getExternalStorageDirectory()
.getAbsolutePath(), "ApplifierVideoCache");
if (appvc.isDirectory()) {
String[] children = appvc.list();
for (int i = 0; i < children.length; i++) {
new File(appvc, children[i]).delete();
}
}
Now I want to delete multiple files, but dont want to mention each file with that big block. Am I able to combine all files in one variable? Thanks ;)
Make a recursive method:
/*
* NOTE: coded so as to work around File's misbehaviour with regards to .delete(),
* which does not throw an exception if it fails -- or why you should use Java 7's Files
*/
public void doDelete(final File base)
throws IOException
{
if (base.isDirectory()) {
for (final File entry: base.listFiles())
doDelete(entry);
return;
}
if (!file.delete())
throw new IOException ("Failed to delete " + file + '!');
}
Another possibility would be using the Apache commons-io library and calling
if (file.isDirectory())
FileUtils.deleteDirectory(File directory);
else {
if(!file.delete())
throw new IOException("Failed to delete " + file);
}
You should make a method out of this chunk of code, pass file name and call it whenever you like:
public void DeleteFile(String fileName) {
File appvc = new File(Environment.getExternalStorageDirectory()
.getAbsolutePath(), fileName);
if (appvc.isDirectory()) {
String[] children = appvc.list();
for (int i = 0; i < children.length; i++) {
new File(appvc, children[i]).delete();
}
}
}
File dir = new File(android.os.Environment.getExternalStorageDirectory(),"ApplifierVideoCache");
Then call
deletedir(dir);
public void deletedir(File dir) {
File listFile[] = dir.listFiles();
if (listFile != null) {
for (int i = 0; i < listFile.length; i++) {
listFile[i].delete();
}
}
}
or if your folder as sub folders then
public void walkdir(File dir) {
File listFile[] = dir.listFiles();
if (listFile != null) {
for (int i = 0; i < listFile.length; i++)
{
if (listFile[i].isDirectory())
{
walkdir(listFile[i]);
} else
{
listFile[i].delete();
}
}
}
For kotlin
Create a array of path list
val paths: MutableList<String> = ArrayList()
paths.add("Yor path")
paths.add("Yor path")
.
.
delete file for each path
try{
paths.forEach{
val file = File(it)
if(file.exists(){
file.delete()
}
}
}catch(e:IOException){
}

Categories