I wrote a mapreduce program, but when I tried to run on hadoop it can't succeed since it generates that much amount of intermediate data that I get an error message: the node has no more space on it. After it tries with the second node, but the result is the same. I would like process two text files: approximately ~60k lines.
I have tried:
- enable snappy compression, but it didn't help.
- add more space, so the two node have 50-50gb storage
Since none of them are helped maybe the problem is with the code, not with the setup.
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class FirstMapper extends Mapper<LongWritable, Text, Text, Text> {
enum POS_TAG {
CC, CD, DT, EX,
FW, IN, JJ, JJR,
JJS, LS, MD, NN,
NNS, NNP, NNPS, PDT,
WDT, WP, POS, PRP,
PRP$, RB, RBR, RBS,
RP, SYM, TO, UH,
VB, VBD, VBG, VBN,
VBP, VBZ, WP$, WRB
}
private static final List<String> tags = Stream.of(POS_TAG.values())
.map(Enum::name)
.collect(Collectors.toList());
private static final int MAX_NGRAM = 5;
private static String[][] cands = {
new String[3],
new String[10],
new String[32],
new String[10]
};
#Override
protected void setup(Context context) throws IOException, InterruptedException {
Configuration conf = context.getConfiguration();
String location = conf.get("job.cands.path");
if (location != null) {
BufferedReader br = null;
try {
FileSystem fs = FileSystem.get(conf);
Path path = new Path(location);
if (fs.exists(path)) {
FSDataInputStream fis = fs.open(path);
br = new BufferedReader(new InputStreamReader(fis));
String line;
int i = 0;
while ((line = br.readLine()) != null) {
String[] splitted = line.split(" ");
cands[i] = splitted;
i++;
}
}
} catch (IOException e) {
//
} finally {
br.close();
}
}
}
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] tokens = value.toString().split(" ");
int m = tokens.length;
for (int n = 2; n <= MAX_NGRAM; n++) {
for (int s = 0; s <= m - n; s++) {
for (int i = 0; i < cands[n - 2].length; i++) {
List<String> pattern = new ArrayList<>();
List<String> metWords = new ArrayList<>();
for (int j = 0; j <= n - 1; j++) {
String[] pair = tokens[s + j].split("/");
String word = pair[0];
String pos = pair[1];
char c = cands[n - 2][i].charAt(j);
addToPattern(word, pos, c, pattern);
if (c > 0 && tags.contains(pos)) {
metWords.add(word);
}
}
if (metWords.isEmpty()) {
metWords.add("_NONE");
}
Text resultKey = new Text(pattern.toString() + ";" + metWords.toString());
context.write(resultKey, new Text(key.toString()));
}
}
}
}
public void addToPattern(String word, String pos, char c, List<String> pattern) {
switch (c) {
case 'w':
pattern.add(word);
break;
case 'p':
pattern.add(pos);
break;
default:
pattern.add("_WC_");
break;
}
}
}
public class Main {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("job.cands.path", "/user/thelfter/pwp");
Job job1 = Job.getInstance(conf, "word pattern1");
job1.setJarByClass(Main.class);
job1.setMapperClass(FirstMapper.class);
job1.setCombinerClass(FirstReducer.class);
job1.setReducerClass(FirstReducer.class);
job1.setMapOutputKeyClass(Text.class);
job1.setMapOutputValueClass(Text.class);
FileInputFormat.addInputPath(job1, new Path(args[0]));
FileOutputFormat.setOutputPath(job1, new Path("/user/thelfter/output"));
System.exit(job1.waitForCompletion(true) ? 0 : 1);
}
}
If your using YARN, then the Node Manager's disk space is controlled by yarn.nodemanager.local-dirs in your yarn-site.xml file so whatever that is pointing to needs to have enough disk space.
Related
I am trying to classify a dataset! In this dataset the first column is the ideal outcome and the other 20 columns are the inputs.
The problem that arises here for me is that the SVM trained on the dataset (in this case 80% is used for training) shows a training error of 0.0 but it always predicts 1.0 as outcome.
I have divided the set into two parts one for training (80% of the data) and 20% for classification. The data is a concatenation of two short timeseries of RSI values (one 2 period and one 14 period).
Why does the SVM behave this way? And can I do something to avoid this? I thought 0.0 of training error would mean, that on the training set the SVM makes no more errors. This seems to be false judging from the results.
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.encog.Encog;
import org.encog.ml.data.MLData;
import org.encog.ml.data.MLDataPair;
import org.encog.ml.data.MLDataSet;
import org.encog.ml.data.basic.BasicMLDataSet;
import org.encog.ml.svm.SVM;
import org.encog.ml.svm.training.SVMTrain;
public class SVMTest {
public static void main(String[] args) {
List<String> lines = readFile("/home/wens/mlDataSet.csv");
double[][] trainingSetData = getInputData(lines, 0, lines.size()/10*8);
double[][] trainingIdeal = getIdeal(lines, 0, lines.size()/10*8);
MLDataSet trainingSet = new BasicMLDataSet(trainingSetData, trainingIdeal);
double[][] classificationSetData = getInputData(lines, lines.size()/10*8, lines.size());
double[][] classificationIdeal = getIdeal(lines, lines.size()/10*8, lines.size());
MLDataSet classificationSet = new BasicMLDataSet(classificationSetData, classificationIdeal);
SVM svm = new SVM(20,false);
final SVMTrain train = new SVMTrain(svm, trainingSet);
train.iteration();
train.finishTraining();
System.out.println("training error: " + train.getError());
System.out.println("SVM Results:");
for(MLDataPair pair: classificationSet ) {
final MLData output = svm.compute(pair.getInput());
System.out.println("actual: " + output.getData(0) + "\tideal=" + pair.getIdeal().getData(0));
}
Encog.getInstance().shutdown();
}
private static List<String> readFile(String filepath){
List<String> res = new ArrayList<>();
try {
File f = new File(filepath);
BufferedReader b = new BufferedReader(new FileReader(f));
String readLine = "";
while ((readLine = b.readLine()) != null) {
res.add(readLine);
}
} catch (IOException e) {
e.printStackTrace();
}
return res;
}
private static double[][] getInputData(List<String> lines, int start, int end){
double[][] res = new double[end-start][20];
int cnt = 0;
for(int i=start; i<end; i++){
String[] tmp = lines.get(i).split("\t");
for(int j=1; j<tmp.length; j++){
res[cnt][j-1] = Double.parseDouble(tmp[j]);
}
cnt++;
}
return res;
}
private static double[][] getIdeal(List<String> lines, int start, int end){
double[][] res = new double[end-start][1];
int cnt = 0;
for(int i=start; i<end; i++){
String[] tmp = lines.get(i).split("\t");
res[cnt][0] = Double.parseDouble(tmp[0]);
cnt++;
}
return res;
}
}
I have my below Java code and TestData.csv (Input file)
and my expected output is like below. But it show actual count
I tried lot. Any one hava any idea on this. Any help is valuable one. Based on column data I want the count for the particular value.
package com;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Arrays;
import com.opencsv.CSVWriter;
import com.opencsv.CSVReader;
import java.time.format.DateTimeFormatter;
import java.time.LocalDateTime;
public class TestDataProcess {
public static void main(String args[]) throws IOException {
processData();
}
public static void processData() {
String[] trafficDetails;
int locColumnPosition, subCcolumnPosition, j, i, msgTypePosition, k, m, trafficLevelPosition;
String masterCSVFile, dayFolderPath;
String[] countryID = { "LOC1" };
String[] subID = { "S1" };
String[] mType = { "MSG1" };
String[] trafficLevel = { "1", "2", "3" };
String columnNameLocation = "CountryID";
String columnNameSubsystem = "SubID";
String columnNameMsgType = "Type";
String columnNameAlrmLevel = "TrafficLevel";
masterCSVFile = "D:\\TestData.csv";
dayFolderPath = "D:\\output\\";
DateTimeFormatter dtf = DateTimeFormatter.ofPattern("dd_MM_yyyy");
LocalDateTime now = LocalDateTime.now();
System.out.println(dtf.format(now));
int count = 0;
for (i = 0; i < countryID.length; i++) {
count = 0;
for (j = 0; j < subID.length; j++) {
count = 0;
String locaIdSubsysId = dtf.format(now) + "_" + countryID[i] + "_" + subID[j] + ".csv";
try (CSVWriter csvWriter = new CSVWriter(new FileWriter(dayFolderPath + locaIdSubsysId, true));
CSVReader csvReader = new CSVReader(new FileReader(masterCSVFile));) {
trafficDetails = csvReader.readNext();
csvWriter.writeNext(trafficDetails);
locColumnPosition = getHeaderLocation(trafficDetails, columnNameLocation);
subCcolumnPosition = getHeaderLocation(trafficDetails, columnNameSubsystem);
msgTypePosition = getHeaderLocation(trafficDetails, columnNameMsgType);
trafficLevelPosition = getHeaderLocation(trafficDetails, columnNameAlrmLevel);
while ((trafficDetails = csvReader.readNext()) != null && locColumnPosition > -1
&& subCcolumnPosition > -1) {
for (k = 0; k < mType.length; k++) {
for (m = 0; m < trafficLevel.length; m++) {
if (trafficDetails[locColumnPosition].matches(countryID[i])
& trafficDetails[subCcolumnPosition].matches(subID[j])
& trafficDetails[trafficLevelPosition].matches(trafficLevel[m])
& trafficDetails[msgTypePosition].matches(mType[k]))
{
count = count + 1;
csvWriter.writeNext(trafficDetails);
}
}
}
}
} catch (Exception ee) {
ee.printStackTrace();
}
}
}
}
public static int getHeaderLocation(String[] headers, String columnName) {
return Arrays.asList(headers).indexOf(columnName);
}
}
You can have that using a Map to store the traffic level as a key and all the rows from your csv file in a List as its value. Then just print the size of the List.
See the following example and have a look at the code comments:
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;
public class ExampleMain {
public static void main(String[] args) {
// create a Path object from the path to your file
Path csvFilePath = Paths.get("Y:\\our\\path\\to\\file.csv");
// create a data structure that stores data rows per traffic level
Map<Integer, List<DataRow>> dataRowsPerTrafficLevel = new TreeMap<Integer, List<DataRow>>();
try {
// read all the lines of the file
List<String> lines = Files.readAllLines(csvFilePath);
// iterate all the lines, skipping the header line
for (int i = 1; i < lines.size(); i++) {
// split the lines by the separator (WHICH MAY DIFFER FROM THE ONE USED HERE)
String[] lineValues = lines.get(i).split(",");
// store the value from column 6 (index 5) as the traffic level
int trafficLevel = Integer.valueOf(lineValues[5]);
// if the map already contains this key, just add the next data row
if (dataRowsPerTrafficLevel.containsKey(trafficLevel)) {
DataRow dataRow = new DataRow();
dataRow.subId = lineValues[1];
dataRow.countryId = lineValues[2];
dataRow.type = lineValues[3];
dataRowsPerTrafficLevel.get(trafficLevel).add(dataRow);
} else {
/* otherwise create a list, then a data row, add it to the list and put it in
* the map along with the new key
*/
List<DataRow> dataRows = new ArrayList<DataRow>();
DataRow dataRow = new DataRow();
dataRow.subId = lineValues[1];
dataRow.countryId = lineValues[2];
dataRow.type = lineValues[3];
dataRows.add(dataRow);
dataRowsPerTrafficLevel.put(trafficLevel, dataRows);
}
}
// print the result
dataRowsPerTrafficLevel.forEach((trafficLevel, dataRows) -> {
System.out.println("For TrafficLevel " + trafficLevel + " there are " + dataRows.size()
+ " data rows in the csv file");
});
} catch (IOException e) {
e.printStackTrace();
}
}
/*
* small holder class that just holds the values of columns 3, 4 and 5.
* If you want to have distinct values, make this one a full POJO implementing Comparable
*/
static class DataRow {
String subId;
String countryId;
String type;
}
I am making a java program using PDFbox that reads any pdf file and counts how many times each word appears in the file but for some reason nothing appears when I run the program, I expect it to print each word and the number of occurrences of that word next to it. thanks in advance.
here is my code:
package lab8;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Map;
import java.util.TreeMap;
import java.util.Scanner;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class Extractor {
public static void main(String[] args) throws FileNotFoundException {
Map<String, Integer> frequencies = new TreeMap<String, Integer>();
PDDocument pd;
File input = new File("C:\\Users\\Ammar\\Desktop\\Application.pdf");
Scanner in = new Scanner(input);
try {
pd = PDDocument.load(input);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setEndPage(20);
String text = stripper.getText(pd);
while (in.hasNext()) {
String word = clean(in.next());
if (word != "") {
Integer count = frequencies.get(word);
if (count == null) {
count = 1;
} else {
count = count + 1;
}
frequencies.put(word, count);
}
}
for (String key : frequencies.keySet()) {
System.out.println(key + ": " + frequencies.get(key));
}
if (pd != null) {
pd.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
private static String clean(String s) {
String r = "";
for (int i = 0; i < s.length(); i++) {
char c = s.charAt(i);
if (Character.isLetter(c)) {
r = r + c;
}
}
return r.toLowerCase();
}
}
I have tried to resolve the logic.
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.Map;
import java.util.TreeMap;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class Extractor {
public static void main(String[] args) throws FileNotFoundException {
Map<String, Integer> wordFrequencies = new TreeMap<String, Integer>();
Map<Character, Integer> charFrequencies = new TreeMap<Character, Integer>();
PDDocument pd;
File input = new File("C:\\Users\\Ammar\\Desktop\\Application.pdf");
try {
pd = PDDocument.load(input);
PDFTextStripper stripper = new PDFTextStripper();
stripper.setEndPage(20);
String text = stripper.getText(pd);
for(int i=0; i<text.length(); i++)
{
char c = text.charAt(i);
int count = charFrequencies.get(c) != null ? (charFrequencies.get(c)) + 1 : 1;
charFrequencies.put(c, count);
}
String[] texts = text.split(" ");
for (String txt : texts) {
int count = wordFrequencies.get(txt) != null ? (wordFrequencies.get(txt)) + 1 : 1;
wordFrequencies.put(txt, count);
}
System.out.println("Printing the number of words");
for (String key : wordFrequencies.keySet()) {
System.out.println(key + ": " + wordFrequencies.get(key));
}
System.out.println("Printing the number of characters");
for (char charKey : charFrequencies.keySet()) {
System.out.println(charKey + ": " + charFrequencies.get(charKey));
}
if (pd != null) {
pd.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Try this code. If there is still some problem and you are not able to resolve. I can try to resolve.
In your code you can also use StringTokenizer's object by passing your string i.e
StringTokenizer st = new StringTokenizer(stripper.getText(pd));
And in while loop st.hasMoreTokens() and to render each word String word = clean(st.nextToken()); This is also working fine.
package demo_thesis;
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import weka.classifiers.Classifier;
import weka.classifiers.Evaluation;
import weka.classifiers.evaluation.NominalPrediction;
import weka.classifiers.rules.DecisionTable;
import weka.classifiers.rules.PART;
import weka.classifiers.trees.DecisionStump;
import weka.classifiers.trees.J48;
import weka.core.FastVector;
import weka.core.Instances;
public class WekaTest {
public static BufferedReader readDataFile(String filename) {
BufferedReader inputReader = null;
try {
inputReader = new BufferedReader(new FileReader(filename));
} catch (FileNotFoundException ex) {
System.err.println("File not found: " + filename);
}
return inputReader;
}
public static Evaluation classify(Classifier model,
Instances trainingSet, Instances testingSet) throws Exception {
Evaluation evaluation = new Evaluation(trainingSet);
model.buildClassifier(trainingSet);
evaluation.evaluateModel(model, testingSet);
return evaluation;
}
public static double calculateAccuracy(FastVector predictions) {
double correct = 0;
for (int i = 0; i < predictions.size(); i++) {
NominalPrediction np = (NominalPrediction) predictions.elementAt(i);
if (np.predicted() == np.actual()) {
correct++;
}
}
return 100 * correct / predictions.size();
}
public static Instances[][] crossValidationSplit(Instances data, int numberOfFolds) {
Instances[][] split = new Instances[2][numberOfFolds];
for (int i = 0; i < n`enter code here`umberOfFolds; i++) {
split[0][i] = data.trainCV(numberOfFolds, i);
split[1][i] = data.testCV(numberOfFolds, i);
}
return split;
}
public static void main(String[] args) throws Exception {
BufferedReader datafile = readDataFile("C:\\Users\\user\\Desktop\\demo_thesis\\src\\input_file\\weather.txt");
Instances data = new Instances(datafile);
data.setClassIndex(data.numAttributes() - 1);
// Do 10-split cross validation
Instances[][] split = crossValidationSplit(data, 10);
// Separate split into training and testing arrays
Instances[] trainingSplits = split[0];
Instances[] testingSplits = split[1];
// Use a set of classifiers
Classifier[] models = {
new J48(), // a decision tree
new PART(),
new DecisionTable(),//decision table majority classifier
new DecisionStump() //one-level decision tree
};
// Run for each model
for (int j = 0; j < models.length; j++) {
// Collect every group of predictions for current model in a FastVector
FastVector predictions = new FastVector();
// For each training-testing split pair, train and test the classifier
for (int i = 0; i < trainingSplits.length; i++) {
Evaluation validation = classify(models[j], trainingSplits[i], testingSplits[i]);
predictions.appendElements(validation.predictions());
// Uncomment to see the summary for each training-testing pair.
System.out.println(models[j].toString());
}
// Calculate overall accuracy of current classifier on all splits
double accuracy = calculateAccuracy(predictions);
// Print current classifier's name and accuracy in a complicated,
// but nice-looking way.
System.out.println("Accuracy of " + models[j].getClass().getSimpleName() + ": "
+ String.format("%.2f%%", accuracy)
+ "\n---------------------------------");
}
}
}
I have integrated weka jar file to a java package. i have used this code.but there is a error show in the line predictions.appendElements(validation.predictions()); Giving a error like "cannot find symbol symbol: method predictions()".I have used Netbeans IDE 8.2 and Jdk 1.8. I have tried in many ways but can not solve it. how should I solve the error??
need help with writing to and receiving from the text files
it seems to go almost all the way but then it says that no file exists, at that point it should create one and then start writing to it. it says that it failed to find one and then it just ends itself. I don't know why
package sorting;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.Random;
public class Sorting {
private static int[] oneToFiftyThou = new int[50000];
private static int[] fiftyThouToOne = new int[50000];
private static int[] randomFiftyThou = new int[50000];
public static void main(String[] args) {
if(args.length>0) {
if(args[0].equalsIgnoreCase("init")) {
// initialize the 3 files
// 1-50000 file1
// 50000-1 file2
// random 50000 file3
initializeFiles();
writeFiles();
}
} else {
readFilestoArray();
System.out.println(""+oneToFiftyThou[0] + " - " +
oneToFiftyThou[oneToFiftyThou.length-1]);
System.out.println(""+fiftyThouToOne[0] + " - " +
fiftyThouToOne[fiftyThouToOne.length-1]);
System.out.println(""+randomFiftyThou[0] + " - " +
randomFiftyThou[randomFiftyThou.length-1]);
intInsertionSort(oneToFiftyThou);
intInsertionSort(fiftyThouToOne);
intInsertionSort(randomFiftyThou);
}
}
private static void initializeFiles() {
//Array one
for(int i=1; i<oneToFiftyThou.length+1; i++) {
oneToFiftyThou[i-1] = i;
}
//Array two
for(int i=50000; i>0; i--) {
fiftyThouToOne[fiftyThouToOne.length-(i)] = i;
}
//Array Three Random. Copy Array one into a new Array and shuffle.
System.arraycopy(oneToFiftyThou, 0, randomFiftyThou, 0,
randomFiftyThou.length);
Random random = new Random();
for(int i=randomFiftyThou.length-1; i>0; i--) {
int index = random.nextInt(i+1);
//Swap the values
int value = randomFiftyThou[index];
randomFiftyThou[index] = randomFiftyThou[i];
randomFiftyThou[i] = value;
}
}
public static void writeFiles() {
ArrayList<int[]> arrayList = new ArrayList<int[]>();
arrayList.add(oneToFiftyThou);
arrayList.add(fiftyThouToOne);
arrayList.add(randomFiftyThou);
int fileIter = 1;
for(Iterator<int[]> iter = arrayList.iterator();
iter.hasNext(); ) {
int[] array = iter.next();
try {
File file = new File("file"+fileIter+".txt");
//check for file, create it if it doesn't exist
if(!file.exists()) {
file.createNewFile();
}
FileWriter fileWriter = new FileWriter(file);
BufferedWriter bufferWriter = new BufferedWriter
(fileWriter);
for(int i = 0; i<array.length; i++) {
bufferWriter.write(""+array[i]);
if(i!=array.length-1) {
bufferWriter.newLine();
}
}
bufferWriter.close();
fileIter++;
}catch(IOException ioe) {
ioe.printStackTrace();
System.exit(-1);
}
}
}
public static void readFilestoArray() {
ArrayList<int[]> arrayList = new ArrayList<int[]>();
arrayList.add(oneToFiftyThou);
arrayList.add(fiftyThouToOne);
arrayList.add(randomFiftyThou);
int fileIter = 1;
for(Iterator<int[]> iter = arrayList.iterator();
iter.hasNext(); ) {
int[] array = iter.next();
try {
File file = new File("file"+fileIter+".txt");
//check for file, exit with error if file doesn't exist
if(!file.exists()) {
System.out.println("file doesn't exist "
+ file.getName());
System.exit(-1);
}
FileReader fileReader = new FileReader(file);
BufferedReader bufferReader = new BufferedReader
(fileReader);
for(int i = 0; i<array.length; i++) {
array[i] = Integer.parseInt
(bufferReader.readLine());
}
bufferReader.close();
fileIter++;
}catch(IOException ioe) {
ioe.printStackTrace();
System.exit(-1);
}
}
}
private static void intInsertionSort(int[] intArray) {
int comparisonCount = 0;
long startTime = System.currentTimeMillis();
for(int i=1; i<intArray.length;i++) {
int tempValue = intArray[i];
int j = 0;
for(j=i-1; j>=0 && tempValue<intArray[j];j--){
comparisonCount++;
intArray[j+1] = intArray[j];
}
intArray[j+1] = tempValue;
}
long endTime=System.currentTimeMillis();
System.out.println("Comparison Count = " + comparisonCount
+ " running time (in millis) = " +
(endTime-startTime) );
}
}
Well, works for me. Execute it in console like that:
java Sorting init
Then execute it another time:
java Sorting
Works perfectly. If you are in Eclipse go to run configuration > arguments and put init there.
Point is in your main method you are checking if someone invoked the program with init parameter, if yes then you create those files and write to them, if not - you are reading from them. You are probably invoking without init and the files are not there yet, that's why it doesn't work.