I want to implement one program, which I have written in java, in map reduce so that it can be implemented in hadoop framework but I am new to hadoop how should I proceed. I have pasted the basic structure of the program(there are other more classes and methods but I have omitted them and also omitted some codes to keep things simple) how should I begin to write driver, mapper and reducer class.
public class TC{
TC tc = new TC();
long start =System.currentTimeMillis(); // staring the timer to get running time
int i=0;
int a=0, c=0, g=0, i=0, m=0, s=0, t=0, e=0;
String categories[]={"a","c","e","g","i","m","s","t"}; // different categories in the training set
Map<String, Integer> all_wordMap=new HamnMap<String, Integer>(); creating a map
for(i=0;i<categories.length;i++)
{
String class_file[]=no_of_files("path"+categories[i]);
for(int j=0;j<class_file.length;j++){
// code block for counting all distinct words in the files and putting them into the map
}
}
int total_words=all_wordMap.size(); // size of the map
Map<String,String> bvm=new HasnMap<String,String>(); // creating another map
for(i=0;i<categories.length;i++)
{
String class_file[]=tc.no_of_files("path"+categories[i]);
for(int j=0;j<class_file.length;j++){
//code block to create binary vector for each text file
}
}
}
String classs = null;
String s="path"; // path from where the test files will be read
String files[]=tc.no_of_files(s); // no. of test files
for(int y=0;y<files.length;y++){
String file1 =files[y];
classs=tc.classifier(file1,bvm,all_wordMap); // calling a method classifier which classify the test file(which class does the test file belong to)
System.out.println("The category is "+classs); // these are the outputs
if ("a".equals(classs)){
a++;
}
if ("c".equals(classs)){
c++;
}
if ("e".equals(classs)){
e++;
}
if ("g".equals(classs)){
g++;
}
if ("in".equals(classs)){
in++;
}
if ("mn".equals(classs)){
mn++;
}
if ("sh".equals(classs)){
sh++;
}
if ("td".equals(classs)){
td++;
}
}
System.out.println ("a = "+a); //counting the no. of files of this class
System.out.println ("c = "+c);
System.out.println ("e = "+e);
System.out.println ("g = "+g);
System.out.println ("in = "+in);
System.out.println ("mn = "+mn);
System.out.println ("sh = "+sh);
System.out.println ("td = "+td);
System.out.println(System.currentTimeMillis() - start); // stopping the clock
}
Related
package Testing;
import java.io.File;
import java.io.FileNotFoundException;
import java.util.HashMap;
import java.util.Map;
import java.util.Scanner;
public class testing {
// map to store the number of errors per user
private static Map<String, Integer> errorsPerUser = new HashMap<>();
// variable to store the number of jobs started
private static int jobsStarted = 0;
// variable to store the number of jobs completed
private static int jobsCompleted = 0;
public static void main(String[] args) {
// specify the path to the log file
String filePath = "C:/Users/Wafiq/Documents/WIX1002/GroupAssignment/extracted_log.txt";
try (Scanner scanner = new Scanner(new File(filePath))) {
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
int timestampEndIndex = line.indexOf("]");
String lineWithoutTimestamp = line.substring(timestampEndIndex+2);
// check if line contains error message
if (lineWithoutTimestamp.contains("error: This association")) {
// extract the user from the line
String user = extractUser(lineWithoutTimestamp);
// increment the error count for the user
incrementErrorCount(user);
}
// check if line indicates job start
if (lineWithoutTimestamp.contains("sched: Allocate")) {
jobsStarted++;
}
// check if line indicates job completion
if (lineWithoutTimestamp.contains("_job_complete: JobId")) {
jobsCompleted++;
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
}
// print the results
System.out.println("Number of jobs started: " + jobsStarted);
System.out.println("Number of jobs completed: " + jobsCompleted);
System.out.println("Number of errors per user:");
for (Map.Entry<String, Integer> entry : errorsPerUser.entrySet()) {
System.out.println(": " + entry.getValue());
}
}
// method to extract the user from the line
private static String extractUser(String line) {
// assuming the user is the string before "error" in the line
return line.substring(0, line.indexOf("error")).trim();
}
// method to increment the error count for the user
private static void incrementErrorCount(String user) {
if (errorsPerUser.containsKey(user)) {
errorsPerUser.put(user, errorsPerUser.get(user) + 1);
} else {
errorsPerUser.put(user, 1);
}
}
}
Output:
File data:
I'm trying to extract the number of jobs causing error and the corresponding user. I have done the number of jobs causing error but I don't know how to extract the number of corresponding user.
(p/s: Pls don't slander me, I'm a first year student in Comp Science. I have tried my best)
The user is not at the same index each line so I dont know how to extract it from the line.
While the user is not at the same index across lines, it always comes after user=' and ends on the next '. Search for these substrings in your line and you are done.
int startIndex = line.indexOf("user='");
if (startIndex>=0) {
int endIndex = line.indexOf("'", startIndex);
String user = line.substring(startIndex, endIndex);
System.out.println("user="+user);
} else {
System.out.println("no user in line");
}
Edit: I saw there is another pattern also in use. I think you can change the above algorithm to also allow for the second one.
it is a simple question, how to print out the selected file name, thanks.
public class CSVMAX {
public CSVRecord hottestInManyDays() {
//select many csv files from my computer
DirectoryResource dr = new DirectoryResource();
CSVRecord largestSoFar = null;
//read every row and implement the method we just define
for(File f : dr.selectedFiles()) {
FileResource fr = new FileResource(f);
CSVRecord currentRow = hottestHourInFile(fr.getCSVParser());
if (largestSoFar == null) {
largestSoFar = currentRow;
}
else {
double currentTemp = Double.parseDouble(currentRow.get("TemperatureF"));
double largestTemp = Double.parseDouble(largestSoFar.get("TemperatureF"));
//Check if currentRow’s temperature > largestSoFar’s
if (currentTemp > largestTemp) {
//If so update largestSoFar to currentRow
largestSoFar = currentRow;
}
}
}
return largestSoFar;
}
here I want to print out the file name but I dont know how to do that.
public void testHottestInManyDay () {
CSVRecord largest = hottestInManyDays();
System.out.println("hottest temperature on that day was in file " + ***FILENAME*** + largest.get("TemperatureF") +
" at " + largest.get("TimeEST"));
}
}
Ultimately, it seems that hottestInManyDays() will need to return this information.
Does CSVRecord have a property for that?
Something like this:
CSVRecord currentRow = hottestHourInFile(fr.getCSVParser());
currentRow.setFileName(f.getName());
If not, can such a property be added to it?
Maybe CSVRecord doesn't have that property. But it can be added?:
private String _fileName;
public void setFileName(String fileName) {
this._fileName = fileName;
}
public String getFileName() {
return this._fileName;
}
If not, can you create a wrapper class for both pieces of information?
If you can't modify CSVRecord and it doesn't have a place for the information you want, wrap it in a class which does. Something as simple as this:
class CSVWrapper {
private CSVRecord _csvRecord;
private String _fileName;
// getters and setters for the above
// maybe also a constructor? make them final? your call
}
Then return that from hottestInManyDays() instead of a CSVRecord. Something like this:
CSVWrapper csvWrapper = new csvWrapper();
csvWrapper.setCSVRecord(currentRow);
csvWrapper.setFileName(f.getName());
Changing the method signature and return value as needed, of course.
However you do it, once it's on the return value from hottestInManyDays() you can use it in the method which consumes that:
CSVWrapper largest = hottestInManyDays();
System.out.println("hottest temperature on that day was in file " + largest.getFileName() + largest.getCSVRecord().get("TemperatureF") +
" at " + largest.getCSVRecord().get("TimeEST"));
(Note: If the bits at the very end there don't sit right as a Law Of Demeter violation, then feel free to extend the wrapper to include pass-thru operations as needed. Maybe even have it share a common interface with CSVRecord so it can be used as a drop-in replacement for one as needed elsewhere in the system.)
Referring to the line for(File f : dr.selectedFiles())
f is a [File]. It has a toString() method [from docs],
Returns the pathname string of this abstract pathname. This is just
the string returned by the getPath() method.
So, in the first line inside the loop, you can put System.out.println(f.toString()); to print out the file path.
Hope this helps clear a part of the story.
Now, to update this string, I see you are using some object that is called largest in testHottestInManyDay(). You should add a filepath string in this object and set it inside the else block.
One has to return both the CSVRecord and the File. Either in a newly made class.
As CSVRecord can be converted to a map, add the file name to the map, using a new column name, here "FILENAME."
public Map<String, String> hottestInManyDays() {
//select many csv files from my computer
DirectoryResource dr = new DirectoryResource();
CSVRecord largestSoFar = null;
File fileOfLargestSoFar = null;
//read every row and implement the method we just define
for (File f : dr.selectedFiles()) {
FileResource fr = new FileResource(f);
CSVRecord currentRow = hottestHourInFile(fr.getCSVParser());
if (largestSoFar == null) {
largestSoFar = currentRow;
fileOfLargestSoFar = f;
}
else {
double currentTemp = Double.parseDouble(currentRow.get("TemperatureF"));
double largestTemp = Double.parseDouble(largestSoFar.get("TemperatureF"));
//Check if currentRow’s temperature > largestSoFar’s
if (currentTemp > largestTemp) {
//If so update largestSoFar to currentRow
largestSoFar = currentRow;
fileOfLargestSoFar = f;
}
}
}
Map<String, String> map = new HashMap<>(largestSoFar.toMap());
map.put("FILENAME", fileOfLargestSoFar.getPath());
return map;
}
Map<String, String> largest = hottestInManyDays();
System.out.println("hottest temperature on that day was in file "
+ largest.get("FILENAME") + largest.get("TemperatureF") +
" at " + largest.get("TimeEST"));
Pretty much for my assignment I have to List all the courses (just the course code) that have classes in a given building on a given day such that any part of the class is between the given times. Each course involved should only be listed once, even if it has several classes. I have done everything except listing the course once, even if it has several classes. How do I ignore duplicate strings from a file?
public void potentialDisruptions(String building, String targetDay, int targetStart, int targetEnd){
UI.printf("\nClasses in %s on %s between %d and %d%n",
building, targetDay, targetStart, targetEnd);
UI.println("=================================");
boolean containsCourse = false;
try {
Scanner scan = new Scanner(new File("classdata.txt"));
while(scan.hasNext()){
String course = scan.next();
String type= scan.next();
String day = scan.next();
int startTime = scan.nextInt();
int endTime = scan.nextInt();
String room = scan.next();
if(room.contains(building)){
if(day.contains(targetDay)){
if(endTime >= targetStart){
if( startTime<= targetEnd){
UI.printf("%s%n", course);
containsCourse = true;
}
}
}
}
}
if(!containsCourse){
UI.println("error");
}
}
catch(IOException e){
UI.println("File reading failed");
}
UI.println("=========================");
}
You can put all the string token in Set and check if that token contain in Set befor you process further as below :-
// Declration
....
Set courseSet = new HashSet();
...
// Check befor you process further
if(!courseSet.contains(course))
{
...
// Your Code...
...
courseSet.add(course)
}
You could put the courses in a Set and loop over them since a Set always contains unique values.
When I run the LensKit demo program I get this error:
[main] ERROR org.grouplens.lenskit.data.dao.DelimitedTextRatingCursor - C:\Users\sean\Desktop\ml-100k\u - Copy.data:4: invalid input, skipping line
I reworked the ML 100k data set so that it only holds this line although I dont see how this would effect it:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244
Here is the code I am using too:
public class HelloLenskit implements Runnable {
public static void main(String[] args) {
HelloLenskit hello = new HelloLenskit(args);
try {
hello.run();
} catch (RuntimeException e) {
System.err.println(e.getMessage());
System.exit(1);
}
}
private String delimiter = "\t";
private File inputFile = new File("C:\\Users\\sean\\Desktop\\ml-100k\\u - Copy.data");
private List<Long> users;
public HelloLenskit(String[] args) {
int nextArg = 0;
boolean done = false;
while (!done && nextArg < args.length) {
String arg = args[nextArg];
if (arg.equals("-e")) {
delimiter = args[nextArg + 1];
nextArg += 2;
} else if (arg.startsWith("-")) {
throw new RuntimeException("unknown option: " + arg);
} else {
inputFile = new File(arg);
nextArg += 1;
done = true;
}
}
users = new ArrayList<Long>(args.length - nextArg);
for (; nextArg < args.length; nextArg++) {
users.add(Long.parseLong(args[nextArg]));
}
}
public void run() {
// We first need to configure the data access.
// We will use a simple delimited file; you can use something else like
// a database (see JDBCRatingDAO).
EventDAO base = new SimpleFileRatingDAO(inputFile, "\t");
// Reading directly from CSV files is slow, so we'll cache it in memory.
// You can use SoftFactory here to allow ratings to be expunged and re-read
// as memory limits demand. If you're using a database, just use it directly.
EventDAO dao = new EventCollectionDAO(Cursors.makeList(base.streamEvents()));
// Second step is to create the LensKit configuration...
LenskitConfiguration config = new LenskitConfiguration();
// ... configure the data source
config.bind(EventDAO.class).to(dao);
// ... and configure the item scorer. The bind and set methods
// are what you use to do that. Here, we want an item-item scorer.
config.bind(ItemScorer.class)
.to(ItemItemScorer.class);
// let's use personalized mean rating as the baseline/fallback predictor.
// 2-step process:
// First, use the user mean rating as the baseline scorer
config.bind(BaselineScorer.class, ItemScorer.class)
.to(UserMeanItemScorer.class);
// Second, use the item mean rating as the base for user means
config.bind(UserMeanBaseline.class, ItemScorer.class)
.to(ItemMeanRatingItemScorer.class);
// and normalize ratings by baseline prior to computing similarities
config.bind(UserVectorNormalizer.class)
.to(BaselineSubtractingUserVectorNormalizer.class);
// There are more parameters, roles, and components that can be set. See the
// JavaDoc for each recommender algorithm for more information.
// Now that we have a factory, build a recommender from the configuration
// and data source. This will compute the similarity matrix and return a recommender
// that uses it.
Recommender rec = null;
try {
rec = LenskitRecommender.build(config);
} catch (RecommenderBuildException e) {
throw new RuntimeException("recommender build failed", e);
}
// we want to recommend items
ItemRecommender irec = rec.getItemRecommender();
assert irec != null; // not null because we configured one
// for users
for (long user: users) {
// get 10 recommendation for the user
List<ScoredId> recs = irec.recommend(user, 10);
System.out.format("Recommendations for %d:\n", user);
for (ScoredId item: recs) {
System.out.format("\t%d\n", item.getId());
}
}
}
}
I am really lost on this one and would appreciate any help. Thanks for your time.
The last line of your input file only contains one field. Each input file line needs to contain 3 or 4 fields.
DurationOfRun:5
ThreadSize:10
ExistingRange:1-1000
NewRange:5000-10000
Percentage:55 - AutoRefreshStoreCategories Data:Previous/30,New/70 UserLogged:true/50,false/50 SleepTime:5000 AttributeGet:1,16,10106,10111 AttributeSet:2060/30,10053/27
Percentage:25 - CrossPromoEditItemRule Data:Previous/60,New/40 UserLogged:true/50,false/50 SleepTime:4000 AttributeGet:1,10107 AttributeSet:10108/34,10109/25
Percentage:20 - CrossPromoManageRules Data:Previous/30,New/70 UserLogged:true/50,false/50 SleepTime:2000 AttributeGet:1,10107 AttributeSet:10108/26,10109/21
I am trying to parse above .txt file(first four lines are fixed and last three Lines can increase means it can be more than 3), so for that I wrote the below code and its working but it looks so messy. so Is there any better way to parse the above .txt file and also if we consider performance then which will be best way to parse the above txt file.
private static int noOfThreads;
private static List<Command> commands;
public static int startRange;
public static int endRange;
public static int newStartRange;
public static int newEndRange;
private static BufferedReader br = null;
private static String sCurrentLine = null;
private static List<String> values;
private static String commandName;
private static String percentage;
private static List<String> attributeIDGet;
private static List<String> attributeIDSet;
private static LinkedHashMap<String, Double> dataCriteria;
private static LinkedHashMap<Boolean, Double> userLoggingCriteria;
private static long sleepTimeOfCommand;
private static long durationOfRun;
br = new BufferedReader(new FileReader("S:\\Testing\\PDSTest1.txt"));
values = new ArrayList<String>();
while ((sCurrentLine = br.readLine()) != null) {
if(sCurrentLine.startsWith("DurationOfRun")) {
durationOfRun = Long.parseLong(sCurrentLine.split(":")[1]);
} else if(sCurrentLine.startsWith("ThreadSize")) {
noOfThreads = Integer.parseInt(sCurrentLine.split(":")[1]);
} else if(sCurrentLine.startsWith("ExistingRange")) {
startRange = Integer.parseInt(sCurrentLine.split(":")[1].split("-")[0]);
endRange = Integer.parseInt(sCurrentLine.split(":")[1].split("-")[1]);
} else if(sCurrentLine.startsWith("NewRange")) {
newStartRange = Integer.parseInt(sCurrentLine.split(":")[1].split("-")[0]);
newEndRange = Integer.parseInt(sCurrentLine.split(":")[1].split("-")[1]);
} else {
attributeIDGet = new ArrayList<String>();
attributeIDSet = new ArrayList<String>();
dataCriteria = new LinkedHashMap<String, Double>();
userLoggingCriteria = new LinkedHashMap<Boolean, Double>();
percentage = sCurrentLine.split("-")[0].split(":")[1].trim();
values = Arrays.asList(sCurrentLine.split("-")[1].trim().split("\\s+"));
for(String s : values) {
if(s.startsWith("Data")) {
String[] data = s.split(":")[1].split(",");
for (String n : data) {
dataCriteria.put(n.split("/")[0], Double.parseDouble(n.split("/")[1]));
}
//dataCriteria.put(data.split("/")[0], value)
} else if(s.startsWith("UserLogged")) {
String[] userLogged = s.split(":")[1].split(",");
for (String t : userLogged) {
userLoggingCriteria.put(Boolean.parseBoolean(t.split("/")[0]), Double.parseDouble(t.split("/")[1]));
}
//userLogged = Boolean.parseBoolean(s.split(":")[1]);
} else if(s.startsWith("SleepTime")) {
sleepTimeOfCommand = Long.parseLong(s.split(":")[1]);
} else if(s.startsWith("AttributeGet")) {
String[] strGet = s.split(":")[1].split(",");
for(String q : strGet) attributeIDGet.add(q);
} else if(s.startsWith("AttributeSet:")) {
String[] strSet = s.split(":")[1].split(",");
for(String p : strSet) attributeIDSet.add(p);
} else {
commandName = s;
}
}
Command command = new Command();
command.setName(commandName);
command.setExecutionPercentage(Double.parseDouble(percentage));
command.setAttributeIDGet(attributeIDGet);
command.setAttributeIDSet(attributeIDSet);
command.setDataUsageCriteria(dataCriteria);
command.setUserLoggingCriteria(userLoggingCriteria);
command.setSleepTime(sleepTimeOfCommand);
commands.add(command);
Well, parsers usually are messy once you get down to the lower layers of them :-)
However, one possible improvement, at least in terms of code quality, would be to recognize the fact that your grammar is layered.
By that, I mean every line is an identifying token followed by some properties.
In the case of DurationOfRun, ThreadSize, ExistingRange and NewRange, the properties are relatively simple. Percentage is somewhat more complex but still okay.
I would structure the code as (pseudo-code):
def parseFile (fileHandle):
while (currentLine = fileHandle.getNextLine()) != EOF:
if currentLine.beginsWith ("DurationOfRun:"):
processDurationOfRun (currentLine[14:])
elsif currentLine.beginsWith ("ThreadSize:"):
processThreadSize (currentLine[11:])
elsif currentLine.beginsWith ("ExistingRange:"):
processExistingRange (currentLine[14:])
elsif currentLine.beginsWith ("NewRange:"):
processNewRange (currentLine[9:])
elsif currentLine.beginsWith ("Percentage:"):
processPercentage (currentLine[11:])
else
raise error
Then, in each of those processWhatever() functions, you parse the remainder of the line based on the expected format. That keeps your code small and readable and easily changed in future, without having to navigate a morass :-)
For example, processDurationOfRun() simply gets an integer from the remainder of the line:
def processDurationOfRun (line):
this.durationOfRun = line.parseAsInt()
Similarly, the functions for the two ranges split the string on - and get two integers from the resultant values:
def processExistingRange (line):
values[] = line.split("-")
this.existingRangeStart = values[0].parseAsInt()
this.existingRangeEnd = values[1].parseAsInt()
The processPercentage() function is the tricky one but that is also easily doable if you layer it as well. Assuming those things are always in the same order, it consists of:
an integer;
a literal -;
some sort of textual category; and
a series of key:value pairs.
And even these values within the pairs can be parsed by lower levels, splitting first on commas to get subvalues like Previous/30 and New/70, then splitting each of those subvalues on slashes to get individual items. That way, a logical hierarchy can be reflected in your code.
Unless you're expecting to be parsing this text files many times per second, or unless it's many megabytes in size, I'd be more concerned about the readability and maintainability of your code than the speed of the parsing.
Mostly gone are the days when we need to wring the last ounce of performance from our code but we still have problems in fixing said code in a timely manner when bugs are found or enhancements are desired.
Sometimes it's preferable to optimise for readability.
I would not worry about performance until I was sure there was actually a performance issue. Regarding the rest of the code, if you won't be adding any new line types I would not worry about it. If you do worry about it, however, a factory design pattern can help you separate the selection of the type of processing needed from the actual processing. It makes adding new line types easier without introducing as much opportunity for error.
The younger and more convenient class is Scanner. You just need to modify the delimiter, and get reading of data in the desired format (readInt, readLong) in one go - no need for separate x.parseX - calls.
Second: Split your code into small, reusable pieces. They make the program readable, and you can hide details easily.
Don't hesitate to use a struct-like class for a range, for example. Returning multiple values from a method can be done by these, without boilerplate (getter,setter,ctor).
import java.util.*;
import java.io.*;
public class ReadSampleFile
{
// struct like classes:
class PercentageRow {
public int percentage;
public String name;
public int dataPrevious;
public int dataNew;
public int userLoggedTrue;
public int userLoggedFalse;
public List<Integer> attributeGet;
public List<Integer> attributeSet;
}
class Range {
public int from;
public int to;
}
private int readInt (String name, Scanner sc) {
String s = sc.next ();
if (s.startsWith (name)) {
return sc.nextLong ();
}
else err (name + " expected, found: " + s);
}
private long readLong (String name, Scanner sc) {
String s = sc.next ();
if (s.startsWith (name)) {
return sc.nextInt ();
}
else err (name + " expected, found: " + s);
}
private Range readRange (String name, Scanner sc) {
String s = sc.next ();
if (s.startsWith (name)) {
Range r = new Range ();
r.from = sc.nextInt ();
r.to = sc.nextInt ();
return r;
}
else err (name + " expected, found: " + s);
}
private PercentageLine readPercentageLine (Scanner sc) {
// reuse above methods
PercentageLine percentageLine = new PercentageLine ();
percentageLine.percentage = readInt ("Percentage", sc);
// ...
return percentageLine;
}
public ReadSampleFile () throws FileNotFoundException
{
/* I only read from my sourcefile for convenience.
So I could scroll up to see what's the next entry.
Don't do this at home. :) The dummy later ...
*/
Scanner sc = new Scanner (new File ("./ReadSampleFile.java"));
sc.useDelimiter ("[ \n/,:-]");
// ... is the comment I had to insert.
String dummy = sc.nextLine ();
List <String> values = new ArrayList<String> ();
if (sc.hasNext ()) {
// see how nice the data structure is reflected
// by this code:
long duration = readLong ("DurationOfRun");
int noOfThreads = readInt ("ThreadSize");
Range eRange = readRange ("ExistingRange");
Range nRange = readRange ("NewRange");
List <PercentageRow> percentageRows = new ArrayList <PercentageRow> ();
// including the repetition ...
while (sc.hasNext ()) {
percentageRows.add (readPercentageLine ());
}
}
}
public static void main (String args[]) throws FileNotFoundException
{
new ReadSampleFile ();
}
public static void err (String msg)
{
System.out.println ("Err:\t" + msg);
}
}