Improving the code that parses a Text File

Improving the code that parses a Text File - java

Text File(First three lines are simple to read, next three lines starts with p)
ThreadSize:2
ExistingRange:1-1000
NewRange:5000-10000
p:55 - AutoRefreshStoreCategories Data:Previous UserLogged:true Attribute:1 Attribute:16 Attribute:2060
p:25 - CrossPromoEditItemRule Data:New UserLogged:false Attribute:1 Attribute:10107 Attribute:10108
p:20 - CrossPromoManageRules Data:Previous UserLogged:true Attribute:1 Attribute:10107 Attribute:10108
Below is the code I wrote to parse the above file and after parsing it I am setting the corresponding values using its Setter. I just wanted to know whether I can improve this code more in terms of parsing and other things also by using other way like using RegEx? My main goal is to parse it and set the corresponding values. Any feedback or suggestions will be highly appreciated.
private List<Command> commands;
private static int noOfThreads = 3;
private static int startRange = 1;
private static int endRange = 1000;
private static int newStartRange = 5000;
private static int newEndRange = 10000;
private BufferedReader br = null;
private String sCurrentLine = null;
private int distributeRange = 100;
private List<String> values = new ArrayList<String>();
private String commandName;
private static String data;
private static boolean userLogged;
private static List<Integer> attributeID = new ArrayList<Integer>();
try {
// Initialize the system
commands = new LinkedList<Command>();
br = new BufferedReader(new FileReader("S:\\Testing\\Test1.txt"));
while ((sCurrentLine = br.readLine()) != null) {
if(sCurrentLine.contains("ThreadSize")) {
noOfThreads = Integer.parseInt(sCurrentLine.split(":")[1]);
} else if(sCurrentLine.contains("ExistingRange")) {
startRange = Integer.parseInt(sCurrentLine.split(":")[1].split("-")[0]);
endRange = Integer.parseInt(sCurrentLine.split(":")[1].split("-")[1]);
} else if(sCurrentLine.contains("NewRange")) {
newStartRange = Integer.parseInt(sCurrentLine.split(":")[1].split("-")[0]);
newEndRange = Integer.parseInt(sCurrentLine.split(":")[1].split("-")[1]);
} else {
allLines.add(Arrays.asList(sCurrentLine.split("\\s+")));
String key = sCurrentLine.split("-")[0].split(":")[1].trim();
String value = sCurrentLine.split("-")[1].trim();
values = Arrays.asList(sCurrentLine.split("-")[1].trim().split("\\s+"));
for(String s : values) {
if(s.contains("Data:")) {
data = s.split(":")[1];
} else if(s.contains("UserLogged:")) {
userLogged = Boolean.parseBoolean(s.split(":")[1]);
} else if(s.contains("Attribute:")) {
attributeID.add(Integer.parseInt(s.split(":")[1]));
} else {
commandName = s;
}
}
Command command = new Command();
command.setName(commandName);
command.setExecutionPercentage(Double.parseDouble(key));
command.setAttributeID(attributeID);
command.setDataCriteria(data);
command.setUserLogging(userLogged);
commands.add(command);
}
}
} catch(Exception e) {
System.out.println(e);
}

I think you should know what exactly you're expecting while using RegEx. http://java.sun.com/developer/technicalArticles/releases/1.4regex/ should be helpful.

To answer a comment:
p:55 - AutoRefreshStoreCategories Data:Previous UserLogged:true Attribute:1 Attribute:16 Attribute:2060
to parse above with regex (and 3 times Attribute:):
String parseLine = "p:55 - AutoRefreshStoreCategories Data:Previous UserLogged:true Attribute:1 Attribute:16 Attribute:2060";
Matcher m = Pattern
.compile(
"p:(\\d+)\\s-\\s(.*?)\\s+Data:(.*?)\\s+UserLogged:(.*?)\\s+Attribute:(\\d+)\\s+Attribute:(\\d+)\\s+Attribute:(\\d+)")
.matcher(parseLine);
if(m.find()) {
int p = Integer.parseInt(m.group(1));
String method = m.group(2);
String data = m.group(3);
boolean userLogged = Boolean.valueOf(m.group(4));
int at1 = Integer.parseInt(m.group(5));
int at2 = Integer.parseInt(m.group(6));
int at3 = Integer.parseInt(m.group(7));
System.out.println(p + " " + method + " " + data + " " + userLogged + " " + at1 + " " + at2 + " "
+ at3);
}
EDIT looking at your comment you still can use regex:
String parseLine = "p:55 - AutoRefreshStoreCategories Data:Previous UserLogged:true "
+ "Attribute:1 Attribute:16 Attribute:2060";
Matcher m = Pattern.compile("p:(\\d+)\\s-\\s(.*?)\\s+Data:(.*?)\\s+UserLogged:(.*?)").matcher(
parseLine);
if(m.find()) {
for(int i = 0; i < m.groupCount(); ++i) {
System.out.println(m.group(i + 1));
}
}
Matcher m2 = Pattern.compile("Attribute:(\\d+)").matcher(parseLine);
while(m2.find()) {
System.out.println("Attribute matched: " + m2.group(1));
}
But that depends if thre is no Attribute: names before "real" attributes (for example as method name - after p)

You can use the Scanner class. It has some helper methods to read text files

I would turn this inside out. Presently you are:
Scanning the line for a keyword: the entire line if it isn't found, which is the usual case as you have a number of keywords to process and they won't all be present on every line.
Scanning the entire line again for ':' and splitting it on all occurrences
Mostly parsing the part after ':' as an integer, or occasionally as a range.
So several complete scans of each line. Unless the file has zillions of lines this isn't a concern in itself but it demonstrates that you have got the processing back to front.

Related

How can I scope three different conditions using the same loop in Java?

I would like to count countX and countX using the same loop instead of creating three different loops. Is there any easy way approaching that?
public class Absence {
private static File file = new File("/Users/naplo.txt");
private static File file_out = new File("/Users/naplo_out.txt");
private static BufferedReader br = null;
private static BufferedWriter bw = null;
public static void main(String[] args) throws IOException {
int countSign = 0;
int countX = 0;
int countI = 0;
String sign = "#";
String absenceX = "X";
String absenceI = "I";
try {
br = new BufferedReader(new FileReader(file));
bw = new BufferedWriter(new FileWriter(file_out));
String st;
while ((st = br.readLine()) != null) {
for (String element : st.split(" ")) {
if (element.matches(sign)) {
countSign++;
continue;
}
if (element.matches(absenceX)) {
countX++;
continue;
}
if (element.matches(absenceI)) {
countI++;
}
}
}
System.out.println("2. exerc.: There are " + countSign + " rows int the file with that sign.");
System.out.println("3. exerc.: There are " + countX + " with sick note, and " + countI + " without sick note!");
} catch (FileNotFoundException ex) {
Logger.getLogger(Absence.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
text file example:
# 03 26
Jujuba Ibolya IXXXXXX
Maracuja Kolos XXXXXXX

I think you meant using less than 3 if statements. You can actually so it with no ifs.
In your for loop write this:
Countsign += (element.matches(sign)) ? 1 : 0;
CountX += (element.matches(absenceX)) ? 1 : 0;
CountI += (element.matches(absenceI)) ? 1 : 0;

Both answers check if the word (element) matches all regular expressions while this can (and should, if you ask me) be avoided since a word can match only one regex. I am referring to the continue part your original code has, which is good since you do not have to do any further checks.
So, I am leaving here one way to do it with Java 8 Streams in "one liner".
But let's assume the following regular expressions:
String absenceX = "X*";
String absenceI = "I.*";
and one more (for the sake of the example):
String onlyNumbers = "[0-9]*";
In order to have some matches on them.
The text is as you gave it.
public class Test {
public static void main(String[] args) throws IOException {
File desktop = new File(System.getProperty("user.home"), "Desktop");
File txtFile = new File(desktop, "test.txt");
String sign = "#";
String absenceX = "X*";
String absenceI = "I.*";
String onlyNumbers = "[0-9]*";
List<String> regexes = Arrays.asList(sign, absenceX, absenceI, onlyNumbers);
List<String> lines = Files.readAllLines(txtFile.toPath());
//#formatter:off
Map<String, Long> result = lines.stream()
.flatMap(line-> Stream.of(line.split(" "))) //map these lines to words
.map(word -> regexes.stream().filter(word::matches).findFirst()) //find the first regex this word matches
.filter(Optional::isPresent) //If it matches no regex, it will be ignored
.collect(Collectors.groupingBy(Optional::get, Collectors.counting())); //collect
System.out.println(result);
}
}
The result:
{X*=1, #=1, I.=2, [0-9]=2}
X*=1 came from word: XXXXXXX
#=1 came from word: #
I.*=2 came from words: IXXXXXX and Ibolya
[0-9]*=2 came from words: 03 and 06
Ignore the fact I load all lines in memory.

So I made it with the following lines to work. It escaped my attention that every character need to be separated from each other. Your ternary operation suggestion also nice so I will use it.
String myString;
while ((myString = br.readLine()) != null) {
String newString = myString.replaceAll("", " ").trim();
for (String element : newString.split(" ")) {
countSign += (element.matches(sign)) ? 1 : 0;
countX += (element.matches(absenceX)) ? 1 : 0;
countI += (element.matches(absenceI)) ? 1 : 0;

Overall count for substrings in a string java

I have a program which takes tweets from twitter which contain a specific word and searchs through each tweet to count the occurrences of another word that relates to the topic (e.g. in this case the main word is cameron and it's searching for tax and panama.) I have it working so it counts for that specific tweet but I can't seem to work out how to get an accumulative count for all the occurrences. I've played around with incrementing a variable when the word occurs but it doesn't seem to work. The code is below, I've taken out my twitter API keys for obvious reasons.
public class TwitterWordCount {
public static void main(String[] args) {
ConfigurationBuilder configBuilder = new ConfigurationBuilder();
configBuilder.setOAuthConsumerKey(XXXXXXXXXXXXXXXXXX);
configBuilder.setOAuthConsumerSecret(XXXXXXXXXXXXXXXXXX);
configBuilder.setOAuthAccessToken(XXXXXXXXXXXXXXXXXX);
configBuilder.setOAuthAccessTokenSecret(XXXXXXXXXXXXXXXXXX);
//create instance of twitter for searching etc.
TwitterFactory tf = new TwitterFactory(configBuilder.build());
Twitter twitter = tf.getInstance();
//build query
Query query = new Query("cameron");
//number of results pulled each time
query.setCount(100);
//set the language of the tweets that we want
query.setLang("en");
//Execute the query
QueryResult result;
try {
result = twitter.search(query);
//Get the results
List<Status> tweets = result.getTweets();
//Print out the information
for (Status tweet : tweets) {
//get information about the tweet
String userName = tweet.getUser().getName();
long userId = tweet.getUser().getId();
Date creationDate = tweet.getCreatedAt();
String tweetText = tweet.getText();
//print out the information
System.out.println();
System.out.println("Tweeted by " + userName + "(" + userId + ") on date " + creationDate);
System.out.println("Tweet: " + tweetText);
// System.out.println();
String s = tweetText;
Pattern pattern = Pattern.compile("\\w+");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
System.out.print(matcher.group() + " ");
}
String str = s;
String findStr = "tax";
int lastIndex = 0;
int count = 0;
//int countall = 0;
while (lastIndex != -1) {
lastIndex = str.indexOf(findStr, lastIndex);
if (lastIndex != -1) {
count++;
lastIndex += findStr.length();
//countall++;
}
}
System.out.println();
System.out.println(findStr + " = " + count);
String two = tweetText;
String str2 = two;
String findStr2 = "panama";
int lastIndex2 = 0;
int count2 = 0;
while (lastIndex2 != -1) {
lastIndex2 = str2.indexOf(findStr2, lastIndex2);
if (lastIndex2 != -1) {
count++;
lastIndex2 += findStr.length();
}
System.out.println(findStr2 + " = " + count2);
}
}
}
catch (TwitterException ex) {
ex.printStackTrace();
}
}
}
I'm also aware that this definitely isn't the cleanest of programs, it's work in progress!

You must define your count variables outside of the for-loop.
int countKeyword1 = 0;
int countKeyword2 = 0;
for (Status tweet : tweets) {
//increase count variables in you while loops
}
System.out.Println("Keyword1 occurrences : " + countKeyword1 );
System.out.Println("Keyword2 occurrences : " + countKeyword2 );
System.out.Println("All occurrences : " + (countKeyword1 + countKeyword2) );

Resetting Java while loop

I currently have 2 loops, one which gets a timestamp, and another while loop to find the mapped information based off that time stamp and output in a certain way.
Issue I have is I am currently looping through a text, and want it to start reading the file from the beginning again when the isdone="N" for the second loop, however, this does not seem to be the case.
Code so far:
public static void organiseFile() throws FileNotFoundException {
String directory = "C:\\Users\\xxx\\Desktop\\Files\\ex1";
Scanner fileIn = new Scanner(new File(directory + "_temp.txt"));
Scanner readIn = new Scanner(new File(directory + ".txt"));
PrintWriter out = new PrintWriter(directory + "_ordered.txt");
ArrayList<String> lines = new ArrayList<String>();
String readTimeStamp = "";
String timeStampMapping = "";
String outputFirst = "";
String outputSecond = "";
String outputThird = "";
String previousTimeStamp = "";
String doneList = "";
String isdone = "";
int counter = 1;
// Loop to get time stamps
while(fileIn.hasNextLine()) {
readTimeStamp = fileIn.nextLine();
if(readTimeStamp != null && readTimeStamp.trim().length() > 0) {
readTimeStamp = readTimeStamp.substring(12, 25);
System.out.println(readTimeStamp);
// Previous time stamp found, no need to loop through it again
if(doneList.contains(readTimeStamp))
isdone = "Y";
// Counter in place to stop outputting the first record, otherwise output file and clear variables down
else if(!previousTimeStamp.equals(readTimeStamp) && counter > 1) {
out.println(outputFirst + outputSecond + outputThird);
System.out.println("Outputting....");
outputFirst = "";
outputSecond = "";
outputThird = "";
counter = 1;
}
// New time stamp found, start finding values in second loop
else
isdone = "N";
// Secondary loop to find match of record
while(readIn.hasNextLine() && isdone.equals("N")) {
System.out.println("Mapping...");
timeStampMapping = readIn.nextLine();
System.out.println(timeStampMapping);
// When a record has been found with matching time stamps, start ordering
if(timeStampMapping.contains(readTimeStamp)) {
previousTimeStamp = readTimeStamp;
System.out.println(previousTimeStamp);
if(timeStampMapping.contains("[EVENT=agentStateEvent]")) {
outputFirst += timeStampMapping + "\r\n";
} else if(timeStampMapping.contains("[EVENT=TerminalConnectionCreated]")) {
outputSecond += timeStampMapping + "\r\n";
} else {
outputThird += timeStampMapping + "\r\n";
doneList += readTimeStamp + ",";
}
counter++;
}
}
}
}
System.out.println("Outputting final record");
out.println(outputFirst + outputSecond + outputThird);
System.out.println("Complete!");
out.close();
}

You can use Scanner.reset() to reset it to the beginning of the file. For example, after your second while-loop include:
if (isdone.equals("Y")) {
fileIn.reset();
}
Btw: why are you using String for isdone instead of boolean??

Extracting certain pattern from log using Java

I want to extract a piece of information from a log file. The pattern that I am using is the prompt of the node-name and the command. I want to extract information of the command output and compare them. Consider the sample output as follows
NodeName > command1
this is the sample output
NodeName > command2
this is the sample output
I have tried the following code.
public static void searchcommand( String strLineString)
{
String searchFor = "Nodename> command1";
String endStr = "Nodename";
String op="";
int end=0;
int len = searchFor.length();
int result = 0;
if (len > 0) {
int start = strLineString.indexOf(searchFor);
while(start!=-1){
end = strLineString.indexOf(endStr,start+len);
if(end!=-1){
op=strLineString.substring(start, end);
}else{
op=strLineString.substring(start, strLineString.length());
}
String[] arr = op.split("%%%%%%%");
for (String z : arr) {
System.out.println(z);
}
start = strLineString.indexOf(searchFor,start+len);
}
}
}
The issue is that the code is too slow to extract the data. Is there any other way to do so?
EDIT 1
Its a log file which I have read as a string in the above code.

My suggestion..
public static void main(String[] args) {
String log = "NodeName > command1 \n" + "this is the sample output \n"
+ "NodeName > command2 \n" + "this is the sample output";
String lines[] = log.split("\\r?\\n");
boolean record = false;
String statements = "";
for (int j = 0; j < lines.length; j++) {
String line = lines[j];
if(line.startsWith("NodeName")){
if(record){
//process your statement
System.out.println(statements);
}
record = !record;
statements = ""; // Reset statement
continue;
}
if(record){
statements += line;
}
}
}

Here is my suggestion:
Use a regular expression. Here is one:
final String input = " NodeName > command1\n" +
"\n" +
" this is the sample output1 \n" +
"\n" +
" NodeName > command2 \n" +
"\n" +
" this is the sample output2";
final String regex = ".*?NodeName > command(\\d)(.*?)(?=NodeName|\\z)";
final Matcher matcher = Pattern.compile(regex, Pattern.DOTALL).matcher(input);
while(matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2).trim());
}
Output:
1
this is the sample output1
2
this is the sample output2
So, to break down the regex:
First, it skips all signs until it finds the first "NodeName > command", followed by a number. This number we want to keep, to know which command created the output. Next, we grab all the following signs, until we (using lookahead) find another NodeName, or the end of the input.

Example using WikipediaTokenizer in Lucene

I want to use WikipediaTokenizer in lucene project - http://lucene.apache.org/java/3_0_2/api/contrib-wikipedia/org/apache/lucene/wikipedia/analysis/WikipediaTokenizer.html But I never used lucene. I just want to convert a wikipedia string into a list of tokens. But, I see that there are only four methods available in this class, end, incrementToken, reset, reset(reader). Can someone point me to an example to use it.
Thank you.

In Lucene 3.0, next() method is removed. Now you should use incrementToken to iterate through the tokens and it returns false when you reach the end of the input stream. To obtain the each token, you should use the methods of the AttributeSource class. Depending on the attributes that you want to obtain (term, type, payload etc), you need to add the class type of the corresponding attribute to your tokenizer using addAttribute method.
Following partial code sample is from the test class of the WikipediaTokenizer which you can find if you download the source code of the Lucene.
...
WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test));
int count = 0;
int numItalics = 0;
int numBoldItalics = 0;
int numCategory = 0;
int numCitation = 0;
TermAttribute termAtt = tf.addAttribute(TermAttribute.class);
TypeAttribute typeAtt = tf.addAttribute(TypeAttribute.class);
while (tf.incrementToken()) {
String tokText = termAtt.term();
//System.out.println("Text: " + tokText + " Type: " + token.type());
String expectedType = (String) tcm.get(tokText);
assertTrue("expectedType is null and it shouldn't be for: " + tf.toString(), expectedType != null);
assertTrue(typeAtt.type() + " is not equal to " + expectedType + " for " + tf.toString(), typeAtt.type().equals(expectedType) == true);
count++;
if (typeAtt.type().equals(WikipediaTokenizer.ITALICS) == true){
numItalics++;
} else if (typeAtt.type().equals(WikipediaTokenizer.BOLD_ITALICS) == true){
numBoldItalics++;
} else if (typeAtt.type().equals(WikipediaTokenizer.CATEGORY) == true){
numCategory++;
}
else if (typeAtt.type().equals(WikipediaTokenizer.CITATION) == true){
numCitation++;
}
}
...

WikipediaTokenizer tf = new WikipediaTokenizer(new StringReader(test));
Token token = new Token();
token = tf.next(token);
http://www.javadocexamples.com/java_source/org/apache/lucene/wikipedia/analysis/WikipediaTokenizerTest.java.html
Regards

public class WikipediaTokenizerTest {
static Logger logger = Logger.getLogger(WikipediaTokenizerTest.class);
protected static final String LINK_PHRASES = "click [[link here again]] click [http://lucene.apache.org here again] [[Category:a b c d]]";
public WikipediaTokenizer testSimple() throws Exception {
String text = "This is a [[Category:foo]]";
return new WikipediaTokenizer(new StringReader(text));
}
public static void main(String[] args){
WikipediaTokenizerTest wtt = new WikipediaTokenizerTest();
try {
WikipediaTokenizer x = wtt.testSimple();
logger.info(x.hasAttributes());
Token token = new Token();
int count = 0;
int numItalics = 0;
int numBoldItalics = 0;
int numCategory = 0;
int numCitation = 0;
while (x.incrementToken() == true) {
logger.info("seen something");
}
} catch(Exception e){
logger.error("Exception while tokenizing Wiki Text: " + e.getMessage());
}
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Improving the code that parses a Text File - java

I think you should know what exactly you're expecting while using RegEx. http://java.sun.com/developer/technicalArticles/releases/1.4regex/ should be helpful.

You can use the Scanner class. It has some helper methods to read text files

Related

How can I scope three different conditions using the same loop in Java?

Overall count for substrings in a string java

Resetting Java while loop

Extracting certain pattern from log using Java

Example using WikipediaTokenizer in Lucene

Categories

Resources