How to extract key phrases from a given text with OpenNLP? - java

I'm using Apache OpenNLP and i'd like to extract the Keyphrases of a given text. I'm already gathering entities - but i would like to have Keyphrases.
The problem i have is that i can't use TF-IDF cause i don't have models for that and i only have a single text (not multiple documents)
Here is some code (prototyped - not so clean)
public List<KeywordsModel> extractKeywords(String text, NLPProvider pipeline) {
SentenceDetectorME sentenceDetector = new SentenceDetectorME(pipeline.getSentencedetecto("en"));
TokenizerME tokenizer = new TokenizerME(pipeline.getTokenizer("en"));
POSTaggerME posTagger = new POSTaggerME(pipeline.getPosmodel("en"));
ChunkerME chunker = new ChunkerME(pipeline.getChunker("en"));
ArrayList<String> stopwords = pipeline.getStopwords("en");
Span[] sentSpans = sentenceDetector.sentPosDetect(text);
Map<String, Float> results = new LinkedHashMap<>();
SortedMap<String, Float> sortedData = new TreeMap(new MapSort.FloatValueComparer(results));
float sentenceCounter = sentSpans.length;
float prominenceVal = 0;
int sentences = sentSpans.length;
for (Span sentSpan : sentSpans) {
prominenceVal = sentenceCounter / sentences;
sentenceCounter--;
String sentence = sentSpan.getCoveredText(text).toString();
int start = sentSpan.getStart();
Span[] tokSpans = tokenizer.tokenizePos(sentence);
String[] tokens = new String[tokSpans.length];
for (int i = 0; i < tokens.length; i++) {
tokens[i] = tokSpans[i].getCoveredText(sentence).toString();
}
String[] tags = posTagger.tag(tokens);
Span[] chunks = chunker.chunkAsSpans(tokens, tags);
for (Span chunk : chunks) {
if ("NP".equals(chunk.getType())) {
int npstart = start + tokSpans[chunk.getStart()].getStart();
int npend = start + tokSpans[chunk.getEnd() - 1].getEnd();
String potentialKey = text.substring(npstart, npend);
if (!results.containsKey(potentialKey)) {
boolean hasStopWord = false;
String[] pKeys = potentialKey.split("\\s+");
if (pKeys.length < 3) {
for (String pKey : pKeys) {
for (String stopword : stopwords) {
if (pKey.toLowerCase().matches(stopword)) {
hasStopWord = true;
break;
}
}
if (hasStopWord == true) {
break;
}
}
}else{
hasStopWord=true;
}
if (hasStopWord == false) {
int count = StringUtils.countMatches(text, potentialKey);
results.put(potentialKey, (float) (Math.log(count) / 100) + (float)(prominenceVal/5));
}
}
}
}
}
sortedData.putAll(results);
System.out.println(sortedData);
return null;
}
What it basically does is giving me the Nouns back and sorting them by prominence value (where is it in the text?) and counts.
But honestly - this doesn't work soo good.
I also tried it with lucene analyzer but the results were also not so good.
So - how can i achieve what i want to do? I already know of KEA/Maui-indexer etc (but i'm afraid i can't use them because of GPL :( )
Also interesting? Which other algorithms can i use instead of TF-IDF?
Example:
This text: http://techcrunch.com/2015/09/04/etsys-pulling-the-plug-on-grand-st-at-the-end-of-this-month/
Good output in my opinion: Etsy, Grand St., solar chargers, maker marketplace, tech hardware

Finally, i found something:
https://github.com/srijiths/jtopia
It is using the POS from opennlp/stanfordnlp. It has an ALS2 license. Haven't measured precision and recall yet but it delivers great results in my opinion.
Here is my code:
Configuration.setTaggerType("openNLP");
Configuration.setSingleStrength(6);
Configuration.setNoLimitStrength(5);
// if tagger type is "openNLP" then give the openNLP POS tagger path
//Configuration.setModelFileLocation("model/openNLP/en-pos-maxent.bin");
// if tagger type is "default" then give the default POS lexicon file
//Configuration.setModelFileLocation("model/default/english-lexicon.txt");
// if tagger type is "stanford "
Configuration.setModelFileLocation("Dont need that here");
Configuration.setPipeline(pipeline);
TermsExtractor termExtractor = new TermsExtractor();
TermDocument topiaDoc = new TermDocument();
topiaDoc = termExtractor.extractTerms(text);
//logger.info("Extracted terms : " + topiaDoc.getExtractedTerms());
Map<String, ArrayList<Integer>> finalFilteredTerms = topiaDoc.getFinalFilteredTerms();
List<KeywordsModel> keywords = new ArrayList<>();
for (Map.Entry<String, ArrayList<Integer>> e : finalFilteredTerms.entrySet()) {
KeywordsModel keyword = new KeywordsModel();
keyword.setLabel(e.getKey());
keywords.add(keyword);
}
I modified the Configurationfile a bit so that the POSModel is loaded from the pipeline instance.

Related

Cannot iterate through CSV columns

I'm building a stock screener that applies a calculation through each column of a csv file. However, when I run the for loop, I only get one result back.
String path = "C:/Users/0/Desktop/Git/Finance/Data/NQ100.csv";
Reader buf = Files.newBufferedReader(Paths.get(path));
CSVParser parsed = new CSVParser(buf, CSVFormat.DEFAULT.withFirstRecordAsHeader()
.withIgnoreHeaderCase().withTrim());
// Parse tickers
Map<String, Integer> header = parsed.getHeaderMap();
List<String> tickerList = new ArrayList<>(header.keySet());
for (int x=1; x < tickerList.size(); x++) { <----------------------- PROBLEM
// Accessing closing price by Header names
List<Double> closeList = new ArrayList<>();
for (CSVRecord record : parsed) {
String stringClose = record.get(x);
Double close = Double.valueOf(stringClose);
closeList.add(close);
}
// Percentage Change
List<Double> pctList = new ArrayList<>();
for (int i=1; i < closeList.size(); i++) {
Double pct = closeList.get(i) / closeList.get(i-1) - 1;
pctList.add(pct);
}
// Statistics
Double sum = 0.0, var = 0.0, mean, sd, rfr, sr;
// Mean
for (Double num : pctList) sum += num;
mean = sum/pctList.size();
// Standard Deviation
for (Double num: pctList) var += Math.pow(num - mean, 2);
sd = Math.sqrt(var/pctList.size());
// Risk Free Rate
rfr = Math.pow((1+0.03),(1/252.0))-1;
// Sharpe Ratio
sr = Math.sqrt(252) * ((mean-rfr)/sd);
System.out.println(tickerList.get(x) + " " + sr);
}
My data looks like this:
,AAL,AAPL,ADBE
2007-10-25,26.311651,23.141403,47.200001
2007-10-26,26.273216,23.384495,47.0
2007-10-29,26.004248,23.43387,47.0
So I was expecting:
AAL XXX
AAPL XXX
ADBE XXX
But I got just:
AAL 0.3604941921663456
Would be grateful if you guys can help me find the problem!
You can iterate through Iterable in Java only once, in your case CSVParser parsed implements Iterable<CSVRecord>.
So you iterate through it only for the first time when you calculate statistics for AAL, during analyzing data for AAPL and ADBE it will be handled as an empty one.
You can handle this by introducing helper list init by the parsed, add next code (it is a one line solution of course e.g. in Java 8, but this option will work for earlier versions too) before the for cycle:
List<CSVRecord> records = new ArrayList<>();
for (CSVRecord record : parsed) {
records.add(record);
}
And change next line:
for (CSVRecord record : records) {
with:
for (CSVRecord record : parsed) {
For the CSV you've provided you will have next output then:
AAL -21.583101145880306
AAPL 23.417753561072438
ADBE -16.75343297000953
So here's a block of the code that work for me, if i understand your question, you only want to "read" each column and row from a csv file, hope helps.
br = new BufferedReader(new InputStreamReader(new FileInputStream(archivo), "UTF8"));
while ((line = br.readLine()) != null) {
if(a!=0){
String[] datos = line.split(cvsSplitBy);
System.out.println(datos[0] + " - " + datos[1] + " - " + datos[2]);
}
a++;
}

How to Loop next element in hashmap

I have a set of strings like this
A_2007-04, A_2007-09, A_Agent, A_Daily, A_Execute, A_Exec, B_Action, B_HealthCheck
I want output as:
Key = A, Value = [2007-04,2007-09,Agent,Execute,Exec]
Key = B, Value = [Action,HealthCheck]
I'm using HashMap to do this
pckg:{A,B}
count:total no of strings
reports:set of strings
Logic I used is nested loop:
for (String l : reports[i]) {
for (String r : pckg) {
String[] g = l.split("_");
if (g[0].equalsIgnoreCase(r)) {
report.add(g[1]);
dirFiles.put(g[0], report);
} else {
break;
}
}
}
I'm getting output as
Key = A, Value = [2007-04,2007-09,Agent,Execute,Exec]
How to get second key?
Can someone suggest logic for this?
Assuming that you use Java 8, it can be done using computeIfAbsent to initialize the List of values when it is a new key as next:
List<String> tokens = Arrays.asList(
"A_2007-04", "A_2007-09", "A_Agent", "A_Daily", "A_Execute",
"A_Exec", "P_Action", "P_HealthCheck"
);
Map<String, List<String>> map = new HashMap<>();
for (String token : tokens) {
String[] g = token.split("_");
map.computeIfAbsent(g[0], key -> new ArrayList<>()).add(g[1]);
}
In terms of raw code this should do what I think you are trying to achieve:
// Create a collection of String any way you like, but for testing
// I've simply split a flat string into an array.
String flatString = "A_2007-04,A_2007-09,A_Agent,A_Daily,A_Execute,A_Exec,"
+ "P_Action,P_HealthCheck";
String[] reports = flatString.split(",");
Map<String, List<String>> mapFromReportKeyToValues = new HashMap<>();
for (String report : reports) {
int underscoreIndex = report.indexOf("_");
String key = report.substring(0, underscoreIndex);
String newValue = report.substring(underscoreIndex + 1);
List<String> existingValues = mapFromReportKeyToValues.get(key);
if (existingValues == null) {
// This key hasn't been seen before, so create a new list
// to contain values which belong under this key.
existingValues = new ArrayList<>();
mapFromReportKeyToValues.put(key, existingValues);
}
existingValues.add(newValue);
}
System.out.println("Generated map:\n" + mapFromReportKeyToValues);
Though I recommend tidying it up and organising it into a method or methods as fits your project code.
Doing this with Map<String, ArrayList<String>> will be another good approach I think:
String reports[] = {"A_2007-04", "A_2007-09", "A_Agent", "A_Daily",
"A_Execute", "A_Exec", "P_Action", "P_HealthCheck"};
Map<String, ArrayList<String>> map = new HashMap<>();
for (String rep : reports) {
String s[] = rep.split("_");
String prefix = s[0], suffix = s[1];
ArrayList<String> list = new ArrayList<>();
if (map.containsKey(prefix)) {
list = map.get(prefix);
}
list.add(suffix);
map.put(prefix, list);
}
// Print
for (Map.Entry<String, ArrayList<String>> entry : map.entrySet()) {
String key = entry.getKey();
ArrayList<String> valueList = entry.getValue();
System.out.println(key + " " + valueList);
}
for (String l : reports[i]) {
String[] g = l.split("_");
for (String r : pckg) {
if (g[0].equalsIgnoreCase(r)) {
report = dirFiles.get(g[0]);
if(report == null){ report = new ArrayList<String>(); } //create new report
report.add(g[1]);
dirFiles.put(g[0], report);
}
}
}
Removed the else part of the if condition. You are using break there which exits the inner loop and you never get to evaluate the keys beyond first key.
Added checking for existing values. As suggested by Orin2005.
Also I have moved the statement String[] g = l.split("_"); outside inner loop so that it doesn't get executed multiple times.

Insert a bulleted list from an ArrayList Apache POI XWPF

I have an array list that I want to use to create a new bullet list inside a document.
I already have numbering (with numbers) and I want to have both (number and bullet) on different lists.
My document is pre-populated with some data and I have some tokens who determine where go my data. For my list, I have token who is like this one and I able to reach it.
{{tokenlist1}}
I want to :
first option : reach my token, create a new bullet list and delete my token
second option : replace my token by my first element and continue my bullet list.
It would be really appreciated if the bullet form (square, round, check, ....) can stay the same as they are with the token.
EDIT
for those who want an answer here's my solution.
Action
Map<String, Object> replacements = new HashMap<String, Object>();
replacements.put("{{token1}}", "texte changé 1");
replacements.put("{{token2}}", "ici est le texte du token numéro 2");
replacements.put("{{tokenList1}}", tokenList1);
replacements.put("{{tokenList2}}", tokenList1);
templateWithToken = reportService.findAndReplaceToken(replacements, templateWithToken);
Service
public XWPFDocument findAndReplaceToken (Map<String, Object> replacements,
XWPFDocument document) {
List<XWPFParagraph> paragraphs = document.getParagraphs();
for (int i = 0; i < paragraphs.size(); i++) {
XWPFParagraph paragraph = paragraphs.get(i);
List<XWPFRun> runs = paragraph.getRuns();
for (Map.Entry<String, Object> replPair : replacements
.entrySet()) {
String find = replPair.getKey();
Object repl = replPair.getValue();
TextSegment found =
paragraph.searchText(find, new PositionInParagraph());
if (found != null) {
if (repl instanceof String) {
replaceText(found, runs, find, repl);
} else if (repl instanceof ArrayList<?>) {
Iterator<?> iterArrayList =
((ArrayList) repl).iterator();
boolean isPassed = false;
while (iterArrayList.hasNext()) {
Object object = (Object) iterArrayList.next();
if (isPassed == false) {
replaceText(found, runs, find,
object.toString());
} else {
XWPFRun run = paragraph.createRun();
run.addCarriageReturn();
run.setText(object.toString());
}
isPassed = true;
}
}
}
}
}
return document;
}
private void replaceText(TextSegment found, List<XWPFRun> runs,
String find, Object repl) {
int biginRun = found.getBeginRun();
int biginRun2 = found.getEndRun();
if (found.getBeginRun() == found.getEndRun()) {
// whole search string is in one Run
XWPFRun run = runs.get(found.getBeginRun());
String runText = run.getText(run.getTextPosition());
String replaced = runText.replace(find, repl.toString());
run.setText(replaced, 0);
} else {
// The search string spans over more than one Run
// Put the Strings together
StringBuilder b = new StringBuilder();
for (int runPos = found.getBeginRun(); runPos <= found
.getEndRun(); runPos++) {
XWPFRun run = runs.get(runPos);
b.append(run.getText(run.getTextPosition()));
}
String connectedRuns = b.toString();
String replaced = connectedRuns.replace(find, repl.toString());
// The first Run receives the replaced String of all
// connected Runs
XWPFRun partOne = runs.get(found.getBeginRun());
partOne.setText(replaced, 0);
// Removing the text in the other Runs.
for (int runPos = found.getBeginRun() + 1; runPos <= found
.getEndRun(); runPos++) {
XWPFRun partNext = runs.get(runPos);
partNext.setText("", 0);
}
}
}

Generate Term-Document matrix using Lucene 4.4

I'm trying to create Term-Document matrix for a small corpus to further experiment with LSI. However, I couldn't find a way to do it with Lucene 4.4.
I know how to get TermVector for each document as following:
//create boolean query to search for a specific document (not shown)
TopDocs hits = searcher.search(query, 1);
Terms termVector = reader.getTermVector(hits.scoreDocs[0].doc, "contents");
System.out.println(termVector.size()); //just testing
I thought I can just union all the termVector together as columns in a matrix to get the matrix. However, termVector for different documents have different size. And we don't know how to pad 0 into the termVector. So, certainly, this method does not work.
Hence, I wonder if someone can show me how to create Term-Document vector with Lucene 4.4 please? (If possible, please show sample code).
If Lucene does not support this function, what is the other way you recommend to do it?
Many thanks,
I found the solution to my problem here. Very detail example given by Mr. Sujit, although the code is written in older version of Lucene so many things will have to be changed. I'll update details when I finish my code.
Here is my solution that works on Lucene 4.4
public class BuildTermDocumentMatrix {
public BuildTermDocumentMatrix(File index, File corpus) throws IOException{
reader = DirectoryReader.open(FSDirectory.open(index));
searcher = new IndexSearcher(reader);
this.corpus = corpus;
termIdMap = computeTermIdMap(reader);
}
/**
* Map term to a fix integer so that we can build document matrix later.
* It's used to assign term to specific row in Term-Document matrix
*/
private Map<String, Integer> computeTermIdMap(IndexReader reader) throws IOException {
Map<String,Integer> termIdMap = new HashMap<String,Integer>();
int id = 0;
Fields fields = MultiFields.getFields(reader);
Terms terms = fields.terms("contents");
TermsEnum itr = terms.iterator(null);
BytesRef term = null;
while ((term = itr.next()) != null) {
String termText = term.utf8ToString();
if (termIdMap.containsKey(termText))
continue;
//System.out.println(termText);
termIdMap.put(termText, id++);
}
return termIdMap;
}
/**
* build term-document matrix for the given directory
*/
public RealMatrix buildTermDocumentMatrix () throws IOException {
//iterate through directory to work with each doc
int col = 0;
int numDocs = countDocs(corpus); //get the number of documents here
int numTerms = termIdMap.size(); //total number of terms
RealMatrix tdMatrix = new Array2DRowRealMatrix(numTerms, numDocs);
for (File f : corpus.listFiles()) {
if (!f.isHidden() && f.canRead()) {
//I build term document matrix for a subset of corpus so
//I need to lookup document by path name.
//If you build for the whole corpus, just iterate through all documents
String path = f.getPath();
BooleanQuery pathQuery = new BooleanQuery();
pathQuery.add(new TermQuery(new Term("path", path)), BooleanClause.Occur.SHOULD);
TopDocs hits = searcher.search(pathQuery, 1);
//get term vector
Terms termVector = reader.getTermVector(hits.scoreDocs[0].doc, "contents");
TermsEnum itr = termVector.iterator(null);
BytesRef term = null;
//compute term weight
while ((term = itr.next()) != null) {
String termText = term.utf8ToString();
int row = termIdMap.get(termText);
long termFreq = itr.totalTermFreq();
long docCount = itr.docFreq();
double weight = computeTfIdfWeight(termFreq, docCount, numDocs);
tdMatrix.setEntry(row, col, weight);
}
col++;
}
}
return tdMatrix;
}
}
One can refer this code also. In the latest Lucene version It will be quite easy.
Example 15
public void testSparseFreqDoubleArrayConversion() throws Exception {
Terms fieldTerms = MultiFields.getTerms(index, "text");
if (fieldTerms != null && fieldTerms.size() != -1) {
IndexSearcher indexSearcher = new IndexSearcher(index);
for (ScoreDoc scoreDoc : indexSearcher.search(new MatchAllDocsQuery(), Integer.MAX_VALUE).scoreDocs) {
Terms docTerms = index.getTermVector(scoreDoc.doc, "text");
Double[] vector = DocToDoubleVectorUtils.toSparseLocalFreqDoubleArray(docTerms, fieldTerms);
assertNotNull(vector);
assertTrue(vector.length > 0);
}
}
}

Parsing a .txt file (considering performance measure)

DurationOfRun:5
ThreadSize:10
ExistingRange:1-1000
NewRange:5000-10000
Percentage:55 - AutoRefreshStoreCategories Data:Previous/30,New/70 UserLogged:true/50,false/50 SleepTime:5000 AttributeGet:1,16,10106,10111 AttributeSet:2060/30,10053/27
Percentage:25 - CrossPromoEditItemRule Data:Previous/60,New/40 UserLogged:true/50,false/50 SleepTime:4000 AttributeGet:1,10107 AttributeSet:10108/34,10109/25
Percentage:20 - CrossPromoManageRules Data:Previous/30,New/70 UserLogged:true/50,false/50 SleepTime:2000 AttributeGet:1,10107 AttributeSet:10108/26,10109/21
I am trying to parse above .txt file(first four lines are fixed and last three Lines can increase means it can be more than 3), so for that I wrote the below code and its working but it looks so messy. so Is there any better way to parse the above .txt file and also if we consider performance then which will be best way to parse the above txt file.
private static int noOfThreads;
private static List<Command> commands;
public static int startRange;
public static int endRange;
public static int newStartRange;
public static int newEndRange;
private static BufferedReader br = null;
private static String sCurrentLine = null;
private static List<String> values;
private static String commandName;
private static String percentage;
private static List<String> attributeIDGet;
private static List<String> attributeIDSet;
private static LinkedHashMap<String, Double> dataCriteria;
private static LinkedHashMap<Boolean, Double> userLoggingCriteria;
private static long sleepTimeOfCommand;
private static long durationOfRun;
br = new BufferedReader(new FileReader("S:\\Testing\\PDSTest1.txt"));
values = new ArrayList<String>();
while ((sCurrentLine = br.readLine()) != null) {
if(sCurrentLine.startsWith("DurationOfRun")) {
durationOfRun = Long.parseLong(sCurrentLine.split(":")[1]);
} else if(sCurrentLine.startsWith("ThreadSize")) {
noOfThreads = Integer.parseInt(sCurrentLine.split(":")[1]);
} else if(sCurrentLine.startsWith("ExistingRange")) {
startRange = Integer.parseInt(sCurrentLine.split(":")[1].split("-")[0]);
endRange = Integer.parseInt(sCurrentLine.split(":")[1].split("-")[1]);
} else if(sCurrentLine.startsWith("NewRange")) {
newStartRange = Integer.parseInt(sCurrentLine.split(":")[1].split("-")[0]);
newEndRange = Integer.parseInt(sCurrentLine.split(":")[1].split("-")[1]);
} else {
attributeIDGet = new ArrayList<String>();
attributeIDSet = new ArrayList<String>();
dataCriteria = new LinkedHashMap<String, Double>();
userLoggingCriteria = new LinkedHashMap<Boolean, Double>();
percentage = sCurrentLine.split("-")[0].split(":")[1].trim();
values = Arrays.asList(sCurrentLine.split("-")[1].trim().split("\\s+"));
for(String s : values) {
if(s.startsWith("Data")) {
String[] data = s.split(":")[1].split(",");
for (String n : data) {
dataCriteria.put(n.split("/")[0], Double.parseDouble(n.split("/")[1]));
}
//dataCriteria.put(data.split("/")[0], value)
} else if(s.startsWith("UserLogged")) {
String[] userLogged = s.split(":")[1].split(",");
for (String t : userLogged) {
userLoggingCriteria.put(Boolean.parseBoolean(t.split("/")[0]), Double.parseDouble(t.split("/")[1]));
}
//userLogged = Boolean.parseBoolean(s.split(":")[1]);
} else if(s.startsWith("SleepTime")) {
sleepTimeOfCommand = Long.parseLong(s.split(":")[1]);
} else if(s.startsWith("AttributeGet")) {
String[] strGet = s.split(":")[1].split(",");
for(String q : strGet) attributeIDGet.add(q);
} else if(s.startsWith("AttributeSet:")) {
String[] strSet = s.split(":")[1].split(",");
for(String p : strSet) attributeIDSet.add(p);
} else {
commandName = s;
}
}
Command command = new Command();
command.setName(commandName);
command.setExecutionPercentage(Double.parseDouble(percentage));
command.setAttributeIDGet(attributeIDGet);
command.setAttributeIDSet(attributeIDSet);
command.setDataUsageCriteria(dataCriteria);
command.setUserLoggingCriteria(userLoggingCriteria);
command.setSleepTime(sleepTimeOfCommand);
commands.add(command);
Well, parsers usually are messy once you get down to the lower layers of them :-)
However, one possible improvement, at least in terms of code quality, would be to recognize the fact that your grammar is layered.
By that, I mean every line is an identifying token followed by some properties.
In the case of DurationOfRun, ThreadSize, ExistingRange and NewRange, the properties are relatively simple. Percentage is somewhat more complex but still okay.
I would structure the code as (pseudo-code):
def parseFile (fileHandle):
while (currentLine = fileHandle.getNextLine()) != EOF:
if currentLine.beginsWith ("DurationOfRun:"):
processDurationOfRun (currentLine[14:])
elsif currentLine.beginsWith ("ThreadSize:"):
processThreadSize (currentLine[11:])
elsif currentLine.beginsWith ("ExistingRange:"):
processExistingRange (currentLine[14:])
elsif currentLine.beginsWith ("NewRange:"):
processNewRange (currentLine[9:])
elsif currentLine.beginsWith ("Percentage:"):
processPercentage (currentLine[11:])
else
raise error
Then, in each of those processWhatever() functions, you parse the remainder of the line based on the expected format. That keeps your code small and readable and easily changed in future, without having to navigate a morass :-)
For example, processDurationOfRun() simply gets an integer from the remainder of the line:
def processDurationOfRun (line):
this.durationOfRun = line.parseAsInt()
Similarly, the functions for the two ranges split the string on - and get two integers from the resultant values:
def processExistingRange (line):
values[] = line.split("-")
this.existingRangeStart = values[0].parseAsInt()
this.existingRangeEnd = values[1].parseAsInt()
The processPercentage() function is the tricky one but that is also easily doable if you layer it as well. Assuming those things are always in the same order, it consists of:
an integer;
a literal -;
some sort of textual category; and
a series of key:value pairs.
And even these values within the pairs can be parsed by lower levels, splitting first on commas to get subvalues like Previous/30 and New/70, then splitting each of those subvalues on slashes to get individual items. That way, a logical hierarchy can be reflected in your code.
Unless you're expecting to be parsing this text files many times per second, or unless it's many megabytes in size, I'd be more concerned about the readability and maintainability of your code than the speed of the parsing.
Mostly gone are the days when we need to wring the last ounce of performance from our code but we still have problems in fixing said code in a timely manner when bugs are found or enhancements are desired.
Sometimes it's preferable to optimise for readability.
I would not worry about performance until I was sure there was actually a performance issue. Regarding the rest of the code, if you won't be adding any new line types I would not worry about it. If you do worry about it, however, a factory design pattern can help you separate the selection of the type of processing needed from the actual processing. It makes adding new line types easier without introducing as much opportunity for error.
The younger and more convenient class is Scanner. You just need to modify the delimiter, and get reading of data in the desired format (readInt, readLong) in one go - no need for separate x.parseX - calls.
Second: Split your code into small, reusable pieces. They make the program readable, and you can hide details easily.
Don't hesitate to use a struct-like class for a range, for example. Returning multiple values from a method can be done by these, without boilerplate (getter,setter,ctor).
import java.util.*;
import java.io.*;
public class ReadSampleFile
{
// struct like classes:
class PercentageRow {
public int percentage;
public String name;
public int dataPrevious;
public int dataNew;
public int userLoggedTrue;
public int userLoggedFalse;
public List<Integer> attributeGet;
public List<Integer> attributeSet;
}
class Range {
public int from;
public int to;
}
private int readInt (String name, Scanner sc) {
String s = sc.next ();
if (s.startsWith (name)) {
return sc.nextLong ();
}
else err (name + " expected, found: " + s);
}
private long readLong (String name, Scanner sc) {
String s = sc.next ();
if (s.startsWith (name)) {
return sc.nextInt ();
}
else err (name + " expected, found: " + s);
}
private Range readRange (String name, Scanner sc) {
String s = sc.next ();
if (s.startsWith (name)) {
Range r = new Range ();
r.from = sc.nextInt ();
r.to = sc.nextInt ();
return r;
}
else err (name + " expected, found: " + s);
}
private PercentageLine readPercentageLine (Scanner sc) {
// reuse above methods
PercentageLine percentageLine = new PercentageLine ();
percentageLine.percentage = readInt ("Percentage", sc);
// ...
return percentageLine;
}
public ReadSampleFile () throws FileNotFoundException
{
/* I only read from my sourcefile for convenience.
So I could scroll up to see what's the next entry.
Don't do this at home. :) The dummy later ...
*/
Scanner sc = new Scanner (new File ("./ReadSampleFile.java"));
sc.useDelimiter ("[ \n/,:-]");
// ... is the comment I had to insert.
String dummy = sc.nextLine ();
List <String> values = new ArrayList<String> ();
if (sc.hasNext ()) {
// see how nice the data structure is reflected
// by this code:
long duration = readLong ("DurationOfRun");
int noOfThreads = readInt ("ThreadSize");
Range eRange = readRange ("ExistingRange");
Range nRange = readRange ("NewRange");
List <PercentageRow> percentageRows = new ArrayList <PercentageRow> ();
// including the repetition ...
while (sc.hasNext ()) {
percentageRows.add (readPercentageLine ());
}
}
}
public static void main (String args[]) throws FileNotFoundException
{
new ReadSampleFile ();
}
public static void err (String msg)
{
System.out.println ("Err:\t" + msg);
}
}

Categories