apply extraction information with java

apply extraction information with java - java

i trying to apply a dictionary (File of Words) on text(File of text):
we test if the word exists in a line of the text, if yes we will print it (the line). we test all word of dictionary for every line of text.
i used EXPREG pattern+matcher but the problem is the time. the operation take 5H.
The 2 File have 3330ko and 55ko
.
my question is is there another method to do this like UNITEX but in java
public class Tratemant_Dic extends Thread {
Tratemant_Dic() {
}
public void run() {
try {
BufferedReader file_corpus = new BufferedReader(
new InputStreamReader(new FileInputStream(
"corpus-medical.TXT"), "UTF-16LE"));
PrintWriter ecrire = new PrintWriter("sort.html");
String line;
String nom = null;
ecrire.write("<mot><span style=\"color:red\">startsss</span></mot></br><ligne>start\n");
while ((line = file_corpus.readLine()) != null) {
BufferedReader file_nom = new BufferedReader(
new InputStreamReader(new FileInputStream(
"Fichie_sorte.DIC"), "UTF-16LE"));
while ((nom = file_nom.readLine()) != null) {
nom = nom.substring(0, nom.length() - 3);
Pattern p = Pattern.compile("(.*)\\W+" + nom + "\\b.*",
Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(line);
if (m.find()) {
System.out.println(nom + "==>" + line);
ecrire.write("<mot><span style=\"color:red\">" + nom
+ "</span></mot></br><ligne>" + line + "\n");
}
}
file_nom.close();
}
ecrire.close();
System.out.println("FIN");
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

If I understand what you are trying to do correctly, I would not use regular expressions to do it. They're slow and you do not need them.
This is really a string matching problem. Your dictionary should probably be stored in a hash table, using the hashCode() method to get a key for the string. You then search in your dictionary for each word as you read it ( calculating the appropriate hash code as you read it ) from the text. Properly done that should be as fast as it gets.
Remember that hash codes are not guaranteed to be unique, so always make sure the actual strings match even if the hash code is found in the table.

I would start by attempting to time each of the "things" your application does than then target the slowest item (as mentioned in a comment by Jay, one issue in your case is the fact you are loading the dictionary every time) rather than base the improvements on a guess of what is wrong (the regex being slow).
You can use System.nanoTime() or one of the many stopwatches to do this. I normally use guava.

Why you not use instead of
Pattern p = Pattern.compile("(.*)\\W+" + nom + "\\b.*",
Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(line);
if (m.find()) {
...
just
if(line.indexOf(nom) > -1) {
...
?
Update: if you need word boundary stuff use something:
String lineToLowerCase = line.toLowerCase(); // before second while
...
int index = lineToLowerCase.indexOf(nom.toLowerCase());
if(index > -1) {
if(index ==0 || Character.isWhitespace(lineToLowerCase.charAt(index-1))) {
int indexEnd = index + nom.length();
if (indexEnd >= lineToLowerCase.length() || !Character.isAlphabetic(lineToLowerCase.charAt(indexEnd))) {
...
for testing
public static void main(String[] s) {
check("skdc s dcd dsf", "dcd"); // print true
check("skdc sdcd dsf", "dcd"); // print false
check("dcd dsf", "dcd"); // print true
check("afasa dcd", "dcd"); // print true
check("afasa dCD11", "dcD"); // print true
check("skdc s dcda dsf", "dcd"); // print false
}
public static void check(String line, String nom) {
String lineToLowerCase = line.toLowerCase();
int index = lineToLowerCase.indexOf(nom.toLowerCase());
if(index > -1) {
if(index ==0 || Character.isWhitespace(lineToLowerCase.charAt(index-1))) {
int indexEnd = index + nom.length();
if (indexEnd >= lineToLowerCase.length() || !Character.isAlphabetic(lineToLowerCase.charAt(indexEnd))) {
System.out.println("true");
return;
}
}
}
System.out.println("false");
}

Related

How can I scope three different conditions using the same loop in Java?

I would like to count countX and countX using the same loop instead of creating three different loops. Is there any easy way approaching that?
public class Absence {
private static File file = new File("/Users/naplo.txt");
private static File file_out = new File("/Users/naplo_out.txt");
private static BufferedReader br = null;
private static BufferedWriter bw = null;
public static void main(String[] args) throws IOException {
int countSign = 0;
int countX = 0;
int countI = 0;
String sign = "#";
String absenceX = "X";
String absenceI = "I";
try {
br = new BufferedReader(new FileReader(file));
bw = new BufferedWriter(new FileWriter(file_out));
String st;
while ((st = br.readLine()) != null) {
for (String element : st.split(" ")) {
if (element.matches(sign)) {
countSign++;
continue;
}
if (element.matches(absenceX)) {
countX++;
continue;
}
if (element.matches(absenceI)) {
countI++;
}
}
}
System.out.println("2. exerc.: There are " + countSign + " rows int the file with that sign.");
System.out.println("3. exerc.: There are " + countX + " with sick note, and " + countI + " without sick note!");
} catch (FileNotFoundException ex) {
Logger.getLogger(Absence.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
text file example:
# 03 26
Jujuba Ibolya IXXXXXX
Maracuja Kolos XXXXXXX

I think you meant using less than 3 if statements. You can actually so it with no ifs.
In your for loop write this:
Countsign += (element.matches(sign)) ? 1 : 0;
CountX += (element.matches(absenceX)) ? 1 : 0;
CountI += (element.matches(absenceI)) ? 1 : 0;

Both answers check if the word (element) matches all regular expressions while this can (and should, if you ask me) be avoided since a word can match only one regex. I am referring to the continue part your original code has, which is good since you do not have to do any further checks.
So, I am leaving here one way to do it with Java 8 Streams in "one liner".
But let's assume the following regular expressions:
String absenceX = "X*";
String absenceI = "I.*";
and one more (for the sake of the example):
String onlyNumbers = "[0-9]*";
In order to have some matches on them.
The text is as you gave it.
public class Test {
public static void main(String[] args) throws IOException {
File desktop = new File(System.getProperty("user.home"), "Desktop");
File txtFile = new File(desktop, "test.txt");
String sign = "#";
String absenceX = "X*";
String absenceI = "I.*";
String onlyNumbers = "[0-9]*";
List<String> regexes = Arrays.asList(sign, absenceX, absenceI, onlyNumbers);
List<String> lines = Files.readAllLines(txtFile.toPath());
//#formatter:off
Map<String, Long> result = lines.stream()
.flatMap(line-> Stream.of(line.split(" "))) //map these lines to words
.map(word -> regexes.stream().filter(word::matches).findFirst()) //find the first regex this word matches
.filter(Optional::isPresent) //If it matches no regex, it will be ignored
.collect(Collectors.groupingBy(Optional::get, Collectors.counting())); //collect
System.out.println(result);
}
}
The result:
{X*=1, #=1, I.=2, [0-9]=2}
X*=1 came from word: XXXXXXX
#=1 came from word: #
I.*=2 came from words: IXXXXXX and Ibolya
[0-9]*=2 came from words: 03 and 06
Ignore the fact I load all lines in memory.

So I made it with the following lines to work. It escaped my attention that every character need to be separated from each other. Your ternary operation suggestion also nice so I will use it.
String myString;
while ((myString = br.readLine()) != null) {
String newString = myString.replaceAll("", " ").trim();
for (String element : newString.split(" ")) {
countSign += (element.matches(sign)) ? 1 : 0;
countX += (element.matches(absenceX)) ? 1 : 0;
countI += (element.matches(absenceI)) ? 1 : 0;

Java parsing alternative to current solution

I have a text file to parse, that requires different logic depending on certain conditions. Below, is my current solution that works. However, I find it very clunky, and have been looking into other solutions such as StringTokenizer or Pattern class and am wondering I may be able to implement this more elegantly using them.
Do let me know if I should move this to the Code Review forum--I have not initially put it there, as I am unable to implement the other mentioned solutions.
File file = fileChooser.getSelectedFile();
java.io.BufferedReader reader = new java.io.BufferedReader(new java.io.FileReader(file));
memoryMap = new HashMap<Integer, Integer>();
registerMap = new HashMap<Integer, Integer>();
String line = reader.readLine();
while (line != null) {
if (line.contains("#")) {
System.out.println(line);
line = reader.readLine();
}
if (!Character.isDigit(line.charAt(0))) {
System.out.println(line);
String[] setFirstSplit = line.split(":");
if (setFirstSplit[0].equals("M")) {
boolean isFirst = true;
for (String setFirstSegment : setFirstSplit) {
if (!isFirst) {
String[] setSecondSplit = setFirstSegment.split(",");
for (String setSecondSegment : setSecondSplit) {
String[] setThirdSplit = setSecondSegment.split("=");
for (String setThirdSegment : setThirdSplit) {
System.out.println(setThirdSegment);
memoryMap.put(Integer.parseInt(setThirdSplit[0]), Integer.parseInt(setThirdSplit[1]));
System.out.println("Memory Set Result: " + memoryMap);
}
}
} else {
isFirst = false;
}
}
}
if (setFirstSplit[0].equals("R")) {
boolean isFirst = true;
for (String setFirstSegment : setFirstSplit) {
if (!isFirst) {
String[] setSecondSplit = setFirstSegment.split(",");
for (String setSecondSegment : setSecondSplit) {
String[] setThirdSplit = setSecondSegment.split("=");
for (String setThirdSegment : setThirdSplit) {
System.out.println(setThirdSegment);
registerMap.put(Integer.parseInt(setThirdSplit[0]), Integer.parseInt(setThirdSplit[1]));
System.out.println("Register Set Result: " + registerMap);
}
}
} else {
isFirst = false;
}
}
}
line = reader.readLine();
} else {
System.out.println(line);
String[] actionFirstSplit = line.split(" ");
if (actionFirstSplit[1].equals("LOAD")) {
String[] actionSecondSplit = actionFirstSplit[2].split(",");
LoadStep action = new LoadStep();
action.executeStep(Integer.parseInt(actionSecondSplit[0]), Integer.parseInt(actionSecondSplit[1]));
System.out.println("Memory Action Result: " + memoryMap);
System.out.println("Register Action Result: " + registerMap);
}
else {
System.out.println(line);
}
line = reader.readLine();
}
}
reader.close();
The text file looks like this:
# sets the memory address 0 to store the value 1. M stands for memory.
M:0=1,1=11
# All programs starts with an initial setup of values in memory such as the example shown above
0 LOAD 1,3
1 LOAD 0,2
2 ADD 1,2
3 ADD 0,1
4 LSS 1,3,2
5 STOR 62,1
6 STOP

Write it top-down.
String line = reader.readLine();
while (line != null) {
if (parsedComment(line)) {
} else if (parsedMemory(line)) {
} else if (parsedInstruction(line)) {
} else {
error(...);
}
line = reader.readLine();
}
Parse functions may use fields to pass results, like those maps, or have extra parameters.
(If you have multi-line syntax, the reader might be better placed in a field, and disappear as parameter. You can then read a line ahead in the field, and check on that.)

You could use a parser generator like ANTLR http://www.antlr.org/

Translate words in a string using BufferedReader (Java)

I've been working on this for a few days now and I just can't make any headway. I've tried using Scanner and BufferedReader and had no luck.
Basically, I have a working method (shortenWord) that takes a String and shortens it according to a text file formatted like this:
hello,lo
any,ne
anyone,ne1
thanks,thx
It also accounts for punctuation so 'hello?' becomes 'lo?' etc.
I need to be able to read in a String and translate each word individually, so "hello? any anyone thanks!" will become "lo? ne ne1 thx!", basically using the method I already have on each word in the String. The code I have will translate the first word but then does nothing to the rest. I think it's something to do with how my BufferedReader is working.
import java.io.*;
public class Shortener {
private FileReader in ;
/*
* Default constructor that will load a default abbreviations text file.
*/
public Shortener() {
try {
in = new FileReader( "abbreviations.txt" );
}
catch ( Exception e ) {
System.out.println( e );
}
}
public String shortenWord( String inWord ) {
String punc = new String(",?.!;") ;
char finalchar = inWord.charAt(inWord.length()-1) ;
String outWord = new String() ;
BufferedReader abrv = new BufferedReader(in) ;
// ends in punctuation
if (punc.indexOf(finalchar) != -1 ) {
String sub = inWord.substring(0, inWord.length()-1) ;
outWord = sub + finalchar ;
try {
String line;
while ( (line = abrv.readLine()) != null ) {
String[] lineArray = line.split(",") ;
if ( line.contains(sub) ) {
outWord = lineArray[1] + finalchar ;
}
}
}
catch (IOException e) {
System.out.println(e) ;
}
}
// no punctuation
else {
outWord = inWord ;
try {
String line;
while( (line = abrv.readLine()) != null) {
String[] lineArray = line.split(",") ;
if ( line.contains(inWord) ) {
outWord = lineArray[1] ;
}
}
}
catch (IOException ioe) {
System.out.println(ioe) ;
}
}
return outWord;
}
public void shortenMessage( String inMessage ) {
String[] messageArray = inMessage.split("\\s+") ;
for (String word : messageArray) {
System.out.println(shortenWord(word));
}
}
}
Any help, or even a nudge in the right direction would be so much appreciated.
Edit: I've tried closing the BufferedReader at the end of the shortenWord method and it just results in me getting an error on every word in the String after the first one saying that the BufferedReader is closed.

So I took at look at this. First of all, if you have the option to change the format of your textfile I would change it to something like this (or XML):
key1=value1
key2=value2
By doing this you could later use java's Properties.load(Reader). This would remove the need for any manual parsing of the file.'
If by any change you don't have the option to change the format then you'll have to parse it yourself. Something like the code below would do that, and put the results into a Map called shortningRules which could then be used later.
private void parseInput(FileReader reader) {
try (BufferedReader br = new BufferedReader(reader)) {
String line;
while ((line = br.readLine()) != null) {
String[] lineComponents = line.split(",");
this.shortningRules.put(lineComponents[0], lineComponents[1]);
}
} catch (IOException e) {
e.printStackTrace();
}
}
When it comes to actually shortening a message I would probably opt for a regex approach, e.g \\bKEY\\b where key is word you want shortened. \\b is a anchor in regex and symbolizes a word boundery which means it will not match spaces or punctuation.
The whole code for doing the shortening would then become something like this:
public void shortenMessage(String message) {
for (Entry<String, String> entry : shortningRules.entrySet()) {
message = message.replaceAll("\\b" + entry.getKey() + "\\b", entry.getValue());
}
System.out.println(message); //This should probably be a return statement instead of a sysout.
}
Putting it all together will give you something this, here I've added a main for testing purposes.

I think you can have a simpler solution using a HashMap. Read all the abbreviations into the map when the Shortener object is created, and just reference it once you have a word. The word will be the key and the abbreviation the value. Like this:
public class Shortener {
private FileReader in;
//the map
private HashMap<String, String> abbreviations;
/*
* Default constructor that will load a default abbreviations text file.
*/
public Shortener() {
//initialize the map
this.abbreviations = new HashMap<>();
try {
in = new FileReader("abbreviations.txt" );
BufferedReader abrv = new BufferedReader(in) ;
String line;
while ((line = abrv.readLine()) != null) {
String [] abv = line.split(",");
//If there is not two items in the file, the file is malformed
if (abv.length != 2) {
throw new IllegalArgumentException("Malformed abbreviation file");
}
//populate the map with the word as key and abbreviation as value
abbreviations.put(abv[0], abv[1]);
}
}
catch ( Exception e ) {
System.out.println( e );
}
}
public String shortenWord( String inWord ) {
String punc = new String(",?.!;") ;
char finalchar = inWord.charAt(inWord.length()-1) ;
// ends in punctuation
if (punc.indexOf(finalchar) != -1) {
String sub = inWord.substring(0, inWord.length() - 1);
//Reference map
String abv = abbreviations.get(sub);
if (abv == null)
return inWord;
return new StringBuilder(abv).append(finalchar).toString();
}
// no punctuation
else {
//Reference map
String abv = abbreviations.get(inWord);
if (abv == null)
return inWord;
return abv;
}
}
public void shortenMessage( String inMessage ) {
String[] messageArray = inMessage.split("\\s+") ;
for (String word : messageArray) {
System.out.println(shortenWord(word));
}
}
public static void main (String [] args) {
Shortener s = new Shortener();
s.shortenMessage("hello? any anyone thanks!");
}
}
Output:
lo?
ne
ne1
thx!
Edit:
From atommans answer, you can basically remove the shortenWord method, by modifying the shortenMessage method like this:
public void shortenMessage(String inMessage) {
for (Entry<String, String> entry:this.abbreviations.entrySet())
inMessage = inMessage.replaceAll(entry.getKey(), entry.getValue());
System.out.println(inMessage);
}

print out from switch statement java

here is a piece of code:
class Main {
public static void main(String[] args) {
try {
CLI.parse (args, new String[0]);
InputStream inputStream = args.length == 0 ?
System.in : new java.io.FileInputStream(CLI.infile);
ANTLRInputStream antlrIOS = new ANTLRInputStream(inputStream);
if (CLI.target == CLI.SCAN || CLI.target == CLI.DEFAULT)
{
DecafScanner lexer = new DecafScanner(antlrIOS);
Token token;
boolean done = false;
while (!done)
{
try
{
for (token=lexer.nextToken();
token.getType()!=Token.EOF; token=lexer.nextToken())
{
String type = "";
String text = token.getText();
switch (token.getType())
{
case DecafScanner.ID:
type = " CHARLITERAL";
break;
}
System.out.println (token.getLine() + type + " " + text);
}
done = true;
} catch(Exception e) {
// print the error:
System.out.println(CLI.infile+" "+e);
}
}
}
else if (CLI.target == CLI.PARSE)
{
DecafScanner lexer = new DecafScanner(antlrIOS);
CommonTokenStream tokens = new CommonTokenStream(lexer);
DecafParser parser = new DecafParser (tokens);
parser.program();
}
} catch(Exception e) {
// print the error:
System.out.println(CLI.infile+" "+e);
}
}
}
It prints out as it is but somehow it does not print the type out only the default value of it which is an empty string. How can I make it to print out from the switch statement?
Thanks!

Try debugging.
Try printing the value from within the switch section, to see if you ever get into it.
Try replacing the switch with a simple "==" to see if you ever get "token.getType() == DecafScanner.ID"
General suggestion - move the definition of "type" and "next" outside the loop to avoid recreating them again and again.

Finish the logic of search on java [duplicate]

This question already has answers here:
How do I compare strings in Java?
(23 answers)
Closed 8 years ago.
When the array argi has more than one partition, its only begins to find word from the second url of the array, but why? Correct it please in code. Here the part of it:
I make regular expresion for search by title in url:
private final Pattern TITLE = Pattern.compile("\\<title\\>(.*)\\<\\/title\\>", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
And searh logic here:
public String search(String url, String someword) {
try {
InputStreamReader in = new InputStreamReader(new URL(url).openStream(),"UTF-8");
StringBuilder input = new StringBuilder();
int ch;
while ((ch = in.read()) != -1) {
input.append((char) ch);
}
if (Pattern.compile(someword, Pattern.CASE_INSENSITIVE).matcher(input).find()) {
Matcher title = TITLE.matcher(input);
if (title.find()) {
return title.group(1);
}
}
} catch (IOException e) {
e.printStackTrace();
} catch (PatternSyntaxException e) {
e.printStackTrace();
}
return null;
}
String[] argi = {"http://localhost:8080/site/dipnagradi","http://localhost:8080/site/contacts"};
for (int i = 0; i < argi.length; i++) {
String result = search(argi[i], word);
if (result != null) {
str = "Search phrase " + "<b>"+ word + "</b>" + " have found " + "" + result + ""+ "<p></p>";
}
else{
str="Search word not found!";
}
if (word == null||word=="") {
str = "Enter a search word!";
}
}
return null;
}

Use if (word == null || word.isEmpty()), (for beginners) never use == in Object comparison in java.
Also, for a more detailed answer, you really need to post your input and expected output. And also what`search() does.

Do not use '==' operator with String as String is an Object and is compared using equals method.. Instead use : if(!"".equals(word){}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

apply extraction information with java - java

Related

How can I scope three different conditions using the same loop in Java?

Java parsing alternative to current solution

Translate words in a string using BufferedReader (Java)

print out from switch statement java

Finish the logic of search on java [duplicate]

Categories

Resources