I am using Java8 and OpenNLP. I am trying to extract all noun words from sentences.
I have tried this example, but it extracts all noun phrases ("NP"). Does anyone know how I can just extract the individual noun words?
Thanks
What have you tried so far? I haven't looked at the example you link to in a lot of detail, but I'm pretty sure that you could get where you want to with modifying that example.
In any case, it's not very difficult:
InputStream modelIn = null;
POSModel POSModel = null;
try{
File f = new File("<location to your tagger model here>");
modelIn = new FileInputStream(f);
POSModel = new POSModel(modelIn);
POSTaggerME tagger = new POSTaggerME(POSModel);
SimpleTokenizer tokenizer= new SimpleTokenizer();
String tokens[] = tokenizer.tokenize("This is a sample sentence.");
String[] tagged = tagger.tag(tokens);
for (int i = 0; i < tagged.length; i++){
if (tagged[i].equalsIgnoreCase("nn")){
System.out.println(tokens[i]);
}
}
}
catch(IOException e){
throw new BadRequestException(e.getMessage());
}
You can download the tagger models here: http://opennlp.sourceforge.net/models-1.5/
And I should say that the SimpleTokenizer is deprecated. You may want to look into a bit more sophisticated one, but in my experience, the more fancy ones from OpenNLP are also a lot slower (and in general unacceptably slow for just tokenisation).
Related
I'm new to Java so apologies in advance.
I need to scan through a .txt file where each line is a set of names, and if a name is present anywhere in another .txt file it should then output the line from the first file into a third .txt file.
As far as I am aware my current solution will only scan through the first line and then stop, because once scanB has reached the end of the file it cannot return to the beginning? So I probably need to use a completely different approach to achieve the result I'm looking for. The code I've got so far is below but I am aware it is most likely waaay off for what I need to be doing.
Sorry again if there's any really really stupid mistakes in this, as I said I'm very new to this.
`File A = new File("A.txt");
Scanner scanA = new Scanner(A);
String personA = "";
File B = new File("B.txt");
Scanner scanB = new Scanner(B);
String personCheck = "";
while(scanA.hasNextLine()){
personA = scanA.nextLine();
while(scanB.hasNextLine()){
personB = scaninteractionevents.nextLine();
if(personCheck.contains(personB)){
FileWriter f = new FileWriter("PersonList.txt", true);
BufferedWriter b = new BufferedWriter(f);
PrintWriter writer = new PrintWriter(b);
writer.print(personCheck);
}
}
}`
Thank you for asking your question. Being new is not a problem. Your question is clear and well provided with a reproducable example.
If you want to rescan a file multiple times, you need to reprovide the file to the scanner every time. The best way to do this, is to make a new scanner every iteration.
File A = new File("A.txt");
Scanner scanA = new Scanner(A);
String personA = "";
File B = new File("B.txt");
String personCheck = "";
while(scanA.hasNextLine()){
personA = scanA.nextLine();
Scanner scanB = new Scanner(B);
while(scanB.hasNextLine()){
personB = scaninteractionevents.nextLine();
if(personCheck.contains(personB)){
FileWriter f = new FileWriter("PersonList.txt", true);
BufferedWriter b = new BufferedWriter(f);
PrintWriter writer = new PrintWriter(b);
writer.print(personCheck);
}
}
scanB.close();
}
scanA.close();
f.close();
b.close();
As you can see, I also added some close() calls, as it is good practice to close readers and writers to make sure your memory does not get flooded.
Edit: as was said in the answers, it might be better not to reread the file every time you run the code. You could indeed store it in a string, depending on the file size. This requires slightly more programming insights and is only necessary if you are working on efficient programming rather than starting to learn a new coding language. This is my opinion, others may disagree. ;)
If i use a query like this in command line
./opennlp TokenNameFinder en-ner-person.bin "input.txt" "output.txt"
I'll get person names printed in output.txt but I want to write own models such that i should print my own entities.
E.g.
what is the risk value on icm2500.
Delivery of prd_234 will be arrived late.
Watson is handling router_34.
If i pass these lines, it should parse and extract product_entities. icm2500, prd_234, router_34... etc these are all Products( we can save this information in a file and we can use it as look up kind of for models or openNLP).
Can anyone please tel me how to do this ?
You'll need to train your own model by annotating some sentences in the opennlp format. For the example sentences you posted the format would look like this:
what is the risk value on <START:product> icm2500 <END>.
Delivery of <START:product> prd_234 <END> will be arrived late.
Watson is handling <START:product> router_34 <END>.
Make sure each sentence ends in a newline and if there are newlines in the sentence to escape them somehow.
Once you make a file like this out of your data, then you can use the Java API to train the model like this
public static void main(String[] args){
Charset charset = Charset.forName("UTF-8");
ObjectStream<String> lineStream =
new PlainTextByLineStream(new FileInputStream("your file in the above format"), charset);
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream);
TokenNameFinderModel model;
try {
model = NameFinderME.train("en", "person", sampleStream, TrainingParameters.defaultParams(),
null, Collections.<String, Object>emptyMap());
}
finally {
sampleStream.close();
}
try {
modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}
}
now you can use the model with the namefinder.
Because you may have a definitive, and possibly short, list of product names, you might consider a simple regex approach.
here's the opennlp docs that cover the NameFinder a bit:
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training.tool
I am new in OpenNLP. I use OpenNLP to find location's name from sentence. My input string is "Italy pardons US colonel in CIA case". I can not find "Italy" word in result set. How can I solve this problem. Thanks in advance!
try {
InputStream modelIn = new FileInputStream("en-token.bin");
TokenizerModel tokenModel = new TokenizerModel(modelIn);
modelIn.close();
Tokenizer tokenizer = new TokenizerME(tokenModel);
NameFinderME nameFinder =
new NameFinderME(
new TokenNameFinderModel(new FileInputStream("en-ner-location.bin")));
String tokens[] = tokenizer.tokenize(documentStr);
Span nameSpans[] = nameFinder.find(tokens);
for( int i = 0; i<nameSpans.length; i++) {
System.out.println("Span: "+nameSpans[i].toString());
}
}
catch(Exception e) {
System.out.println(e.toString());
}
opennlp results are dependent on the data the model was created from. The en-ner-location.bin file at sourceforge may not contain samples that make sense for your data. Also, extracting nouns or noun phrases (NNP) with a chunker or POS tagger will not be isolated to only locations. So the answer to your question is: The model doesn't account for every case in your data, this is the reason why you don't get a hit on this particular sentence. BTW, NER is never perfect and will always return some degree of false positives and false negatives.
currently i creating a java apps and no database required
that why i using text file to make it
the structure of file is like this
unique6id username identitynumber point
unique6id username identitynumber point
may i know how could i read and find match unique6id then update the correspond row of point ?
Sorry for lack of information
and here is the part i type is
public class Cust{
string name;
long idenid, uniqueid;
int pts;
customer(){}
customer(string n,long ide, long uni, int pt){
name = n;
idenid = ide;
uniqueid = uni;
pts = pt;
}
FileWriter fstream = new FileWriter("Data.txt", true);
BufferedWriter fbw = new BufferedWriter(fstream);
Cust newCust = new Cust();
newCust.name = memUNTF.getText();
newCust.ic = Long.parseLong(memICTF.getText());
newCust.uniqueID = Long.parseLong(memIDTF.getText());
newCust.pts= points;
fbw.write(newCust.name + " " + newCust.ic + " " + newCust.uniqueID + " " + newCust.point);
fbw.newLine();
fbw.close();
this is the way i text in the data
then the result inside Data.txt is
spencerlim 900419129876 448505 0
Eugene 900419081234 586026 0
when user type in 586026 then it will grab row of eugene
bind into Cust
and update the pts (0 in this case, try to update it into other number eg. 30)
Thx for reply =D
Reading is pretty easy, but updating a text file in-place (ie without rewriting the whole file) is very awkward.
So, you have two options:
Read the whole file, make your changes, and then write the whole file to disk, overwriting the old version; this is quite easy, and will be fast enough for small files, but is not a good idea for very large files.
Use a format that is not a simple text file. A database would be one option (and bear in mind that there is one, Derby, built into the JDK); there are other ways of keeping simple key-value stores on disk (like a HashMap, but in a file), but there's nothing built into the JDK.
You can use OpenCSV with custom separators.
Here's a sample method that updates the info for a specified user:
public static void updateUserInfo(
String userId, // user id
String[] values // new values
) throws IOException{
String fileName = "yourfile.txt.csv";
CSVReader reader = new CSVReader(new FileReader(fileName), ' ');
List<String[]> lines = reader.readAll();
Iterator<String[]> iterator = lines.iterator();
while(iterator.hasNext()){
String[] items = (String[]) iterator.next();
if(items[0].equals(userId)){
for(int i = 0; i < values.length; i++){
String value = values[i];
if(value!=null){
// for every array value that's not null,
// update the corresponding field
items[i+1]=value;
}
}
break;
}
}
new CSVWriter(new FileWriter(fileName), ' ').writeAll(lines);
}
Use InputStream(s) and Reader(s) to read file.
Here is a code snippet that shows how to read file.
BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream("c:/myfile.txt")));
String line = null;
while ((line = reader.readLine()) != null) {
// do something with the line.
}
Use OutputStream and Writer(s) to write to file. Although you can use random access files, i.e. write to the specific place of the file I do not recommend you to do this. Much easier and robust way is to create new file every time you have to write something. I know that it is probably not the most efficient way, but you do not want to use DB for some reasons... If you have to save and update partial information relatively often and perform search into the file I'd recommend you to use DB. There are very light weight implementations including pure java implementations (e.g. h2: http://www.h2database.com/html/main.html).
I downloaded the Apache HWPF. I want to use it to read a doc file and write its text into a plain text file. I don't know the HWPF so well.
My very simple program is here:
I have 3 problems now:
Some of packages have errors (they can't find apache hdf). How I can fix them?
How I can use the methods of HWDF to find and extract the images out?
Some piece of my program is incomplete and incorrect. So please help me to complete it.
I have to complete this program in 2 days.
once again I repeat Please Please help me to complete this.
Thanks you Guys a lot for your help!!!
This is my elementary code :
public class test {
public void m1 (){
String filesname = "Hello.doc";
POIFSFileSystem fs = null;
fs = new POIFSFileSystem(new FileInputStream(filesname );
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
String str = we.getText() ;
String[] paragraphs = we.getParagraphText();
Picture pic = new Picture(. . .) ;
pic.writeImageContent( . . . ) ;
PicturesTable picTable = new PicturesTable( . . . ) ;
if ( picTable.hasPicture( . . . ) ){
picTable.extractPicture(..., ...);
picTable.getAllPictures() ;
}
}
Apache Tika will do this for you. It handles talking to POI to do the HWPF stuff, and presents you with either XHTML or Plain Text for the contents of the file. If you register a recursing parser, then you'll also get all the embedded images too.
//you can use the org.apache.poi.hwpf.extractor.WordExtractor to get the text
String fileName = "example.doc";
HWPFDocument wordDoc = new HWPFDocument(new FileInputStream(fileName));
WordExtractor extractor = new WordExtractor(wordDoc);
String[] text = extractor.getParagraphText();
int lineCounter = text.length;
String articleStr = ""; // This string object use to store text from the word document.
for(int index = 0;index < lineCounter;++ index){
String paragraphStr = text[index].replaceAll("\r\n","").replaceAll("\n","").trim();
int paragraphLength = paragraphStr.length();
if(paragraphLength != 0){
articleStr.concat(paragraphStr);
}
}
//you can use the org.apache.poi.hwpf.usermodel.Picture to get the image
List<Picture> picturesList = wordDoc.getPicturesTable().getAllPictures();
for(int i = 0;i < picturesList.size();++i){
BufferedImage image = null;
Picture pic = picturesList.get(i);
image = ImageIO.read(new ByteArrayInputStream(pic.getContent()));
if(image != null){
System.out.println("Image["+i+"]"+" ImageWidth:"+image.getWidth()+" ImageHeight:"+image.getHeight()+" Suggest Image Format:"+pic.suggestFileExtension());
}
}
If you just want to do this, and you don't care about the coding, you can just use Antiword.
$ antiword file.doc > out.txt
I know this long after the fact but I've found TextMining on google code, more accurate and very easy to use. It is however, pretty much abandoned code.