If i use a query like this in command line
./opennlp TokenNameFinder en-ner-person.bin "input.txt" "output.txt"
I'll get person names printed in output.txt but I want to write own models such that i should print my own entities.
E.g.
what is the risk value on icm2500.
Delivery of prd_234 will be arrived late.
Watson is handling router_34.
If i pass these lines, it should parse and extract product_entities. icm2500, prd_234, router_34... etc these are all Products( we can save this information in a file and we can use it as look up kind of for models or openNLP).
Can anyone please tel me how to do this ?
You'll need to train your own model by annotating some sentences in the opennlp format. For the example sentences you posted the format would look like this:
what is the risk value on <START:product> icm2500 <END>.
Delivery of <START:product> prd_234 <END> will be arrived late.
Watson is handling <START:product> router_34 <END>.
Make sure each sentence ends in a newline and if there are newlines in the sentence to escape them somehow.
Once you make a file like this out of your data, then you can use the Java API to train the model like this
public static void main(String[] args){
Charset charset = Charset.forName("UTF-8");
ObjectStream<String> lineStream =
new PlainTextByLineStream(new FileInputStream("your file in the above format"), charset);
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream);
TokenNameFinderModel model;
try {
model = NameFinderME.train("en", "person", sampleStream, TrainingParameters.defaultParams(),
null, Collections.<String, Object>emptyMap());
}
finally {
sampleStream.close();
}
try {
modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}
}
now you can use the model with the namefinder.
Because you may have a definitive, and possibly short, list of product names, you might consider a simple regex approach.
here's the opennlp docs that cover the NameFinder a bit:
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training.tool
Related
I am using Java8 and OpenNLP. I am trying to extract all noun words from sentences.
I have tried this example, but it extracts all noun phrases ("NP"). Does anyone know how I can just extract the individual noun words?
Thanks
What have you tried so far? I haven't looked at the example you link to in a lot of detail, but I'm pretty sure that you could get where you want to with modifying that example.
In any case, it's not very difficult:
InputStream modelIn = null;
POSModel POSModel = null;
try{
File f = new File("<location to your tagger model here>");
modelIn = new FileInputStream(f);
POSModel = new POSModel(modelIn);
POSTaggerME tagger = new POSTaggerME(POSModel);
SimpleTokenizer tokenizer= new SimpleTokenizer();
String tokens[] = tokenizer.tokenize("This is a sample sentence.");
String[] tagged = tagger.tag(tokens);
for (int i = 0; i < tagged.length; i++){
if (tagged[i].equalsIgnoreCase("nn")){
System.out.println(tokens[i]);
}
}
}
catch(IOException e){
throw new BadRequestException(e.getMessage());
}
You can download the tagger models here: http://opennlp.sourceforge.net/models-1.5/
And I should say that the SimpleTokenizer is deprecated. You may want to look into a bit more sophisticated one, but in my experience, the more fancy ones from OpenNLP are also a lot slower (and in general unacceptably slow for just tokenisation).
I am using univocity to parse files to beans, perform some magic on the objects, and afterwards write the beans back to a file. The code snippet that duplicates my problem:
public class Sample {
public static void main(String[] args) throws IOException {
BeanListProcessor<Person> rowProcessor = new BeanListProcessor<Person>(Person.class);
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setProcessor(rowProcessor);
parserSettings.getFormat().setDelimiter(',');
parserSettings.getFormat().setQuote('"');
parserSettings.getFormat().setQuoteEscape('/');
CsvParser parser = new CsvParser(parserSettings);
parser.parse(new FileReader("src/main/resources/person.csv"));
List<Person> beans = rowProcessor.getBeans();
Writer outputWriter = new FileWriter("src/main/resources/personOut.csv", true);
CsvWriterSettings settings = new CsvWriterSettings();
settings.getFormat().setDelimiter(',');
settings.getFormat().setQuote('"');
settings.getFormat().setQuoteEscape('/');
settings.setRowWriterProcessor(new BeanWriterProcessor<Person>(Person.class));
CsvWriter writer = new CsvWriter(outputWriter, settings);
for (Person person : beans) {
writer.processRecord(person);
}
writer.close();
}
}
Bean class (getters and setters ommited):
public class Person {
#Parsed(index = 0)
private int id;
#Parsed(index = 1)
private String firstName;
#Parsed(index = 2)
private String lastName;
}
And the file content (input):
1,"Alex ,/,awesome/,",chan
2,"Peter ,boring",pitt
The file is not under my control as it is being provided by an external party. After the application performs operation on the objects (not included in the code snippet), I need to return the file to the external party with exactly the same settings.
Desired file output content:
1,"Alex ,/,awesome/,",chan
2,"Peter ,boring",pitt
However I am getting:
1,"Alex ,,awesome,",chan
2,"Peter ,boring",pitt
Is there a way to include the actual used quote escape character when writing the beans back to a file?
EDIT
I have tried using settings.setQuoteAllFields(true); on the writer and all combinations of
parserSettings.setKeepQuotes(true);
parserSettings.setKeepEscapeSequences(true);
on the CsvParserSettings, but none seem to give me the result I am looking for.
The solution for this is divided in two parts:
First, please make sure you use version 2.2.3, as it's been just released with an adjustment to properly capture the quote escape character if it's not been escaped, which is your case:
Here the / is not escaped:
"Alex ,/,awesome/,"
Versions prior to 2.2.3 would expect this:
"Alex ,//,awesome//,"
Second, when writing this back to the output, you won't want to write the escape escape, i.e., given the string "Alex ,/,awesome/,"
you want to get:
"Alex ,/,awesome/,"
Instead of what it will do by default:
"Alex ,//,awesome//,"
On the CsvWriterSettings, do this:
settings.getFormat().setCharToEscapeQuoteEscaping('\0');
And it will not try to escape the quote escape character.
Hope it helps
The JSON example file consists of:
{
"1st_key": "value1",
"2nd_key": "value2",
"object_keys": {
"obj_1st": "value1",
"obj_2nd": "value2",
"obj_3rd": "value3",
}
}
I read the JSON file into a String with this StringBuilder method, in order to add the newlines into the string itself. So the String looks exactly like the JSON file above.
public String getJsonContent(String fileName) {
StringBuilder result = new StringBuilder("");
File file = new File(fileName);
try (Scanner scanner = new Scanner(file)) {
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
result.append(line).append("\n");
}
scanner.close();
} catch (IOException e) {
e.printStackTrace();
}
return result.toString();
}
Then I translate the JSON file into an Object using MongoDB API (with DBObject, BasicDBObject and util.JSON) and I call out the Object section I need to change, which is 'object_keys':
File jsonFile = new File(C:\\example.json);
String jsonString = getJsonContent(jsonFile.getAbsolutePath());
DBObject jsonObject = (DBObject)JSON.parse(jsonString);
BasicDBObject objectKeys = (BasicDBObject) jsonObject.get("object_keys");
Now I can write new values into the Object using the PUT method like this:
objectKeys.put("obj_1st","NEW_VALUE1");
objectKeys.put("obj_2nd","NEW_VALUE2");
objectKeys.put("obj_3rd","NEW_VALUE3");
! This following part not needed, check out my answer below.
After I have changed the object, I need to write it back into the json file, so I need to translate the Object into a String. There are two methods to do this, either one works.
String newJSON = jsonObject.toString();
or
String newJSON = JSON.serialize(jsonObject);
Then I write the content back into the file using PrintWriter
PrintWriter writer = new PrintWriter(C:\\example.json)
writer.print(newJSON);
writer.close();
The problem I am facing now is that the String that is written is in a single line with no formatting whatosever. Somewhere along the way it lost all the newlines. So it basically looks like this:
{"1st_key": "value1","2nd_key": "value2","object_keys": { "obj_1st": "NEW_VALUE1","obj_2nd": "NEW_VALUE2","obj_3rd": "NEW_VALUE3", }}
I need to write the JSON file back in the same format as shown in the beginning, keeping all the tabulation, spaces etc.
Is this possible somehow ?
When you want something formatted the way you said it is addressed as writing to a file in a pretty/beautiful way. For example: Output beautiful json. A quick search on google found what i believe to solve your problem.
Solution
You're going to have to use a json parser of some sort. I personally prefer org.json and would recommend it if you are manipulating the json data, but you may also like json-io which is really good for json serialization with no external dependencies.
With json-io, it's as simple as
String formattedJson = JsonWriter.formatJson(jsonObject.toString())
With org.json, you simply pass an int to the toString method.
Thanks Saraiva, I found a surprisingly simple solution by Googling around with the words 'pretty printing JSON' and used the Google GSON library. I downloaded the .jar and added it to my project in Eclipse.
These are the new imports I needed:
import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
Since I already had the JSON Object (jsonObject) readily available from my previous code, I only needed to add two new lines:
Gson gson = new GsonBuilder().setPrettyPrinting().create();
String newJSON = gson.toJson(jsonObject);
Now when I use writer.print(newJSON); it will write the JSON in the right format, beautifully formatted and indented.
I've written a java program that ingests data from a .csv, and converts those data into RDFXML. I used sesame's framework when writing this program, and the program successfully does what it was written to do.
However, I am trying to unit test this program using jUnit, and I need to test a method which converts RDF triples (in turtle format) to RDFXML. To show that the method works correctly, I would like to do this by converting RDFXML back into triples and comparing them to the original triples I passed into the method. So far, I have not found anything in sesame's documentation does this. Any suggestions?
I just solved the problem a few minutes ago. Here's my solution:
#Test
public void testWriteStmtToRDFPos(){
RDFParser parser = new RDFXMLParser();
String baseURI = "";
Model origStmts = new LinkedHashModel();
Model processedStmts = new LinkedHashModel();
StatementCollector collector = new StatementCollector(processedStmts);
parser.setRDFHandler(collector);
origStmts.add(sexOffend,predicate,object);
try{
converter.writeStmtToRDF(origStmts, rdfFile);
FileReader reader = new FileReader(rdfFile);
parser.parse(reader, baseURI);
if(origStmts.equals(processedStmts)){
assert(true);
}
}catch(FileNotFoundException e){
e.printStackTrace();
fail();
}catch(Exception e){
e.printStackTrace();
fail();
}
}
When you set the collector for the parser above, it simply collects any statements that the parser ingests. After doing this, you can compare the collector with origStmts. This wasn't immediately obvious, but is really useful after finding it!
I am new in OpenNLP. I use OpenNLP to find location's name from sentence. My input string is "Italy pardons US colonel in CIA case". I can not find "Italy" word in result set. How can I solve this problem. Thanks in advance!
try {
InputStream modelIn = new FileInputStream("en-token.bin");
TokenizerModel tokenModel = new TokenizerModel(modelIn);
modelIn.close();
Tokenizer tokenizer = new TokenizerME(tokenModel);
NameFinderME nameFinder =
new NameFinderME(
new TokenNameFinderModel(new FileInputStream("en-ner-location.bin")));
String tokens[] = tokenizer.tokenize(documentStr);
Span nameSpans[] = nameFinder.find(tokens);
for( int i = 0; i<nameSpans.length; i++) {
System.out.println("Span: "+nameSpans[i].toString());
}
}
catch(Exception e) {
System.out.println(e.toString());
}
opennlp results are dependent on the data the model was created from. The en-ner-location.bin file at sourceforge may not contain samples that make sense for your data. Also, extracting nouns or noun phrases (NNP) with a chunker or POS tagger will not be isolated to only locations. So the answer to your question is: The model doesn't account for every case in your data, this is the reason why you don't get a hit on this particular sentence. BTW, NER is never perfect and will always return some degree of false positives and false negatives.