For Lucene 3.6.2 I have a following Analyzer:
public final class StandardAnalyzerV36 extends Analyzer {
private Analyzer analyzer;
public StandardAnalyzerV36() {
analyzer = new StandardAnalyzer(Version.LUCENE_36);
}
public StandardAnalyzerV36(Set<?> stopWords) {
analyzer = new StandardAnalyzer(Version.LUCENE_36, stopWords);
}
#Override
public final TokenStream tokenStream(String fieldName, Reader reader) {
return analyzer.tokenStream(fieldName, new HTMLStripCharFilter(CharReader.get(reader)));
}
#Override
public final TokenStream reusableTokenStream(String fieldName, Reader reader) throws IOException {
return analyzer.reusableTokenStream(fieldName, reader);
}
}
Could you please help me to port it on Analyzer for Lucene 5.5.0 ? The Analyzer interface was changed in the new version.
UPDATED
I have reimplemented this Analyzer to following:
public final class StandardAnalyzerV36 extends Analyzer {
public static final CharArraySet STOP_WORDS_SET = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
#Override
protected TokenStreamComponents createComponents(String fieldName) {
final ClassicTokenizer src = new ClassicTokenizer();
TokenStream tok = new StandardFilter(src);
tok = new StopFilter(new LowerCaseFilter(tok), STOP_WORDS_SET);
return new TokenStreamComponents(src, tok);
}
#Override
protected Reader initReader(String fieldName, Reader reader) {
return new HTMLStripCharFilter(reader);
}
but my tests fails on following call:
tokens = LuceneUtils.tokenizeString(analyzer, "[{(RDBMS)}]");
public static List<String> tokenizeString(Analyzer analyzer, String string) {
List<String> result = new ArrayList<String>();
try {
TokenStream stream = analyzer.tokenStream(null, new StringReader(string));
stream.reset();
while (stream.incrementToken()) {
result.add(stream.getAttribute(CharTermAttribute.class).toString());
}
} catch (IOException e) {
// not thrown b/c we're using a string reader...
throw new RuntimeException(e);
}
return result;
}
with a following exception:
java.lang.IllegalStateException: TokenStream contract violation: close() call missing
at org.apache.lucene.analysis.Tokenizer.setReader(Tokenizer.java:90)
at org.apache.lucene.analysis.Analyzer$TokenStreamComponents.setReader(Analyzer.java:315)
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:143)
What is wrong with this code ?
Finally I got it working:
public final class StandardAnalyzerV36 extends Analyzer {
public static final CharArraySet STOP_WORDS_SET = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
#Override
protected TokenStreamComponents createComponents(String fieldName) {
final ClassicTokenizer src = new ClassicTokenizer();
TokenStream tok = new StandardFilter(src);
tok = new StopFilter(new LowerCaseFilter(tok), STOP_WORDS_SET);
return new TokenStreamComponents(src, tok);
}
#Override
protected Reader initReader(String fieldName, Reader reader) {
return new HTMLStripCharFilter(reader);
}
}
public class LuceneUtils {
public static List<String> tokenizeString(Analyzer analyzer, String string) {
List<String> result = new ArrayList<String>();
TokenStream stream = null;
try {
stream = analyzer.tokenStream(null, new StringReader(string));
stream.reset();
while (stream.incrementToken()) {
result.add(stream.getAttribute(CharTermAttribute.class).toString());
}
} catch (IOException e) {
// not thrown b/c we're using a string reader...
throw new RuntimeException(e);
} finally {
IOUtils.closeQuietly(stream);
}
return result;
}
}
Related
The problem is the following. I have several reports that I want to mock and test with Mockito. Each report gives the same UnfinishedVerificationException and nothing that I tried so far worked in order to fix the issue. Example of one of the reports with all parents is below.
I changed any to anyString.
Change ReportSaver from interface to abstract class
Add validateMockitoUsage to nail the right test
Looked into similar Mockito-related cases on StackOverflow
Test:
public class ReportProcessorTest {
private ReportProcessor reportProcessor;
private ByteArrayOutputStream mockOutputStream = (new ReportProcessorMock()).mock();
#SuppressWarnings("serial")
private final static Map<String, Object> epxectedMaps = new HashMap<String, Object>();
#Before
public void setUp() throws IOException {
reportProcessor = mock(ReportProcessor.class);
ReflectionTestUtils.setField(reportProcessor, "systemOffset", "Europe/Berlin");
ReflectionTestUtils.setField(reportProcessor, "redisKeyDelimiter", "#");
Mockito.doNothing().when(reportProcessor).saveReportToDestination(Mockito.any(), Mockito.anyString());
Mockito.doCallRealMethod().when(reportProcessor).process(Mockito.any());
}
#Test
public void calculateSales() throws IOException {
Map<String, Object> processedReport = reportProcessor.process(mockOutputStream);
verify(reportProcessor, times(1)); // The line that cause troubles
assertThat(Maps.difference(processedReport, epxectedMaps).areEqual(), Matchers.is(true));
}
#After
public void validate() {
Mockito.validateMockitoUsage();
}
}
Class under test:
#Component
public class ReportProcessor extends ReportSaver {
#Value("${system.offset}")
private String systemOffset;
#Value("${report.relativePath}")
private String destinationPathToSave;
#Value("${redis.delimiter}")
private String redisKeyDelimiter;
public Map<String, Object> process(ByteArrayOutputStream outputStream) throws IOException {
saveReportToDestination(outputStream, destinationPathToSave);
Map<String, Object> report = new HashMap<>();
try (InputStream inputStream = new ByteArrayInputStream(outputStream.toByteArray());
InputStreamReader reader = new InputStreamReader(inputStream)) {
CSVReaderHeaderAware csvReader = new CSVReaderFormatter(outputStream).headerAware(reader);
Map<String, String> data;
while ((data = csvReader.readMap()) != null) {
String data = data.get("data").toUpperCase();
Long quantity = NumberUtils.toLong(data.get("quantity"));
report.put(data, quantity);
}
}
return report;
}
}
Parent class:
public abstract class ReportSaver {
public void saveReportToDestination(ByteArrayOutputStream outputStream, String destinationPathToSave) throws IOException {
File destinationFile = new File(destinationPathToSave);
destinationFile.getParentFile().mkdirs();
destinationFile.delete();
destinationFile.createNewFile();
OutputStream fileOutput = new FileOutputStream(destinationFile);
outputStream.writeTo(fileOutput);
}
}
Mock:
public class ReportProcessorMock implements GeneralReportProcessorMock {
private static final String report = ""; // There can be some data in here
#Override
public ByteArrayOutputStream mock() {
byte[] reportBytes = report.getBytes();
ByteArrayOutputStream outputStream = new ByteArrayOutputStream(reportBytes.length);
outputStream.write(reportBytes, 0, reportBytes.length);
return outputStream;
}
}
When you verify, you verify a particular public method of the mock:
verify(reportProcessor, times(1)).process(mockOutputStream);
or use a wildcard if appropriate:
verify(reportProcessor, times(1)).process(any(ByteArrayOutputStream.class));
Here is initialize code
public class Main {
public void index(String input_path, String index_dir, String separator, String extension, String field, DataHandler handler) {
Index index = new Index(handler);
index.initWriter(index_dir, new StandardAnalyzer());
index.run(input_path, field, extension, separator);
}
public List<?> search(String input_path, String index_dir, String separator, String extension, String field, DataHandler handler) {
Search search = new Search(handler);
search.initSearcher(index_dir, new StandardAnalyzer());
return search.runUsingFiles(input_path, field, extension, separator);
}
#SuppressWarnings("unchecked")
public static void main(String[] args) {
String lang = "en-US";
String dType = "data";
String train = "res/input/" +lang+ "/" +dType +"/train/";
String test = "res/input/"+ lang+ "/" +dType+ "/test/";
String separator = "\\|";
String extension = "csv";
String index_dir = "res/index/" +lang+ "." +dType+ ".index";
String output_file = "res/result/" +lang+ "." +dType+ ".output.json";
String searched_field = "utterance";
Main main = new Main();
DataHandler handler = new DataHandler();
main.index(train, index_dir, separator, extension, searched_field, handler);
//List<JSONObject> result = (List<JSONObject>) main.search(test, index_dir, separator, extension, searched_field, handler);
//handler.writeOutputJson(result, output_file);
}
}
It is my code
public class Index {
private IndexWriter writer;
private DataHandler handler;
public Index(DataHandler handler) {
this.handler = handler;
}
public Index() {
this(new DataHandler());
}
public void initWriter(String index_path, Directory store, Analyzer analyzer) {
IndexWriterConfig config = new IndexWriterConfig(analyzer);
try {
this.writer = new IndexWriter(store, config);
} catch (IOException e) {
e.printStackTrace();
}
}
public void initWriter(String index_path, Analyzer analyzer) {
try {
initWriter(index_path, FSDirectory.open(Paths.get(index_path)), analyzer);
} catch (IOException e) {
e.printStackTrace();
}
}
public void initWriter(String index_path) {
List<String> stopWords = Arrays.asList();
CharArraySet stopSet = new CharArraySet(stopWords, false);
initWriter(index_path, new StandardAnalyzer(stopSet));
}
#SuppressWarnings("unchecked")
public void indexDocs(List<?> datas, String field) throws IOException {
FieldType fieldType = new FieldType();
FieldType fieldType2 = new FieldType();
fieldType.setStored(true);
fieldType.setTokenized(true);
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
fieldType2.setStored(true);
fieldType2.setTokenized(false);
fieldType2.setIndexOptions(IndexOptions.DOCS);
for(int i = 0 ; i < datas.size() ; i++) {
Map<String,String> temp = (Map<String,String>) datas.get(i);
Document doc = new Document();
for(String key : temp.keySet()) {
if(key.equals(field))
continue;
doc.add(new Field(key, temp.get(key), fieldType2));
}
doc.add(new Field(field, temp.get(field), fieldType));
this.writer.addDocument(doc);
}
}
public void run(String path, String field, String extension, String separator) {
List<File> files = this.handler.getInputFiles(path, extension);
List<?> data = this.handler.readDocs(files, separator);
try {
System.out.println("start index");
indexDocs(data, field);
this.writer.commit();
this.writer.close();
System.out.println("done");
} catch (IOException e) {
e.printStackTrace();
}
}
public void run(String path) {
run(path, "search_field", "csv", "\t");
}
I made simple search module using Java and Lucene.
This module consisted of two phase, index and search.
In index phase, It read csv files and convert to Document each row and add to IndexWriter object using IndexWriter.addDocument() method.
Finaly, It call IndexWriter.commit() method.
It is working well in my local PC (windows)
but in Ubuntu PC, doesn't finished IndexWriter.commit() method.
Of course IndexWriter.flush() method doesn't work.
What is the problem?
my propblem is, that I have a List that is an ArrayList that is filled with my Objects called Daten, which I save in a .json-file
PopUp pop = new PopUp();
Gson gson = new GsonBuilder().create();
JsonWriter writer = new JsonWriter(new FileWriter(new File(System.getProperty("user.home"), "/Wirtschaft.json")));
gson.toJson(allEntries, List.class, writer);
try {
writer.flush();
writer.close();
pop.show("Saved!");
} catch (IOException e) {
pop.show("Trying to save data failed!");
}
To use the list of Daten again I read everything from the .json-file and save it in a List<Daten>
Gson gson = new GsonBuilder().create();
List<Daten> allSaves = new ArrayList<>();
if (new File(System.getProperty("user.home"), "/Wirtschaft.json").exists()) {
JsonReader jReader = new JsonReader(new FileReader(new File(System.getProperty("user.home"), "/Wirtschaft.json")));
BufferedReader br = new BufferedReader(new FileReader(new File(System.getProperty("user.home"), "/Wirtschaft.json")));
if (br.readLine() != null) {
allSaves = gson.fromJson(jReader, List.class);
}
br.close();
jReader.close();
}
return allSaves;
Now I want to display this List in a TableView like this:
ObservableList<Daten> listEntries = FXCollections.observableArrayList(Daten.readData());
columnGewicht.setCellValueFactory(new PropertyValueFactory<>("gewicht"));
columnPreis.setCellValueFactory(new PropertyValueFactory<>("preisProStueck"));
columnGewinn.setCellValueFactory(new PropertyValueFactory<>("gewinn"));
columnEB.setCellValueFactory(new PropertyValueFactory<>("eb"));
columnAKK.setCellValueFactory(new PropertyValueFactory<>("akk"));
columnSB.setCellValueFactory(new PropertyValueFactory<>("sb"));
columnGK.setCellValueFactory(new PropertyValueFactory<>("gk"));
columnBoni.setCellValueFactory(new PropertyValueFactory<>("boni"));
table.setItems(listEntries);
Problem is, that the TableView remains empty when I use the code above, but if I don't use the Daten that I wrote in that file like above, it works and shows everything in the TableView, even if I choose the same numbers etc.:
List<Daten> list = new ArrayList<>();
list.add(new Daten(100, 6, 421, 3, 4, 1, 6, 0));
ObservableList<Daten> listEntries = FXCollections.observableArrayList(list);
columnGewicht.setCellValueFactory(new PropertyValueFactory<>("gewicht"));
columnPreis.setCellValueFactory(new PropertyValueFactory<>("preisProStueck"));
columnGewinn.setCellValueFactory(new PropertyValueFactory<>("gewinn"));
columnEB.setCellValueFactory(new PropertyValueFactory<>("eb"));
columnAKK.setCellValueFactory(new PropertyValueFactory<>("akk"));
columnSB.setCellValueFactory(new PropertyValueFactory<>("sb"));
columnGK.setCellValueFactory(new PropertyValueFactory<>("gk"));
columnBoni.setCellValueFactory(new PropertyValueFactory<>("boni"));
table.setItems(listEntries);
How can I fix this, that the TableView does not show my Data that I read from the file? Like there can not be something wrong with the TableView, if this other methode works... I am unfortunately clueless.
Thanks in advance!
EDIT:
Here is the Class for Daten:
public class Daten {
private Double gewicht;
private Double preisProStueck;
private Double gewinn;
private Double eb;
private Double akk;
private Double sb;
private Double gk;
private Integer boni;
Daten(double gewicht, double preisProStueck, double gewinn, double EB, double AKK, double SB, double GK, int boni) {
this.gewicht = gewicht;
this.preisProStueck = preisProStueck;
this.gewinn = gewinn;
this.eb = EB;
this.akk = AKK;
this.sb = SB;
this.gk = GK;
this.boni = boni;
}
static List<Daten> readData() throws IOException {
Gson gson = new GsonBuilder().create();
List<Daten> allSaves = new ArrayList<>();
if (new File(System.getProperty("user.home"), "/Wirtschaft.json").exists()) {
JsonReader jReader = new JsonReader(new FileReader(new File(System.getProperty("user.home"), "/Wirtschaft.json")));
BufferedReader br = new BufferedReader(new FileReader(new File(System.getProperty("user.home"), "/Wirtschaft.json")));
if (br.readLine() != null) {
allSaves = gson.fromJson(jReader, List.class);
}
br.close();
jReader.close();
}
return allSaves;
}
private static void writeData(List<Daten> allEntries) throws IOException {
PopUp pop = new PopUp();
Gson gson = new GsonBuilder().create();
JsonWriter writer = new JsonWriter(new FileWriter(new File(System.getProperty("user.home"), "/Wirtschaft.json")));
gson.toJson(allEntries, List.class, writer);
try {
writer.flush();
writer.close();
pop.show("Zeugs wurde gespeichert!");
} catch (IOException e) {
pop.show("Trying to save data failed!");
}
}
static void addData(Daten data) throws IOException {
List<Daten> list = readData();
list.add(data);
writeData(list);
}
public Double getGewicht() {
return gewicht;
}
public void setGewicht(Double gewicht) {
this.gewicht = gewicht;
}
public Double getPreisProStueck() {
return preisProStueck;
}
public void setPreisProStueck(Double preisProStueck) {
this.preisProStueck = preisProStueck;
}
public Double getGewinn() {
return gewinn;
}
public void setGewinn(Double gewinn) {
this.gewinn = gewinn;
}
public Double getEb() {
return eb;
}
public void setEb(Double eb) {
this.eb = eb;
}
public Double getAkk() {
return akk;
}
public void setAkk(Double akk) {
this.akk = akk;
}
public Double getSb() {
return sb;
}
public void setSb(Double sb) {
this.sb = sb;
}
public Double getGk() {
return gk;
}
public void setGk(Double gk) {
this.gk = gk;
}
public Integer getBoni() {
return boni;
}
public void setBoni(Integer boni) {
this.boni = boni;
}}
I am experimenting with OpenNlp 1.7.2 and maxent-3.0.0.jar to train for thai language , below is the code that reads thai train data and creates the bin model.
public class TrainPerson {
public static void main(String[] args) throws IOException {
String trainFile = "/Documents/workspace/ThaiOpenNLP/bin/thaiPerson.train";
String modelFile = "/Documents/workspace/ThaiOpenNLP/bin/th-ner-person.bin";
writePersonModel(trainFile, modelFile);
}
private static void writePersonModel(String trainFile, String modelFile)
throws FileNotFoundException, IOException {
Charset charset = Charset.forName("UTF-8");
InputStreamFactory fileInputStream = new MarkableFileInputStreamFactory(new File(trainFile));
ObjectStream<String> lineStream = new PlainTextByLineStream(fileInputStream, charset);
ObjectStream<NameSample> sampleStream = new NameSampleDataStream(lineStream);
TokenNameFinderModel model;
try {
model = NameFinderME.train("th", "person", sampleStream , TrainingParameters.defaultParams(), new TokenNameFinderFactory());
} finally {
sampleStream.close();
}
BufferedOutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
model.serialize(modelOut);
} finally {
if (modelOut != null) {
modelOut.close();
}
}
}}
Thai data looks like as attached in the file trainingData
I am using the output model to detect person name as shown in the below programme. It fails to identify the name.
public class ThaiPersonNameFinder {
static String modelFile = "/Users/avinashpaula/Documents/workspace/ThaiOpenNLP/bin/th-ner-person.bin";
public static void main(String[] args) {
try {
InputStream modelIn = new FileInputStream(new File(modelFile));
TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
NameFinderME nameFinder = new NameFinderME(model);
String sentence[] = new String[]{
"จอห์น",
"30",
"ปี",
"จะ",
"เข้าร่วม",
"ก",
"เริ่มต้น",
"ขึ้น",
"บน",
"มกราคม",
"."
};
Span nameSpans[] = nameFinder.find(sentence);
for (int i = 0; i < nameSpans.length; i++) {
System.out.println(nameSpans[i]);
}
}
catch (IOException e) {
e.printStackTrace();
}
}
}
What am i doing wrong.
I need to do a simple search engine which can recognize and stem Romanian words, including those with diacritics. I used RomanianAnalyzer, but it does not do the right stemming when it comes to the same word written with and without diacritics.
Can you help me with a code for adding/modifying an existing Romanian stemmer?
PS: I edited the question, to be more clear.
You can copy the RomanianAnalyzer source to create a custom analyzer, and add a filter to the analysis chain in the createComponents method. ASCIIFoldingFilter would probably be what you are looking for. I would add it to the end, to be sure that you don't mess up the stemmer when removing the diacritics.
public final class RomanianASCIIAnalyzer extends StopwordAnalyzerBase {
private final CharArraySet stemExclusionSet;
public final static String DEFAULT_STOPWORD_FILE = "stopwords.txt";
private static final String STOPWORDS_COMMENT = "#";
public static CharArraySet getDefaultStopSet(){
return DefaultSetHolder.DEFAULT_STOP_SET;
}
private static class DefaultSetHolder {
static final CharArraySet DEFAULT_STOP_SET;
static {
try {
DEFAULT_STOP_SET = loadStopwordSet(false, RomanianAnalyzer.class,
DEFAULT_STOPWORD_FILE, STOPWORDS_COMMENT);
} catch (IOException ex) {
throw new RuntimeException("Unable to load default stopword set");
}
}
}
public RomanianASCIIAnalyzer() {
this(DefaultSetHolder.DEFAULT_STOP_SET);
}
public RomanianASCIIAnalyzer(CharArraySet stopwords) {
this(stopwords, CharArraySet.EMPTY_SET);
}
public RomanianASCIIAnalyzer(CharArraySet stopwords, CharArraySet stemExclusionSet) {
super(stopwords);
this.stemExclusionSet = CharArraySet.unmodifiableSet(CharArraySet.copy(stemExclusionSet));
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new StandardTokenizer();
TokenStream result = new StandardFilter(source);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopwords);
if(!stemExclusionSet.isEmpty())
result = new SetKeywordMarkerFilter(result, stemExclusionSet);
result = new SnowballFilter(result, new RomanianStemmer());
//This following line is the addition made to the RomanianAnalyzer source.
result = new ASCIIFoldingFilter(result);
return new TokenStreamComponents(source, result);
}
}