I need to do a simple search engine which can recognize and stem Romanian words, including those with diacritics. I used RomanianAnalyzer, but it does not do the right stemming when it comes to the same word written with and without diacritics.
Can you help me with a code for adding/modifying an existing Romanian stemmer?
PS: I edited the question, to be more clear.
You can copy the RomanianAnalyzer source to create a custom analyzer, and add a filter to the analysis chain in the createComponents method. ASCIIFoldingFilter would probably be what you are looking for. I would add it to the end, to be sure that you don't mess up the stemmer when removing the diacritics.
public final class RomanianASCIIAnalyzer extends StopwordAnalyzerBase {
private final CharArraySet stemExclusionSet;
public final static String DEFAULT_STOPWORD_FILE = "stopwords.txt";
private static final String STOPWORDS_COMMENT = "#";
public static CharArraySet getDefaultStopSet(){
return DefaultSetHolder.DEFAULT_STOP_SET;
}
private static class DefaultSetHolder {
static final CharArraySet DEFAULT_STOP_SET;
static {
try {
DEFAULT_STOP_SET = loadStopwordSet(false, RomanianAnalyzer.class,
DEFAULT_STOPWORD_FILE, STOPWORDS_COMMENT);
} catch (IOException ex) {
throw new RuntimeException("Unable to load default stopword set");
}
}
}
public RomanianASCIIAnalyzer() {
this(DefaultSetHolder.DEFAULT_STOP_SET);
}
public RomanianASCIIAnalyzer(CharArraySet stopwords) {
this(stopwords, CharArraySet.EMPTY_SET);
}
public RomanianASCIIAnalyzer(CharArraySet stopwords, CharArraySet stemExclusionSet) {
super(stopwords);
this.stemExclusionSet = CharArraySet.unmodifiableSet(CharArraySet.copy(stemExclusionSet));
}
#Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new StandardTokenizer();
TokenStream result = new StandardFilter(source);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopwords);
if(!stemExclusionSet.isEmpty())
result = new SetKeywordMarkerFilter(result, stemExclusionSet);
result = new SnowballFilter(result, new RomanianStemmer());
//This following line is the addition made to the RomanianAnalyzer source.
result = new ASCIIFoldingFilter(result);
return new TokenStreamComponents(source, result);
}
}
Related
Here is initialize code
public class Main {
public void index(String input_path, String index_dir, String separator, String extension, String field, DataHandler handler) {
Index index = new Index(handler);
index.initWriter(index_dir, new StandardAnalyzer());
index.run(input_path, field, extension, separator);
}
public List<?> search(String input_path, String index_dir, String separator, String extension, String field, DataHandler handler) {
Search search = new Search(handler);
search.initSearcher(index_dir, new StandardAnalyzer());
return search.runUsingFiles(input_path, field, extension, separator);
}
#SuppressWarnings("unchecked")
public static void main(String[] args) {
String lang = "en-US";
String dType = "data";
String train = "res/input/" +lang+ "/" +dType +"/train/";
String test = "res/input/"+ lang+ "/" +dType+ "/test/";
String separator = "\\|";
String extension = "csv";
String index_dir = "res/index/" +lang+ "." +dType+ ".index";
String output_file = "res/result/" +lang+ "." +dType+ ".output.json";
String searched_field = "utterance";
Main main = new Main();
DataHandler handler = new DataHandler();
main.index(train, index_dir, separator, extension, searched_field, handler);
//List<JSONObject> result = (List<JSONObject>) main.search(test, index_dir, separator, extension, searched_field, handler);
//handler.writeOutputJson(result, output_file);
}
}
It is my code
public class Index {
private IndexWriter writer;
private DataHandler handler;
public Index(DataHandler handler) {
this.handler = handler;
}
public Index() {
this(new DataHandler());
}
public void initWriter(String index_path, Directory store, Analyzer analyzer) {
IndexWriterConfig config = new IndexWriterConfig(analyzer);
try {
this.writer = new IndexWriter(store, config);
} catch (IOException e) {
e.printStackTrace();
}
}
public void initWriter(String index_path, Analyzer analyzer) {
try {
initWriter(index_path, FSDirectory.open(Paths.get(index_path)), analyzer);
} catch (IOException e) {
e.printStackTrace();
}
}
public void initWriter(String index_path) {
List<String> stopWords = Arrays.asList();
CharArraySet stopSet = new CharArraySet(stopWords, false);
initWriter(index_path, new StandardAnalyzer(stopSet));
}
#SuppressWarnings("unchecked")
public void indexDocs(List<?> datas, String field) throws IOException {
FieldType fieldType = new FieldType();
FieldType fieldType2 = new FieldType();
fieldType.setStored(true);
fieldType.setTokenized(true);
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
fieldType2.setStored(true);
fieldType2.setTokenized(false);
fieldType2.setIndexOptions(IndexOptions.DOCS);
for(int i = 0 ; i < datas.size() ; i++) {
Map<String,String> temp = (Map<String,String>) datas.get(i);
Document doc = new Document();
for(String key : temp.keySet()) {
if(key.equals(field))
continue;
doc.add(new Field(key, temp.get(key), fieldType2));
}
doc.add(new Field(field, temp.get(field), fieldType));
this.writer.addDocument(doc);
}
}
public void run(String path, String field, String extension, String separator) {
List<File> files = this.handler.getInputFiles(path, extension);
List<?> data = this.handler.readDocs(files, separator);
try {
System.out.println("start index");
indexDocs(data, field);
this.writer.commit();
this.writer.close();
System.out.println("done");
} catch (IOException e) {
e.printStackTrace();
}
}
public void run(String path) {
run(path, "search_field", "csv", "\t");
}
I made simple search module using Java and Lucene.
This module consisted of two phase, index and search.
In index phase, It read csv files and convert to Document each row and add to IndexWriter object using IndexWriter.addDocument() method.
Finaly, It call IndexWriter.commit() method.
It is working well in my local PC (windows)
but in Ubuntu PC, doesn't finished IndexWriter.commit() method.
Of course IndexWriter.flush() method doesn't work.
What is the problem?
I need a Lucene Tokenizer that can do the following. Given the string "wines bottle caps", the following queries should succeed
wine
bott
cap
ottl
aps
wine bottl
Here is what I have so far. How might I modify it to work? No query less than three characters should work.
public class PorterAnalyzer extends Analyzer {
private final Version version;
public PorterAnalyzer(Version version) {
this.version = version;
}
#Override
#SuppressWarnings("resource")
protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
final StandardTokenizer src = new StandardTokenizer(reader);
TokenStream tok = new StandardFilter(src);
tok = new LowerCaseFilter( tok);
tok = new StopFilter( tok, StandardAnalyzer.STOP_WORDS_SET);
tok = new PorterStemFilter(tok);
return new TokenStreamComponents(src, tok);
}
}
I think you are searching for NGramTokenFilter.
Try, for example:
tok=new NGramTokenFilter(tok,2,5);
I would like to find the Lucene analyzer corresponding to the language of a Java locale.
For instance, Locale.ENGLISH would be mapped to org.apache.lucene.analysis.en.EnglishAnalyzer.
Is there an automated mapping somewhere?
This is not available out-of-the-box. See below the way I do it.
public final class LocaleAwareAnalyzer extends AnalyzerWrapper {
private static final Logger LOG = LoggerFactory.getLogger(LocaleAwareAnalyzer.class);
private final Analyzer defaultAnalyzer;
private final Map<String, Analyzer> perLocaleAnalyzer = perLocaleAnalyzers();
public LocaleAwareAnalyzer(final Analyzer defaultAnalyzer) {
this.defaultAnalyzer = Precondition.notNull("defaultAnalyzer", defaultAnalyzer);
}
#Override
protected Analyzer getWrappedAnalyzer(final String fieldName) {
if (fieldName == null) {
return defaultAnalyzer;
}
final int n = fieldName.indexOf('_');
if (n >= 0) {
// Unfortunately CharArrayMap does not offer get(CharSequence, start, end)
final String locale = fieldName.substring(n + 1);
final Analyzer a = perLocaleAnalyzer.get(locale);
if (a != null) {
return a;
}
LOG.warn("No Analyzer for Locale '%s', using default", locale);
}
return defaultAnalyzer;
}
#Override
protected TokenStreamComponents wrapComponents(final String fieldName,
final TokenStreamComponents components) {
return components;
}
private static Map<String, Analyzer> perLocaleAnalyzers() {
final Map<String, Analyzer> m = new HashMap<>();
m.put("en", new EnglishAnalyzer(Version.LUCENE_43));
m.put("es", new SpanishAnalyzer(Version.LUCENE_43));
m.put("de", new GermanAnalyzer(Version.LUCENE_43));
m.put("fr", new FrenchAnalyzer(Version.LUCENE_43));
// ... etc
return m;
}
}
I'm using java.util.resourcebundle to format my JSTL messages and this works fine:
I use the class MessageFormat you can see here. Now I want to encapsulate this to a method that is just getParametrizedMessage(String key, String[]parameters) but I'm not sure how to do it. Now there is quite a lot of work to display just one or two messages with parameters:
UserMessage um = null;
ResourceBundle messages = ResourceBundle.getBundle("messages");
String str = messages.getString("PF1");
Object[] messageArguments = new String[]{nyreg.getNummer()};
MessageFormat formatter = new MessageFormat("");
formatter.applyPattern(messages.getString("PI14"));
String outputPI14 = formatter.format(messageArguments);
formatter.applyPattern(messages.getString("PI15"));
String outputPI15 = formatter.format(messageArguments)
if(ipeaSisFlag)
if(checkIfPCTExistInDB && nyreg.isExistInDB()) {
//um = new ExtendedUserMessage(MessageHandler.getParameterizedMessage("PI15", new String[]{nyreg.getNummer()}) , UserMessage.TYPE_INFORMATION, "Info");
um = new ExtendedUserMessage(outputPI15 , UserMessage.TYPE_INFORMATION, "Info");
…and so on. Now can I move this logic to a static class MessageHandler.getParameterizedMessage that now is not working and looking like this:
private final static String dictionaryFileName="messages.properties";
public static String getParameterizedMessage(String key, String [] params){
if (dictionary==null){
loadDictionary();
}
return getParameterizedMessage(dictionary,key,params);
}
private static void loadDictionary(){
String fileName = dictionaryFileName;
try {
dictionary=new Properties();
InputStream fileInput = MessageHandler.class.getClassLoader().getResourceAsStream(fileName);
dictionary.load(fileInput);
fileInput.close();
}
catch(Exception e) {
System.err.println("Exception reading propertiesfile in init "+e);
e.printStackTrace();
dictionary=null;
}
}
How can I make using my parametrized messages as easy as calling a method with key and parameter?
Thanks for any help
Update
The logic comes from an inherited method that in in the abstract class that this extends. The method looks like:
protected static String getParameterizedMessage(Properties dictionary,String key,String []params){
if (dictionary==null){
return "ERROR";
}
String msg = dictionary.getProperty(key);
if (msg==null){
return "?!Meddelande " +key + " saknas!?";
}
if (params==null){
return msg;
}
StringBuffer buff = new StringBuffer(msg);
for (int i=0;i<params.length;i++){
String placeHolder = "<<"+(i+1)+">>";
if (buff.indexOf(placeHolder)!=-1){
replace(buff,placeHolder,params[i]);
}
else {
remove(buff,placeHolder);
}
}
return buff.toString();
}
I think I must rewrite the above method in order to make it work like a resourcebundle rather than just a dictionary.
Update 2
The code that seems to work is here
public static String getParameterizedMessage(String key, Object [] params){
ResourceBundle messages = ResourceBundle.getBundle("messages");
MessageFormat formatter = new MessageFormat("");
formatter.applyPattern(messages.getString(key));
return formatter.format(params);
}
I'm not really sure what you're trying to achive, here's what I did in the past:
public static final String localize(final Locale locale, final String key, final Object... param) {
final String name = "message";
final ResourceBundle rb;
/* Resource bundles are cached internally,
never saw a need to implement another caching level
*/
try {
rb = ResourceBundle.getBundle(name, locale, Thread.currentThread()
.getContextClassLoader());
} catch (MissingResourceException e) {
throw new RuntimeException("Bundle not found:" + name);
}
String keyValue = null;
try {
keyValue = rb.getString(key);
} catch (MissingResourceException e) {
// LOG.severe("Key not found: " + key);
keyValue = "???" + key + "???";
}
/* Message formating is expensive, try to avoid it */
if (param != null && param.length > 0) {
return MessageFormat.format(keyValue, param);
} else {
return keyValue;
}
}
I need to convert HTML to plain text. My only requirement of formatting is to retain new lines in the plain text. New lines should be displayed not only in the case of <br> but other tags, e.g. <tr/>, </p> leads to a new line too.
Sample HTML pages for testing are:
http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html
http://www.javadb.com/write-to-file-using-bufferedwriter
Note that these are only random URLs.
I have tried out various libraries (JSoup, Javax.swing, Apache utils) mentioned in the answers to this StackOverflow question to convert HTML to plain text.
Example using JSoup:
public class JSoupTest {
#Test
public void SimpleParse() {
try {
Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
System.out.print(doc.text());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Example with HTMLEditorKit:
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;
public class Html2Text extends HTMLEditorKit.ParserCallback {
StringBuffer s;
public Html2Text() {}
public void parse(Reader in) throws IOException {
s = new StringBuffer();
ParserDelegator delegator = new ParserDelegator();
// the third parameter is TRUE to ignore charset directive
delegator.parse(in, this, Boolean.TRUE);
}
public void handleText(char[] text, int pos) {
s.append(text);
}
public String getText() {
return s.toString();
}
public static void main (String[] args) {
try {
// the HTML to convert
URL url = new URL("http://www.javadb.com/write-to-file-using-bufferedwriter");
URLConnection conn = url.openConnection();
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
String inputLine;
String finalContents = "";
while ((inputLine = reader.readLine()) != null) {
finalContents += "\n" + inputLine.replace("<br", "\n<br");
}
BufferedWriter writer = new BufferedWriter(new FileWriter("samples/testHtml.html"));
writer.write(finalContents);
writer.close();
FileReader in = new FileReader("samples/testHtml.html");
Html2Text parser = new Html2Text();
parser.parse(in);
in.close();
System.out.println(parser.getText());
}
catch (Exception e) {
e.printStackTrace();
}
}
}
Have your parser append text content and newlines to a StringBuilder.
final StringBuilder sb = new StringBuilder();
HTMLEditorKit.ParserCallback parserCallback = new HTMLEditorKit.ParserCallback() {
public boolean readyForNewline;
#Override
public void handleText(final char[] data, final int pos) {
String s = new String(data);
sb.append(s.trim());
readyForNewline = true;
}
#Override
public void handleStartTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
if (readyForNewline && (t == HTML.Tag.DIV || t == HTML.Tag.BR || t == HTML.Tag.P)) {
sb.append("\n");
readyForNewline = false;
}
}
#Override
public void handleSimpleTag(final HTML.Tag t, final MutableAttributeSet a, final int pos) {
handleStartTag(t, a, pos);
}
};
new ParserDelegator().parse(new StringReader(html), parserCallback, false);
I would guess you could use the ParserCallback.
You would need to add code to support the tags that require special handling. There are:
handleStartTag
handleEndTag
handleSimpleTag
callbacks that should allow you to check for the tags you want to monitor and then append a newline character to your buffer.
Building on your example, with a hint from html to plain text? message:
import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
public class TestJsoup
{
public void SimpleParse()
{
try
{
Document doc = Jsoup.connect("http://www.particle.kth.se/~lindsey/JavaCourse/Book/Part1/Java/Chapter09/scannerConsole.html").get();
// Trick for better formatting
doc.body().wrap("<pre></pre>");
String text = doc.text();
// Converting nbsp entities
text = text.replaceAll("\u00A0", " ");
System.out.print(text);
}
catch (IOException e)
{
e.printStackTrace();
}
}
public static void main(String args[])
{
TestJsoup tjs = new TestJsoup();
tjs.SimpleParse();
}
}
You can use XSLT for this purpose. Take a look at this link which addresses a similar problem.
Hope it is helpful.
I would use SAX. If your document is not well-formed XHTML, I would transform it with JTidy.
JSoup is not FreeMarker (or any other customer/non-HTML tag) compatible. Consider this as the most pure solution for converting Html to plain text.
http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726
My code:
return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();