I'm removing stop words from a String, using Apache's Lucene (8.6.3) and the following Java 8 code:
private static final String CONTENTS = "contents";
final String text = "This is a short test! Bla!";
final List<String> stopWords = Arrays.asList("short","test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);
try {
Analyzer analyzer = new StandardAnalyzer(stopSet);
TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
while(tokenStream.incrementToken()) {
System.out.print("[" + term.toString() + "] ");
}
tokenStream.close();
analyzer.close();
} catch (IOException e) {
System.out.println("Exception:\n");
e.printStackTrace();
}
This outputs the desired result:
[this] [is] [a] [bla]
Now I want to use both the default English stop set, which should also remove "this", "is" and "a" (according to github) AND the custom stop set above (the actual one I'm going to use is a lot longer), so I tried this:
Analyzer analyzer = new EnglishAnalyzer(stopSet);
The output is:
[thi] [is] [a] [bla]
Yes, the "s" in "this" is missing. What's causing this? It also didn't use the default stop set.
The following changes remove both the default and the custom stop words:
Analyzer analyzer = new EnglishAnalyzer();
TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
tokenStream = new StopFilter(tokenStream, stopSet);
Question: What is the "right" way to do this? Is using the tokenStream within itself (see code above) going to cause problems?
Bonus question: How do I output the remaining words with the right upper/lower case, hence what they use in the original text?
I will tackle this in two parts:
stop-words
preserving original case
Handling the Combined Stop Words
To handle the combination of Lucene's English stop word list, plus your own custom list, you can create a merged list as follows:
import org.apache.lucene.analysis.en.EnglishAnalyzer;
...
final List<String> stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);
CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
stopSet.addAll(enStopSet);
The above code simply takes the English stopwords bundled with Lucene and merges then with your list.
That gives the following output:
[bla]
Handling Word Case
This is a bit more involved. As you have noticed, the StandardAnalyzer includes a step in which all words are converted to lower case - so we can't use that.
Also, if you want to maintain your own custom list of stop words, and if that list is of any size, I would recommend storing it in its own text file, rather than embedding the list into your code.
So, let's assume you have a file called stopwords.txt. In this file, there will be one word per line - and the file will already contain the merged list of your custom stop words and the official list of English stop words.
You will need to prepare this file manually yourself (i.e. ignore the notes in part 1 of this answer).
My test file is just this:
short
this
is
a
test
the
him
it
I also prefer to use the CustomAnalyzer for something like this, as it lets me build an analyzer very simply.
import org.apache.lucene.analysis.custom.CustomAnalyzer;
...
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("stop",
"ignoreCase", "true",
"words", "stopwords.txt",
"format", "wordset")
.build();
This does the following:
It uses the "icu" tokenizer org.apache.lucene.analysis.icu.segmentation.ICUTokenizer, which takes care of tokenizing on Unicode whitespace, and handling punctuation.
It applies the stopword list. Note the use of true for the ignoreCase attribute, and the reference to the stop-word file. The format of wordset means "one word per line" (there are other formats, also).
The key here is that there is nothing in the above chain which changes word case.
So, now, using this new analyzer, the output is as follows:
[Bla]
Final Notes
Where do you put the stop list file? By default, Lucene expects to find it on the classpath of your application. So, for example, you can put it in the default package.
But remember that the file needs to be handled by your build process, so that it ends up alongside the application's class files (not left behind with the source code).
I mostly use Maven - and therefore I have this in my POM to ensure the ".txt" file gets deployed as needed:
<build>
<resources>
<resource>
<directory>src/main/java</directory>
<excludes>
<exclude>**/*.java</exclude>
</excludes>
</resource>
</resources>
</build>
This tells Maven to copy files (except Java source files) to the build target - thus ensuring the text file gets copied.
Final note - I did not investigate why you were getting that truncated [thi] token. If I get a chance I will take a closer look.
Follow-Up Questions
After combining I have to use the StandardAnalyzer, right?
Yes, that is correct. the notes I provided in part 1 of the answer relate directly to the code in your question, and to the StandardAnalyzer you use.
I want to keep the stop word file on a specific non-imported path - how to do that?
You can tell the CustomAnalyzer to look in a "resources" directory for the stop-words file. That directory can be anywhere on the file system (for easy maintenance, as you noted):
import java.nio.file.Path;
import java.nio.file.Paths;
...
Path resources = Paths.get("/path/to/resources/directory");
Analyzer analyzer = CustomAnalyzer.builder(resources)
.withTokenizer("icu")
.addTokenFilter("stop",
"ignoreCase", "true",
"words", "stopwords.txt",
"format", "wordset")
.build();
Instead of using .builder() we now use .builder(resources).
I want to replace token value in all the files inside a specific directory (say dir1):
Few examples:
/dir1/helloworld.jsp
/dir1/dir2/sample.jsp
/dir1/dir2/dir3/produtinfo.jsp
/dir1/random/dir3/dir4/random.jsp
I tried using following configuration which does not seem to be working
<configuration>
<delimiters>
<delimiter>#</delimiter>
</delimiters>
<basedir>target/dir1</basedir>
<filesToInclude>*.jsp</filesToInclude>
<tokenValueMap>/usr/home/output.properties</tokenValueMap>
</configuration>
Am I missing something here?
Also, I have two properties something like
prop1.value=10
prop2.prop1.value=20
so this plugin replaces "sample text ${prop2.prop1.value}" with "sample text prop2.10". How do I fix this?
It should update it with "sample text 20" instead.
In short, I want to replace some property values present in all the jsp files present inside src/main/webapp/jsp while building maven project so if there are any alternatives, please feel free to suggest.
I'm fairly new to servlets so I hope this isn't an obvious question. So I have a simple Java servlet I've created in NetBeans using a template. I have a context parameter I created in web.xml that lists allowed hosts (one of my request parameters is a URL which I'll compare against this list):
<context-param>
<param-name>allowedHosts</param-name>
<param-value>
http://opendap.co-ops.nos.noaa.gov/thredds/wms/NOAA/CBOFS/MODELS/201206/nos.cbofs.fields.nowcast.20120612.t00z.nc?service=WMS&version=1.3.0&request=GetCapabilities
http://www.google.com
http://www.facebook.com
</param-value>
</context-param>
When I put only dummy URLs in like Google and Facebook, this works perfectly. However, when I add the first URL, the Tomcat server cannot even deploy. Looking at my logs, I see this at the top of a very long stacktrace:
SEVERE: Parse Fatal Error at line 19 column 146: The reference to entity "version" must end with the ';' delimiter.
org.xml.sax.SAXParseException: The reference to entity "version" must end with the ';' delimiter.
Line 19 column 146 indeed points to the "version" portion of that long URL I have in the list of context parameters. So obviously "version" is some kind of reserve word. If I remove the "version" parameter from this URL, "request" is also a problem.
I can get around this by doing just http://opendap.co-ops.nos.noaa.gov/thredds/wms/ as the URL (because ultimately I want a list of hosts not specific URLs anyway), but I was wondering what one would have to do to get around this otherwise... Is there a way to include URLs that have such "reserve words" in web.xml?
Thanks!
Just put "&" (without the quotes) instead of the ampersands in your URL.
Hope this helps - DF
Edit: Try this:
<context-param>
<param-name>allowedHosts</param-name>
<param-value>
http://opendap.co-ops.nos.noaa.gov/thredds/wms/NOAA/CBOFS/MODELS/201206/nos.cbofs.fields.nowcast.20120612.t00z.nc?service=WMS&version=1.3.0&request=GetCapabilities
</param-value>
Edit2: PS These things are called "entities" in XML - do a Google for more info
I want to use a key defined in properties file as a variable like this :
key1= value1
key2= value2
key3= key1
I try :
key3= {key1}
or
key3= ${key1}
But it dosn't work !
Any idea please ?
Java's built-in Properties class doesn't do what you're looking for.
But there are third-party libraries out there that do. Commons Configuration is one that I have used with some success. The PropertiesConfiguration class does exactly what you're looking for.
So you might have a file named my.properties that looks like this:
key1=value1
key2=Something and ${key1}
Code that uses this file might look like this:
CompositeConfiguration config = new CompositeConfiguration();
config.addConfiguration(new SystemConfiguration());
config.addConfiguration(new PropertiesConfiguration("my.properties"));
String myValue = config.getString("key2");
myValue will be "Something and value1".
When you define the value of a key in properties file, it will be parsed as literal value. So when you define key3= ${key1}, key3 will have value of "${key1}".
http://docs.oracle.com/javase/1.4.2/docs/api/java/util/Properties.html#load(java.io.InputStream)
I agree with csd, the plain configuration file may not be your choice. I prefer using Apache Ant ( http://ant.apache.org/ ), where you can do something like this:
<property name="src.dir" value="src"/>
<property name="conf.dir" value="conf" />
Then later when you want to use key 'src.dir', just call it something like this:
<dirset dir="${src.dir}"/>
Another good thing about using Apache Ant is you also can load the .properties file into Ant build file. Just import it like this:
<loadproperties srcFile="file.properties"/>
.xmlEven better: use the latest Maven.
You can do some neat stuff with maven. In this case you can make an .properties file with this lines in it:
key1 = ${src1.dir}
key2 = ${src1.dir}/${filename}
key3 = ${project.basedir}
in the maven's pom.xml file (placed in the root of your project) you should do something like this:
<resources>
<resource>
<filtering>true</filtering>
<directory>just-path-to-the-properties-file-without-it</directory>
<includes>
<include>file-name-to-be-filtered</include>
</includes>
</resource>
</resources>
...
<properties>
<src1.dir>/home/root</src1.dir>
<filename>some-file-name</filename>
</properties>
That way you would have keys values changed in build time, which means that after compiling you will have this values inside your properties file:
key1 = /home/root
key2 = /home/root/some-file-name
key3 = the-path-to-your-project
compile with this line when you are in the same dir as pom.xml file:
mvn clean install -DskipTests
I am writing a app, which has to create multiple Resources. The input is a XML. I need to parse the XML, create the Resources in parallel and update the responses in the Ouputs.
<Components>
<Resources>
<CreateResourceARequest> ...</CreateResourceARequest>
<CreateResourceBRequest>...</CreateResourceBRequest>
<CreateResourceCRequest>...</CreateResourceCRequest>
</Resources>
<Outputs>
<CreateResourceAResponse>...</CreateResourceAResponse>
<CreateResourceBResponse>...</CreateResourceBResponse>
<CreateResourceCResponse>...</CreateResourceCResponse>
</Outputs>
<Components>
Each of the ResourceRequests are handled by a specific Classes.
What is the best way to create Resources in parallel, aggregate the results and update the xml ?
you will parse xml file in single thread anyway because the file is linear. but you can collect you parsers for each CreateResourceCRequest in a set and start them all in parallel threads after parsing file