Is there any use in caching very small objects?

Is there any use in caching very small objects? - java

TaggedLogger has only a string field - tag.
public class TaggedLogger {
private final String tag;
public static TaggedLogger forInstance(Object instance) {
return new TaggedLogger(getTagOfInstance(instance));
}
public static String getTagOfInstance(Object instance) {
return getTagOfClass(instance.getClass());
}
public static TaggedLogger forClass(Class<?> someClass) {
return new TaggedLogger(getTagOfClass(someClass));
}
public static String getTagOfClass(Class<?> someClass) {
return someClass.getName();
}
public static TaggedLogger withTag(String tag) {
return new TaggedLogger(tag);
}
private TaggedLogger(String tag) {
this.tag = tag;
}
public void debug(Object obj) {
Log.d(getTag(), String.valueOf(obj));
}
public String getTag() {
return tag;
}
public void exception(String message) {
Log.e(getTag(), String.valueOf(message));
}
public void exception(Throwable exception) {
Log.e(getTag(), String.valueOf(exception.getMessage()), exception);
}
public void exception(Throwable exception, String additionalMessage) {
Log.e(getTag(), String.valueOf(exception.getMessage()), exception);
Log.e(getTag(), String.valueOf(additionalMessage));
}
public void info(Object obj) {
Log.i(getTag(), String.valueOf(obj.toString()));
}
}
And TaggedLoggers is using to get cached (or create new and put in cache) TaggedLogger instances:
public class TaggedLoggers {
public static final TaggedLogger GLOBAL = getCachedWithTag("GLOBAL");
private static final Map<String,TaggedLogger> cache = new HashMap<String, TaggedLogger>();
public static TaggedLogger getCachedForInstance(Object obj) {
return getCachedWithTag(TaggedLogger.getTagOfInstance(obj));
}
public static TaggedLogger getCachedForClass(Class<?> someClass) {
return getCachedWithTag(TaggedLogger.getTagOfClass(someClass));
}
public static TaggedLogger getCachedWithTag(String tag) {
TaggedLogger logger = cache.get(tag);
if (logger == null) {
logger = TaggedLogger.withTag(tag);
cache.put(tag, logger);
}
return logger;
}
}
Is there any use in TaggedLoggers class?
Actually I often use TaggedLogger for logging using arguments as tags. I.e.:
public class FragmentUtils {
public static void showMessage(Fragment fragment, String message, int toastDuration) {
TaggedLoggers.getCachedForInstance(fragment).debug(message);
Context context = fragment.getActivity();
if (context == null) {
return;
}
Toast toast = Toast.makeText(context, message, toastDuration);
toast.show();
}
}
So, caching TaggedLogger instances actually helps me to avoid a lot of unnecessary instances.
But, should I to do so?

Caching of existing instances can help a lot or can kill performances.
When creating a new instance you have to consider two factors :
The time to setup the instance itself, that is allocate ram, initialize fields and execute the constructor
The time taken by the garbage collector to cleanup after the instance is not reachable anymore
This used to be a lot of time on older JVMs, not the garbage collector has been improved and usually creating and throwing away small instances is not a big problem as it was before, but still it has it's cost. Don't know exactly how much Android JVMs have been optimized.
In this case it depends on how often you create and throw away these instances, which you said is very often.
When instead reusing them, you have to consider two factors :
The time to lookup the existing instance, that is a map lookup
The ram that is kept full of actually unused instances
So, in this case, it depends on how many different instances you have. If you will have thousands of TaggedLogger's then looking up the map and keeping all that stuff in ram could hurt performances more than creating and throwing away.
If TaggedLoggers are around hundred(s), then probably better caching, if they go into thousands, then probably better to instantiate and throw away.
However, I would question wether you need a TaggerLogger. If you always have the tag String, can't you simply call the logger method or only have a (possibly even static) façade in front of it, instead of instances that contains only an information (the tag string) that you already have?

Creating and garbage collecting TaggedLoggers would be nearly free, so there's no real benefit to caching them, but since TaggedLoggers uses a HashMap instead of a ConcurrentHashMap there is the potential, if it is called from multiple threads, for difficult to debug problems up to and including an infinite loop if two threads try to resize the map larger at the same time.
It provides little if any benefit, creates additional complexity, and may create problems.
See also: A Beautiful Race Condition

Related

Hadoop singleton pattern uasge

I'm trying to implement singleton which is going to cache and validate configuration of map reduce jobs in hadoop. Let's name it ConfigurationManager.
Here is what I have for now:
public class ConfigurationManager {
private static volatile ConfigurationManager instance;
private static final String CONF_NAME = "isSomethingEnabled";
private boolean isSomethingEnabled;
private ConfigurationManager(Configuration configuration) {
this.isSomethingEnabled= configuration.getBoolean(CONF_NAME, false);
}
public static void init(Configuration configuration) {
if (instance == null) {
synchronized (ConfigurationManager.class) {
if (instance == null) {
this.instance = new ConfigurationManager(configuration);
}
}
}
}
public static ConfigurationManager get() {
return instance;
}
public boolean isSomethingEnabled() {
return isSomethingEnabled;
}
}
As you can see it is designed to be thread-safe. Moreover it is not standard singleton: I separated initialization and accessor methods not to enforce presence of hadoop's Configuration instance on get call. So to use it I prematurely call init in the ancestor of Tool and then trying to access my singleton using get in reducers (like this ConfigurationManager.get().isSomethingEnabled()), but for some reasons get returns null. Could somebody, please, explain such a behaviour? Maybe maps/reducers are initiated as separate processes?

Each reduce task runs on a different jvm. Which would explain the null.
You can do it per reduce task in : Reducer - configure

Lazy initialization for non-static values

The question actually refers to a different question, which was closed as duplicate because it was probably not well formulated.
What would be an effective alternative lazy initialization idiom instead of double-checked locking for this code sample (in a multithreaded environment):
public class LazyEvaluator {
private final Object state;
private volatile LazyValue lazyValue;
public LazyEvaluator(Object state) {
this.state = state;
}
public LazyValue getLazyValue() {
if (lazyValue == null) {
synchronized (this) {
if (lazyValue == null) {
lazyValue = new LazyValue(someSlowOperationWith(state), 42);
}
}
}
return lazyValue;
}
public static class LazyValue {
private String name;
private int value;
private LazyValue(String name, int value) {
this.name = name;
this.value = value;
}
private String getName() {
return name;
}
private int getValue() {
return value;
}
}
}
EDIT Updated to include a slow operation and added explicit mention about multithreaded environment

If I understand you, then you could change this
public LazyValue getLazyValue() {
if (lazyValue == null) {
synchronized (this) {
if (lazyValue == null) {
lazyValue = new LazyValue(state.toString());
}
}
}
return lazyValue;
}
to this
public synchronized LazyValue getLazyValue() {
if (lazyValue == null) {
lazyValue = new LazyValue(state.toString());
}
return lazyValue;
}
But it's only necessary pre-Java 5 (which doesn't support acquire/release semantics for volatile) and if mulitple threads might access the same instance of your LazyEvaluator. If each thread has a thread-local instance then you don't need to synchronize.

The simplest solution would be
public LazyValue getLazyValue() {
return new LazyValue(state.toString(), 42);
}
as LazyValue is a trivial object which is not worth to be remembered at all.
If an expensive computation is involved you can turn the LazyValue into a true immutable object by declaring its fields final:
public static class LazyValue {
private final String name;
private final int value;
// …
this way you could publish the instance even through a data race:
// with lazyValue even not being volatile
public LazyValue getLazyValue() {
return lazyValue!=null? lazyValue:
(lazyValue=new LazyValue(state.toString(), 42));
}
In this case the value might be calculated multiple times in the unlikely case that multiple threads access it concurrently but once a thread sees a non-null value, it will be a correctly initialized value due to the final field initialization guaranty.
If the calculation is so expensive that even an unlikely concurrent calculation must be avoided, then simply declare getLazyValue() synchronized as its overhead will be negligible compared to the calculation that will be saved.
Finally, if you really encounter a scenario were the computation is so heavy, that overlapping concurrent computation must be avoided at all cost but profiling shows that later-on synchronization is a bottleneck, you might have encountered one of the very rare cases were double-checked locking could be an option (really rare).
In this case, there’s still an alternative to your question’s code. Combine the DCL with my suggestion above of declaring all LazyValue’s fields as final and make the lazyValue holder field non-volatile. That way you can even save the volatile read after the lazy value has been constructed. However, I still say, it should be really rarely needed.
Maybe that’s the non-technical reason why DCL has so much negative reputation: it’s appearance in discussions (or on StackOverflow) is way out of all proportion to its real need.

Well, "effective alternative lazy initialization idiom" leaves a lot of flexibility, so I'll put my two cents in the ring by noting that this might be a good place to apply a library. In particular, Guava. https://code.google.com/p/guava-libraries/
// You have some long method call you want to make lazy
MyValue someLongMethod(int input) { ... }
// So you wrap it in a supplier so it's lazy
Supplier<MyValue> lazy = new Supplier<MyValue>() {
public MyValue get() {
return someLongMethod(2);
}
}
// and you want it only to be called once ...
Supplier<MyValue> cached = Suppliers.memoize(lazy);
// ... and someLongMethod won't actually be called until
cached.get();
Double-checked-locking is is used (properly) by the Suppliers class. AS far as idioms go, Supplier is certainly effective and quite popular --java.util.function.Supplier came in Java 8.
Good luck.

Disabling an if-Condition for one static method call by setting a static field

We have got a class, let it be named AttributeUpdater in our project handling the copying of values from one entity to another. The core method traverses through the attributes of an entity and copies them as specified into the second one. During that loop the AttributeUpdater collects all reports, which contain information about what value was overwritten during copying, into a nice list for eventual logging purposes. This list is deleted in case that the old entity which values got overwritten was never persisted into the database, because in that case you only would overwrite default values and logging that is deemed redundant. In pseudo Java code:
public class AttributeUpdater {
public static CopyResult updateAttributes(Entity source, Entity target, String[] attributes) {
List<CopyReport> reports = new ArrayList<CopyReport>();
for(String attribute : attributes) {
reports.add(copy(source, target, attribute));
}
if(target.isNotPersisted()) {
reports.clear();
}
return new CopyResult(reports);
}
}
Now someone got the epiphany that there is one case in which the reports actually matter even if the entity has not been persisted yet. This would not be that big of a deal if I could just add another parameter to the method signature, but that is somewhat out of option due to the actual structure of the class and the amount of required refractoring. Since the method is static the only other solution I came up with is adding a flag as a static field and setting it just before the function call.
public class AttributeUpdater {
public static final ThreadLocal<Boolean> isDeletionEnabled = new ThreadLocal<Boolean> {
#Override protected Boolean initialValue() {
return Boolean.TRUE;
}
public static Boolean getDeletionEnabled() { return isDeletionEnabled.get(); }
public static void setDeletionEnabled(Boolean b) { isDeletionEnabled.set(b); }
public static CopyResult updateAttributes(Entity source, Entity target, String[] attributes) {
List<CopyReport> reports = new ArrayList<CopyReport>();
for(String attribute : attributes) {
reports.add(copy(source, target, attribute));
}
if(isDeletionEnabled.get() && target.isNotPersisted()) {
reports.clear();
}
return new CopyResult(reports);
}
}
ThreadLocal is a container used for thread-safety. This solution, while it does the job, has at least for me one major drawback: for all the other methods which assume that the reports are deleted there is now no way of guaranteeing that those reports will be deleted as expected. Again refractoring is not an option. So I came up with this:
public class AttributeUpdater {
private static final ThreadLocal<Boolean> isDeletionEnabled = new ThreadLocal<Boolean> {
#Override protected Boolean initialValue() {
return Boolean.TRUE;
}
public static Boolean getDeletionEnabled() { return isDeletionEnabled.get(); }
public static void disableDeletionForNextCall() { isDeletionEnabled.set(Boolean.FALSE); }
public static CopyResult updateAttributes(Entity source, Entity target, String[] attributes) {
List<CopyReport> reports = new ArrayList<CopyReport>();
for(String attribute : attributes) {
reports.add(copy(source, target, attribute));
}
if(isDeletionEnabled.get() && target.isNotPersisted()) {
reports.clear();
}
isDeletionEnabled.set(Boolean.TRUE);
return new CopyResult(reports);
}
}
This way I can guarantee that for old code the function will always work like it did before the change. The downside to this solution is, especially for nested entities, that I am going to be accessing the ThreadLocal-Container a lot - Iteration over one of those means calling disableDeletionForNextCall() for each nested element. Also as the method is called a lot overall there are valid performance concerns.
TL;DR: Look at pseudo Java source code. First one is old code, second and third are different attempts to allow deletion disabling. Parameters cannot be added to method signature.
Is there a possibility to determine which solution is better or is this merely a philosophical issue? Or is there even a better solution to this problem?

The obvious way to decide which solution is better in terms of performance would be benchmarking this. As both solutions access the thread-local variable at least for reading, I doubt that they would differ too much. You could perhaps combine them like this:
if(!isDeletionEnabled.get())
isDeletionEnabled.set(Boolean.TRUE);
else if (target.isNotPersisted())
reports.clear();
In this case, you will have the benefit of the second solution (guaranteed resetting of the flag) without unneccessary writes.
I doubt there will be much practical difference. With a bit of luck, the HotSpot JVM will compile the thread local variable into some nice native code which works without too much of a performance penalty, though I have no actual experience there.

Thread-safe cache of one object in java

let's say we have a CountryList object in our application that should return the list of countries. The loading of countries is a heavy operation, so the list should be cached.
Additional requirements:
CountryList should be thread-safe
CountryList should load lazy (only on demand)
CountryList should support the invalidation of the cache
CountryList should be optimized considering that the cache will be invalidated very rarely
I came up with the following solution:
public class CountryList {
private static final Object ONE = new Integer(1);
// MapMaker is from Google Collections Library
private Map<Object, List<String>> cache = new MapMaker()
.initialCapacity(1)
.makeComputingMap(
new Function<Object, List<String>>() {
#Override
public List<String> apply(Object from) {
return loadCountryList();
}
});
private List<String> loadCountryList() {
// HEAVY OPERATION TO LOAD DATA
}
public List<String> list() {
return cache.get(ONE);
}
public void invalidateCache() {
cache.remove(ONE);
}
}
What do you think about it? Do you see something bad about it? Is there other way to do it? How can i make it better? Should i look for totally another solution in this cases?
Thanks.

google collections actually supplies just the thing for just this sort of thing: Supplier
Your code would be something like:
private Supplier<List<String>> supplier = new Supplier<List<String>>(){
public List<String> get(){
return loadCountryList();
}
};
// volatile reference so that changes are published correctly see invalidate()
private volatile Supplier<List<String>> memorized = Suppliers.memoize(supplier);
public List<String> list(){
return memorized.get();
}
public void invalidate(){
memorized = Suppliers.memoize(supplier);
}

Thanks you all guys, especially to user "gid" who gave the idea.
My target was to optimize the performance for the get() operation considering the invalidate() operation will be called very rare.
I wrote a testing class that starts 16 threads, each calling get()-Operation one million times. With this class I profiled some implementation on my 2-core maschine.
Testing results
Implementation Time
no synchronisation 0,6 sec
normal synchronisation 7,5 sec
with MapMaker 26,3 sec
with Suppliers.memoize 8,2 sec
with optimized memoize 1,5 sec
1) "No synchronisation" is not thread-safe, but gives us the best performance that we can compare to.
#Override
public List<String> list() {
if (cache == null) {
cache = loadCountryList();
}
return cache;
}
#Override
public void invalidateCache() {
cache = null;
}
2) "Normal synchronisation" - pretty good performace, standard no-brainer implementation
#Override
public synchronized List<String> list() {
if (cache == null) {
cache = loadCountryList();
}
return cache;
}
#Override
public synchronized void invalidateCache() {
cache = null;
}
3) "with MapMaker" - very poor performance.
See my question at the top for the code.
4) "with Suppliers.memoize" - good performance. But as the performance the same "Normal synchronisation" we need to optimize it or just use the "Normal synchronisation".
See the answer of the user "gid" for code.
5) "with optimized memoize" - the performnce comparable to "no sync"-implementation, but thread-safe one. This is the one we need.
The cache-class itself:
(The Supplier interfaces used here is from Google Collections Library and it has just one method get(). see http://google-collections.googlecode.com/svn/trunk/javadoc/com/google/common/base/Supplier.html)
public class LazyCache<T> implements Supplier<T> {
private final Supplier<T> supplier;
private volatile Supplier<T> cache;
public LazyCache(Supplier<T> supplier) {
this.supplier = supplier;
reset();
}
private void reset() {
cache = new MemoizingSupplier<T>(supplier);
}
#Override
public T get() {
return cache.get();
}
public void invalidate() {
reset();
}
private static class MemoizingSupplier<T> implements Supplier<T> {
final Supplier<T> delegate;
volatile T value;
MemoizingSupplier(Supplier<T> delegate) {
this.delegate = delegate;
}
#Override
public T get() {
if (value == null) {
synchronized (this) {
if (value == null) {
value = delegate.get();
}
}
}
return value;
}
}
}
Example use:
public class BetterMemoizeCountryList implements ICountryList {
LazyCache<List<String>> cache = new LazyCache<List<String>>(new Supplier<List<String>>(){
#Override
public List<String> get() {
return loadCountryList();
}
});
#Override
public List<String> list(){
return cache.get();
}
#Override
public void invalidateCache(){
cache.invalidate();
}
private List<String> loadCountryList() {
// this should normally load a full list from the database,
// but just for this instance we mock it with:
return Arrays.asList("Germany", "Russia", "China");
}
}

Whenever I need to cache something, I like to use the Proxy pattern.
Doing it with this pattern offers separation of concerns. Your original
object can be concerned with lazy loading. Your proxy (or guardian) object
can be responsible for validation of the cache.
In detail:
Define an object CountryList class which is thread-safe, preferably using synchronization blocks or other semaphore locks.
Extract this class's interface into a CountryQueryable interface.
Define another object, CountryListProxy, that implements the CountryQueryable.
Only allow the CountryListProxy to be instantiated, and only allow it to be referenced
through its interface.
From here, you can insert your cache invalidation strategy into the proxy object. Save the time of the last load, and upon the next request to see the data, compare the current time to the cache time. Define a tolerance level, where, if too much time has passed, the data is reloaded.
As far as Lazy Load, refer here.
Now for some good down-home sample code:
public interface CountryQueryable {
public void operationA();
public String operationB();
}
public class CountryList implements CountryQueryable {
private boolean loaded;
public CountryList() {
loaded = false;
}
//This particular operation might be able to function without
//the extra loading.
#Override
public void operationA() {
//Do whatever.
}
//This operation may need to load the extra stuff.
#Override
public String operationB() {
if (!loaded) {
load();
loaded = true;
}
//Do whatever.
return whatever;
}
private void load() {
//Do the loading of the Lazy load here.
}
}
public class CountryListProxy implements CountryQueryable {
//In accordance with the Proxy pattern, we hide the target
//instance inside of our Proxy instance.
private CountryQueryable actualList;
//Keep track of the lazy time we cached.
private long lastCached;
//Define a tolerance time, 2000 milliseconds, before refreshing
//the cache.
private static final long TOLERANCE = 2000L;
public CountryListProxy() {
//You might even retrieve this object from a Registry.
actualList = new CountryList();
//Initialize it to something stupid.
lastCached = Long.MIN_VALUE;
}
#Override
public synchronized void operationA() {
if ((System.getCurrentTimeMillis() - lastCached) > TOLERANCE) {
//Refresh the cache.
lastCached = System.getCurrentTimeMillis();
} else {
//Cache is okay.
}
}
#Override
public synchronized String operationB() {
if ((System.getCurrentTimeMillis() - lastCached) > TOLERANCE) {
//Refresh the cache.
lastCached = System.getCurrentTimeMillis();
} else {
//Cache is okay.
}
return whatever;
}
}
public class Client {
public static void main(String[] args) {
CountryQueryable queryable = new CountryListProxy();
//Do your thing.
}
}

Your needs seem pretty simple here. The use of MapMaker makes the implementation more complicated than it has to be. The whole double-checked locking idiom is tricky to get right, and only works on 1.5+. And to be honest, it's breaking one of the most important rules of programming:
Premature optimization is the root of
all evil.
The double-checked locking idiom tries to avoid the cost of synchronization in the case where the cache is already loaded. But is that overhead really causing problems? Is it worth the cost of more complex code? I say assume it is not until profiling tells you otherwise.
Here's a very simple solution that requires no 3rd party code (ignoring the JCIP annotation). It does make the assumption that an empty list means the cache hasn't been loaded yet. It also prevents the contents of the country list from escaping to client code that could potentially modify the returned list. If this is not a concern for you, you could remove the call to Collections.unmodifiedList().
public class CountryList {
#GuardedBy("cache")
private final List<String> cache = new ArrayList<String>();
private List<String> loadCountryList() {
// HEAVY OPERATION TO LOAD DATA
}
public List<String> list() {
synchronized (cache) {
if( cache.isEmpty() ) {
cache.addAll(loadCountryList());
}
return Collections.unmodifiableList(cache);
}
}
public void invalidateCache() {
synchronized (cache) {
cache.clear();
}
}
}

I'm not sure what the map is for. When I need a lazy, cached object, I usually do it like this:
public class CountryList
{
private static List<Country> countryList;
public static synchronized List<Country> get()
{
if (countryList==null)
countryList=load();
return countryList;
}
private static List<Country> load()
{
... whatever ...
}
public static synchronized void forget()
{
countryList=null;
}
}
I think this is similar to what you're doing but a little simpler. If you have a need for the map and the ONE that you've simplified away for the question, okay.
If you want it thread-safe, you should synchronize the get and the forget.

What do you think about it? Do you see something bad about it?
Bleah - you are using a complex data structure, MapMaker, with several features (map access, concurrency-friendly access, deferred construction of values, etc) because of a single feature you are after (deferred creation of a single construction-expensive object).
While reusing code is a good goal, this approach adds additional overhead and complexity. In addition, it misleads future maintainers when they see a map data structure there into thinking that there's a map of keys/values in there when there is really only 1 thing (list of countries). Simplicity, readability, and clarity are key to future maintainability.
Is there other way to do it? How can i make it better? Should i look for totally another solution in this cases?
Seems like you are after lazy-loading. Look at solutions to other SO lazy-loading questions. For example, this one covers the classic double-check approach (make sure you are using Java 1.5 or later):
How to solve the "Double-Checked Locking is Broken" Declaration in Java?
Rather than just simply repeat the solution code here, I think it is useful to read the discussion about lazy loading via double-check there to grow your knowledge base. (sorry if that comes off as pompous - just trying teach to fish rather than feed blah blah blah ...)

There is a library out there (from atlassian) - one of the util classes called LazyReference. LazyReference is a reference to an object that can be lazily created (on first get). it is guarenteed thread safe, and the init is also guarenteed to only occur once - if two threads calls get() at the same time, one thread will compute, the other thread will block wait.
see a sample code:
final LazyReference<MyObject> ref = new LazyReference() {
protected MyObject create() throws Exception {
// Do some useful object construction here
return new MyObject();
}
};
//thread1
MyObject myObject = ref.get();
//thread2
MyObject myObject = ref.get();

This looks ok to me (I assume MapMaker is from google collections?) Ideally you wouldn't need to use a Map because you don't really have keys but as the implementation is hidden from any callers I don't see this as a big deal.

This is way to simple to use the ComputingMap stuff. You only need a dead simple implementation where all methods are synchronized, and you should be fine. This will obviously block the first thread hitting it (getting it), and any other thread hitting it while the first thread loads the cache (and the same again if anyone calls the invalidateCache thing - where you also should decide whether the invalidateCache should load the cache anew, or just null it out, letting the first attempt at getting it again block), but then all threads should go through nicely.

Use the Initialization on demand holder idiom
public class CountryList {
private CountryList() {}
private static class CountryListHolder {
static final List<Country> INSTANCE = new List<Country>();
}
public static List<Country> getInstance() {
return CountryListHolder.INSTANCE;
}
...
}

Follow up to Mike's solution above. My comment didn't format as expected... :(
Watch out for synchronization issues in operationB, especially since load() is slow:
public String operationB() {
if (!loaded) {
load();
loaded = true;
}
//Do whatever.
return whatever;
}
You could fix it this way:
public String operationB() {
synchronized(loaded) {
if (!loaded) {
load();
loaded = true;
}
}
//Do whatever.
return whatever;
}
Make sure you ALWAYS synchronize on every access to the loaded variable.

fuzzy implementation for capturing specific strings

I am going to develop a web crawler using java to capture hotel room prices from hotel websites.
In this case I want to capture room price with the room type and the meal type, so my algorithm should be intelligent to handle that.
For example:
Room type: Deluxe
Meal type: HalfBoad
price : $20.00
The main problem is room prices can be in different ways in different hotel sites. So my algorithm should be independent from hotel sites.
I am plan to use above room types and meal types as a fuzzy sets and compare the words in webpage with above fuzzy sets using a suitable membership function.
Anyone experienced with this? or have an idea for my problem?

There are two ways to approach this problem:
You can customize your crawler to understand the formats used by different Websites; or
You can come up with a general ("fuzzy") solution.
(1) will, by far, be the easiest. Ideally you want to create some tools that make this easier so you can create a filter for any new site in minimal time. IMHO your time will be best spent with this approach.
(2) has lots of problems. Firstly it will be unreliable. You will come across formats you don't understand or (worse) get wrong. Second, it will require a substantial amount of development to get something working. This is the sort of thing you use when you're dealing with thousands or millions of sites.
With hundreds of sites you will get better and more predictable results with (1).

As with all problems, design can let you deliver value adapt to situations you haven't considered much more quickly than the general solution.
Start by writing something that parses the data from one provider - the one with the simplest format to handle. Find a way to adapt that handler into your crawler. Be sure to encapsulate construction - you should always do this anyway...
public class RoomTypeExtractor
{
private RoomTypeExtractor() { }
public static RoomTypeExtractor GetInstance()
{
return new RoomTypeExtractor();
}
public string GetRoomType(string content)
{
// BEHAVIOR #1
}
}
The GetInstance() ,ethod lets you promote to a Strategy pattern for practically free.
Then add your second provider type. Say, for instance, that you have a slightly more complex data format which is a little more prevalent than the first format. Start by refactoring what was your concrete room type extractor class into an abstraction with a single variation behind it and have the GetInstance() method return an instance of the concrete type:
public abstract class RoomTypeExtractor
{
public static RoomTypeExtractor GetInstance()
{
return SimpleRoomTypeExtractor.GetInstance();
}
public abstract string GetRoomType(string content);
}
public final class SimpleRoomTypeExtractor extends RoomTypeExtractor
{
private SimpleRoomTypeExtractor() { }
public static SimpleRoomTypeExtractor GetInstance()
{
return new SimpleRoomTypeExtractor();
}
public string GetRoomType(string content)
{
// BEHAVIOR #1
}
}
Create another variation that implements the Null Object pattern...
public class NullRoomTypeExtractor extends RoomTypeExtractor
{
private NullRoomTypeExtractor() { }
public static NullRoomTypeExtractor GetInstance()
{
return new NullRoomTypeExtractor();
}
public string GetRoomType(string content)
{
// whatever "no content" behavior you want... I chose returning null
return null;
}
}
Add a base class that will make it easier to work with the Chain of Responsibility pattern that is in this problem:
public abstract class ChainLinkRoomTypeExtractor extends RoomTypeExtractor
{
private final RoomTypeExtractor next_;
protected ChainLinkRoomTypeExtractor(RoomTypeExtractor next)
{
next_ = next;
}
public final string GetRoomType(string content)
{
if (CanHandleContent(content))
{
return GetRoomTypeFromUnderstoodFormat(content);
}
else
{
return next_.GetRoomType(content);
}
}
protected abstract bool CanHandleContent(string content);
protected abstract string GetRoomTypeFromUnderstoodFormat(string content);
}
Now, refactor the original implementation to have a base class that joins it into a Chain of Responsibility...
public final class SimpleRoomTypeExtractor extends ChainLinkRoomTypeExtractor
{
private SimpleRoomTypeExtractor(RoomTypeExtractor next)
{
super(next);
}
public static SimpleRoomTypeExtractor GetInstance(RoomTypeExtractor next)
{
return new SimpleRoomTypeExtractor(next);
}
protected string CanHandleContent(string content)
{
// return whether or not content contains the right format
}
protected string GetRoomTypeFromUnderstoodFormat(string content)
{
// BEHAVIOR #1
}
}
Be sure to update RoomTypeExtractor.GetInstance():
public static RoomTypeExtractor GetInstance()
{
RoomTypeExtractor extractor = NullRoomTypeExtractor.GetInstance();
extractor = SimpleRoomTypeExtractor.GetInstance(extractor);
return extractor;
}
Once that's done, create a new link for the Chain of Responsibility...
public final class MoreComplexRoomTypeExtractor extends ChainLinkRoomTypeExtractor
{
private MoreComplexRoomTypeExtractor(RoomTypeExtractor next)
{
super(next);
}
public static MoreComplexRoomTypeExtractor GetInstance(RoomTypeExtractor next)
{
return new MoreComplexRoomTypeExtractor(next);
}
protected string CanHandleContent(string content)
{
// Check for presence of format #2
}
protected string GetRoomTypeFromUnderstoodFormat(string content)
{
// BEHAVIOR #2
}
}
Finally, add the new link to the chain, if this is a more common format, you might want to give it higher priority by putting it higher in the chain (the real forces that govern the order of the chain will become apparent when you do this):
public static RoomTypeExtractor GetInstance()
{
RoomTypeExtractor extractor = NullRoomTypeExtractor.GetInstance();
extractor = SimpleRoomTypeExtractor.GetInstance(extractor);
extractor = MoreComplexRoomTypeExtractor.GetInstance(extractor);
return extractor;
}
As time passes, you may want to add ways to dynamically add new links to the Chain of Responsibility, as pointed out by Cletus, but the fundamental principle here is Emergent Design. Start with high quality. Keep quality high. Drive with tests. Do those three things and you will be able to use the fuzzy logic engine between your ears to overcome almost any problem...
EDIT
Translated to Java. Hope I did that right; I'm a little rusty.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.