Just wondering if there are a set of design patterns for complex string manipulation?
Basically the problem I am trying to solve is I need to be able to read in a string, like the following:
"[name_of_kicker] looks to make a clearance kick, but is under some real pressure from the [name_of_defending_team] players. He gets a [length_of_kick] kick away, but it drifts into touch on the full."
or
"[name_of_kicker] receives the ball from [name_of_passer] and launches the bomb. [name_of_kicker] has really made good contact, it's given a couple of [name_of_attacking_team] chasers ample time to get under the ball as it comes down."
And replace each "tag" with a possible value and check if the string is equal to another string.
So for example, any tag that represents a player I need to be able to replace with anyone of 22 string values that represent a player. But I also need to be able to make sure I have looped through each combination of players for the various tags, that I may find in a string. NOTE, the tags listed in the above 2 samples, are not the only tags possible, there are countless other ones that could come up in any sentence.
I had tried to create a load of nested for loops to go through the collection of players, etc and attempt to replace the tags each time, but with there being many possibilities of tags I was just creating one nested for loop within another, and it has become unmanageable, and also I suspect inefficient, since I need to loop through over 1,000 base string like the samples above, and replace difference tags with players, etc for each one...
So are there any String manipulation patterns I could look into, or does anyone have any possible solutions to solving a problem like this.
Firstly, to answer your question.
Just wondering if there are a set of design patterns for complex string manipulation?
Not really. There are some techniques, but they hardly qualify as design patterns. The two techniques that spring to mind are template expansion and pattern matching.
What you are currently doing / proposing to do is a form of template expansion. However, typical templating engines don't support the combinatorial expansion that you are trying to do, and as you expect anticipate, it would appear to be an inefficient way to solve your problem.
A better technique would appear to be pattern matching. Let's take your first example, and turn it into a pattern:
"(Ronaldino|Maradonna|Peter Shilton|Jackie Charlton) looks to make a clearance kick, but is under some real pressure from the (Everton|Real Madrid|Adelaide United) players. He gets a ([0-9]+ metre) kick away, but it drifts into touch on the full."
What I've done is insert all of the possible alternatives into the pseudo-template, to turn it into a regex. I can now compile this regex to a java.util.Pattern, and use it to match against your list of other strings.
Having said that, if you are trying to do this to "analyse" text, I don't rate your chances of success. I think you would be better off going down the NLP route.
What you're describing looks a bit like what template engines are used for.
Two popular ones for Java are:
FreeMarker
StringTemplate
But there are many, many more, of course.
MY two cents,As you stated "I was just creating one nested for loop within another, and it has become unmanageable,"
You are looking in the wrong direction my friend there is whole universe of solutions to the problem you are facing ,simply know as a rule engine.
There are various type of rule engines(business rule engines,web template engines etc.) but for above requirement i suggest business rule engines.
Can't comment on which one to use as it depends upon
Multi-threading.
Open Source/Commercial.
Load factor/Processing time etc.
Hope it helps
http://ratakondas.blogspot.in/2012/06/business-rules-engines-white-paper.html
[read the summary section it gives best advice.]
https://en.wikipedia.org/wiki/Business_rules_engine#Types_of_rule_engines
https://en.wikipedia.org/wiki/Comparison_of_web_template_engines
Welcome to the world of rule engines :)
Related
I want to create an algorithm that searches job descriptions for given words (like Java, Angular, Docker, etc). My algorithm works, but it is rather naive. For example, it cannot detect the word Java if it is contained in another word (such as JavaEE). When I check for substrings, I have the problem that, for example, Java is recognized in the word JavaScript, which I want to avoid. I could of course make an explicit case distinction here, but I'm more looking for a general solution.
Are there any particular techniques or approaches that try to solve this problem?
Unfortunately, I don't have the amount of data necessary for data-driven approaches like machine learning.
Train a simple word2vec language model with your whole job description text data. Then use your own logic to find the keywords. When you find a match, if it's not an exact match use your similar words list.
For example you're searching for Java but find also javascript, use your word vectors to find if there is any similarity between them (in another words, if they ever been used in a similar context). Java and JavaEE probably already used in a same sentence before but java and javascript or Angular and Angularentwicklung been not.
It may seems a bit like over-engineering, but its not :).
I spent some time researching my problem, and I found that identifying certain words, even if they don't match 1:1, is not a trivial problem. You could solve the problem by listing synonyms for the words you are looking for, or you could build a rule-based named entity recognition service. But that is both error-prone and maintenance-intensive.
Probably the best way to solve my problem is to build a named entity recognition service using machine learning. I am currently watching a video series that looks very promising for the given problem. --> https://www.youtube.com/playlist?list=PL2VXyKi-KpYs1bSnT8bfMFyGS-wMcjesM
I will comment on this answer when I am done with my work to give feedback to those who are facing the same problem.
I am planning to develop an adventure-like game.
For that I am going to have a lot of instances of classes with different texts (basicly strings).
I dont want to hardcode this many texts, so i am looking for a way to do it better.
The guy in this video ( https://www.youtube.com/watch?v=8CDePunJlck ) is using json to write text files for each class instance manually and parse them automatically into instances. That goes into the right direction.
I´m looking for more information on that, so how is this procedure called?
Its said in the video that this also works with databases?
Is there a way to design a little bit more complex stuff with things like this?
E.g. I have the case that I would like to output different texts if e.g. a local or global variable is over a treshold etc. Can I do this without hardcoding and write an own class for each of my proposed instances?
Thank you!
Your question is quite broad, and it is hard to give a definitive answer. Here are some thoughts - hope you find it helpful.
You are right that you don't want to hardcode strings. The alternative to this is storing strings as external resources, and loading them into your game at start. There are numerous ways the resource can be organized; the choice depends on your programming platform, game architecture etc. For example, you can use simple name-value approach:
AREA_1_DESCRIPTION: You stand next o a small white house.
ITEM_22_DESTRUCTION: The nasty snake disappears with a loud "Bang!"
Using JSON or XML will give you more structured storage, which can be of great help, since you can organize your texts so that it is easier to use them in the code:
<item id="375" name="Great Sword">
<short_description>A Great Sword of Darkness</short_description>
<long_description>The sword has almost black blade with some unknown runes engraved</long_description>
</item>
If your programming system can access a database, then you can do something similar and store texts in the tables; this, however, might make it more difficult to edit texts later. If you want to go this way, I would still recommend using XML or JSON to store the texts, and making the game import texts in DB on the first run.
You probably will also need some sort of simple template-handling engine to be able to re-use some strings. You can start with creating your version of Java String.format() method. Your method might take as a first argument an ID of a string in your string catalog, and use some simple placeholders for the parameters. Suppose you have the following entry in your catalog:
FIRE_GEM_ACTION: "The Fire Gem touches %% and in %% seconds it turns into ashes."
Then you can write a method that will do something like this:
int delaySeconds = 5;
String message = MyTemplateProcessor.process(FIRE_GEM_ACTION, "old map", delaySeconds);
The function will take the string from the catalog, search for the occurrences of the placeholders (%%) and replace them sequentially with the parameters, so in the message you will get: The Fire Gem touches old map and in 5 seconds it turns into ashes.
In general, I would recommend you to have a look at some systems specially designed for creation of adventure games. Inform 7 will be a good starting place: http://inform7.com/learn/
I have come across similar problems a few times in the past and want to know what language (methodology) if any is used to solve similar problems (I am a J2EE/java developer):
problem: Out of a probable set of words, with a given rule (say the word can be a combination of A and X, and always starts with a X, each word is delimited by a space), you have to read a sequence of words and parse through the input to decide which of the words are syntatctically correct. In a nutshell these are problems that involve parsing techniques. Say simulate the logic of an vending machine in Java.
So what I want to know is what are the techniques/best approach to solve problems pertaining to parsing inputs. Like alien language processing problem in google code jam
Google code jam problem
Do we use something like ANTLR or some library in java.
I know this question is slightly generic, but I had no other way of expressing it.
P.S: I do not want a solution, I am looking for best way to solve such recurring problems.
You can use JavaCC for complex parsing.
For relative simple parsing and event processing I use enum(s) as a state machine. esp as a push parser.
For very simple parsing, you can use indexOf or split(" ") with equals, switch or startsWith
If you want to simulate the logic of a something that is essentially a finite state automation, you can simply code the FSA by hand. This is a standard computer science solution. A less obvious way to do this is to use a lexer-generator (there are lots of them) to generate the FSA from descriptions of the valid sequences of events (in lexer-generator speak, these are called "characters" but you can cheat and substitute event occurrences for characters).
If you have complex recursive rules about matching, you'll want a more traditional parser.
You can code these by hand, too, if the grammar isn't complicated; see my ?SO answer on "how to build a recursive descent parser". If your grammar is complex or it changes quickly, you'll want to use a standard parser generator. Other answers here suggest specific ones but there are many to choose from, all generally very capable.
[FWIW, I applied parser generators to recognizing valid transaction sequences in 1974 in TRW POS terminals the May Company department store. Worked pretty well.]
You can use ANTLR which is good, It will help in complex problem But you can also use regular expressions eg: spilt("\\s+").
I am developing a financial manager in my freetime with Java and Swing GUI. When the user adds a new entry, he is prompted to fill in: Moneyamount, Date, Comment and Section (e.g. Car, Salary, Computer, Food,...)
The sections are created "on the fly". When the user enters a new section, it will be added to the section-jcombobox for further selection. The other point is, that the comments could be in different languages. So the list of hard coded words and synonyms would be enormous.
So, my question is, is it possible to analyse the comment (e.g. "Fuel", "Car service", "Lunch at **") and preselect a fitting Section.
My first thought was, do it with a neural network and learn from the input, if the user selects another section.
But my problem is, I don´t know how to start at all. I tried "encog" with Eclipse and did some tutorials (XOR,...). But all of them are only using doubles as in/output.
Anyone could give me a hint how to start or any other possible solution for this?
Here is a runable JAR (current development state, requires Java7) and the Sourceforge Page
Forget about neural networks. This is a highly technical and specialized field of artificial intelligence, which is probably not suitable for your problem, and requires a solid expertise. Besides, there is a lot of simpler and better solutions for your problem.
First obvious solution, build a list of words and synonyms for all your sections and parse for these synonyms. You can then collect comments online for synonyms analysis, or use parse comments/sections provided by your users to statistically detect relations between words, etc...
There is an infinite number of possible solutions, ranging from the simplest to the most overkill. Now you need to define if this feature of your system is critical (prefilling? probably not, then)... and what any development effort will bring you. One hour of work could bring you a 80% satisfying feature, while aiming for 90% would cost one week of work. Is it really worth it?
Go for the simplest solution and tackle the real challenge of any dev project: delivering. Once your app is delivered, then you can always go back and improve as needed.
String myString = new String(paramInput);
if(myString.contains("FUEL")){
//do the fuel functionality
}
In a simple app, if you will be having only some specific sections in your application then you can get string from comments and check it if it contains some keywords and then according to it change the value of Section.
If you have a lot of categories, I would use something like Apache Lucene where you could index all the categories with their name's and potential keywords/phrases that might appear in a users description. Then you could simply run the description through Lucene and use the top matched category as a "best guess".
P.S. Neural Network inputs and outputs will always be doubles or floats with a value between 0 and 1. As for how to implement String matching I wouldn't even know where to start.
It seems to me that following will do:
hard word statistics
maybe a stemming class (English/Spanish) which reduce a word like "lunches" to "lunch".
a list of most frequent non-words (the, at, a, for, ...)
The best fit is a linear problem, so theoretical fit for a neural net, but why not take immediately the numerical best fit.
A machine learning algorithm such as an Artificial Neural Network doesn't seem like the best solution here. ANNs can be used for multi-class classification (i.e. 'to which of the provided pre-trained classes does the input represent?' not just 'does the input represent an X?') which fits your use case. The problem is that they are supervised learning methods and as such you need to provide a list of pairs of keywords and classes (Sections) that spans every possible input that your users will provide. This is impossible and in practice ANNs are re-trained when more data is available to produce better results and create a more accurate decision boundary / representation of the function that maps the inputs to outputs. This also assumes that you know all possible classes before you start and each of those classes has training input values that you provide.
The issue is that the input to your ANN (a list of characters or a numerical hash of the string) provides no context by which to classify. There's no higher level information provided that describes the word's meaning. This means that a different word that hashes to a numerically close value can be misclassified if there was insufficient training data.
(As maclema said, the output from an ANN will always be floats with each value representing proximity to a class - or a class with a level of uncertainty.)
A better solution would be to employ some kind of word-relation or synonym graph. A Bag of words model might be useful here.
Edit: In light of your comment that you don't know the Sections before hand,
an easy solution to program would be to provide a list of keywords in a file that gets updated as people use the program. Simply storing a mapping of provided comments -> Sections, which you will already have in your database, would allow you to filter out non-keywords (and, or, the, ...). One option is to then find a list of each Section that the typed keywords belong to and suggest multiple Sections and let the user pick one. The feedback that you get from user selections would enable improvements of suggestions in the future. Another would be to calculate a Bayesian probability - the probability that this word belongs to Section X given the previous stored mappings - for all keywords and Sections and either take the modal Section or normalise over each unique keyword and take the mean. Calculations of probabilities will need to be updated as you gather more information ofcourse, perhaps this could be done with every new addition in a background thread.
I'm looking to compare two documents to determine what percentage of their text matches based on keywords.
To do this I could easily chop them into a set word of sanitised words and compare, but I would like something a bit smarter, something that can match words based on their root, ie. even if their tense or plurality is different. This sort of technique seems to be used in full text searches, but I have no idea what to look for.
Does such an engine (preferably applicable to Java) exist?
Yes, you want a stemmer. Lauri Karttunen did some work with finite state machines that was amazing, but sadly I don't think there's an available implementation to use. As mentioned, Lucene has stemmers for a variety of languages and the OpenNLP and Gate projects might help you as well. Also, how were you planning to "chop them up"? This is a little trickier than most people think because of punctuation, possesives, and the like. And just splitting on white space doesn't work at all in many languages. Take a look at OpenNLP for that too.
Another thing to consider is that just comparing the non stop-words of the two documents might not be the best approach for good similarity depending on what you are actually trying to do because you lose locality information. For example, a common approach to plagiarism detection is to break the documents into chunks of n tokens and compare those. There are algorithms such that you can compare many documents at the same time in this way much more efficiently than doing a pairwise comparison between each document.
I don't know of a pre-built engine, but if you decide to roll your own (e.g., if you can't find pre-written code to do what you want), searching for "Porter Stemmer" should get you started on an algorithm to get rid of (most) suffixes reasonably well.
I think Lucene might be along the lines of what your looking for. From my experience its pretty easy to use.
EDIT: I just reread the question and thought about it some more. Lucene is a full-text search engine for java. However, I'm not quite sure how hard it would be to re purpose it for what your trying to do. either way, it might be a good resource to start looking at and go from there.