Java implementation of Aho-Corasick string matching algorithm?

Java implementation of Aho-Corasick string matching algorithm? - java

Now I know there have been previous questions regarding this algorithm, however I honestly haven't come across a simple java implementation. Many people have copied and pasted the same code in their GitHub profiles, and its irritating me.
So for the purpose of interview exercise, I've planned to set out and implement the algorithm using a different approach.
The algorithm tend out to be very very challenging. I honestly am lost on how to go about it. The logic just doesn't make sense. I've nearly spent 4 days straight sketching the approach, but to no avail.
Therefore please enlighten us with your wisdom.
I'm primarily doing the algorithm based on this information Intuition behind the Aho-Corasick string matching algorithm
It would be a big bonus if one can implement their own solution here.
But here's the following incomplete and not working solution which I got really stuck at:
If your overwhelemed with the code, the main problem lies at the main algorithm of Aho-Corasick. We already have created the trie tree of dictionaries well.
But the issue is, that now that we have the trie strcuture, how do we actually start implementing.
None of the resources were helpful.
public class DeterminingDNAHealth {
private Trie tree;
private String[] dictionary;
private Node FailedNode;
private DeterminingDNAHealth() {
}
private void buildMatchingMachine(String[] dictionary) {
this.tree = new Trie();
this.dictionary = dictionary;
Arrays.stream(dictionary).forEach(tree::insert);
}
private void searchWords(String word, String[] dictionary) {
buildMatchingMachine(dictionary);
HashMap < Character, Node > children = tree.parent.getChildren();
String matchedString = "";
for (int i = 0; i < 3; i++) {
char C = word.charAt(i);
matchedString += C;
matchedChar(C, matchedString);
}
}
private void matchedChar(char C, String matchedString) {
if (tree.parent.getChildren().containsKey(C) && dictionaryContains(matchedString)) {
tree.parent = tree.parent.getChildren().get(C);
} else {
char suffix = matchedString.charAt(matchedString.length() - 2);
if (!tree.parent.getParent().getChildren().containsKey(suffix)) {
tree.parent = tree.parent.getParent();
}
}
}
private boolean dictionaryContains(String word) {
return Arrays.asList(dictionary).contains(word);
}
public static void main(String[] args) {
DeterminingDNAHealth DNA = new DeterminingDNAHealth();
DNA.searchWords("abccab", new String[] {
"a",
"ab",
"bc",
"bca",
"c",
"caa"
});
}
}
I have setup a trie data structure which works fine. So no problem here
trie.java
public class Trie {
public Node parent;
public Node fall;
public Trie() {
parent = new Node('⍜');
parent.setParent(new Node());
}
public void insert(String word) {...}
private boolean delete(String word) {...}
public boolean search(String word) {...}
public Node searchNode(String word) {...}
public void printLevelOrderDFS(Node root) {...}
public static void printLevel(Node node, int level) {...}
public static int maxHeight(Node root) {...}
public void printTrie() {...}
}
Same thing for Node.
Node.java
public class Node {
private char character;
private Node parent;
private HashMap<Character, Node> children = new HashMap<Character, Node>();
private boolean leaf;
// default case
public Node() {}
// constructor accepting the character
public Node(char character) {
this.character = character;
}
public void setCharacter(char character) {...}
public char getCharacter() {...}
public void setParent(Node parent) {...}
public Node getParent() {...}
public HashMap<Character, Node> getChildren() {...}
public void setChildren(HashMap<Character, Node> children) {...}
public void resetChildren() {...}
public boolean isLeaf() {...}
public void setLeaf(boolean leaf) {...}
}

I usually teach a course on advanced data structures every other year, and we cover Aho-Corasick automata when exploring string data structures. There are slides available here that show how to develop the algorithm by optimizing several slower ones.
Generally speaking, I’d break the implementation down into four steps:
Build the trie. At its core, an Aho-Corasick automaton is a trie with some extra arrows tossed in. The first step in the algorithm is to construct this trie, and the good news is that this proceeds just like a regular trie construction. In fact, I’d recommend just implementing this step by pretending you’re just making a trie and without doing anything to anticipate the later steps.
Add suffix (failure) links. This step in the algorithm adds in the important failure links, which the matcher uses whenever it encounters a character that it can’t use to follow a trie edge. The best explanation I have for how these work is in the linked lecture slides. This step of the algorithm is implemented as a breadth-first search walk over the trie. Before you code this one up, I’d recommend working through a few examples by hand to make sure you get the general pattern. Once you do, this isn’t particularly tricky to code up. However, trying to code this up without fully getting how it works is going to make debugging a nightmare!
Add output links. This step is where you add in the links that are used to report all the strings that match at a given node in the trie. It’s implemsnted through a second breadth-first search over the trie, and again, the best explanation I have for how it works is in the slides. The good news is that this step is actually a lot easier to implement than suffix link construction, both because you’ll be more familiar with how to do the BFS and how to walk down and up the trie. Again, don’t attempt to code this up unless you can comfortably do this by hand! You don’t need min code, but you don’t want to get stuck debugging code whose high-level behavior you don’t understand.
Implement the matcher. This step isn’t too bad! You just walk down the trie reading characters from the input, outputting all matches at each step and using the failure links whenever you get stuck and can’t advance downward.
I hope this gives you a more modular task breakdown and a reference about how the whole process is supposed to work. Good luck!

You're not going to get a good understanding of the Aho-Corasick string matching algorithm by reading some source code. And you won't find a trivial implementation because the algorithm is non-trivial.
The original paper, Efficient String Matching: An Aid to Bibliographic Search, is well written and quite approachable. I suggest you download that PDF, read it carefully, think about it a bit, and read it again. Study the paper.
You might also find it useful to read others' descriptions of the algorithm. There are many, many pages with text descriptions, diagrams, Powerpoint slides, etc.
You probably want to spend at least a day or two studying those resources before you try to implement it. Because if you try to implement it without fully understanding how it works, you're going to be lost, and your implementation will show it. The algorithm isn't exactly simple, but it's quite approachable.
If you just want some code, there's a good implementation here: https://codereview.stackexchange.com/questions/115624/aho-corasick-for-multiple-exact-string-matching-in-java.

Related

How to create a general tree with basic function like insert in java

//I have some basic code written down for the General Tree.
class GeneralTree {
public static class Node{
String data;
ArrayList<Node> link;
Node(){}
public void setValue(String data){
this.data = data;
}
public String getValue(){
return data;
}
}
Node root;
int degree;
String type; //shows tree type;
public GeneralTree(){
degree = 0;
root = null;
type = "";
}
public GeneralTree(Node root, int degree){
this.root = root;
this.degree = degree;
}
public Node getRoot(){return root;}
}
public class Hw5 {
}
I tried searching the internet for explanation on General trees. I understand how they work on paper and can even convert a general tree to Binary on paper, but I do not know how a general tree code implementation will work. Binary tree has right and left childs, they are easy to deal with. on the other hand, general trees have an ArrayList that stores multiple childs, which is the confusing part for me. i do not know how an insert function will look like for this and how I will even traverse this tree.
Need Help With:
Code implementation for general tree.
How an insert function will work for the general tree
if you can direct me to some reading material, that would be amazing too.

I find that by looking for the solution by myself I learn much more and it sticks better. I'm not saying that you should do that too, Googling skills are your best way to find what you need faster because when you are a novice you can't put what you are looking for in words (I speak from experience).
Use Github, stackoverflow and Tabnine to find code samples, there are plenty out there (use keywords "general tree java code example"). Have a look at a few examples and try to read the code and understand what is happening. You can run the code in debug mode, put some break points in and step through the execution to see what happens. Debugging is a very useful skill.
You can find some examples of general trees here:
https://github.com/cmilliga/GeneralTreesCollegeWork
https://github.com/Faisal-AlDhuwayhi/Electric-Power-Grid/tree/master/Project-Code/src
Each node can have children Node(s) so the ArrayList is there to keep track of the children for that Node. If you want to add a new Node you need to iterate through the Node and its children.
Hopefully this helps.

Using Visitor and Composite patterns to build a filtered stream

I am using a composite pattern with multiple leaf node classes which have specialist operations and a visitor pattern to allow those operations to be performed. In this example I've left out all the obvious accept methods for clarity.
interface Command {
public int getCost();
}
class SimpleCommand implements Command {
private int cost;
public int getCost() {
return cost;
}
}
class MultiCommand implements Command {
private Command subcommand;
private int repeated;
public int getCost() {
return repeated * subcommand.getCost();
}
public void decrement() {
if (repeated > 0)
repeated--;
}
}
class CommandList implements Command {
private List<Command> commands;
public int getCost() {
return commands.stream().mapToInt(Command::getCost).sum();
}
public void add(Command command) {
commands.add(command);
}
}
interface CommandVisitor {
default void visitSimpleCommand(SimpleCommandCommand command) { }
default void visitMultiCommand(MultiCommand multiCommand) { }
default void visitCommandList(CommandList commandList) { }
}
It's now possible to build visitors to perform operations such as decrement. However I find it easier to create a general purpose visitor that streams objects of a certain class so that any operation can be performed on them:
class MultiCommandCollector implements CommandVisitor {
private final Stream.Builder<MultiCommand> streamBuilder = Stream.builder();
public static Stream<MultiCommand> streamFor(Command command) {
MultiCommandVisitor visitor = new MultiCommandVisitor();
command.accept(visitor);
return visitor.streamBuilder.build();
}
public void visitMultiCommand(MultiCommand multiCommand) {
builder.accept(multiCommand);
}
}
This is used as you would expect. For example:
MultiCommandCollector.streamFor(command).forEach(MultiCommand::decrement);
This has one significant limitation: it can't be used to alter the hierarchy as the stream is processed. For example, the following fails:
CommandListCollector.streamFor(commandList).forEach(cl -> cl.add(command));
I can't think of an alternative elegant design that would allow this.
My question is: is there a natural extension to this design to allow a general purpose visitor that can also alter the hierarchy? In other words, is there a way that the visitor can visit one member then refresh the hierarchy before visiting the next? Is this compatible with the use of streams?

In my prior experience, Visitor pattern is useful to either query or to recreate the hierarchy. The querying part is obvious - you would simply listen for specific types of sub-objects and then build the query result as it fits. The other question, changing the hierarchy, is more difficult.
It may really get hard to change the hierarchy while iterating through it. Therefore, I know of two useful techniques that work fine in practice.
While visiting the hierarchy, build the list of objects to change.
Do not change them until the visiting is completed. Concrete visitor
can build the list of objects of interest as its private member. Once it
completes the visit, it would expose the list of objects as its result.
Only then start iterating through the resulting list and make changes to
the objects.
While visiting the hierarchy, as you visit an element, create a copy of
the element. If the element needs to be changed, then construct the changed
version. Otherwise, if the elements needs not change, just return it as the
new element. After entire visit is done, you will have the new hierarchy
with all modifications made as intended. The old hierarchy could be
dereferenced then, and garbage collector will collect those elements which
have been replaced with new ones.
The first algorithm is applicable when elements are mutable. The second algorithm is applicable when elements are immutable.
Hope this helps.

Random Postorder traversal in neo4j

I'm trying to create an algorithm in Neo4j using the java API. The algorithm is called GRAIL (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.169.1656&rep=rep1&type=pdf) and it assigns labels to a graph for later answering reachability queries.
This algorithm uses postorder depth first search but with random traversal each time (each child of a node is visited randomly in each traversal).
In the neo4j java api there is an algorithm doing this (https://github.com/neo4j/neo4j/blob/7f47df8b5d61b0437e39298cd15e840aa1bcfed8/community/kernel/src/main/java/org/neo4j/graphdb/traversal/PostorderDepthFirstSelector.java) without the randomness and i can't seem to find a way to do this.
My code has a traversal description in which i want to add a custom order (BranchOrderingPolicy) in order to achieve the before mentioned algorithm. like this:
.order(**postorderDepthFirst()**)

The answer to my question came rather easy but after a lot of thinking. I just had to alter the path expander (i created my own) which returns the relationhipss that the traversal takes as next and there a simple line of code to randomize the relationships.
The code is :
public class customExpander implements PathExpander {
private final RelationshipType relationshipType;
private final Direction direction;
private final Integer times;
public customExpander (RelationshipType relationshipType, Direction direction ,Integer times)
{
this.relationshipType = relationshipType;
this.direction = direction;
this.times=times;
}
#Override
public Iterable<Relationship> expand(Path path, BranchState state)
{
List<Relationship> results = new ArrayList<Relationship>();
for ( Relationship r : path.endNode().getRelationships( relationshipType, direction ) )
{
results.add( r );
}
Collections.shuffle(results);
}
return results;
}
#Override
public PathExpander<String> reverse()
{
return null;
}
}

There's no such ordering by default in neo4j, however it should be possible to write one. TraversalBranch#next gives the next branch and so your implementation could get all or some and pick at random. However state keeping will be slightly tricky and as memory hungry as a breadth first ordering I'd guess. Neo4j keeps relationships in linked lists per node, so there's no easy way to pick one at random without first gathering all of them.

Java OOP: Building Object Trees / Object Families

Been a while since I used Java and was wondering if this was a decent or even correct way of setting this up.
FYI, userResults refers to a JDBI variable that isn't present in the code below.
Feel free to suggest a better method, thanks.
public class Stat
{
private int current;
private int max;
public int getCurrent() {return current;}
public void setCurrent(int current) {this.current = current;}
public int getMax() {return max;}
public void setMax(int max) {this.max = max;}
}
public class Character
{
Stat hp = new Stat();
Stat mp = new Stat();
}
Character thisCharacter = new Character();
// Set the value of current & max HP according to db data.
thisCharacter.hp.setCurrent((Integer) userResults.get("hpColumn1"));
thisCharacter.hp.setMax((Integer) userResults.get("hpColumn2"));
// Print test values
System.out.println (thisCharacter.hp.Current);
System.out.println (thisCharacter.hp.Max);

Correct? Well, does it work? Then it probably is correct.
Wether or not it is a decent way to do it then the answer is "maybe". It is hard to tell from what context this code is in. But there are some things you could keep in mind though:
In which class (or object rather) are the Stat set in? Do you feel is it the responsibility of the class to do this and know what database values to get them from? If not, consider making some kind of a class that does this.
Making chained calls such as thisCharacter.hp.setCurrent(...) is a violation of principle of least knowledge. Sometimes you can't help it, but usually it leads to kludgy code. Consider having something that handles all the logic surrounding the stats. In your code you may need a HealthStatsHandler that have methods such as loadStats(), saveStats(), and mutator actions such as takeDamage(int dmg) and revive(int health).
If you have trouble figuring things out if it has the correct object design, then study up on the SOLID principles. They provide nice guidelines that any developer should follow if they want to have code that is extensible and "clean".

This is not really a tree. It is not possible two have more than one layer of children.
Usually you define an interface let's call it Node where both Stat and Character implements it and the two children of Character would have the type Node.

I would consider creating the Stat objects seperately and passing them into Character, and making the character attributes private as follows:
public class Character
{
private Stat hp;
private Stat mp;
public Stat getHp() {return hp;}
public void setHp(Stat h) {this.hp = h;}
public Stat getMp() {return mp;}
public void setMp(Stat m) {this.mp = m;}
}
// Set the value of current & max HP according to db data.
Stat hp = new Stat();
hp.setCurrent((Integer) userResults.get("hpColumn1"));
hp.setMax((Integer) userResults.get("hpColumn2"));
Character thisCharacter = new Character();
thisCharacter.setHp(hp);
// do the same for mp
One additional simple step would be to create a Character constructor that would take an hp and an mp

Storing the state of a complex object with Memento pattern (and Command)

I'm working on a small UML editor project, in Java, that I started a couple of months ago. After a few weeks, I got a working copy for a UML class diagram editor.
But now, I'm redesigning it completely to support other types of diagrams, such a sequence, state, class, etc. This is done by implementing a graph construction framework (I'm greatly inspired by Cay Horstmann work on the subject with the Violet UML editor).
Redesign was going smoothly until one of my friends told me that I forgot to add a Do/Undo functionnality to the project, which, in my opinion, is vital.
Remembering object oriented design courses, I immediately thought of Memento and Command pattern.
Here's the deal. I have a abstract class, AbstractDiagram, that contains two ArrayLists : one for storing nodes (called Elements in my project) and the other for storing Edges (called Links in my projects). The diagram will probably keep a stack of commands that can be Undoed/Redoed. Pretty standard.
How can I execute these commands in a efficient way? Say, for example, that I want to move a node (the node will be an interface type named INode, and there will be concrete nodes derived from it (ClassNode, InterfaceNode, NoteNode, etc.)).
The position information is held as an attribute in the node, so by modying that attribute in the node itself, the state is changed. When the display will be refreshed, the node will have moved. This is the Memento part of the pattern (I think), with the difference that the object is the state itself.
Moreover, if I keep a clone of the original node (before it moved), I can get back to its old version. The same technique applies for the information contained in the node (the class or interface name, the text for a note node, the attributes name, and so on).
The thing is, how do I replace, in the diagram, the node with its clone upon undo/redo operation? If I clone the original object that is referenced by the diagram (being in the node list), the clone isn't reference in the diagram, and the only thing that points to is the Command itself! Shoud I include mechanisms in the diagram for finding a node according to an ID (for example) so I can replace, in the diagram, the node by its clone (and vice-versa) ? Is it up to the Memento and Command patterns to do that ? What about links? They should be movable too but I don't want to create a command just for links (and one just for nodes), and I should be able to modify the right list (nodes or links) according to the type of the object the command is referring to.
How would you proceed? In short, I am having trouble representing the state of an object in a command/memento pattern so that it can be efficiently recovered and the original object restored in the diagram list, and depending on the object type (node or link).
Thanks a lot!
Guillaume.
P.S.: if I'm not clear, tell me and I will clarify my message (as always!).
Edit
Here's my actual solution, that I started implementing before posting this question.
First, I have an AbstractCommand class defined as follow :
public abstract class AbstractCommand {
public boolean blnComplete;
public void setComplete(boolean complete) {
this.blnComplete = complete;
}
public boolean isComplete() {
return this.blnComplete;
}
public abstract void execute();
public abstract void unexecute();
}
Then, each type of command is implemented using a concrete derivation of AbstractCommand.
So I have a command to move an object :
public class MoveCommand extends AbstractCommand {
Moveable movingObject;
Point2D startPos;
Point2D endPos;
public MoveCommand(Point2D start) {
this.startPos = start;
}
public void execute() {
if(this.movingObject != null && this.endPos != null)
this.movingObject.moveTo(this.endPos);
}
public void unexecute() {
if(this.movingObject != null && this.startPos != null)
this.movingObject.moveTo(this.startPos);
}
public void setStart(Point2D start) {
this.startPos = start;
}
public void setEnd(Point2D end) {
this.endPos = end;
}
}
I also have a MoveRemoveCommand (to... move or remove an object/node). If I use the ID of instanceof method, I don't have to pass the diagram to the actual node or link so that it can remove itself from the diagram (which is a bad idea I think).
AbstractDiagram diagram;
Addable obj;
AddRemoveType type;
#SuppressWarnings("unused")
private AddRemoveCommand() {}
public AddRemoveCommand(AbstractDiagram diagram, Addable obj, AddRemoveType type) {
this.diagram = diagram;
this.obj = obj;
this.type = type;
}
public void execute() {
if(obj != null && diagram != null) {
switch(type) {
case ADD:
this.obj.addToDiagram(diagram);
break;
case REMOVE:
this.obj.removeFromDiagram(diagram);
break;
}
}
}
public void unexecute() {
if(obj != null && diagram != null) {
switch(type) {
case ADD:
this.obj.removeFromDiagram(diagram);
break;
case REMOVE:
this.obj.addToDiagram(diagram);
break;
}
}
}
Finally, I have a ModificationCommand which is used to modify the info of a node or link (class name, etc.). This may be merged in the future with the MoveCommand. This class is empty for now. I will probably do the ID thing with a mechanism to determine if the modified object is a node or an edge (via instanceof or a special denotion in the ID).
Is this is a good solution?

I think you just need to decompose your problem into smaller ones.
First problem:
Q: How to represent the steps in your app with the memento/command pattern?
First off, I have no idea exactly how your app works but hopefully you will see where I am going with this. Say I want to place a ClassNode on the diagram that with the following properties
{ width:100, height:50, position:(10,25), content:"Am I certain?", edge-connections:null}
That would be wrapped up as a command object. Say that goes to a DiagramController. Then the diagram controller's responsibility can be to record that command (push onto a stack would be my bet) and pass the command to a DiagramBuilder for example. The DiagramBuilder would actually be responsible for updating the display.
DiagramController
{
public DiagramController(diagramBuilder:DiagramBuilder)
{
this._diagramBuilder = diagramBuilder;
this._commandStack = new Stack();
}
public void Add(node:ConditionalNode)
{
this._commandStack.push(node);
this._diagramBuilder.Draw(node);
}
public void Undo()
{
var node = this._commandStack.pop();
this._diagramBuilderUndraw(node);
}
}
Some thing like that should do it and of course there will be plenty of details to sort out. By the way, the more properties your nodes have the more detailed Undraw is going to have to be.
Using an id to link the command in your stack to the element drawn might be a good idea. That might look like this:
DiagramController
{
public DiagramController(diagramBuilder:DiagramBuilder)
{
this._diagramBuilder = diagramBuilder;
this._commandStack = new Stack();
}
public void Add(node:ConditionalNode)
{
string graphicalRefId = this._diagramBuilder.Draw(node);
var nodePair = new KeyValuePair<string, ConditionalNode> (graphicalRefId, node);
this._commandStack.push(nodePair);
}
public void Undo()
{
var nodePair = this._commandStack.pop();
this._diagramBuilderUndraw(nodePair.Key);
}
}
At this point you don't absolutely have to have the object since you have the ID but it will be helpful should you decide to also implement redo functionality. A good way to generate the id for your nodes would be to implement a hashcode method for them except for the fact that you wouldn't be guaranteed not to duplicate your nodes in such a way that would cause the hash code to be identical.
The next part of the problem is within your DiagramBuilder because you're trying to figure out how the heck to deal with these commands. For that all I can say is to really just ensure you can create an inverse action for each type of component you can add. To handle the delinking you can look at the edge-connection property (links in your code I think) and notify each of the edge-connections that they are to disconnect from the specific node. I would assume that on disconnection they could redraw themselves appropriately.
To kinda summarize, I would recommend not keeping a reference to your nodes in the stack but instead just a kind of token that represents a given node's state at that point. This will allow you to represent the same node in your undo stack at multiple places without it referring to the same object.
Post if you've got Q's. This is a complex issue.

In my humble opinion, you're thinking it in a more complicated way than it really is. In order to revert to previous state, clone of whole node is not required at all. Rather each **Command class will have -
reference to the node it is acting upon,
memento object (having state variables just enough for the node to revert to)
execute() method
undo() method.
Since command classes have reference to the node, we do not need ID mechanism to refer to objects in the diagram.
In the example from your question, we want to move a node to a new position. For that, we have a NodePositionChangeCommand class.
public class NodePositionChangeCommand {
// This command will act upon this node
private Node node;
// Old state is stored here
private NodePositionMemento previousNodePosition;
NodePositionChangeCommand(Node node) {
this.node = node;
}
public void execute(NodePositionMemento newPosition) {
// Save current state in memento object previousNodePosition
// Act upon this.node
}
public void undo() {
// Update this.node object with values from this.previousNodePosition
}
}
What about links? They should be movable too but I don't want to create a command just for links (and one just for nodes).
I read in GoF book (in memento pattern discussion) that move of link with change in position of nodes are handled by some kind of constraint solver.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.