Find arbitary patterns common to a group of strings

Find arbitary patterns common to a group of strings - java

Background:
I am developing a program in that iterates over all the movies & tv series episodes stored on my computer, rates them (using rotten tomatoes) and sorts them in order of rating.
I extract the movie name by removing all the unneccessary text such as '.avi', '720p' etc. from the file name.
I am using Java.
Problem:
Some folders contain movie files such as:
Episode 301 Rainforest Schmainforest.avi
Episode 302 Spontaneous Combustion.avi
The word 'Episode' and numbers are valid and are common words in movies, so I can't simply remove them. However, It is clear from the repetitive nature of the names that 'Episode' and '3XX' should be removed.
Aother folder might be:
720p.S5.E1.cripple fight.avi
720p.S5.E2.towelie.avi
Many arbitary patterns like these exist in different groups of files, and I need something to recongise these arbitary patterns so I can extract the keywords. It would be unfeasible to write regex for each case.
Summary:
Is there a tool or API that I can use to find complex repetitive patterns (must be able to match sequences of numbers)? [something like a longest common sequence library]

Well, you could simply take all the filtered names in your dir, and do a simple word-count. You could give extra weight to words that occur in (roughly) the same spot every time.
In the end you'd end up with a count and a weight, and you need to decide what lines to draw. It's probably not every file in the dir (because of maybe images or samples), but if most have a certain word, it's not "the" or something like that, and mabye they all appear "at the start" or "on the second spot", you can filter them.
But this wouldn't work for, random example, Friends episodes. THey're all called "The one where.....". That would be filtered in every sane version of your sought-after algorithm
The bottom line is: I don't think you can because of the friends-episode-problem. There just not enough distinction between wanted repetition and unwanted repetition.
Only thing you can do is make a blacklist of stuff you want to filter, like you allready seem to do with the avi / 720 thing.

I believe that what you are asking for is not trivial. Pattern extraction, as opposed to mere recognition, is well within the fields of artificial intelligence and knowledge discovery. I have encountered several related libraries for Java, but most need a lot of additional code to define even the simplest task.
Since this is a rather hot research area, you might want to perform a cursory search in Google Scholar, using appropriate keywords.
Disclaimer: before you use any library or algorithm found via the Internet, you should investigate its legal status. Unfortunately quite a few of the algorithms that are developed in active research areas are often encumbered by patents and such...

I have a kind-of answer posted here
http://pastebin.com/Eb0cQyKd
I wanted to remove non-unique parts of file names such as'720dpi', 'Episode', 'xvid' 'ac3' without specifying in advance what they would be. But I wanted to keep information like S01E01. I had created a huge black list but it wasn't convenient because the list kept on changing.
The code linked above uses Python (not Java) to remove all non-unique words in a file name.
Basically it creates a list of all the words used in the file names, and any word which comes up for most of the files it puts into a dictionary. Then it iterates through the files and deletes all these dictionary words from them.
The script also does some cleaning: some movies use underscores ('_') or periods ('.') to separate words in the filenames. I convert all these to spaces.
I have used it a lot recently and it works well.

Related

Solution for selecting an appropriate category for something based off of the words within its description (java)

Im trying to better organise the types of tasks regularly sent to my team based off of the titles and short comment people enter.
Our team only handles a handful of issues (maybe 10 or so) different types of tasks, so I've put together a list of common words used within the description of a particular type of task and i've been using this to categorise the issues. for example.... an issue might come through like "User x doesn't have access to office after hours, please update their swipecard access level". what i've got so far is if the comments contain 'swipecard' or 'access', its a building access type request.
I've quickly found myself with code that's LOTS of ... if contains, and if !contains...
Is there a neater way of doing what im after?

If you want to make it complex, it sounds like you have a classification problem.
If you want to keep it simple, you're probably on the right track with your if statements and contains(). To get to a cleaner solution, I would approach it as follows:
Create a class to modify your categories - give it two attributes: String categoryName, List<String> commonlyUsedWords;
Populate a list with instances of that class - one per type.
For each issue, loop through the list of categories and check how many words match, and store that as a percentage (e.g. 8 out of 10 words match, therefore 80% match).
Return the category with the highest match rate.

Questions about using a Radix Tree in android for English dictionary word-lookup in 240k word-list

App Overview
In this game you append a letter to growing chain of letters, but each player tries not to form a word. You have the option to say it's a word after your opponent chose a letter to append to the chain of letters, which needs to be checked to a certain data-structure. I need to implement this data-structure.
Requirements of Data-structure
I need a data-structure that is able to tell fast if a word exists in a list of 240000 words for a game on an android device.
You should be able to play up to 20 games easily
Should be written for an android app
A nice extra feature would also be to quickly show all possible words from a given word but is not necessary.
What I tried
A Radix Tree seemed like a good idea for this, see the picture below. Now I might regret the time I put into this, since I think it would require too many objects. Every black dot as well as the numbered circles would be represented as node objects in my code.
A radix tree would require at the bare minimum 240k (240,000) nodes and thus objects, each path to every node would be one word, which would result in the 240k word list. Each game would be represented only storing a reference to the current node in the tree, meaning that an extra game requires little extra storage.
I also thought that I could implement it as a hashMap with all possible words in it and loop through all the words and narrow down after each letter. This seems like a computational approach, where the Radix Tree would require less computations but a lot more storage.
[EDIT] This was a wrong assumption of mine, look below at the picture.
Questions I have
Is a Radix Tree one of the best data-structure for the requirements given most android devices used today? (answers/comments seem to indicate it is)
How does it work in memory when you have so many objects? Are they all stored in ram or also on the disk? I could find this that an app could use a total of 16mb/25mb/32mb of ram. Is it likely I will reach over 16mb of ram when putting 240000 object in ram?
You can store and retrieve the large Radix Tree object during runtime from a file right? Which is stored to a disk in the res/raw folder.
Would having (let's say) 50 games open with a hash-map in which for every game you have to use a copy of the hashmap, on which you narrow down the possible words, be even possible? How much additional storage can an application claim after installation?
Based on the comments:
It seemed like my assumption that a Radix Tree would require more space seemed wrong: To see the image larger right click on it and open in new tab

A trie/prefix tree/radix tree seems like a perfectly valid data structure for this application. If the dictionary is fixed (that is, no words get added/deleted during play), you can save memory by compressing shared branches.

String analysis and classification

I am developing a financial manager in my freetime with Java and Swing GUI. When the user adds a new entry, he is prompted to fill in: Moneyamount, Date, Comment and Section (e.g. Car, Salary, Computer, Food,...)
The sections are created "on the fly". When the user enters a new section, it will be added to the section-jcombobox for further selection. The other point is, that the comments could be in different languages. So the list of hard coded words and synonyms would be enormous.
So, my question is, is it possible to analyse the comment (e.g. "Fuel", "Car service", "Lunch at **") and preselect a fitting Section.
My first thought was, do it with a neural network and learn from the input, if the user selects another section.
But my problem is, I don´t know how to start at all. I tried "encog" with Eclipse and did some tutorials (XOR,...). But all of them are only using doubles as in/output.
Anyone could give me a hint how to start or any other possible solution for this?
Here is a runable JAR (current development state, requires Java7) and the Sourceforge Page

Forget about neural networks. This is a highly technical and specialized field of artificial intelligence, which is probably not suitable for your problem, and requires a solid expertise. Besides, there is a lot of simpler and better solutions for your problem.
First obvious solution, build a list of words and synonyms for all your sections and parse for these synonyms. You can then collect comments online for synonyms analysis, or use parse comments/sections provided by your users to statistically detect relations between words, etc...
There is an infinite number of possible solutions, ranging from the simplest to the most overkill. Now you need to define if this feature of your system is critical (prefilling? probably not, then)... and what any development effort will bring you. One hour of work could bring you a 80% satisfying feature, while aiming for 90% would cost one week of work. Is it really worth it?
Go for the simplest solution and tackle the real challenge of any dev project: delivering. Once your app is delivered, then you can always go back and improve as needed.

String myString = new String(paramInput);
if(myString.contains("FUEL")){
//do the fuel functionality
}

In a simple app, if you will be having only some specific sections in your application then you can get string from comments and check it if it contains some keywords and then according to it change the value of Section.

If you have a lot of categories, I would use something like Apache Lucene where you could index all the categories with their name's and potential keywords/phrases that might appear in a users description. Then you could simply run the description through Lucene and use the top matched category as a "best guess".
P.S. Neural Network inputs and outputs will always be doubles or floats with a value between 0 and 1. As for how to implement String matching I wouldn't even know where to start.

It seems to me that following will do:
hard word statistics
maybe a stemming class (English/Spanish) which reduce a word like "lunches" to "lunch".
a list of most frequent non-words (the, at, a, for, ...)
The best fit is a linear problem, so theoretical fit for a neural net, but why not take immediately the numerical best fit.

A machine learning algorithm such as an Artificial Neural Network doesn't seem like the best solution here. ANNs can be used for multi-class classification (i.e. 'to which of the provided pre-trained classes does the input represent?' not just 'does the input represent an X?') which fits your use case. The problem is that they are supervised learning methods and as such you need to provide a list of pairs of keywords and classes (Sections) that spans every possible input that your users will provide. This is impossible and in practice ANNs are re-trained when more data is available to produce better results and create a more accurate decision boundary / representation of the function that maps the inputs to outputs. This also assumes that you know all possible classes before you start and each of those classes has training input values that you provide.
The issue is that the input to your ANN (a list of characters or a numerical hash of the string) provides no context by which to classify. There's no higher level information provided that describes the word's meaning. This means that a different word that hashes to a numerically close value can be misclassified if there was insufficient training data.
(As maclema said, the output from an ANN will always be floats with each value representing proximity to a class - or a class with a level of uncertainty.)
A better solution would be to employ some kind of word-relation or synonym graph. A Bag of words model might be useful here.
Edit: In light of your comment that you don't know the Sections before hand,
an easy solution to program would be to provide a list of keywords in a file that gets updated as people use the program. Simply storing a mapping of provided comments -> Sections, which you will already have in your database, would allow you to filter out non-keywords (and, or, the, ...). One option is to then find a list of each Section that the typed keywords belong to and suggest multiple Sections and let the user pick one. The feedback that you get from user selections would enable improvements of suggestions in the future. Another would be to calculate a Bayesian probability - the probability that this word belongs to Section X given the previous stored mappings - for all keywords and Sections and either take the modal Section or normalise over each unique keyword and take the mean. Calculations of probabilities will need to be updated as you gather more information ofcourse, perhaps this could be done with every new addition in a background thread.

String Manipulation Patterns

Just wondering if there are a set of design patterns for complex string manipulation?
Basically the problem I am trying to solve is I need to be able to read in a string, like the following:
"[name_of_kicker] looks to make a clearance kick, but is under some real pressure from the [name_of_defending_team] players. He gets a [length_of_kick] kick away, but it drifts into touch on the full."
or
"[name_of_kicker] receives the ball from [name_of_passer] and launches the bomb. [name_of_kicker] has really made good contact, it's given a couple of [name_of_attacking_team] chasers ample time to get under the ball as it comes down."
And replace each "tag" with a possible value and check if the string is equal to another string.
So for example, any tag that represents a player I need to be able to replace with anyone of 22 string values that represent a player. But I also need to be able to make sure I have looped through each combination of players for the various tags, that I may find in a string. NOTE, the tags listed in the above 2 samples, are not the only tags possible, there are countless other ones that could come up in any sentence.
I had tried to create a load of nested for loops to go through the collection of players, etc and attempt to replace the tags each time, but with there being many possibilities of tags I was just creating one nested for loop within another, and it has become unmanageable, and also I suspect inefficient, since I need to loop through over 1,000 base string like the samples above, and replace difference tags with players, etc for each one...
So are there any String manipulation patterns I could look into, or does anyone have any possible solutions to solving a problem like this.

Firstly, to answer your question.
Just wondering if there are a set of design patterns for complex string manipulation?
Not really. There are some techniques, but they hardly qualify as design patterns. The two techniques that spring to mind are template expansion and pattern matching.
What you are currently doing / proposing to do is a form of template expansion. However, typical templating engines don't support the combinatorial expansion that you are trying to do, and as you expect anticipate, it would appear to be an inefficient way to solve your problem.
A better technique would appear to be pattern matching. Let's take your first example, and turn it into a pattern:
"(Ronaldino|Maradonna|Peter Shilton|Jackie Charlton) looks to make a clearance kick, but is under some real pressure from the (Everton|Real Madrid|Adelaide United) players. He gets a ([0-9]+ metre) kick away, but it drifts into touch on the full."
What I've done is insert all of the possible alternatives into the pseudo-template, to turn it into a regex. I can now compile this regex to a java.util.Pattern, and use it to match against your list of other strings.
Having said that, if you are trying to do this to "analyse" text, I don't rate your chances of success. I think you would be better off going down the NLP route.

What you're describing looks a bit like what template engines are used for.
Two popular ones for Java are:
FreeMarker
StringTemplate
But there are many, many more, of course.

MY two cents,As you stated "I was just creating one nested for loop within another, and it has become unmanageable,"
You are looking in the wrong direction my friend there is whole universe of solutions to the problem you are facing ,simply know as a rule engine.
There are various type of rule engines(business rule engines,web template engines etc.) but for above requirement i suggest business rule engines.
Can't comment on which one to use as it depends upon
Multi-threading.
Open Source/Commercial.
Load factor/Processing time etc.
Hope it helps
http://ratakondas.blogspot.in/2012/06/business-rules-engines-white-paper.html
[read the summary section it gives best advice.]
https://en.wikipedia.org/wiki/Business_rules_engine#Types_of_rule_engines
https://en.wikipedia.org/wiki/Comparison_of_web_template_engines
Welcome to the world of rule engines :)

What is a fast and unsupervised way of checking quality of pdf-extracted text?

I am working on a somewhat large corpus with articles numbering the tens of thousands. I am currently using PDFBox to extract with various success, and I am looking for a way to programatically check each file to see if the extraction was moderately successful or not. I'm currently thinking of running a spellchecker on each of them, but the language can differ, I am not yet sure which languages I'm dealing with. Natural language detection with scores may also be an idea.
Oh, and any method also has to play nice with Java, be fast and relatively quick to integrate.

Try an automatically learning spell checker. That's not as scary as it sounds: Start with a big dictionary containing all the words you're likely to encounter. This can be from several languages.
When scanning a PDF, allow for a certain number of unknown words (say 5%). If any of these words are repeated often enough (say 5 times), add them to the dictionary. If the PDF contains more than 5% unknown words, it's very likely something that couldn't be processed.
The scanner will learn over time allowing you to reduce the amount of unknown words if that should be necessary. If that is too much hazzle, a very big dictionary should work well, too.
If you don't have a dictionary, manually process a couple of documents and have the scanner learn. After a dozen files or so, your new dictionary should be large enough for a reasonable water level.

Of course no method will be perfect.
There are usually two classes of text extraction poblems :
1 - nothing gets extracted.
This can be because you've got a scanned document or something is invalid in the PDF.
Usually easy to detect, you should not need complicaed code to check those.
2 - You get garbage.
Most of the times because the PDF file is weirdly encoded.
This can be because of homemade encoding not properly declared, or maybe the PDF author needed characters not recognized by PDF ( For example, The turkish S with cedilla was missing for some time in the adobe glyph list : you could not create a correctly encoded file with it inside so you had to cheat to get it visually on the page ).
I use a ngram based method to detect languages of PDF files based on the extracted text (with different technologies but the idea is the same). Files where the language was not recognized are usually good suspects of a problem...
About spellchecking I suppose it will give you tons of false positives especially if you have multiple languages !

You could just run the corpus against a list of stop words (the most frequent words that search engines ignore, like "and" and "the"), but then you obviously need stop word lists for all possible/probable languages first.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.