I am getting stackoverflow for validating a csv for large string field.
Regex:
(?![^\",][^,]*\")(\"(\"\"|[^\"])*\"|[^\",]*),[0-9]*
TargetString:
"The Nuvi 1450LMT is a portable global positioning system receiver from Garmin that offers a step-up from the company's standard 1450 and 1450T models. Including free lifetime map and traffic updates, this model can be updated once every three months to ensure to the most-up-to-date location information. A built-in FM signal transmitter can provide up-to-the-minute traffic information concerning accidents, construction and other forms of road blockage, providing users with sufficient time to select an alternate route. A back-lighted 5-inch touchscreen TFT display is included that provides clear visual instruction, complete with "Lane Assist" technology that provides virtual first-person instruction on precisely what lanes to use. Comprehensive "City Navigator" maps are included for Canada, the US and Mexico, with two and three-dimensional support and over 6 million user-selected points of interest. Pedestrian navigation is also fully supported on the 1450LMT, with the "CityXplorer" service offering bus, rail, tram and other public transportation information for a wide variety of major cities. Fuel-effecient routes can be determined with the "EcoRoute" mode, while "HotFix" predictive satellite technology helps to maintain the most accurate locational information even when signal is temporarily lost. Photo navigation is supported through Garmin's "Photo Connect" service, and additional car marker and narration voices can be downloaded via the "Garmin Garage" website. Features 5-inch backlit TFT color touchscreen Free lifetime traffic updates Free maps MicroSD card support Voice prompts Lane assist function Auto Re-route Route avoidance FM traffic compatibility EcoRoute routing Custom Points Of Interest Garmin garage car marker and voice customization",9
Can someone help to optimize it.
Can you optimize using possessive quantifiers
I think the best advice would be to not try to use regexes to parse CSV files. Any way you formulate the regex there is the possibility of an unbounded number of branch points ... and hence stack overflow for pathological input strings.
A better approach is to select and use a decent CSV library for Java. Check the answers to this Question:
Can you recommend a Java library for reading (and possibly writing) CSV files?
You can make that error go away by adding a few plus signs:
"(?![^\",][^,]*\")(\"(\"\"|[^\"]+)*\"|[^\",]+),[0-9]+"
^ ^ ^
Note that those are just regular plus signs, not possessive modifiers. The second and third plus signs replaced asterisks, but it's the first one that makes the real difference. That [^\"]+ is what consumes most of the text, and it was doing so one character at a time before I added that plus sign.
But it still won't match, it will just fail more quickly. That regex is for matching CSV fields with properly escaped quotes, and if I understand you correctly, your problem is that they're not escaped. That's a much more challenging problem, but I wonder if you really need to deal with those inner quotes at all. Won't this work?
".*?",\d+
...or as a Java string literal:
"\".*?\",\\d+"
Or are you trying to correct the string by escaping the quotes yourself?
Related
I would like to know how practical it would be to create a program which takes handwritten characters in some form, analyzes them, and offers corrections to the user. The inspiration for this idea is to have elementary school students in other countries or University students in America learn how to write in languages such as Japanese or Chinese where there are a lot of characters and even the slightest mistake can make a big difference.
I am unsure how the program will analyze the character. My current idea is to get a single pixel width line to represent the stroke, compare how far each pixel is from the corresponding pixel in the example character loaded from a database, and output which area needs the most work. Endpoints will also be useful to know. I would also like to tell the user if their character could be interpreted as another character similar to the one they wanted to write.
I imagine I will need a library of some sort to complete this project in any sort of timely manner but I have been unable to locate one which meets the standards I will need for the program. I looked into OpenCV but it appears to be meant for vision than image processing. I would also appreciate the library/module to be in python or Java but I can learn a new language if absolutely necessary.
Thank you for any help in this project.
Character Recognition is usually implemented using Artificial Neural Networks (ANNs). It is not a straightforward task to implement seeing that there are usually lots of ways in which different people write the same character.
The good thing about neural networks is that they can be trained. So, to change from one language to another all you need to change are the weights between the neurons, and leave your network intact. Neural networks are also able to generalize to a certain extent, so they are usually able to cope with minor variances of the same letter.
Tesseract is an open source OCR which was developed in the mid 90's. You might want to read about it to gain some pointers.
You can follow company links from this Wikipedia article:
http://en.wikipedia.org/wiki/Intelligent_character_recognition
I would not recommend that you attempt to implement a solution yourself, especially if you want to complete the task in less than a year or two of full-time work. It would be unfortunate if an incomplete solution provided poor guidance for students.
A word of caution: some companies that offer commercial ICR libraries may not wish to support you and/or may not provide a quote. That's their right. However, if you do not feel comfortable working with a particular vendor, either ask for a different sales contact and/or try a different vendor first.
My current idea is to get a single pixel width line to represent the stroke, compare how far each pixel is from the corresponding pixel in the example character loaded from a database, and output which area needs the most work.
The initial step of getting a stroke representation only a single pixel wide is much more difficult than you might guess. Although there are simple algorithms (e.g. Stentiford and Zhang-Suen) to perform thinning, stroke crossings and rough edges present serious problems. This is a classic (and unsolved) problem. Thinning works much of the time, but when it fails, it can fail miserably.
You could work with an open source library, and although that will help you learn algorithms and their uses, to develop a good solution you will almost certainly need to dig into the algorithms themselves and understand how they work. That requires quite a bit of study.
Here are some books that are useful as introduct textbooks:
Digital Image Processing by Gonzalez and Woods
Character Recognition Systems by Cheriet, Kharma, Siu, and Suen
Reading in the Brain by Stanislas Dehaene
Gonzalez and Woods is a standard textbook in image processing. Without some background knowledge of image processing it will be difficult for you to make progress.
The book by Cheriet, et al., touches on the state of the art in optical character recognition (OCR) and also covers handwriting recognition. The sooner you read this book, the sooner you can learn about techniques that have already been attempted.
The Dehaene book is a readable presentation of the mental processes involved in human reading, and could inspire development of interesting new algorithms.
Have you seen http://www.skritter.com? They do this in combination with spaced recognition scheduling.
I guess you want to classify features such as curves in your strokes (http://en.wikipedia.org/wiki/CJK_strokes), then as a next layer identify componenents, then estimate the most likely character. All the while statistically weighting the most likely character. Where there are two likely matches you will want to show them as likely to be confused. You will also need to create a database of probably 3000 to 5000 characters, or up to 10000 for the ambitious.
See also http://www.tegaki.org/ for an open source program to do this.
I am developing a financial manager in my freetime with Java and Swing GUI. When the user adds a new entry, he is prompted to fill in: Moneyamount, Date, Comment and Section (e.g. Car, Salary, Computer, Food,...)
The sections are created "on the fly". When the user enters a new section, it will be added to the section-jcombobox for further selection. The other point is, that the comments could be in different languages. So the list of hard coded words and synonyms would be enormous.
So, my question is, is it possible to analyse the comment (e.g. "Fuel", "Car service", "Lunch at **") and preselect a fitting Section.
My first thought was, do it with a neural network and learn from the input, if the user selects another section.
But my problem is, I don´t know how to start at all. I tried "encog" with Eclipse and did some tutorials (XOR,...). But all of them are only using doubles as in/output.
Anyone could give me a hint how to start or any other possible solution for this?
Here is a runable JAR (current development state, requires Java7) and the Sourceforge Page
Forget about neural networks. This is a highly technical and specialized field of artificial intelligence, which is probably not suitable for your problem, and requires a solid expertise. Besides, there is a lot of simpler and better solutions for your problem.
First obvious solution, build a list of words and synonyms for all your sections and parse for these synonyms. You can then collect comments online for synonyms analysis, or use parse comments/sections provided by your users to statistically detect relations between words, etc...
There is an infinite number of possible solutions, ranging from the simplest to the most overkill. Now you need to define if this feature of your system is critical (prefilling? probably not, then)... and what any development effort will bring you. One hour of work could bring you a 80% satisfying feature, while aiming for 90% would cost one week of work. Is it really worth it?
Go for the simplest solution and tackle the real challenge of any dev project: delivering. Once your app is delivered, then you can always go back and improve as needed.
String myString = new String(paramInput);
if(myString.contains("FUEL")){
//do the fuel functionality
}
In a simple app, if you will be having only some specific sections in your application then you can get string from comments and check it if it contains some keywords and then according to it change the value of Section.
If you have a lot of categories, I would use something like Apache Lucene where you could index all the categories with their name's and potential keywords/phrases that might appear in a users description. Then you could simply run the description through Lucene and use the top matched category as a "best guess".
P.S. Neural Network inputs and outputs will always be doubles or floats with a value between 0 and 1. As for how to implement String matching I wouldn't even know where to start.
It seems to me that following will do:
hard word statistics
maybe a stemming class (English/Spanish) which reduce a word like "lunches" to "lunch".
a list of most frequent non-words (the, at, a, for, ...)
The best fit is a linear problem, so theoretical fit for a neural net, but why not take immediately the numerical best fit.
A machine learning algorithm such as an Artificial Neural Network doesn't seem like the best solution here. ANNs can be used for multi-class classification (i.e. 'to which of the provided pre-trained classes does the input represent?' not just 'does the input represent an X?') which fits your use case. The problem is that they are supervised learning methods and as such you need to provide a list of pairs of keywords and classes (Sections) that spans every possible input that your users will provide. This is impossible and in practice ANNs are re-trained when more data is available to produce better results and create a more accurate decision boundary / representation of the function that maps the inputs to outputs. This also assumes that you know all possible classes before you start and each of those classes has training input values that you provide.
The issue is that the input to your ANN (a list of characters or a numerical hash of the string) provides no context by which to classify. There's no higher level information provided that describes the word's meaning. This means that a different word that hashes to a numerically close value can be misclassified if there was insufficient training data.
(As maclema said, the output from an ANN will always be floats with each value representing proximity to a class - or a class with a level of uncertainty.)
A better solution would be to employ some kind of word-relation or synonym graph. A Bag of words model might be useful here.
Edit: In light of your comment that you don't know the Sections before hand,
an easy solution to program would be to provide a list of keywords in a file that gets updated as people use the program. Simply storing a mapping of provided comments -> Sections, which you will already have in your database, would allow you to filter out non-keywords (and, or, the, ...). One option is to then find a list of each Section that the typed keywords belong to and suggest multiple Sections and let the user pick one. The feedback that you get from user selections would enable improvements of suggestions in the future. Another would be to calculate a Bayesian probability - the probability that this word belongs to Section X given the previous stored mappings - for all keywords and Sections and either take the modal Section or normalise over each unique keyword and take the mean. Calculations of probabilities will need to be updated as you gather more information ofcourse, perhaps this could be done with every new addition in a background thread.
I’m thinking of adding a feature to the TalkingPuffin Twitter client, where, after some training with the user, it can rank incoming tweets according to their predicted value. What solutions are there for the Java virtual machine (Scala or Java preferred) to do this sort of thing?
This is a classification problem, where you essentially want to learn a function y(x) which predicts whether 'x', an unlabeled tweet, belongs in the class 'valuable' or in the class 'not valuable'.
The trickiest bits here are not the algorithm (Naive Bayes is just counting and multiplying and is easy to code!) but:
Gathering the training data
Defining the optimal feature set
For one, I suggest you track tweets that the user favorites, replies to, and retweets, and for the second, look at qualities like who wrote the tweet, the words in the tweet, and whether it contains a link or not.
Doing this well is not easy. Google would love to be able to do such things ("What links will the user value"), as would Netflix ("What movies will they value") and many others. In fact, you'd probably do well to read through the notes about the winning entry for the Netflix Prize.
Then you need to extract a bunch of features, as #hmason says. And then you need an appropriate machine learning algorithm; you either need a function approximator (where you try to use your features to predict a value between, say, 0 and 1, where 1 is "best tweet ever" and 0 is "omg who cares") or a classifier (where you use your features to try to predict whether it's a "good" or "bad" tweet).
If you go for the latter--which makes user-training easy, since they just have to score tweets with "like" (to mix social network metaphors)--then you typically do best with support vector machines, for which there exists a fairly comprehensive Java library.
In the former case, there are a variety of techniques that might be worth trying; if you decide to use the LIBSVM library, they have variants for regression (i.e. parameter estimation) as well.
Given a set of words tagged for part of speech, I want to find those that are obscenities in mainstream English. How might I do this? Should I just make a huge list, and check for the presence of anything in the list? Should I try to use a regex to capture a bunch of variations on a single root?
If it makes it easier, I don't want to filter out, just to get a count. So if there are some false positives, it's not the end of the world, as long as there's a more or less uniformly over exaggerated rate.
A huge list and think of the target audience. Is there 3rd party service that you can use that specialises in this rather than rolling your own?
Some quick thoughts:
The Scunthorpe problem (and follow the links to "Swear filter" for more)
British or American English? fanny, fag etc
Political correctness: "black" or "Afro-American"?
Edit:
Be very careful and again here. Normal words can offend, whether by choice or ignorance
Is the phrase I want to stick my long-necked Giraffe up your fluffy white bunny obscene?
I'd make a huge list.
Regex'es have the problem of misfiring, when applied to natural language - especially with an amount of exceptions English has.
Note that any NLP logic like this will be subject to attacks of "character replacement":
For example, I can write "hello" as "he11o", replacing L's with One's. Same with obscenities. So while there's no perfect answer, a "blacklist" approach of "bad words" might work. Watch out for false positives (I'd run my blacklist against a large book to see what comes up)
One problem with filters of this kind is their tendency to flag entirely proper English town names like Scunthorpe. While that can be reduced by checking the whole word rather than parts, you then find people taking advantage by merging their offensive words with adjacent text.
It depends what your text source is, but I'd go for some kind of established and proven pattern matching algorithm, using a Trie for example.
Use the morphy lemmatizer built into WordNet, and then determine whether the lemma is an obscenity. This will solve the problem of different verb forms, plurals, etc...
I would advocate a large list of simple regex's. Smaller than a list of the variants, but not trying to capture anything more than letter alternatives in any given expression: like "f[u_-##$%^&*.]ck".
You want to use Bayesian Analysis to solve this problem. Bayesian probability is a powerful technique used by spam filters to detect spam/phishing messages in your email inbox. You can train your analysis engine so that it can improve over time. The ability to detect a legitimate email vs. a spam email sounds identical to the problem you are experiencing.
Here are a couple of useful links:
A Plan For Spam - The first proposal to use Bayesian analysis to combat spam.
Data Mining (ppt) - This was written by a colleague of mine.
Classifier4J - A text classifier library written in Java (they exist for every language, but you tagged this question with Java).
There are webservices that do this kind of thing in English.
I'm sure there are others, but I've used WebPurify in a project for precisely this reason before.
At Melissa Data, when my manager , the director of Massachusetts Research and Development and I refactored a Data Profiler targeted at Relational Databases , we counted profanities by the number of Levinshtein Distance matches where the number of insertions, deletions and substitutions is tunable by the user so as to allow for spelling mistakes, Germanic equivalents of English language, plurals, as well as whitespace and non-whitespace punctuation. We speeded up the running time of the Levinshtein Distance calculation by looking only in the diagonal bands of the n by n matrix.
I am looking to use a natural language parsing library for a simple chat bot. I can get the Parts of Speech tags, but I always wonder. What do you do with the POS. If I know the parts of the speech, what then?
I guess it would help with the responses. But what data structures and architecture could I use.
A part-of-speech tagger assigns labels to the words in the input text. For example, the popular Penn Treebank tagset has some 40 labels, such as "plural noun", "comparative adjective", "past tense verb", etc. The tagger also resolves some ambiguity. For example, many English word forms can be either nouns or verbs, but in the context of other words, their part of speech is unambiguous.
So, having annotated your text with POS tags you can answer questions like: how many nouns do I have?, how many sentences do not contain a verb?, etc.
For a chatbot, you obviously need much more than that. You need to figure out the subjects and objects in the text, and which verb (predicate) they attach to; you need to resolve anaphors (which individual does a he or she point to), what is the scope of negation and quantifiers (e.g. every, more than 3), etc.
Ideally, you need to map you input text into some logical representation (such as first-order logic), which would let you bring in reasoning to determine if two sentences are equivalent in meaning, or in an entailment relationship, etc.
While a POS-tagger would map the sentence
Mary likes no man who owns a cat.
to such a structure
Mary/NNP likes/VBZ no/DT man/NN who/WP owns/VBZ a/DT cat/NN ./.
you would rather need something like this:
SubClassOf(
ObjectIntersectionOf(
Class(:man)
ObjectSomeValuesFrom(
ObjectProperty(:own)
Class(:cat)
)
)
ObjectComplementOf(
ObjectSomeValuesFrom(
ObjectInverseOf(ObjectProperty(:like))
ObjectOneOf(
NamedIndividual(:Mary)
)
)
)
)
Of course, while POS-taggers get precision and recall values close to 100%, more complex automatic processing will perform much worse.
A good Java library for NLP is LingPipe. It doesn't, however, go much beyond POS-tagging, chunking, and named entity recognition.
Natural language processing is wide and deep, with roots going back at least to the 60s. You could start reading up on computational linguistics in general, natural language generation, generative grammars, Markov chains, chatterbots and so forth.
Wikipedia has a short list of libraries which I assume you might have seen. Java doesn't have a long tradition in NLP, though I haven't looked at the Stanford libraries.
I doubt you'll get very impressive results without diving fairly deeply into linguistics and grammar. Not everybody's favourite school subject (or so I've heard reported -- loved'em meself!).
I'll skip a lot many details and keep this simple. Parts of Speech tagging help you to create an parse tree out of a sentence. Once you have this, you try to make out a meaning as unambiguously as possible. The result of this parsing step will greatly aid you to frame a suitable response for you chatterbot.
Once you have part of speech tags you can extract, for example, all nouns, so you know roughly what things or objects someone is talking about.
To give you an example:
Someone says "you can open a new window." When you have the POS tags you know they are not talking about a can (as in container, jar etc., which would even make sense in the context of open), but a window. You'll also know that open is a verb.
With this information, your chat bot can generate a much better reply that will have nothing to do with can openers etc.
Note: You don't need a parser to get POS tags. A simple POS tagger is enough. A parser will give you even more information (e.g. what is the subject, what the object of the sentence?)