Select clause complex regex pattern - java

I am working on some application for my masters thesis and in the process I have to build a SQL Parser. To do so I've decide to go heavy on regexing since it seems the best way at the time.
The problem is that i have some minor problems with my regexes.
Considering some query examples such as:
select
RIC
from
(select
s.RIC, m.NAME
from
Stock s, Market m
where
s.LISTED_ON_EXCHANGE = m.RIC) t
where
RIC > 'G';
select *
from Stock
order by COMPANY
LIMIT 0,2;
select 1+2;
select now();
select
s.RIC, m.NAME
from
Stock s
INNER JOIN
Market ON m I s.LISTED_ON_EXCHANGE = m.RIC;
select *
from Stock
order by COMPANY;
select *
from Stock
where RIC in ('GS.N' , 'INFY.BO');
select *
from Stock
where RIC LIKE 'V%';
select *
from Stock
where RIC BETWEEN 'G' AND 'I';
select count(*)
from STOCK
where LISTED_ON_EXCHANGE IS NOT NULL;
select na_me as n, price as p
from bla, blabla, blalalaa;
And given the following two regexes:
SELECT_FIELDS_PATTERN = "(?<=[SELECT]) [\\d\\w',.*() ]+ (?=FROM)";
That should match selection fields.
And:
SELECT_FROM_PATTERN = "(?<=[FROM]) [\\w, ]+ (?(?=(?:WHERE|INNER|ORDER)))";
That should match FROM clauses excluding any conditions or ordering etc.
All of the queries except
select 1+2;
select now();
Should be valid. That's because I only want to parse select queries that contain relevant information for me.
The problem is that the two regexes I've created won't validate for example the last query:
select na_me as n, price as p from bla, blabla, blalalaa;
So I would require some help to improve my regexing for select queries, maybe even merge the two regexes?
An example of a correct output for the first query:
select RIC from (select s.RIC, m.NAME from Stock s, Market m where s.LISTED_ON_EXCHANGE=m.RIC) t where RIC > 'G';
The output should be:
RIC
for the first part and
(select s.RIC, m.NAME from Stock s, Market m where s.LISTED_ON_EXCHANGE=m.RIC) t
for the second part

Character classes are not groups -- remove [ & ] around keywords.
Don't use useless lookarounds, it can lead to problems in some cases.
You probably want to use \b around keywords so that SELECT does not match in FOOSELECT.
Can use (?i) to make the expression case insensitive.
You could use something like:
(?i)\bSELECT\b\s+(.+)\s+\bFROM\b\s+([\w\s,]+?)(?:\s+\b(?:WHERE|INNER|ORDER)\b|;?$)
With the parts of interest being captured in the first and second capturing group.
Note this will not work right with strings and in other cases, also SQL is recursive, which is pretty difficult to parse with Java regex. I suggest you use a proper parser if you want to parse SQL properly. (You can write a simple one your self, using regex for lexing generating tokens and Java to parse the tokens and build a parse tree.)

Related

Ordering Sql result based on number of token matches from RLIKE

I am trying to implement a simple search query, where I am splitting the search text into tokens and then returning all results that contain any of the tokens, I am using RLIKE 'token1|token2|token3|...', this is working correctly and returning all the result, but now I would want to Order the result by the numbers of tokens from the RLIKE that matches, is that anyway possible anyway? thanks in advance.
SELECT p.* FROM product p
WHERE p.title RLIKE 'token1|token2|token3';
You can use the operator LIKE for each of the tokens in the ORDER BY clause:
ORDER BY (p.title LIKE '%token1%') +
(p.title LIKE '%token2%') +
(p.title LIKE '%token3%') DESC
Each of the boolean expressions p.title LIKE '%tokenX%' evaluates to 1 for true or 0 for false.

How to filter DBpedia results in SPARQL

I have a little problem...
if I have this simple SPARQL query
SELECT ?abstract
WHERE {
<http://dbpedia.org/resource/Mitsubishi> <http://dbpedia.org/ontology/abstract> ?abstract.
FILTER langMatches( lang(?abstract), 'en')}
I have this result:
SPARQL Result
and it has a non-English character...
is there any idea how to remove them and retrieve just English words?
You'll need to define exactly what characters you want and don't want in your result, but you can use replace to replace characters outside of a range with, e.g., empty strings. If you wanted to exclude all but the Basic Latin, Latin-1 Supplement, Latin Extended-A, and Latin Extended-B ranges, (which ends up being \u0000–\u024f) you could do the following:
SELECT ?abstract ?cleanAbstract
WHERE {
dbpedia:Mitsubishi dbpedia-owl:abstract ?abstract
FILTER langMatches( lang(?abstract), 'en')
bind(replace(?abstract,"[^\\x{0000}-\\x{024f}]","") as ?cleanAbstract)
}
SPARQL results
Or even simpler:
SELECT (replace(?abstract_,"[^\\x{0000}-\\x{024f}]","") as ?abstract)
WHERE {
dbpedia:Mitsubishi dbpedia-owl:abstract ?abstract_
FILTER langMatches(lang(?abstract_), 'en')
}
SPARQL results
The Mitsubishi Group (, Mitsubishi Gurūpu) (also known as the
Mitsubishi Group of Companies or Mitsubishi Companies) is a group of
autonomous Japanese multinational companies covering a range of
businesses which share the Mitsubishi brand, trademark, and legacy.The
Mitsubishi group of companies form a loose entity, the Mitsubishi
Keiretsu, which is often referenced in Japanese and US media and
official reports; in general these companies all descend from the
zaibatsu of the same name. The top 25 companies are also members of
the Mitsubishi Kin'yōkai, or "Friday Club", and meet monthly. In
addition the Mitsubishi.com Committee exists to facilitate
communication and access of the Mitsubishi brand through a portal web
site.
You may find the Latin script in Unicode Wikipedia article useful.

StringUtil indexOf() equivalent postgreSQL query

I need to implement stringUtils Class indexOf() method in postgresql.
Lets say I have a table in which url is one of the column.
url : "http://paypal-info.com/home.webapps.cgi-bin-limit/webscr.cmd-login-submit"
My requirement is to find the index of the 3rd occurence of '/' in the above url and do substring and take only paypal-info.com host name in Postgresql Query
Any idea on implementing this would be grateful.
Thanks
Have you tried split_part method?
SELECT split_part('http://paypal-info.com/home.webapps.cgi-bin-limit/webscr.cmd-login-submit', '/', 3)
Result:
split_part
paypal-info.com
For other string functions try this doc:
http://www.postgresql.org/docs/9.1/static/functions-string.html
Edit: as for indexOf itself I don't know any built-in postgres solution. But using two string functions You can achieve it like this:
SELECT strpos('http://paypal-info.com/home.webapps.cgi-bin-limit/webscr.cmd-login-submit', split_part('http://paypal-info.com/home.webapps.cgi-bin-limit/webscr.cmd-login-submit', '/', 4)) - 1 as index_of;
The string functions and operators section of the manual is the equivalent of String.indexOf, e.g.
select position('/' in 'http://paypal-info.com/home.webapps.cgi-bin-limit/webscr.cmd-login-submit');
however it doesn't offer the option to get the n'th occurrence.
You're really approaching this all wrong. You should use proper URL parsing code to extract the host portion, not attempt to roll your own or use regex / splitting / string mangling.
PostgreSQL doesn't have a native URL/URI type, but its procedural languages do and it's trivial to wrap suitable functions. e.g. with PL/Python:
create language plpythonu;
create or replace function urlhost(url text) returns text
language plpythonu
immutable strict
as $$
import urlparse
return urlparse.urlparse(url).netloc
$$;
then:
regress=# select urlhost('http://paypal-info.com/home.webapps.cgi-bin-limit/webscr.cmd-login-submit');
urlhost
-----------------
paypal-info.com
(1 row)
If you'd prefer to use PL/Perl, PL/V8, or whatever, that's fine.
For best performance, you could write a simple C function and expose that as an extension.
Just replace 3 with N to get the index of the Nth '/' in a given string
SELECT length(substring('http://asd/asd', '(([^/]*/){3})')) - 1
To extract the host name from url you can use
SELECT substring('http://asd.com:234/qwe', 'http://([^:]+).*/')
Tested here: SQLFiddle

Catch-all second alternative for my start rule

I'm trying to write an ANTLR grammar for a little query language. Queries are a list of search terms restricted to specific fields:
field1:a field2:b field3:c
That's supposed to return a list of entities where field1 matches a, field2 matches b, and so on. Queries can also be completely unrestricted:
abc
That's supposed to return entities with any field that matches abc. Here's the ANTLR grammar:
#members {
String unrestrictedQuery;
}
FIELD1_OPERATOR: 'field1:';
FIELD2_OPERATOR: 'field2:';
FIELD3_OPERATOR: 'field3:';
DIGIT: '0'..'9';
LETTER: 'A'..'Z' | 'a'..'z';
query: subquery (' ' subquery)*
| UNRESTRICTED_QUERY=.* {unrestrictedQuery = $UNRESTRICTED_QUERY.text;}
;
I want unrestricted queries to be any text that doesn't match the query rule's first alternative.
1) Is there a better way to grab the text that the second alternative matched?
2) When I plug this into my web server, the unrestrictedQuery parser field resolves to the last character of the query. It seems like the action gets called for every character of the query when I really want the whole string.
Thanks for reading!
"I want unrestricted queries to be any text that doesn't match the query rule's first alternative".
This is a bad design decision. What if in future, you want to add Field4? Then incompatibility occur. Better change the grammar so that unrestricted queries are easily recognized. Surround field values (a, b, c) with quotes, or start unrestricted query with a colon:
field1:a :abc field2:b

Exact match with sql like and the bind

I have a bind in the SQL query
SELECT * FROM users WHERE name LIKE '%?%'
the bind set the ?.
Now, if i want to search with like method everything work but if, without change the sql, i want to search the exact match i dont now how to do.
I tried some regexp int the textbox es:
_jon \jon\ [jon] and some others but nothing work properly.
Any ideas?
Change your query to
select * from users where name like '?'
If you want to do a wildcard match, put the wildcards as part of the string that you're binding to the variable. If you don't want to do a wildcard match, then don't.
Note that like and = have the same performance except when your wildcard character is first in the string (for example, '%bob') as in that case the query optimizer can't use indexes as well to find the row(s) that you're looking for.
you can't search an exact match if the sql contains % symbols, as they are wildcards. you'll need to change the sql to
select * from users where name = '?'
for an exact match
(you can also use select * from users where name like '?' but that's more inefficient)
What is keeping you from changing the SQL?
The Like condition is for 'similar' matches, while the '=' is for exact matches.

Categories