How to distinguish in quotes delimiter vs out of quotes delimiter - java

I have a txt file that contains the following
SELECT TOP 20 personid AS "testQu;otes"
FROM myTable
WHERE lname LIKE '%pi%' OR lname LIKE '%m;i%';
SELECT TOP 10 personid AS "testQu;otes"
FROM myTable2
WHERE lname LIKE '%ti%' OR lname LIKE '%h;i%';
............
The above query can be any legit SQl statement (on one or multiple lines , i.e. any way user wishes to type in )
I need to split this txt and put into an array
File file ... blah blah blah
..........................
String myArray [] = text.split(";");
But this does not work properly because it take into account ALL ; . I need to ignore those ; that are within ";" AND ';'. For example ; in here '%h;i%' does not count because it is inside ''. How can I split correctly ?

Assuming that each ; you want to split on is at the end of line you can try to split on each ; + line separator after it like
text.split(";"+System.lineSeparator())
If your file has other line separators then default ones you can try with
text.split(";\n")
text.split(";\r\n")
text.split(";\r")
BTW if you want to include ; in split result (if you don't want to get rid of it) you can use look-behind mechanism like
text.split("(?<=;)"+System.lineSeparator())
In case you are dynamically reading file line-by-line just check if line.endsWith(";").

I see a 'new line' after your ';' - It is generalizable to the whole text file ?
If you must/want use regular expression you could split with a regex of the form
;$
The $ means "end of line", depending of the regex implementation of Java (don't remember).
I will not use regex for this kind of task. Parsing the text and counting the number of ' or " to be able to recognize the reals ";" delimiters is sufficient.

Related

Unable to capture next line character in Java

I have a requirement of parsing through an python file which contains multiple sql queries and get the start and end positions of the query to get only the query part using JAVA
I am using .contains function to check for sql(''' as my opening character for the query and now for the closing character I have ''') but there are some cases where ''') comes in between the query when there is a variable involved which should not be detected as an end of the query.
Something like this :
spark.sql(''' SELECT .......
FROM.....
WHERE xxx IN ('''+ Variable +''')
''')
here the last but one line also gets detected as end of line if I use line.contains(" ''') ") which is wrong.
All I can think of is to check for next line character as the end of the query as each query is separated by two empty lines. So tried these if (line.contains(" ''')\n") & if (line.contains(" ''')\r\n") but none of them work for me.
Kindly let me know of any other way to do this.
Note that I do not have the privilege to change the query file.
Thanks
I believe simple contains won't solve this problem.
You will have to use Pattern if you are looking to match \n.
String query = "spark.sql(''' SELECT .......\n" +
"FROM..... \n" +
"WHERE xxx IN ('''+ Variable +''')\n" +
"''')";
Pattern pattern = Pattern.compile("^spark.sql\\('''(.*)'''\\)$", Pattern.DOTALL);
System.out.println(pattern.matcher(query).find());
Output:
true
Pattern.DOTALL tells Java to allow the dot to match newline characters, too.

TRIM replaces LTRIM/RTRIM

I am fairly new to Oracle.
Is it safe to say that LTRIM(RTRIM(<myVarchar>)) is totally replaceable by TRIM(<myVarchar>) if I want to replace both leading and trailing whitespaces in Oracle 11g?
Also, when I am trying to use this function in my query using JPA, I am getting error "org.hibernate.hql.internal.ast.QuerySyntaxException: unexpected AST node".
Here is the query that I am using:
#Query("Select p from OldPin p WHERE TRIM(p.eeNo) = :accNum and
TRIM(p.pinStatus) = 'A' and TRIM(p.memberType='E') and TRIM(p.sCode) in
('MSHK','MCMG')")
public OldPin findByAccountNum(#Param("accNum") String accNum);
Trim will remove both leading and trailing spaces by default
eg. Trim(' test ') output will be test
If we use Trim(both from ) then it will remove a character from both side
eg.,
Trim(both '1' from '111oracle111') output will be oracle
Trim(leading '1' from '111oracle111') output will be oracle111
Trim(trailing '1' from '111oracle111') output will be 111oracle
Trim(both 'ab' from 'abtechab') - it will throw error, because trim will support single character only
In RTrim and LTrim we can remove any number of character.
Too, you can use
select '%'||replace(' hello world ', ' ' , '')||'%' from dual;
and the output is
%helloworld%

How to tokenize, scan or split this string of email addresses

For Simple Java Mail I'm trying to deal with a somewhat free-format of delimited email addresses. Note that I'm specifically not validating, just getting the addresses out of a list of addresses. For this use case the addresses can be assumed to be valid.
Here is an example of a valid input:
"name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
So there are two basic forms "name#domain.com" and "Joe Sixpack ", which can appear in a comma / semicolon delimited string, ignoring white space padding. The problem is that the names can contains delimiters as valid characters.
The following array shows the data needed (trailing spaces or delimiters would not be a big problem):
["name#domain.com",
"Sixpack, Joe 1 <name#domain.com>",
"Sixpack, Joe 2 <name#domain.com>",
"Sixpack, Joe, 3<name#domain.com>",
"nameFoo#domain.com",
"nameBar#domain.com",
"nameBaz#domain.com"]
I can't think of a clean way to deal with this. Any suggestion how I can reliably recognize whether a comma is part of a name or is a delimiter?
Final solution (variation on the accepted answer):
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
// recognize value tails and replace the delimiters there, disambiguating delimiters
const result = string
.replace(/(#.*?>?)\s*[,;]/g, "$1<|>")
.replace(/<\|>$/,"") // remove trailing delimiter
.split(/\s*<\|>\s*/) // split on delimiter including surround space
console.log(result)
Or in Java:
public static String[] extractEmailAddresses(String emailAddressList) {
return emailAddressList
.replaceAll("(#.*?>?)\\s*[,;]", "$1<|>")
.replaceAll("<\\|>$", "")
.split("\\s*<\\|>\\s*");
}
since you are not validating, i assume that the email addresses are valid.
Based on this assumption, i will look up an email address followed by ; or , this way i know its valid.
var string = "name#domain.com,Sixpack, Joe 1 <name#domain.com>, Sixpack, Joe 2 <name#domain.com> ;Sixpack, Joe, 3<name#domain.com> , nameFoo#domain.com,nameBar#domain.com;nameBaz#domain.com;"
const result = string.match(/(.*?#.*?\..*?)[,;]/g)
console.log(result)
This pattern works for your provided examples:
([^#,;\s]+#[^#,;\s]+)|(?:$|\s*[,;])(?:\s*)(.*?)<([^#,;\s]+#[^#,;\s]+)>
([^#,;\s]+#[^#,;\s]+) # email defined by an # with connected chars except ',' ';' and white-space
| # OR
(?:$|\s*[,;])(?:\s*) # start of line OR 0 or more spaces followed by a separator, then 0 or more white-space chars
(.*?) # name
<([^#,;\s]+#[^#,;\s]+)> # email enclosed by lt-gt
PCRE Demo
Using Java's replaceAll and split functions (mimicked in javascript below), I would say lock onto what you know ends an item (the ".com"), replace separator characters with a unique temp (a uuid or something like <|>), and then split using your refactored delimiter.
Here is a javascript example, but Java's repalceAll and split can do the same job.
var string = "name#domain.com,Joe Sixpack <name#domain.com>, Sixpack, Joe <name#domain.com> ;Sixpack, Joe<name#domain.com> , name#domain.com,name#domain.com;name#domain.com;"
const result = string.replace(/(\.com>?)[\s,;]+/g, "$1<|>").replace(/<\|>$/,"").split("<|>")
console.log(result)

Trying to create a regexp

I have a string which I want a string to parse via Java or Python regexp:
something (\var1 \var2 \var3 $var4 #var5 $var6 *fdsfdsfd #uytuytuyt fdsgfdgfdgf aaabbccc)
The number of var is unknown. Their exact names are unknown. Their names may or may not start with "\" or "$", "*", "#" or "#" and there're delimited by whitespace.
I'd like to parse them separately, that is, in capture groups, if possible. How can I do that? The output I want is a list of:
[\var1 , \var2 , \var3 , $var4 , #var5 , $var6 , *fdsfdsfd , #uytuytuyt , fdsgfdgfdgf , aaabbccc]
I don't need the java or python code, I just need the regexp. My incomplete one is:
something\s\(.+\)
something\s\((.+)\)
In this regex you are capturing the string containing all the variables. split it based on whitespace since you are sure that they are delimited by whitespace.
m = re.search('something\s\((.+)\)', input_string)
if m:
list_of_vars = m.group(1).split()

What is the newline character in Oracle

Running in sqldeveloper, this prompts an empty cell in the results
select regexp_replace(chr(10) || chr(13), '\n', 'New Line')
from dual
I was expecting this to be '_New Line' or 'New Line_' but instead I get '__' (Note that underscore is visually an empty space).
This doesn't work either.
select regexp_replace(
UNISTR('\000A')
|| UNISTR('\000D')
, '\n', 'New Line')
from dual
Specifically I need this to work in java. There is a method that will invoque a stored procedure that returns a cursor with columns that may have line breakings. From the java side I need this to work:
String or = resultset.getString("some_column"); // null check left for simplicity
... = or.split("\\n");
... = or.split("\\r?\\n");
I was testing with regexp_replace because it is supposed to do the same.
You're already using the newline character, it's chr(10):
select regexp_replace('<'||chr(10) || chr(13)||'>', chr(10), 'New Line')
from dual;
To replace the line feed instead, use chr(13).
You don't need to use a regular expression here, you can use the normal replace() instead.
SQL Fiddle with angle brackets added to show the output more clearly.

Categories