Java - parsing text using delimiter for separating different arguments - java

How would you use multiple delimiters or a single delimiter to detect and separate out different string matches?
For example, I use a Scanner to parse in the following string:
MrsMarple=new Person(); MrsMarple.age=30;
I would like to separate out this string to determine, in sequence, when a new person is being created and when their age is being set. I also need to know what the age is being set to.
There can be anything between and/or either side of these arguments (there doesn't necessarily have to be a space between them, but the semi-colon is required). The "MrsMarple" could be any word. I would also prefer any arguments following a "//" (two slashes) on the same line to be ignored but that's optional.
If you can think of a simple alternative to using regex I'm more than willing to consider it.

I might try a simple split/loop approach.
Given String input = "MrsMarple=new Person(); MrsMarple.age=30;":
String[] noComments = input.split("//");
String[] statements = input.split(noComments[0]);
for(String statement: statements) {
String[] varValue = statement.split("=");
...
// Additional MrsMarple-SmartSense Technology (tm) here...
...
}
with judicious use of String.trim() and or other simple tools.

Or to make the the matter more general (and without regexes), you may try scripting (as it looks like a script language syntax): http://java.sun.com/developer/technicalArticles/J2SE/Desktop/scripting/ . Example:
ScriptEngineManager mgr = new ScriptEngineManager();
ScriptEngine jsEngine = mgr.getEngineByName("JavaScript");
String input = "MrsMarple=new Person(); MrsMarple.age=30;"
try {
jsEngine.eval(input);
} catch (ScriptException ex) {
ex.printStackTrace();
}
In this case you'll need a Java class called Person with public field called age. Above code has not been tested, you may need to add something like
jsEngine.eval("importPackage(my.package);");
to make it work. Anyway, Oracle's tutorial should be helpfull.

Related

Regex pattern to split colon char with a condition

I have a string like this :
http://schemas/identity/claims/usertype:External
Then my goal is to split that string into 2 words by colon delimiter, but in need to specified how the regex worked, it will be split the colon but not including colon in "http://", so those strings will be split into :
http://schemas/identity/claims/usertype
External
I have tried regex like this :
(http:\/\/+schemas\/identity\/claims\/usertype)
So it will be :
http://schemas/identity/claims/usertype
:External
then after that i will replace the remaining colon with empty string.
but i think its not a best practice for this, because i rarely used regex.
Do you have any suggestion to simplified the regex ?
Thanks in advance
This is an X/Y problem. Fortunately, you asked the question in a great way, by explaining the underlying problem you are trying to solve (namely: Pull some string out of a URL), and then describing the direction you've chosen to solve your problem (which is bad, see below), and then asking about a problem you have with this solution (which is irrelevant, as the entire solution is bad).
URLs aren't parsable like this. You shouldn't treat them as a string you can lop into pieces like this. For example, the server part can contain colons too: For port number. In front of the server part, there can be an authentication which can also contain a colon. It's rarely used, of course.
Try this one, which shows the problem with your approach:
https://joe:joe#google.com:443/
That link just works. Port 443 was the default anyway, and google ignores the authentication header that ends up sending, but the point is, a URL may contain this stuff.
But rzwitserloot, it.. won't! I know!
That's bad programming mindset. That mindset leads to security issues. Why go for a solution that burdens your codebase with unstated assumptions (assumption: The places that provide a URL to this code are under my control and will never send port or auth headers)? If the 'server' part is configurable in a config file, will you mention in said config file that you cannot add a port? Will you remember 4 years from now?
The solution that does it right isn't going to burden your code with all these unstated (or very unwieldy if stated) assumptions.
Okay, so what is the right way?
First, toss that string into the constructor of java.net.URI. Then, use the methods there to get what you actually want, which is the path part. That is a string you can pull apart:
URI uri = new URI("http://schemas/identity/claims/usertype:External");
String path = uri.getPath();
String newPath = path.replaceAll(":.*", "");
String type = path.replaceAll(".*?:", "");
URI newUri = uri.resolve(newPath);
System.out.println(newUri);
System.out.println(type);
prints:
http://schemas/identity/claims/usertype
External
NB: Toss some ports or auth stuff in there, or make it a relative URL - do whatever you like, this code is far more robust in the face of changing the base URL than any attempt to count colons is going to be.
Use Negative Lookbehind and split
Regex:
"(?<!(http|https)):"
Regex in context:
public static void main(String[] args) {
String input = "http://schemas/identity/claims/usertype:External";
validateURI(input);
List<String> result = Arrays.asList(input.split("(?<!(http|https)):"));
result.forEach(System.out::println);
}
private static void validateURI(String input) {
try {
new URI(input);
} catch (URISyntaxException e) {
System.out.println("Invalid URI!!!");
e.printStackTrace();
}
}
Output:
http://schemas/identity/claims/usertype
External
I think this might help you:
public class Separator{
public static void main(String[] args) {
String input = "http://schemas/identity/claims/usertype:External";
String[] splitted = input.split("\\:");
System.out.println(splitted[splitted.length-1]);
}
}
Output
External

Java Regexp to match domain of url

I would like to use Java regex to match a domain of a url, for example,
for www.table.google.com, I would like to get 'google' out of the url, namely, the second last word in this URL string.
Any help will be appreciated !!!
It really depends on the complexity of your inputs...
Here is a pretty simple regex:
.+\\.(.+)\\..+
It fetches something that is inside dots \\..
And here are some examples for that pattern: https://regex101.com/r/L52oz6/1.
As you can see, it works for simple inputs but not for complex urls.
But why reinventing the wheel, there are plenty of really good libraries that correctly parse any complex url. But sure, for simple inputs a small regex is easily build. So if that does not solve the problem for your inputs then please callback, I will adjust the regex pattern then.
Note that you can also just use simple splitting like:
String[] elements = input.split("\\.");
String secondToLastElement = elements[elements.length - 2];
But don't forget the index-bound checking.
Or if you search for a very quick solution than walk through the input starting from the last position. Work your way through until you found the first dot, continue until the second dot was found. Then extract that part with input.substring(index1, index2);.
There is also already a delegate method for exactly that purpose, namely String#lastIndexOf (see the documentation).
Take a look at this code snippet:
String input = ...
int indexLastDot = input.lastIndexOf('.');
int indexSecondToLastDot = input.lastIndexOf('.', indexLastDot);
String secondToLastWord = input.substring(indexLastDot, indexSecondToLastDot);
Maybe the bounds are off by 1, haven't tested the code, but you get the idea. Also don't forget bound checking.
The advantage of this approach is that it is really fast, it can directly work on the internal structures of Strings without creating copies.
My attempt:
(?<scheme>https?:\/\/)?(?<subdomain>\S*?)(?<domainword>[^.\s]+)(?<tld>\.[a-z]+|\.[a-z]{2,3}\.[a-z]{2,3})(?=\/|$)
Demo. Works correctly for:
http://www.foo.stackoverflow.com
http://www.stackoverflow.com
http://www.stackoverflow.com/
http://stackoverflow.com
https://www.stackoverflow.com
www.stackoverflow.com
stackoverflow.com
http://www.stackoverflow.com
http://www.stackoverflow.co.uk
foo.www.stackoverflow.com
foo.www.stackoverflow.co.uk
foo.www.stackoverflow.co.uk/a/b/c
private static final Pattern URL_MATCH_GET_SECOND_AND_LAST =
Pattern.compile("www.(.*)//.google.(.*)", Pattern.CASE_INSENSITIVE);
String sURL = "www.table.google.com";
if (URL_MATCH_GET_SECOND_AND_LAST.matcher(sURL).find()){
Matcher matchURL = URL_MATCH_GET_SECOND_AND_LAST .matcher(sURL);
if (matchURL .find()) {
String sFirst = matchURL.group(1);
String sSecond= matchURL.group(2);
}
}

Creating string array from a delimited string.

I have a string of urls, example "domain.com/url1, domain.com/url2 etc". Sometimes they are comma, tab, or pipe delimited. What I'd like to do is split them up in a string array and automatically handle any potential use case. Does anybody know of a good way to handle this?
I started with something like this, but it doesn't function correctly nor does it handle all use cases.
Collection<String> newUrls = Arrays.asList(photoHolder.getPhotoURLs().replaceAll("\\|", ",").replaceAll("\\s+", "").split(","));
I believe this should be possible with only using the split method and providing a regex that will match any of your delimiters.
Collection<String> newUrls = Arrays.asList(photoHolder.getPhotoURLs().split("\\t|\\||,"));
You might use some "smart" regular expressions that are independent of the delimiter, but use the domain names (.com, .co.uk, IP addresses...) to separate the URLs.
I think you have make some split methods to cover all the scenarios and finally put them in a one list. And the other case about the potential use case use a try catch and in the catch handle the exception because we cannot handle every scenario. Think this will be helpful somehow.
Uri uri = Uri.parse("domain.com/url1/what ever_the_url");
String protocol = uri.getScheme();
String server = uri.getAuthority();
String path = uri.getPath();
Set<String> args = uri.getQueryParameterNames();
String limit = uri.getQueryParameter("limit");
Try this also.

Pulling e-mails from a string

So, I'm parsing a .mozeml file from Eudora and converting them into an mbox file (mbox got corrupted, and deleted but mozeml files were left over, but unable to import them). There's over 200,000 e-mails, and unsure of what's a good way to handle this properly.
I am thinking of creating a Java program that will read the .mozeml files (they are xml, utf-8 format) parse the data, and then write an mbox file in this format http://en.wikipedia.org/wiki/Mbox#Family.
The problem is just that the xml file didn't separate the To line and the message; it's just one entire string. I'm not entirely sure how to properly handle that.
For example here is how the message looks
"Joe 1" <joe1#gmail.com>joe2#gmail.comHello this is an e-mail...
or
"Joe 1" <joe1#gmail.com>"Joe 2" <joe2#gmail.com>Hello this is an e-mail...
There's a lot of test cases to check if it's a .com/.net/com.hk/.co.jp/etc. for the first one. The second one is a bit easier because the end of the to line is >. So, I'm unsure about the first case and ensuring that it's going to be accurate for the 200,000 emails.
Try antlr library for parsing strings.
The first thought for this problem is to use regexp and scanner to find next email occurence in cycle.
class EmailScanner {
public static void main(String[] args) {
try {
Scanner s = new Scanner(new File(/* Your file name here. */););
String token;
do {
token = s.findInLine(/* Put your email pattern here. */);
/* Write your token where you need it. */
} while (token != null);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Possible email patterns can be found easily. For example ^[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}$ or ^[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.(?:[a-zA-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)$ see http://www.regular-expressions.info/email.html.
Here's a standard email regex modified for your format:
Pattern pattern = Pattern.compile(";[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}");
String text = "\"Joe 1\" <joe1#gmail.com>joe2#gmail.com Hello this is an e-mail...";
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group().replaceFirst(";", ""));
}
It's not going to work if, as in your first example, the email runs directly into the message (joe2#gmail.comHello this), and it assumes your email addresses always begin with ;. You can put other delimiters in there, though.
If you know what all the domain suffixes are, you can do this with some regex-fu:
[a-zA-Z_\.0-9]+#[a-zA-Z_\.0-9]+\.(com|edu|org|net|us|tv|...)
You can find a list of top level domain names here: http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains
The full regex, I believe, should be this:
[a-zA-Z_\.0-9\-]+#[a-zA-Z_\.0-9\-]+\.(.aero|.asia|.biz|.cat|.com|.coop|.info|.int|.jobs|.mobi|.museum|.name|.net|.org|.pro|.tel|.travel|.xxx|.edu|.gov|.mil|.ac|.ad|.ae|.af|.ag|.ai|.al|.am|.an|.ao|.aq|.ar|.as|.at|.au|.aw|.ax|.az|.ba|.bb|.bd|.be|.bf|.bg|.bh|.bi|.bj|.bm|.bn|.bo|.br|.bs|.bt|.bv|.bw|.by|.bz|.ca|.cc|.cd|.cf|.cg|.ch|.ci|.ck|.cl|.cm|.cn|.co|.cr|.cs|.cu|.cv|.cx|.cy|.cz|.dd|.de|.dj|.dk|.dm|.do|.dz|.ec|.ee|.eg|.eh|.er|.es|.et|.eu|.fi|.fj|.fk|.fm|.fo|.fr|.ga|.gb|.gd|.ge|.gf|.gg|.gh|.gi|.gl|.gm|.gn|.gp|.gq|.gr|.gs|.gt|.gu|.gw|.gy|.hk|.hm|.hn|.hr|.ht|.hu|.id|.ie|.il|.im|.in|.io|.iq|.ir|.is|.it|.je|.jm|.jo|.jp|.ke|.kg|.kh|.ki|.km|.kn|.kp|.kr|.kw|.ky|.kz|.la|.lb|.lc|.li|.lk|.lr|.ls|.lt|.lu|.lv|.ly|.ma|.mc|.md|.me|.mg|.mh|.mk|.ml|.mm|.mn|.mo|.mp|.mq|.mr|.ms|.mt|.mu|.mv|.mw|.mx|.my|.mz|.na|.nc|.ne|.nf|.ng|.ni|.nl|.no|.np|.nr|.nu|.nz|.om|.pa|.pe|.pf|.pg|.ph|.pk|.pl|.pm|.pn|.pr|.ps|.pt|.pw|.py|.qa|.re|.ro|.rs|.ru|.rw|.sa|.sb|.sc|.sd|.se|.sg|.sh|.si|.sj|.sk|.sl|.sm|.sn|.so|.sr|.ss|.st|.su|.sv|.sy|.sz|.tc|.td|.tf|.tg|.th|.tj|.tk|.tl|.tm|.tn|.to|.tp|.tr|.tt|.tv|.tw|.tz|.ua|.ug|.uk|.us|.uy|.uz|.va|.vc|.ve|.vg|.vi|.vn|.vu|.wf|.ws|.ye|.yt|.yu|.za|.zm|.zw)
Of course, I'm not sure if that's a complete list of TLDs, and I know ICANN recently started allowing custom TLDs, but this should catch the vast majority of the email addresses.

How to match a word(String) in URL

This website contains different Url, But i want my application should vist urls only which contains specific keyword like "drugs" like
if urls are
http://website.com/countryname/drug/info/A
http://website.com/countryname/Browse/Alphabet/D?cat=company
it should visit first URL.so how to match a specific keyword drug in url.I know it can be done using regexp also,but have but i am new to it
I am using Java here
You can check if string contains a word with method contains().
if(myString.contains("drugs"))
If you need only URLs containing /drug/ try to do something like this:
Pattern p = Pattern.compile("/drug(/|$)");
Matcher m = p.matcher(myURLString);
if(m.find())
{
something_to_do
}
(/|$) means that after /drug can be only a slash ( / ) or nothing at all (dollar means end of the line).So this regex will find all if your string is like .../drug/... or .../drug
Use split() as such:
final String[] words = input.replaceFirst("https?://", "").split("/+");
for (final String word: words)
if ("whatyouwant".equals(word))
//do what is necessary since the word matches
If your code is called very often, you may want to make Patterns out of https?:// and /+ and use Matchers.

Categories