So, I'm parsing a .mozeml file from Eudora and converting them into an mbox file (mbox got corrupted, and deleted but mozeml files were left over, but unable to import them). There's over 200,000 e-mails, and unsure of what's a good way to handle this properly.
I am thinking of creating a Java program that will read the .mozeml files (they are xml, utf-8 format) parse the data, and then write an mbox file in this format http://en.wikipedia.org/wiki/Mbox#Family.
The problem is just that the xml file didn't separate the To line and the message; it's just one entire string. I'm not entirely sure how to properly handle that.
For example here is how the message looks
"Joe 1" <joe1#gmail.com>joe2#gmail.comHello this is an e-mail...
or
"Joe 1" <joe1#gmail.com>"Joe 2" <joe2#gmail.com>Hello this is an e-mail...
There's a lot of test cases to check if it's a .com/.net/com.hk/.co.jp/etc. for the first one. The second one is a bit easier because the end of the to line is >. So, I'm unsure about the first case and ensuring that it's going to be accurate for the 200,000 emails.
Try antlr library for parsing strings.
The first thought for this problem is to use regexp and scanner to find next email occurence in cycle.
class EmailScanner {
public static void main(String[] args) {
try {
Scanner s = new Scanner(new File(/* Your file name here. */););
String token;
do {
token = s.findInLine(/* Put your email pattern here. */);
/* Write your token where you need it. */
} while (token != null);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Possible email patterns can be found easily. For example ^[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}$ or ^[a-zA-Z0-9._%+-]+#[a-zA-Z0-9.-]+\.(?:[a-zA-Z]{2}|com|org|net|edu|gov|mil|biz|info|mobi|name|aero|asia|jobs|museum)$ see http://www.regular-expressions.info/email.html.
Here's a standard email regex modified for your format:
Pattern pattern = Pattern.compile(";[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\\.[A-Za-z]{2,4}");
String text = "\"Joe 1\" <joe1#gmail.com>joe2#gmail.com Hello this is an e-mail...";
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.println(matcher.group().replaceFirst(";", ""));
}
It's not going to work if, as in your first example, the email runs directly into the message (joe2#gmail.comHello this), and it assumes your email addresses always begin with ;. You can put other delimiters in there, though.
If you know what all the domain suffixes are, you can do this with some regex-fu:
[a-zA-Z_\.0-9]+#[a-zA-Z_\.0-9]+\.(com|edu|org|net|us|tv|...)
You can find a list of top level domain names here: http://en.wikipedia.org/wiki/List_of_Internet_top-level_domains
The full regex, I believe, should be this:
[a-zA-Z_\.0-9\-]+#[a-zA-Z_\.0-9\-]+\.(.aero|.asia|.biz|.cat|.com|.coop|.info|.int|.jobs|.mobi|.museum|.name|.net|.org|.pro|.tel|.travel|.xxx|.edu|.gov|.mil|.ac|.ad|.ae|.af|.ag|.ai|.al|.am|.an|.ao|.aq|.ar|.as|.at|.au|.aw|.ax|.az|.ba|.bb|.bd|.be|.bf|.bg|.bh|.bi|.bj|.bm|.bn|.bo|.br|.bs|.bt|.bv|.bw|.by|.bz|.ca|.cc|.cd|.cf|.cg|.ch|.ci|.ck|.cl|.cm|.cn|.co|.cr|.cs|.cu|.cv|.cx|.cy|.cz|.dd|.de|.dj|.dk|.dm|.do|.dz|.ec|.ee|.eg|.eh|.er|.es|.et|.eu|.fi|.fj|.fk|.fm|.fo|.fr|.ga|.gb|.gd|.ge|.gf|.gg|.gh|.gi|.gl|.gm|.gn|.gp|.gq|.gr|.gs|.gt|.gu|.gw|.gy|.hk|.hm|.hn|.hr|.ht|.hu|.id|.ie|.il|.im|.in|.io|.iq|.ir|.is|.it|.je|.jm|.jo|.jp|.ke|.kg|.kh|.ki|.km|.kn|.kp|.kr|.kw|.ky|.kz|.la|.lb|.lc|.li|.lk|.lr|.ls|.lt|.lu|.lv|.ly|.ma|.mc|.md|.me|.mg|.mh|.mk|.ml|.mm|.mn|.mo|.mp|.mq|.mr|.ms|.mt|.mu|.mv|.mw|.mx|.my|.mz|.na|.nc|.ne|.nf|.ng|.ni|.nl|.no|.np|.nr|.nu|.nz|.om|.pa|.pe|.pf|.pg|.ph|.pk|.pl|.pm|.pn|.pr|.ps|.pt|.pw|.py|.qa|.re|.ro|.rs|.ru|.rw|.sa|.sb|.sc|.sd|.se|.sg|.sh|.si|.sj|.sk|.sl|.sm|.sn|.so|.sr|.ss|.st|.su|.sv|.sy|.sz|.tc|.td|.tf|.tg|.th|.tj|.tk|.tl|.tm|.tn|.to|.tp|.tr|.tt|.tv|.tw|.tz|.ua|.ug|.uk|.us|.uy|.uz|.va|.vc|.ve|.vg|.vi|.vn|.vu|.wf|.ws|.ye|.yt|.yu|.za|.zm|.zw)
Of course, I'm not sure if that's a complete list of TLDs, and I know ICANN recently started allowing custom TLDs, but this should catch the vast majority of the email addresses.
Related
I have a string like this :
http://schemas/identity/claims/usertype:External
Then my goal is to split that string into 2 words by colon delimiter, but in need to specified how the regex worked, it will be split the colon but not including colon in "http://", so those strings will be split into :
http://schemas/identity/claims/usertype
External
I have tried regex like this :
(http:\/\/+schemas\/identity\/claims\/usertype)
So it will be :
http://schemas/identity/claims/usertype
:External
then after that i will replace the remaining colon with empty string.
but i think its not a best practice for this, because i rarely used regex.
Do you have any suggestion to simplified the regex ?
Thanks in advance
This is an X/Y problem. Fortunately, you asked the question in a great way, by explaining the underlying problem you are trying to solve (namely: Pull some string out of a URL), and then describing the direction you've chosen to solve your problem (which is bad, see below), and then asking about a problem you have with this solution (which is irrelevant, as the entire solution is bad).
URLs aren't parsable like this. You shouldn't treat them as a string you can lop into pieces like this. For example, the server part can contain colons too: For port number. In front of the server part, there can be an authentication which can also contain a colon. It's rarely used, of course.
Try this one, which shows the problem with your approach:
https://joe:joe#google.com:443/
That link just works. Port 443 was the default anyway, and google ignores the authentication header that ends up sending, but the point is, a URL may contain this stuff.
But rzwitserloot, it.. won't! I know!
That's bad programming mindset. That mindset leads to security issues. Why go for a solution that burdens your codebase with unstated assumptions (assumption: The places that provide a URL to this code are under my control and will never send port or auth headers)? If the 'server' part is configurable in a config file, will you mention in said config file that you cannot add a port? Will you remember 4 years from now?
The solution that does it right isn't going to burden your code with all these unstated (or very unwieldy if stated) assumptions.
Okay, so what is the right way?
First, toss that string into the constructor of java.net.URI. Then, use the methods there to get what you actually want, which is the path part. That is a string you can pull apart:
URI uri = new URI("http://schemas/identity/claims/usertype:External");
String path = uri.getPath();
String newPath = path.replaceAll(":.*", "");
String type = path.replaceAll(".*?:", "");
URI newUri = uri.resolve(newPath);
System.out.println(newUri);
System.out.println(type);
prints:
http://schemas/identity/claims/usertype
External
NB: Toss some ports or auth stuff in there, or make it a relative URL - do whatever you like, this code is far more robust in the face of changing the base URL than any attempt to count colons is going to be.
Use Negative Lookbehind and split
Regex:
"(?<!(http|https)):"
Regex in context:
public static void main(String[] args) {
String input = "http://schemas/identity/claims/usertype:External";
validateURI(input);
List<String> result = Arrays.asList(input.split("(?<!(http|https)):"));
result.forEach(System.out::println);
}
private static void validateURI(String input) {
try {
new URI(input);
} catch (URISyntaxException e) {
System.out.println("Invalid URI!!!");
e.printStackTrace();
}
}
Output:
http://schemas/identity/claims/usertype
External
I think this might help you:
public class Separator{
public static void main(String[] args) {
String input = "http://schemas/identity/claims/usertype:External";
String[] splitted = input.split("\\:");
System.out.println(splitted[splitted.length-1]);
}
}
Output
External
I have a piece of Legacy software called Mixmeister that saved off playlist files in an MMP format.
This format appears to contain binary as well as file paths.
I am looking to extract the file paths along with any additional information I can from these files.
I see this has been done using JAVA (I do not know JAVA) here (see aorund ln 56):
https://github.com/liesen/CueMeister/blob/master/src/mixmeister/mmp/MixmeisterPlaylist.java
and Haskell here:
https://github.com/larjo/MixView/blob/master/ListFiles.hs
So far, I have tried reading the file as binary (got stuck); using Regex expressions (messy output with moderate success) and attempting to try some code to read chunks (beyond my skill level).
The code I am using with moderate success for Regex is:
file='C:\\Users\\xxx\\Desktop\\mixmeisterfile.mmp'
with open(file, 'r', encoding="Latin-1") as filehandle:
#with open(file, 'rb') as filehandle:
for text in filehandle:
b = re.search('TRKF(.*)TKLYTRKM', text)
if b:
print(b.group())
Again, this gets me close but is messy (the data is not all intact and surrounded by ascii and binary characters). Basically, my logic is just searching between two strings to attempt to extract the filenames. What I am really trying to do is get closer to something like what the JAVA has in GIT, which is (the code below is sampled from the GIT link):
List<Track> tracks = new ArrayList<Track>();
Marker trks = null;
for (Chunk chunk : trkl.getChunks()) {
TrackHeader header = new TrackHeader();
String file = "";
List<Marker> meta = new LinkedList<Marker>();
if (chunk.canContainSubchunks()) {
for (Chunk chunk2 : ((ChunkContainer) chunk).getChunks()) {
if ("TRKH".equals(chunk2.getIdentifier())) {
header = readTrackHeader(chunk2);
} else if ("TRKF".equals(chunk2.getIdentifier())) {
file = readTrackFile(chunk2);
} else {
if (chunk2.canContainSubchunks()) {
for (Chunk chunk3 : ((ChunkContainer) chunk2).getChunks()) {
if ("TRKM".equals(chunk3.getIdentifier())) {
meta.add(readTrackMarker(chunk3));
} else if ("TRKS".equals(chunk3.getIdentifier())) {
trks = readTrackMarker(chunk3);
}
}
}
}
}
}
Track tr = new Track(header, file, meta);
I am guessing this would either use RIFF or the chunk library in Python if not done using a Regex? Although I read the documentation at https://docs.python.org/2/library/chunk.html, I am not sure that I understand how to go about something like this - mainly I do not understand how to properly read the binary file which has the visible mixed in file paths.
I don't really know what's going on here but I'll try my best and if it doesn't work out then please excuse my stupidity. When I had a project for parsing weather data for a Metar, I realized that my main issue was that I was trying to turn everything into a String type, which wasn't suitable for all the data and so it would just come out as nothing. Your for loop should work just fine. However, when you traverse, have you tried making everything the same type, such as a Character/String type? Perhaps there are certain elements messed up simply because they don't match the type you are going for.
I would like to use Java regex to match a domain of a url, for example,
for www.table.google.com, I would like to get 'google' out of the url, namely, the second last word in this URL string.
Any help will be appreciated !!!
It really depends on the complexity of your inputs...
Here is a pretty simple regex:
.+\\.(.+)\\..+
It fetches something that is inside dots \\..
And here are some examples for that pattern: https://regex101.com/r/L52oz6/1.
As you can see, it works for simple inputs but not for complex urls.
But why reinventing the wheel, there are plenty of really good libraries that correctly parse any complex url. But sure, for simple inputs a small regex is easily build. So if that does not solve the problem for your inputs then please callback, I will adjust the regex pattern then.
Note that you can also just use simple splitting like:
String[] elements = input.split("\\.");
String secondToLastElement = elements[elements.length - 2];
But don't forget the index-bound checking.
Or if you search for a very quick solution than walk through the input starting from the last position. Work your way through until you found the first dot, continue until the second dot was found. Then extract that part with input.substring(index1, index2);.
There is also already a delegate method for exactly that purpose, namely String#lastIndexOf (see the documentation).
Take a look at this code snippet:
String input = ...
int indexLastDot = input.lastIndexOf('.');
int indexSecondToLastDot = input.lastIndexOf('.', indexLastDot);
String secondToLastWord = input.substring(indexLastDot, indexSecondToLastDot);
Maybe the bounds are off by 1, haven't tested the code, but you get the idea. Also don't forget bound checking.
The advantage of this approach is that it is really fast, it can directly work on the internal structures of Strings without creating copies.
My attempt:
(?<scheme>https?:\/\/)?(?<subdomain>\S*?)(?<domainword>[^.\s]+)(?<tld>\.[a-z]+|\.[a-z]{2,3}\.[a-z]{2,3})(?=\/|$)
Demo. Works correctly for:
http://www.foo.stackoverflow.com
http://www.stackoverflow.com
http://www.stackoverflow.com/
http://stackoverflow.com
https://www.stackoverflow.com
www.stackoverflow.com
stackoverflow.com
http://www.stackoverflow.com
http://www.stackoverflow.co.uk
foo.www.stackoverflow.com
foo.www.stackoverflow.co.uk
foo.www.stackoverflow.co.uk/a/b/c
private static final Pattern URL_MATCH_GET_SECOND_AND_LAST =
Pattern.compile("www.(.*)//.google.(.*)", Pattern.CASE_INSENSITIVE);
String sURL = "www.table.google.com";
if (URL_MATCH_GET_SECOND_AND_LAST.matcher(sURL).find()){
Matcher matchURL = URL_MATCH_GET_SECOND_AND_LAST .matcher(sURL);
if (matchURL .find()) {
String sFirst = matchURL.group(1);
String sSecond= matchURL.group(2);
}
}
I want to check whether a url is a image, java script, pdf etc
String url = "www.something.com/script/sample.js?xyz=xyz;
Below regex works fine but only with out xyz=zyz
".*(mov|jpg|gif|pdf|js)$"
When i remove $ at the end to eliminate regex requirement for .js to be in end but then it gives false
.*(mov|jpg|gif|pdf|js).*$ allows you to have any optional text after the file extension. The capturing group captures the file extension. You can see this here.
Use the regex as below:
.*\\.(mov|jpg|gif|pdf|js)\\?
This matches for dot(.) followed by your extension and terminated by ?
The first dot(.) is matching any character while second dot(.) prefixed by \\ match for dot(.) as literal just before your extension list.
Why not use java.net.URL to parse the url string, it could avoid lots of mismatching problems:
try {
URL url = new URL(urlString);
String filename = url.getFile();
// now test if the filename ends with your desired extensions.
} catch (Exception e) {
// This case the url cannot be parsed.
}
I'm not a big fan of this, but try:
.*\\.(mov|jpg|gif|pdf|js).*$
The problem is that it will accept things like "our.moving.day"
and post your code. there is always more than one way to skin a cat and perhaps there is something wrong with your code, not the regex.
Also, try regex testers...theres a ton of them out there. i'm a big fan of:
http://rubular.com/ and http://gskinner.com/RegExr/ (but they are mostly for php/ruby)
How would you use multiple delimiters or a single delimiter to detect and separate out different string matches?
For example, I use a Scanner to parse in the following string:
MrsMarple=new Person(); MrsMarple.age=30;
I would like to separate out this string to determine, in sequence, when a new person is being created and when their age is being set. I also need to know what the age is being set to.
There can be anything between and/or either side of these arguments (there doesn't necessarily have to be a space between them, but the semi-colon is required). The "MrsMarple" could be any word. I would also prefer any arguments following a "//" (two slashes) on the same line to be ignored but that's optional.
If you can think of a simple alternative to using regex I'm more than willing to consider it.
I might try a simple split/loop approach.
Given String input = "MrsMarple=new Person(); MrsMarple.age=30;":
String[] noComments = input.split("//");
String[] statements = input.split(noComments[0]);
for(String statement: statements) {
String[] varValue = statement.split("=");
...
// Additional MrsMarple-SmartSense Technology (tm) here...
...
}
with judicious use of String.trim() and or other simple tools.
Or to make the the matter more general (and without regexes), you may try scripting (as it looks like a script language syntax): http://java.sun.com/developer/technicalArticles/J2SE/Desktop/scripting/ . Example:
ScriptEngineManager mgr = new ScriptEngineManager();
ScriptEngine jsEngine = mgr.getEngineByName("JavaScript");
String input = "MrsMarple=new Person(); MrsMarple.age=30;"
try {
jsEngine.eval(input);
} catch (ScriptException ex) {
ex.printStackTrace();
}
In this case you'll need a Java class called Person with public field called age. Above code has not been tested, you may need to add something like
jsEngine.eval("importPackage(my.package);");
to make it work. Anyway, Oracle's tutorial should be helpfull.