regex extract in hadoop with pipe (escape character) as delimiter

regex extract in hadoop with pipe (escape character) as delimiter - java

I have a string "Hadoop|regex|Issue". I want to split using | as delimiter. I used this code -
String[] afterSplit = string.split("\\|"); but afterSplit contains only 2 strings "Hadoop" and "regex". I get ArrayIndexOutOfBoundsException exception when I try to retrieve afterSplit[2]. I want "Issue" in afterSplit[2].
I also tried the below code
String regex = Pattern.quote("|");
String[] parts = line.split(regex);
Note: Both work in simple java code but I get error while trying to implement in Hadoop. Please suggest. Thanks.

Related

Java Pattern for regex

I have a 2 column tsv file with column one having
1-FN3Z1-206329557431
1-FN411-153115736976
Where I am trying to remove the first two parts of the value (i.e to extract 206329557431 and 153115736976). I've used online regex tool to generate the patterns
pattern
".*?\\d+.*?\\d+.*?(\\d+)" AND ".*?\\d+.*?\\d+.*?\\d+.*?(\\d+)"
Independently they work fine. I'm trying to look for a combined regex pattern. Any pointers as to how this can be done.

Why don't use split for example :
String spl = "1-FN411-153115736976".split("-")[2];
If you want a regex you can use (.*?-){2}(.*), which mean get everything after the second -
regex demo
Output
206329557431
153115736976

If the strings in your TSV file all have the same widths and patterns, then you can just use substring here:
String tsv = "1-FN3Z1-206329557431";
System.out.println(tsv.substring(8));
Demo

How about this regexp : .-.{5}- looks like it can matches all statements but it depends on your format.
Here is Java code example :
#Test
public void test() {
String test = "1-FN3Z1-206329557431 1-FN411-153115736976";
String result = test.replaceAll(".-.{5}-", "");
assertEquals("206329557431 153115736976", result);
}

Java regular expression not working for MULTILINE entries

this is my Java code:
String patternParticipants = "([\\w\\.=-]+#[\\w\\.-]+\\.[\\w]{2,3}($|\n))*";
Pattern p = Pattern.compile(patternParticipants, Pattern.MULTILINE);
boolean matchesParticipants = p.matcher(reservation.getParticipants().trim()).matches();
And I want to match the following string:
john.wales#gmail.com
david.chrome#gmail.com
david.mika#gui.co
For some reason, matcher returns true only if one email address is given.
I've tried to set it for MULTILINE but this seems not to be working too. Any ideas?

Strip the new line character first, and then run your RegEx on it.
Use a code like this to strip the '\n'.
String text = readFileAsString("textfile.txt");
text.replace("\n", "");
P.S: Your data to be in textfile.txt.
Later, use this RegEx.
"^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"

This looks for a newline (\n) or the end of the string ($) after each address:
static boolean isValidEmailList(String str)
{
return str.matches("([\\w\\.=-]+#[\\w\\.-]+\\.[\\w]{2,3}($|\n))+");
}
Your regular expression will erroneously invalidate many valid email addresses, and allow some invalid addresses. But this is one way to make it work on your multi-line input.

^[_A-Za-z0-9-\+]+(\.[_A-Za-z0-9-]+)*
#[A-Za-z0-9-]+(\.[A-Za-z0-9]+)*(\.[A-Za-z]{2,})$;
You should use this one for Email

Parsing QIF file - .NET ported to Java

I have code that parses a .qif file using .NET. I'm attempting to port this code to Java, but am having trouble with the Regular Expression that does part of the parsing. Here is a sample of the beginning of the file:
!Type:Tag
NAdam
DSon
^
NAllison
^
NAmber
DSabrina's Sister
^
NAnthony
^
In .NET, I can use this code to start the parsing:
// Read the entire file
string input = reader.ReadToEnd();
// Split the file by header types
string[] transactionTypes = Regex.Split(input, #"^(!.*)$", RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
When I debug the .NET parser, I see the following:
transactionTypes[0] = ""
transactionTypes[1] = "!Type:Tag\r"
transactionTypes[2] = "\nNAdam\r\nDSon\r\n^\r\nNAllison\r\n^NAmber\r\nDSabrina's Sister\r\nNAnthony\r\n^
In Java, it seems to always skip the !Type:Tag line, so I don't know the type being parsed. I tried various versions of the Regular Expression in Java, including the following:
String[] transactionTypes = dataToParse.split("!.*");
String[] transactionTypes = dataToParse.split("\\s*^(!.*)\\s*");
String[] transactionTypes = dataToParse.split("\\s*(?m)^(!.*)$\\s*");
When I say it skips the !Type:Tag line, I see the following while debugging:
transactionTypes[0] = ""
transactionTypes[1] = "\nNAdam\r\nDSon\r\n^\r\nNAllison\r\n^NAmber\r\nDSabrina's Sister\r\nNAnthony\r\n^
Any help is appreciated! Thank you in advance!

Are you sure regex is necessary for this? From what I gleaned about the .qif format, it looks more like it was made for reading line by line. Read a line, if it starts with "!" it's a header line, then the following lines are an object, with a line that consists of "^" being a separator between objects, etc. Lots of line-by-line file reading examples in this SO thread:
How to read a large text file line by line using Java?
http://en.wikipedia.org/wiki/Quicken_Interchange_Format

Unable to split a string

I have a string
Mr praneel PIDIKITI
When I use this regular expression
String[] nameParts = name.split("\\s+");
instead of getting three parts I am only getting two, Mr and Praneel PIDIKITI.
I am unable to split the second string. Does anyone know what could be the problem?
I even used split(" ");.
The problem is I used replaceAll("\\<.*?>", " ").trim(); to convert html into this string and then I am using name.split("\\s+"); to get the name value.
I think it must be something other than space (some special character).

Your code should work. I suspect your input. There could be a non printable junk character between Praneel and PIDIKITI. For example,
String name = "Mr praneel" + (char)1 +"PIDIKITI";
String[] nameParts = name.split("\\s+");
for(String s : nameParts)
System.out.println(s);
Are you sure that there is no junk character between Praneel and PIDIKITI?
Remove non printable characters like this:
// remove non printable characters excluding white space characters
name = name.replaceAll("[^\\p{Print}\\s]","");

If you're parsing HTML, may I recommend JSoup? Its a good HTML parser for java

JAVA email message - clip quoted lines

is there a JAVA library to clip the quoted text from an email message?
If it's an HTML message, I used an HTML parser so far and removed the blockquotes from the DOM tree but I have more trouble with the plain text format.
I tried regex:
emailBody = emailBody.replaceAll("\n>[^\n]*?\n", "\n");
but I'm far from mastering it, so I though there has to be a solution since it's a problem concerning more people I guess.
The code above replaces all lines which are new lines (after \n) and beginning with >, not containing any other new lines as long as there is other content and ending with \n. Also I think replacement should be done from starting from the end of the message, and so on. It's a bit more complicated than just that line of code.
So any help is welcome!
Cheers,
Balázs

Do I get you right that you consider each line that starts with a > char a quoted line?
Here's a quick solution:
String[] lines = emailBody.split("\n");
StringBuilder clippedEmailBuilder = new StringBuilder();
for (String line:lines)
if (!line.startsWith(">"))
clippedEmailBuilder.append(line);
emailBody = clippedEmailBuilder.toString();

I'm not sure what you're trying to do with your RE, but considering every line starting with '>' to be quoted mail text you can filter them out with the following:
emailBody.replaceAll(">.*\n", "")
This will match every line starting with '>' and replace it (including the newline) with an empty string

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

regex extract in hadoop with pipe (escape character) as delimiter - java

Related

Java Pattern for regex

Java regular expression not working for MULTILINE entries

Parsing QIF file - .NET ported to Java

Unable to split a string

JAVA email message - clip quoted lines

Categories

Resources