Parsing QIF file - .NET ported to Java - java

I have code that parses a .qif file using .NET. I'm attempting to port this code to Java, but am having trouble with the Regular Expression that does part of the parsing. Here is a sample of the beginning of the file:
!Type:Tag
NAdam
DSon
^
NAllison
^
NAmber
DSabrina's Sister
^
NAnthony
^
In .NET, I can use this code to start the parsing:
// Read the entire file
string input = reader.ReadToEnd();
// Split the file by header types
string[] transactionTypes = Regex.Split(input, #"^(!.*)$", RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
When I debug the .NET parser, I see the following:
transactionTypes[0] = ""
transactionTypes[1] = "!Type:Tag\r"
transactionTypes[2] = "\nNAdam\r\nDSon\r\n^\r\nNAllison\r\n^NAmber\r\nDSabrina's Sister\r\nNAnthony\r\n^
In Java, it seems to always skip the !Type:Tag line, so I don't know the type being parsed. I tried various versions of the Regular Expression in Java, including the following:
String[] transactionTypes = dataToParse.split("!.*");
String[] transactionTypes = dataToParse.split("\\s*^(!.*)\\s*");
String[] transactionTypes = dataToParse.split("\\s*(?m)^(!.*)$\\s*");
When I say it skips the !Type:Tag line, I see the following while debugging:
transactionTypes[0] = ""
transactionTypes[1] = "\nNAdam\r\nDSon\r\n^\r\nNAllison\r\n^NAmber\r\nDSabrina's Sister\r\nNAnthony\r\n^
Any help is appreciated! Thank you in advance!

Are you sure regex is necessary for this? From what I gleaned about the .qif format, it looks more like it was made for reading line by line. Read a line, if it starts with "!" it's a header line, then the following lines are an object, with a line that consists of "^" being a separator between objects, etc. Lots of line-by-line file reading examples in this SO thread:
How to read a large text file line by line using Java?
http://en.wikipedia.org/wiki/Quicken_Interchange_Format

Related

Can't print newline character (\n) in strings returned by String.split in Java

I am writing a Java program in which a tab separated values (TSV) file containing two columns of information is read by a BufferedReader and then split into two components (which will serve as [key,value] pairs in a HashMap later in the program) using String.split("\t"). Let's say the first line of the TSV file is as follows:
Key1\tHello world\nProgramming is cool\nGoodbye
The code shown below would separate this line into "Key1" and "Hello world\nProgramming is cool\nGoodbye":
File file = new File("sample.tsv");
BufferedReader br = new BufferedReader(new FileReader(file));
String s = br.readLine();
String[] tokens = new String[2];
tokens = s.split("\t");
The problem now comes in trying to print the second string (i.e. tokens[1]).
System.out.println(tokens[1]);
The line of code above results in the second string being printed with the newline characters (\n) being ignored. In other words, this is printed...
Hello world\nProgramming is cool\nGoodbye
...instead of this...
Hello worldProgramming is coolGoodbye
If I create a new string with the same text as above and use the String.equals() method to compare the two, it returns false.
String str = "Hello world\nProgramming is cool\nGoodbye";
boolean sameString = str.equals(tokens[1]); // false
Why can't special characters in the strings returned by String.split() be printed properly?
BufferedReader.readLine() read your string as one line, as that's how it's represented in the file. Buffered reader didn't read "\n" as ASCII(10) 0x0A, it read "ASCII(92) 0x9C ASCII(110) 0x6E".
If you type the input file the way you expect to see it with your text editor, it will print the way you expect.
on a unix like system:
echo -e "Hello world\nProgramming is cool\nGoodbye" > InputFile.result_you_want
echo "Hello world\nProgramming is cool\nGoodbye" > InputFile.result_you_get
You could use a program like echo to convert your TSV, but then you will need to split on the "\t" character, ASCII(9) 0x09, and not a literal "\t".
Split takes a regular expression. Escaping that tab character may be interesting.
"\t" or "\\t" may do the trick there.
If this is for work, you may want to use a tool or library to work around having to convert your file with echo.
String parsing in Java with delimeter tab "\t" using split has some suggestions there.
Searching for CSV java API's could be very useful. Most will let you set the delimiter character and information on line ending formats.
because in computer aspect, the text '\n' is not like the binary '\n'.
the first line of ur file, i think is like key1 Hello world\nProgramming\ncool
so it's the it can split the \t,but when it comes to print, it only show the text
'\n' but not the binary '\n' which will make the new Line

How to print the symbols and < and > to a file as space < and > respectively

How do you print symbols in Java to a file when you have only the symbol description?
I received a string from DB2 which contains symbols.
Two samples:
1) <0800>
2) 51V 3801Z
Such a string goes to two different places. One is a JSP rendering it as HTML. That is perfect; I get <0800> and 51V 3801Z, respectively. The other place is a CSV file created with java.io.FileWriter, and it does not convert to "<", ">", and " ". Instead, it is printed exactly as it came from DB2:
<0800>
and 51V 3801Z.
Is there anything the "new" nio library could help me? I have tried apache.commons.lang3.StringScapeUtils.escapeHTML4 without success.
I suggest looking into Apache's StringEscapeUtils, namely the unescapeHtml4() method.
Example:
String input = "<0800>";
String output = StringEscapeUtils.unescapeHtml4(input);
Ensure you are using the unescapeHtml4 method, and not the regular escapeHtml4 method!

How does the following code in Pig (in hadoop ) work using Java regular expressions?

I have a CSV file containing the following data:
396124436476092000,Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse,Obey_Jony09
396124436740317184,"“#BleacherReport: Halloween has given us this amazing Derrick Rose photo (via #amandakaschube, #ScottStrazzante) http://t.co/tM0wEugZR1” yes",Colten_stamkos
I have written the following code in PigLatin to input data to alias B using delimiters in REGEX_EXTRACT_ALL. This command outputs all data represented by (.*)
A = load '/user/pig/tweets' as (line);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,”:-](.*)[“,:-](.*)')) AS (tweetid:long,msg:chararray,userid:chararray);
So want to know how the regex function works with the expression
'(.*)[,”:-](.*)[“,:-](.*)'
to split data into the schema (tweetid,msg,userid)

regex extract in hadoop with pipe (escape character) as delimiter

I have a string "Hadoop|regex|Issue". I want to split using | as delimiter. I used this code -
String[] afterSplit = string.split("\\|"); but afterSplit contains only 2 strings "Hadoop" and "regex". I get ArrayIndexOutOfBoundsException exception when I try to retrieve afterSplit[2]. I want "Issue" in afterSplit[2].
I also tried the below code
String regex = Pattern.quote("|");
String[] parts = line.split(regex);
Note: Both work in simple java code but I get error while trying to implement in Hadoop. Please suggest. Thanks.

JAVA email message - clip quoted lines

is there a JAVA library to clip the quoted text from an email message?
If it's an HTML message, I used an HTML parser so far and removed the blockquotes from the DOM tree but I have more trouble with the plain text format.
I tried regex:
emailBody = emailBody.replaceAll("\n>[^\n]*?\n", "\n");
but I'm far from mastering it, so I though there has to be a solution since it's a problem concerning more people I guess.
The code above replaces all lines which are new lines (after \n) and beginning with >, not containing any other new lines as long as there is other content and ending with \n. Also I think replacement should be done from starting from the end of the message, and so on. It's a bit more complicated than just that line of code.
So any help is welcome!
Cheers,
Balázs
Do I get you right that you consider each line that starts with a > char a quoted line?
Here's a quick solution:
String[] lines = emailBody.split("\n");
StringBuilder clippedEmailBuilder = new StringBuilder();
for (String line:lines)
if (!line.startsWith(">"))
clippedEmailBuilder.append(line);
emailBody = clippedEmailBuilder.toString();
I'm not sure what you're trying to do with your RE, but considering every line starting with '>' to be quoted mail text you can filter them out with the following:
emailBody.replaceAll(">.*\n", "")
This will match every line starting with '>' and replace it (including the newline) with an empty string

Categories