I am scraping information from a log that I need 3 elements. Another added difficulty is that I am parsing the log via readLine() in my java program aka one(1) line at a time. (If there is a possibility to read multiple lines when parsing let me know :) ) NOTE: I have no control over the log output format.
There are 2 possibilities of what I must extract. Either the log is nice and gives the following
NICE FORMAT
.text.rank 0x0000000000400b8f 0x351 is_x86.o
where I must grab .text.rank , 0x0000000000400b8f , and 0x351
Now the not so nice case: If the name is too long, it bumps everything else to the next line like is below, now the only thing after the first element is one blank space followed by a newline (\n) which gets clobbered by readLine() anyway.
EVIL FORMAT : Note each line is in a separate arraylist entry.
.text.__sfmoreglue
0x0000000000401d00 0x55 /mnt/drv2homelibc_popcorn.a(lib_a-findfp.o)
Therefore what the regex actually sees is:
.text.__sfmoreglue
CORNER CASE FORMAT that also occurs within the log but I DO NOT want
*(.text.unlikely)
Finally below is my Pattern line I am currently using for the first line and pline2 is what is used on the next line when group 2 of the first line is empty.
UPDATE: The pattern below works for the NICE FORMAT and EVIL FORMAT But now pattern pline2 has no matches, even though on regex101.com it is correct. Link: https://regex101.com/r/vS7vZ3/9
UPDATE2: I fixed it, I forgot to add m2.find() once I compiled the second line with Pattern pline2. Corrected code is below.
Pattern p = Pattern.compile("^[ \\s](\\.[tex]*\\.[\\._\\-\\#a-zA-Z0-9]*)\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*).*");
Pattern pline2 = Pattern.compile("^\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*)\\s*[\\w\\(\\)\\.\\-]*");
To give a little background I am first matching the name .text.whatever to m.group(1) followed by the address 0x000012345 to m.group(2) and finally the size 0xa48 to m.group(3). This is all assuming the log is in the NICE format. If it is in the EVIL format I see that group(2) is empty and therefore readin the next line of the log to a temp buffer and apply the second pattern pline2 to new line.
Can someone help me with the regex?
Is there a way I can make sure my current line (or even better, just the second grouping) is either the NICE FORMAT or is empty?
As requested my java code:
//1st line pattern
Pattern p = Pattern.compile("^[ \\s](\\.[tex]*\\.[\\._\\-\\#a-zA-Z0-9]*)\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*).*");
//conditional 2nd line pattern
Pattern pline2 = Pattern.compile("^\\s*([x0-9a-f]*)[ \\s]*([x0-9a-f]*)\\s*[\\w\\(\\)\\.\\-]*");
while((temp = br1.readLine()) != null){
Matcher m = p.matcher(temp);
while(m.find()){
System.out.println("What regex finds: m1:"+m.group(1)+"# m2:"+m.group(2)+"# m3:"+m.group(3));
if(!m.group(1).isEmpty() && m.group(2).isEmpty() && m.group(3).isEmpty()){
//means we probably hit a long symbol name and important stuff is on the next line
//save the name at least
name = m.group(1);
//read and utilize the next line
if((temp = br1.readLine()) == null){
return;
}
System.out.println("EVILline2:"+temp); //sanity check the input
System.out.println(pline2.toString()); //sanity check the regex
Matcher m2= pline2.matcher(temp);
while(m2.find()){
System.out.println("regex line2 finds: m1:"+m2.group(1));//+"# m2:"+m2.group(2));
if(m2.group(2).isEmpty()){
size = 0;
}else{
size = Long.parseLong(m2.group(2).replaceFirst("0x", ""),16);
}
addr = Long.parseLong(m2.group(1).replaceFirst("0x", ""),16);
System.out.println("#########LONG NAME: "+name+" addr:"+addr+" size:"+size);
}
}//end if
else{ // assume in NICE FORMAT
//do nice format stuff.
}//end while
}//end outerwhile
An Aside, The output I currently get:
line: .text.c_print_results
What regex finds: m1:.text.c_print_results# m2:# m3:
EVIL FORMATline2: 0x00000000004001e6 0x231 c_print_results_x86.o
^\s*([x0-9a-f]*)[ \s]*([x0-9a-f]*)\s*[\w\(\)\.\-]*
Exception in thread "main" java.lang.IllegalStateException: No match found
at java.util.regex.Matcher.group(Matcher.java:536)
at java.util.regex.Matcher.group(Matcher.java:496)
at regexTest.regex.grabSymbolsInRange(regex.java:143)
at regexTest.regex.main(regex.java:489)
You have a few issues with your pattern.
1st is the separation of first and second groups (that's why group 2 is returning null).
You have 4 groups and you need 3
After capturing your 3 values you can stop matching, so pattern after
last group isn't necessary
you need global modifier \g so it returns all matches
So, instead of your posted Regex, you can try:
(\\.[tex]*\\.[\\._\\-\\#a-zA-Z0-9]*)\\s*([x0-9a-f]*)[ \\s]+([x0-9a-f]*)/g
Tested on Regex101.com:
https://regex101.com/r/lM4bQ9/1
Other then that, a few suggestions:
if you know your text is going to start with text, just put it on the
pattern, don't use [tex]*, which will require a few extra work from
the engine.
[ \s] is the same thing of \s.
[\._\-\#a-zA-Z0-9]* from what i understood, is basically
everything but space, so why not just use [^\s]*
So having these in mind I would suggest you to use this pattern instead:
(\\.text\\.[^\\s]*)\\s*([x0-9a-f]*)\\s+([x0-9a-f]*)/g
I'm having trouble in concatenating pieces of text mixing Western and Arabic chars.
I've a list of tokens like this:
-LRB-
دریای
مازندران
-RRB-
,
I use the following procedure to concatenate these list of tokens:
String str = "";
for (String tok : tokens) {
str += tok + " ";
}
This is the output of my procedure:
-LRB- دریای مازندران -RRB- ,
As can be seen, the position of the Arabic words is inverted.
How can I solve this (maybe suggesting to Java to ignore the information about text direction)?
EDIT
Actually, it seems that my problem was a false problem.
Now I've a new one. I need to wrap each word inside a string like this (word *) so that my output will be like this:
(word1 *)(word2 *)(word3 *)...
The procedure that I use is the following:
String str = "";
for (String tok : tokens) {
str += "(" + tok + "*)";
}
However, the result that I got is this:
(-LRB- *)(دریای *)(مازندران *)(-RRB- *)(, *)
instead of:
(-LRB- *)(دریای)(* مازندران *)(-RRB- *)(, *)
** EDIT2 **
Actually, I've discovered that my problem is not a problem. I wrote my string on a file and I opened it with nano (in the console). And it was correctly concatenated.
So the problem was due to the Eclipse console (and also gedit) which --let's say-- incorrectly rendered the string.
Anyway, thanks for your help!
The output is correct, and if you are presenting this text to an Arabic-speaking user you should not override the directionality of the text. Arabic is written from right to left. When you concatenate two Arabic strings together, the first will appear to the right of the second. This is controlled by the BiDi algorithm, the details of which are covered in http://www.unicode.org/reports/tr9/.
First, I would suggest using StringBuilder instead of raw String concatination. You will make your Garbage Collector a lot happier. Second, not seeing the input or how your StringTokenizer is setup, I would venture a guess that it seems like you are having problems tokenizing the string properly.
I have code that parses a .qif file using .NET. I'm attempting to port this code to Java, but am having trouble with the Regular Expression that does part of the parsing. Here is a sample of the beginning of the file:
!Type:Tag
NAdam
DSon
^
NAllison
^
NAmber
DSabrina's Sister
^
NAnthony
^
In .NET, I can use this code to start the parsing:
// Read the entire file
string input = reader.ReadToEnd();
// Split the file by header types
string[] transactionTypes = Regex.Split(input, #"^(!.*)$", RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.IgnorePatternWhitespace);
When I debug the .NET parser, I see the following:
transactionTypes[0] = ""
transactionTypes[1] = "!Type:Tag\r"
transactionTypes[2] = "\nNAdam\r\nDSon\r\n^\r\nNAllison\r\n^NAmber\r\nDSabrina's Sister\r\nNAnthony\r\n^
In Java, it seems to always skip the !Type:Tag line, so I don't know the type being parsed. I tried various versions of the Regular Expression in Java, including the following:
String[] transactionTypes = dataToParse.split("!.*");
String[] transactionTypes = dataToParse.split("\\s*^(!.*)\\s*");
String[] transactionTypes = dataToParse.split("\\s*(?m)^(!.*)$\\s*");
When I say it skips the !Type:Tag line, I see the following while debugging:
transactionTypes[0] = ""
transactionTypes[1] = "\nNAdam\r\nDSon\r\n^\r\nNAllison\r\n^NAmber\r\nDSabrina's Sister\r\nNAnthony\r\n^
Any help is appreciated! Thank you in advance!
Are you sure regex is necessary for this? From what I gleaned about the .qif format, it looks more like it was made for reading line by line. Read a line, if it starts with "!" it's a header line, then the following lines are an object, with a line that consists of "^" being a separator between objects, etc. Lots of line-by-line file reading examples in this SO thread:
How to read a large text file line by line using Java?
http://en.wikipedia.org/wiki/Quicken_Interchange_Format
I'm triying to read the text file below with a java.util.Scanner in a simple Java Program.
0001;GUAJARA-MIRIM;RO
0002;ALTO ALEGRE DOS PARECIS;RO
0003;PORTO VELHO;RO
I read the text file using the code below:
scanner = new Scanner(filerader).useDelimiter("\\;|\\n");
while (scanner.hasNext()) {
int id= scanner.nextInt();
String name = scanner.next();
String code = scanner.next();
System.out.printf(".%s.%s.%d.\n", name, code, id);
}
The results are:
.GUAJARA-MIRIM.RO.1
.
.ALTO ALEGRE DOS PARECIS.RO.2
.
.PORTO VELHO.RO.3
.
But the result of the third token of each line has an incovenient '\r' caracther at the end (ANSI code 13). I have no idea why (I used the '.' character on the formatting string to to make it clear where the '\r' is).
So,
Why there's a '\r' at the end of the third token?
How to bypass it.
It is very simple to use an workaround like code.substring(0, 2), but instead I want to understand why there's a '\r' character there.
In some file systems(specially Windows), \r\n is used a new line character. You are using \n only a delimiter so \r remain out. Add \r also in your delimiters.
To make your code little more robust, use System.lineSeparator() to get the new line characters and use the delimiters accordingly.
You are using a Windows file, which uses \r\n as line delimiters (aka Carriage Return Line Feed). Unix uses only \n (Line Feed).
To fix this, add \r to your scanner delimiter.
The reason why it happens is already given, Other way to avoid this is to use scanner.nextLine() and then split by ; .
I'm trying to send an email in Java but when I read the body of the email in Outlook, it's gotten rid of all my linebreaks. I'm putting \n at the ends of the lines but is there something special I need to do other than that? The receivers are always going to be using Outlook.
I found a page on microsoft.com that says there's a 'Remove line breaks' "feature" in Outlook so does this mean there's no solution to get around that other than un-checking that setting?
Thanks
I've just been fighting with this today. Let's call the behavior of removing the extra line breaks "continuation." A little experimenting finds the following behavior:
Every message starts with continuation off.
Lines less than 40 characters long do not trigger continuation, but if continuation is on, they will have their line breaks removed.
Lines 40 characters or longer turn continuation on. It remains on until an event occurs to turn it off.
Lines that end with a period, question mark, exclamation point or colon turn continuation off. (Outlook assumes it's the end of a sentence?)
Lines that turn continuation off will start with a line break, but will turn continuation back on if they are longer than 40 characters.
Lines that start or end with a tab turn continuation off.
Lines that start with 2 or more spaces turn continuation off.
Lines that end with 3 or more spaces turn continuation off.
Please note that I tried all of this with Outlook 2007. YMMV.
So if possible, end all bullet items with a sentence-terminating punctuation mark, a tab, or even three spaces.
You can force a line break in outlook when attaching one (or two?) tab characters (\t) just before the line break (CRLF).
Example:
This is my heading in the mail\t\n
Just here Outlook is forced to begin a new line.
It seems to work on Outlook 2010. Please test if this works on other versions.
See also Outlook autocleaning my line breaks and screwing up my email format
You need to use \r\n as a solution.
Microsoft Outlook 2002 and above removes "extra line breaks" from text messages by default (kb308319). That is, Outlook seems to simply ignore line feed and/or carriage return sequences in text messages, running all of the lines together.
This can cause problems if you're trying to write code that will automatically generate an email message to be read by someone using Outlook.
For example, suppose you want to supply separate pieces of information each on separate lines for clarity, like this:
Transaction needs attention!
PostedDate: 1/30/2009
Amount: $12,222.06
TransID: 8gk288g229g2kg89
PostalCode: 91543
Your Outlook recipient will see the information all smashed together, as follows:
Transaction needs attention! PostedDate: 1/30/2009 Amount: $12,222.06 TransID: 8gk288g229g2kg89 ZipCode: 91543
There doesn't seem to be an easy solution. Alternatives are:
You can supply two sets of line breaks between each line. That does stop Outlook from combining the lines onto one line, but it then displays an extra blank line between each line (creating the opposite problem). By "supply two sets of line breaks" I mean you should use "\r\n\r\n" or "\r\r" or "\n\n" but not "\r\n" or "\n\r".
You can supply two spaces at the beginning of every line in the body of your email message. That avoids introducing an extra blank line between each line. But this works best if each line in your message is fairly short, because the user may be previewing the text in a very narrow Outlook window that wraps the end of each line around to the first position on the next line, where it won't line up with your two-space-indented lines. This strategy has been used for some newsletters.
You can give up on using a plain text format, and use an html format.
I had the same issue, and found a solution. Try this: %0D%0A to add a line break.
I have used html line break instead of "\n" . It worked fine.
Adding "\t\r\n" ( \t for TAB) instead of "\r\n" worked for me on Outlook 2010 . Note : adding 3 spaces at end of each line also do same thing but that looks like a programming hack!
You need to send HTML emails. With <br />s in the email, you will always have your line breaks.
The trick is to use the encodeURIComponent() functionality from js:
var formattedBody = "FirstLine \n Second Line \n Third Line";
var mailToLink = "mailto:x#y.com?body=" + encodeURIComponent(formattedBody);
RESULT:
FirstLine
SecondLine
ThirdLine
I had been struggling with all of the above solutions and nothing helped here, because I used a String variable (plain text from a JTextPane) in combination with "text/html" formatting in my e-mail library.
So, the solution to this problem is to use "text/plain", instead of "text/html" and no need to replace return characters at all:
MimeBodyPart messageBodyPart = new MimeBodyPart();
messageBodyPart.setContent(message, "text/plain");
For Outlook 2010 and later versions, use \t\n rather than using \r\n.
If you can add in a '.' (dot) character at the end of each line, this seems to prevent Outlook ruining text formatting.
Try \r\c instead of \n.
EDIT: I think #Robert Wilkinson had it right. \r\n. Memory just isn't what it used to be.
The \n largely works for us, but Outlook does sometimes take it upon itself to remove the line breaks as you say.
I also had this issue with plain/text mail type. Earlier, I used "\n\n" but there was two line breaks. Then, I used "\t\n" and it worked. I was using StringBuffer in java to append content.
The content got printed in next line in Outlook 2010 mail.
Put the text in <pre> Tags and outlook will format and display the text correctly.
i defined it in CSS inline in HTML Body like:
CSS:
pre {
font-family: Verdana, Geneva, sans-serif;
}
i defined the font-family to have to font set.
HTML:
<td width="70%"><pre>Entry Date/Time: 2013-09-19 17:06:25
Entered By: Chris
worklog mania
____________________________________________________________________________________________________
Entry Date/Time: 2013-09-19 17:05:42
Entered By: Chris
this is a new Worklog Entry</pre></td>
Because it is a query, only percent escaped characters work, means %0A gives you a line break. For example,
<a href="mailto:someone#gmail.com?Subject=TEST&body=Hi there,%0A%0AHow are you?%0A%0AThanks">email to me</a>
I also had this issue with plain/text mail type.Form Feed \f worked for me.
Sometimes you have to enter \r\n twice to force outlook to do the break.
This will add one empty line but all the lines will have break.
\r\n will not work until you set body type as text.
message.setBody(MessageBody.getMessageBodyFromText(msg));
BodyType type = BodyType.Text;
message.getBody().setBodyType(type);
I was facing the same issue and here is the code that resolved it:
\t\n - for new line in Email service JavaMailSender
String mailMessage = JSONObject.toJSONString("Your message").replace(",", "\t\n").trim();
RESOLVED IN MY APPLICATION
In my application, I was trying to send an email whose message body was typed by the user in text area. When mail was send, outlook automatically removed line break entered by user.
e.g if user entered
Yadav
Mahesh
outlook displayed it as
YadavMahesh
Resolution: I changed the line break character "\r\n" with "\par " ( remember to hit space at the end of RTF code "\par" )and line breaks are restrored.
Cheers,
Mahesh
Try this:
message.setContent(new String(body.getBytes(), "iso-8859-1"),
"text/html; charset=\"iso-8859-1\"");
Regards,
Mohammad Rasool Javeed
I have a good solution that I tried it, it is just add the Char(13) at end of line like the following example:
Dim S As String
S = "Some Text" & Chr(13)
S = S + "Some Text" & Chr(13)
if the message is text/plain using, \r\n should work;
if the message type is text\html, use < p/>
if work need to be done with formatted text with out html encoding.
it can be easy achieved with following scenario that creates div element on the fly and using <pre></pre> html element to keep formatting.
var email_body = htmlEncode($("#Body").val());
function htmlEncode(value) {
return "<pre>" + $('<div/>').text(value).html() + "</pre>";
}
Not sure if it was mentioned above but Outlook has a checkbox setting called "Remove extra line breaks in plain text messages" and is checked by default. It is located in a different spot for different versions of Outlook but for 2010 go to the "File" tab. Select "Options => Mail" Scroll down to "Message format" Uncheck the checkbox.