Getting data from multiple-encoded file - java

I'm writing parser for Thunderbird mails.
Input:
I've got a file with load of emails (main part written in ANSI - WINDOWS 1250, but the content is in utf-8 or iso-8859-2, it is written in mail's Content-Type markup).
Output:
Collection of messages content (body).
So that's what I do:
Read whole file into a byte[] variable. (still ANSI)
Convert it to String. (utf-16 but bytes as from ANSI) - I need to convert to String now, because i need to get to the next point (divide bunch of messages -> sole message)
Divide bunch of messages into a separate message and add every message into Collection (utf-16).
Check Content-Type of a message.
Using JavaMail API i use mail.getContent(utf-16 I guess, but I'm not sure of encoding inside).
This is my problem: I have a String in UTF-16 i guess, and it's content is e.g. iso-8859-2, so what should I do now?
I was using Charset, and new String(byte[],String (charset name) ), but none of my tries made it.
My try:
Convert final String from UTF-16 -> UTF-8 (cause it's the same amount of bytes as in 8859-2)
Get bytes from utf-8 and encode it as ANSI
Decode ANSI to utf-8
Encode utf-8 to ISO-8859-2 (or leave it, if it already has been utf-8)
Decode from ISO-8859-2.
But it's not giving me any good results.
how may I deal with it? Too many decodings for me, and I feel dizzy.
Input (this was hold as a cp1250 file, but i converted it to utf-8, ):
From - Thu Dec 08 15:06:14 2011
(some mail header stuff....)
Content-Type: text/html; charset="iso-8859-2"
<table border="0" cellspacing="0" width="600"><tbody><tr><th class="ffield2"><span class="cald-word">clichéd</span> </th><td class="field1"><br>
banal; <b>banalny<b>
<br>
She made a <span class="cald-word">clichéd remark about the importance of friendship.</span>
<br>
<b>Wygԯsiԡ jakѶ banalnѠuwagꡯ wadze przyjaݮi . <br>
<b>
<b> <b><br>
</td></tr></tbody></table>
From - Thu Dec 08 15:42:09 2011
Content-Type: text/html; charset=utf-8
(some mail header stuff....)
<table border="0" cellspacing="0" width="600"><tbody><tr><th class="ffield2">nosiness</th><td class="field1"><br>
<br>
interest in somebody else's business; <b>wścibstwo<b>
<br>
Nosiness is something I can't stand, so stop asking such questions.
<br>
<b>Nie znoszę wścibstwa, więc przestań zadawać takie pytania. <b><b> <br>
<b>
</td></tr></tbody></table>

Related

Storing and displaying ₹ (rupee symbol) html (xslt) -> java -> Sql Server 2016

I'm having difficulties getting an html page to pick up a rupee symbol (₹), store it into an SQL Server 2016 database and then retrieve it for display.
Important to note here is that I need to enter the actual symbol not the html version.
The basic flow of the page is that an administrator can add a new currency to the application via a web interface. There is a text box where
they would enter the actual rupee symbol ₹ and hit submit. This then passes the parameters via an HttpServletRequest to a java back end.
The java backend just inserts/updates this value to a SQL Server 2016 table in a field nchar(10).
When the page is refreshed it runs a select against this table and displays all the valid currencies.
The problem is that when the java application retrieves from the HttpServletRequest request object the symbol ₹ becomes â?¹. I can see this in the
debugger, I appreciate that this might be due to my debugger not being able to display this so I go forward.
The java (jdbc) updates the field. I view the field using Sql Server Management Studio and it displays â?¹ in both text and grid view.
I know that SSMS can dispay this symbol as I can insert it directly and it works. So it looks like the information is lost at the html>java request.
The web page itself is legacy and built using xslt. I have added some more details below of where I'm up to.
The website runs on tomcat 8 and the pages are built using xslt, the back end is java.
In the front end I have a text field in an EditCurrency page. I enter ₹ in the symbol field and hit submit.
The relevent fragments of the xslt page that is used to build the the front end are:
<!--header indicates page is utf8-->
<xsl:param name="csrfToken"/>
<xsl:param name="currencyFormatError"/>
...
<!-- on submission the EditCurrency java class is called. method=POST indicates it should allow UTF8 request URL's as is my understanding-->
<form id="cmanager" name="cmanager" onsubmit="return(vNewCurrency())" action="../servlet/webpay.website.admin.EditCurrency" method="POST">
<input name="csrfToken" type="hidden" value="{$csrfToken}"/>
<table width="100%" cellpadding="0" cellspacing="0" border="0">
The tomcat 8 server's server.xml set to encoding UTF-8. I understand this allows the request/response to handle UTF-8:
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443" URIEncoding="UTF-8" />
Java class EditCurrency:
//Retrieves symbol from the HttpServletRequest req
//symbol returns â?¹
String symbol = (String) getParameter(PARAM_SYMBOL);
I've also tried to set the HttpServletRequest req using the following but it does nothing:
try {
req.setCharacterEncoding("UTF-8");
} catch (UnsupportedEncodingException ex) {
java.util.logging.Logger.getLogger(EditCurrency.class.getName()).log(Level.SEVERE, null, ex);
}
Sql Server:
Value â?¹ appears in the nchar(10) field.
Display html:
â?¹ is displayed when the screen is refereshed with this updated value.
So question is how do I fix this up!!??
I had considered some sort of reference table of all currencies and their display values etc but it doesn't seem correct way of doing it.
Unless every tool in the chain, including whatever you use to look at the intermediate results, is UTF-8 capable you will see garbage at some point.
The Unicode code point of the Rupee symbol is 0x20B9, which is UTF-8 encoded as three bytes 0xE2 0x82 0xB9. If you attempt to display that in a tool that uses ISO-8859-1 you see
0xE2 = â
0x82 = ? (there is no character in ISO-8859-1 for code 0x82, so you see a question mark)
0xB9 = ¹
So it appears the symbol is correct in the database, but you are displaying it incorrectly.
To "fix" this problem you must ensure that your tools are all set to UTF-8, and that the web server is configured to include the UTF-8 declaration in the HTML it sends.

JavaMail - Quoted printable remove . (dot) on new line

I'm using Java Mail to create emails, it's almost working but i'm facing a problem that i don'thave any idea how to resolve it.
The Content-Transfer-Encoding quoted-printable breaks my body in a lot of line with 77 characters each line and the problem happens when the next line starts and the first character is a . (dot).
An example of this:
<table border=3D"0" cellpadding=3D"0" cellspacing=3D"0" align=3D"center">
<tbody>
<tr>
<td><br /><font color=3D"#666666" face=3D"Arial, Helvetica, sans-serif=
" size=3D"1">Lala não leleler lala lalalaa, <a href=3D"http://t.laiu=
com.ar/TestsTrackings/op.aspx?Osa8Br5zxNpqrv0AtVqBIiGIGG0CPNrUoxbqY7WYcGhP7=
LrlPvlBijtUAlN+b07u4cgghR7erUuf
P9PWGu7YtTkb51txcLYb9+6jzjBtWhf/L8Ai/gdZjrXfmIamviwsffMsjXa8mtnQm8n/XXkWuDw=
8
gW6EpcofAgSMsqpqmqxv85MRVG2vIFuD9v6lFD1H+dMk0RtR/cMhg/zgtjdIym6pig8sSTDT">c=
lalal lala</a>.</font><br /></td>
</tr>
</tbody>
</table>
On line i have a link that starts with http://t.laiu.... and on next line it just removes my dot. When the user receive emails, he gots a link like t.laiucom.ar... instead t.laui.com.ar.
Anyone have an idea how can i avoid it?
Thank in advance.
In the comments you confirmed that you use Message.writeTo to create a file, and that the periods are there in that file.
So the problem is not javamail or the quoted printable encoding here.
The pickup service which picks up the file seems to already expect it to be fit for SMTP transport, as per rfc5321 (or rfc2821/rfc821), which means that periods at the beginning of a line must be doubled. Message.writeTo won't do that directly, because it does not care about the used transport, it just writes the message to a stream.
Usually, when sent to SMTP through javax.mail.Transport javamail handles this by wrapping the output stream in a SMTPOutputStream, so everything works fine. But by using Message.writeTo directly, you're operating on a lower level and need to deal with correctly formatting the output so it is accepted by the pickup service yourself.
That means you need to replace dots at the beginning of a line with two dots yourself. To do so you could use the SMTPOutputStream wrapper class mentioned above (but it's not public/documented API), or write your own stream wrapper which does the same. Or any other way to modify the generated data you like...

Understanding working of enctype = multipart/form-data in spring mvc

In book "Spring in Action" i read , the default content type of a post submission is application/x-www-form-urlencoded and takes the form of name-value pairs separated by ampersands. (I believe these all goes as the body payload of the HTTP POST request.)
I further read, with enctype set to multipart/form-data, each field will be submitted as a distinct part of the POST request and not as just another name-value pair.
Q1> I don't get this line. I am from a REST background and will want to understand what in content of the HTTP POST request has changed ?
The server side code
#RequestMapping(method=RequestMethod.POST)
public String addSpitterFromForm(#Valid Spitter spitter,
BindingResult bindingResult,
#RequestParam(value="image", required=false)
Accept file upload
 MultipartFile image) {
if(bindingResult.hasErrors()) {
return "spitters/edit";
}
spitterService.saveSpitter(spitter);
try {
if(!image.isEmpty()) {
validateImage(image);
Validate image
 saveImage(spitter.getId() + ".jpg", image); //
}
} catch (ImageUploadException e) {
bindingResult.reject(e.getMessage());
return "spitters/edit";
}
return "redirect:/spitters/" + spitter.getUsername();
}
The client side code
<sf:form method="POST"
modelAttribute="spitter"
enctype="multipart/form-data">
//other stuff
<tr>
<th><sf:label path="fullName">Full name:</sf:label></th>
<td><sf:input path="fullName" size="15" /><br/>
<sf:errors path="fullName" cssClass="error" />
</td>
</tr><tr>
<th><label for="image">Profile image:</label></th>
<td><input name="image" type="file"/>
</tr>
//other stuff
</sf:form>
From the code I am tempted to think that only the input type="file" is sent in a new way. Rest all are sent as key-value pairs. I think the book is also saying the same "When the form is submitted, it’ll be posted as a multipart form where one of the parts contains the image file’s binary data. "
Q2> If what i am thinking is correct, how does client know which input types to send as key-value pairs and whom to send individually?
First of all, enctype of multipart/form-data IS NOT a Spring-MVC thing, it is an attribute of <form> in general web development, which means this attribute can be present in your HTML form regardless the server side technology. You can read more about it here: HTML 5 Candidate Recommendation [Specification] 4 The elements of HTML 4.10 Forms 4.10.22 Form submission 4.10.22.7 Multipart form data, also you can read specifically how the data will be sent by reading RFC2388. If you review it, you will see that data sent in a POST request with multipart/form-data is not a key/value pair anymore, instead it contains multiple parts (yes, it is multi part) where each part looks like this (example belongs to RFC2388):
--AaB03x
content-disposition: form-data; name="field1"
content-type: text/plain;charset=windows-1250
content-transfer-encoding: quoted-printable
Joe owes =80100.
--AaB03x
Note that Joe owes =80100. means Joe owes €100.
You can find another example in HTML 4 Specification where it shows a more concrete example when uploading two or more files (my comments are posted after <--):
Content-Type: multipart/form-data; boundary=AaB03x <-- mark for the whole request
--AaB03x <-- content of a part
Content-Disposition: form-data; name="submit-name" <-- field name
Larry <-- content of the field
--AaB03x <-- content of a part
Content-Disposition: form-data; name="files"
Content-Type: multipart/mixed; boundary=BbC04y
--BbC04y <-- content of a part containing a file
Content-Disposition: file; filename="file1.txt"
Content-Type: text/plain
... contents of file1.txt ...
--BbC04y <-- content of a part containing a file
Content-Disposition: file; filename="file2.gif"
Content-Type: image/gif
Content-Transfer-Encoding: binary
...contents of file2.gif...
--BbC04y-- <-- end of parts containing file
--AaB03x-- <-- end of whole request data

Mime source string to object

I have a standard whole Mime source text string I need converted to either a Java or PHP object (or both if you want to show off !) so it can be manipulated in these platforms.
I have looked everywhere but only seem to be able to create from scratch.
So the below for example becomes an object that I can change headers or body parts, and the resend using the provided classes.
The require application for this is a distributed one, where I can supply customers a small java program while their local email app can point SMTP to, which I have done, and obtained Mime string as below.
I then want to be able to access and manipulate the various parts like headers and individual body parts before sending.
Surely there is some class or library which will offer this ? If necessary I can simply send the string to a PHP script if there is a suitable solution in PHP however its on a shared server so I cannot simply add PHP extensions.
Return-path: <tim#domain_a.com>
Envelope-to: XXXXXXXXXXXX
Delivery-date: Thu, 19 Sep 2013 09:54:17 +0100
Received: from XXXXXXXXXX [61.125]:62344 helo=[192.168.1.10])
by leopard.host-ns.co.uk with esmtpsa (TLSv1:DHE-RSA-CAMELLIA256-SHA:256)
(Exim 4.80.1)
(envelope-from <tXgham#dXm>)
id 1VMa09-000MOc-4T
for tiXham#daXcs.com; Thu, 19 Sep 2013 09:54:17 +0100
Message-ID: <523ABBB6.1080105#datXics.com>
Date: Thu, 19 Sep 2013 09:54:14 +0100
From: Txgham <tiXam#datXics.com>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:17.0) Gecko/20130801 Thunderbird/17.0.8
MIME-Version: 1.0
To: TiXham <tiXam#daXics.com>
Subject: Re: Example
References: <523ABB49.50403#daXnics.com>
In-Reply-To: <523ABB49.50403#daXhanics.com>
Content-Type: multipart/alternative;
boundary="------------000900010104080404030103"
This is a multi-part message in MIME format.
--------------000900010104080404030103
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Example showing reply subpart and HTML <apage.html>
On 19/09/2013 09:52, TiXgham wrote:
> Example email
--------------000900010104080404030103
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
Example showing reply subpart and HTML <br>
<br>
<div class="moz-cite-prefix">On 19/09/2013 09:52, TiXam wrote:<br>
</div>
<blockquote cite="mid:523ABB49.50403#daXanics.com" type="cite">Example
email
<br>
</blockquote>
<br>
</body>
</html>
--------------000900010104080404030103--
This is out of date now and never did fully solve it. The issue was deemed as platform specific (iPhone) and not relevant after so much time.

MimeMessage Content-Transfer-Encoding issue

Greetings all...
I am hoping somebody can shed me some lights about the issue I am having.
Reading the Javadoc of MimeMessage's getInputStream(), it says "Return a decoded input stream for this Message's content"
However, this is not what I am experiencing. The output is not decoded. For instance, if I have a message a follows
Date: Wed, 24 Feb 2010 11:29:13 +1100
From: xxxxxxxxx
To: xxxxxxxxxxxx
Message-ID: <4B8472D9.5050901#xxxxxxxxx>
Subject: xxxxxxxxxxxxxxxxxx
MIME-Version: 1.0
Content-Type: multipart/mixed;
boundary="------------000801030004000206000901"
Content-Transfer-Encoding: quoted-printable
Organization: xxxxxxxxxxxxxxxxxx
User-Agent: Thunderbird 2.0.0.23 (Windows/20090812)
This is a multi-part message in MIME format.
--------------000801030004000206000901
Content-Type: text/plain; charset=3DISO-8859-1; format=3Dflowed
Content-Transfer-Encoding: 7bit
!
--------------000801030004000206000901
Content-Type: text/plain;
name=3D"bla.bla"
Content-Transfer-Encoding: 8bit
Content-Disposition: inline;
filename=3D"bla.bla"
my.username =3D holly
my.host =3D molly
--------------000801030004000206000901--
Then, assuming that I have an object called 'm' constructed with the above content, then calling m.getInputStream() and dump the output to the screen shows those '=3D' charsets.
What did I do wrong?
If I used QPDecoderStream to decode the output of m.getInputStream() then of course the result will be correct. However, it defeat the purpose, because the Javadoc says getInputStream() returns a decoded input stream.
The issue here is that the message is malformed. You're not allowed to set Content-Transfer-Encoding to quoted-printable on a multipart part:
If a Content-Transfer-Encoding header field appears as part of a
message header, it applies to the entire body of that message. If a
Content-Transfer-Encoding header field appears as part of an entity's
headers, it applies only to the body of that entity. If an entity is
of type "multipart" the Content-Transfer-Encoding is not permitted to
have any value other than "7bit", "8bit" or "binary".
You could probably get the top-level MimeMessage's decoded content stream and instantiate a MimeMultipart from it, but that's just hacking around the fundamental problem of a broken message.

Categories