PDFTextStripper parsing with wrong encoding - java

PDFTextStripper stripper = new PDFText2HTML(encoding);
String result = stripper.getText(document).trim();
result contains something like
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd"> <html><head><title>Inserat
SeLe EE rev</title> <meta http-equiv="Content-Type"
content="text/html; charset=utf-8"> </head> <body> <div
style="page-break-before:always;
page-break-after:always"><div><p>&#...
instead of
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd"> <html><head><title>Inserat
SeLe EE rev</title> <meta http-equiv="Content-Type"
content="text/html; charset=utf-8"> </head> <body> <div
style="page-break-before:always; page-break-after:always"><div><p>any
blablabla characters...
When I changed encoding to windows-1252 or utf-8 result not changed. Bad pdf url http://www.permaco.ch/fileadmin/user_upload/jobs/Inserat_SeLe_EE_rev.pdf
How to parse this pdf?

How to parse this pdf?
Short of OCR'ing it you don't.
The PDF in question does not contain the information required to extract text without doing at least some OCR (at least OCR'ing each character of the used font to find a mapping from glyph to character) which would require additional libraries and code.
As a requirement for text extraction the PDF specification ISO 32000-1:2008 correctly states in section 9.10.2 that the font used for the text to extract needs to
either contain a ToUnicode CMap — the font used in your document doesn't —
or be a composite font that uses one of the predefined CMaps listed in Table 118 (except Identity–H and Identity–V) or whose descendant CIDFont uses the Adobe-GB1, Adobe-CNS1, Adobe-Japan1, or Adobe-Korea1 character collection — the font used in your document isn't —
or be a simple font that uses one of the predefined encodings MacRomanEncoding,
MacExpertEncoding, or WinAnsiEncoding, or that has an encoding whose Differences array includes only character names taken from the Adobe standard Latin character set and the set of named characters in the Symbol font — the font used in your document neither uses one of those predefined encodings nor are the character names in its Differences array from those selections mentioned: the names used are /0, /1, ..., /155.
Generally a good first test is to try and copy&paste text using Adobe Reader as much text extraction experience is in the Reader's code. When trying to do so, you'll see that you only get garbage.

Related

Inserting and displaying mathematical symbols in j2ee web application

I have mathematical symbols e.g. alfa, beta,mu . When I copy these symbols in text area they are getting copied. I am copying them from word document. When I insert them into the database using prepared statement the symbols are getting inserted as code. for example the alfa is getting stored as β. This is fine I guess. But when I retrieve them from the database using java.sq.Statement and displaying them in the html page they are getting displayed as code instead of symbol. I mean "β" is displayed in html instead displaying alfa symbol. So how to deal with this situation? how can I store symbols and display them properly in html?
I am using mysql database, java1.7,struts2.0 and tomcat7.
The correct display of HTML characters is: β (Looks like: β) You need to add a semicolon.
1) How are you displaying the codes in HTML?
2) What is the char encoding of machine your are running your server/viewing your html
I had following code and it worked
<html>
<body>
This is alpha α<br/>
This is beta β <br/>
This is gamma Γ <br/>
<body>
</html> as shown below:
This is alpha α
This is beta β
This is gamma Γ
You may need to declare your charset:
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
or see the encoding of your server (if its in jsp)
The following tag of struts helped me solving this to an extent.
<s:property value="name" escape="false" />
I hope you're using JSPs. Add this import on top of your JSP which is rendering the symbols:
<%# page contentType="text/html;charset=UTF-8" pageEncoding="UTF-8"%>

how to implement build a selector for HTML DOM elements by its class name using regexp

I have a question here. If I have a html file here.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title> New Document </title>
<meta name="Generator" content="EditPlus">
<meta name="Author" content="">
<meta name="Keywords" content="">
<meta name="Description" content="">
</head>
<body>
<h1>Welcome to My Homepage</h1>
<p class="intro">My name is Donald.</p>
<h1 class="intro"><p class="important">Note that this is an important paragraph.</p>
</h1>
<div class="intro important"><p class="apple">I live in apple.</p></div>
<div class="intro important">I like apple.</p></div>
<p>I live in Duckburg.</p>
</body>
</html>
Right now I want to get html element by class name.
If the class name is ".intro", it should return:
My name is Donald.
<p class="important">Note that this is an important paragraph.</p>
If the class name is ".intro.important" it should return:
Note that this is an important paragraph.
If the class name is ".intro.important>.apple", it should return:
I live in apple.
I know jquery has class selector this function, but now I want to implement this function.
Can I use java regexp to do this? It seems like that the class name is single string is ok. But if the class name has a child class name, it will make it hard.
One more question, can java get the dom structure of the html?
You can't parse [x]HTML with RegEx
It's that simple, RegExp was not built to cover the full grammar of XML and different tools need to be used for different jobs.
CSS Selectors not readily available
Unfortunately CSS selector parsers are not yet (afaik) a part of DOM parsers so you would need to use an XPath parser to achieve the same things as with CSS selectors.
There are however some projects such as jquery4j.org which port jQuery (+ widgets) to Java, but they don't bring CSS selectors to the table, the bring a lot more and I'm not sure if you really need all that.
XPath Selectors as an alternative to CSS Selectors
DOM parser + XPath parser for Java are the best approach. The DOM parser reads and load the HTML structure as DOM objects while the XPath parser uses (its own different type of selectors) to find objects within the DOM.
But be careful, don't feed the DOM parser huge amounts of HTML code (entire pages) unless you really need it to sift through it all. If you have a smaller piece of string that isolates the targeted area in the HTML where your info is present then it's better to use DOM with that. This is because DOM parsers are memory hungry beasts.
Can I use java regexp to do this?
You can create regex that selects nested content within tag with specific class name.
I can give you regex that finds content within a tag but it doesn't care of class name:
<([a-z][a-z0-9]*+)[^>]*>.*?</\\1>
But if the class name has a child class name, it will make it hard.
In such case it is easier to use java string.
can java get the dom structure of the html?
Yes, it can be done with jsoup at jsoup.org.

URL Encoding Issue with "Œ"

I have read all of the Java URL encoding threads on here but still haven't found a solution to my problem: Google Chrome encodes "BŒUF" to "B%8CUF" POST data, awesome. How can I convince Java to do the same? (The website is <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="fr"> and <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> in case this is important.)
System.out.println(URLEncoder.encode("BŒUF", "utf-8"));
System.out.println(URLEncoder.encode("BŒUF", "iso-8859-1"));
System.out.println(URLEncoder.encode("BŒUF", "iso-8859-15"));
System.out.println(new URI("http","www.google.com","/ig/api","BŒUF", null).toASCIIString());
prints
B%C5%92UF
B%3FUF
B%BCUF
http://www.google.com/ig/api?B%C5%92UF
but not "B%8CUF"?
You are specifically looking for windows-1252 encoding not UTF-8:
System.out.println(URLEncoder.encode("BŒUF", "windows-1252"));
Gives,
B%8CUF

Encode-DecodeAdventure

I tried to encode special characters in javascript using encodeURI() and encodeURIComponent() functions and decode them using the java.net.URLDecoder.decode() method and this worked like a charm in firefox. but it doesn't seem to be working in Internet explorer. Is there any alternative code where the same code would work on both browsers?
Example:
when I pass $%^& as the value, after encoding, it becomes %24%25%5E%26. After decoding using java.net.URLDecoder.decode() method, it becomes $%%5E&
this is the actual value-
var str = "$%^&";
var valueJS = encodeURI(str);
var valueJS = encodeURIComponent(valueJS); // to encode even those chars in valueJS that were not encoded by encodeURI()
this is the encoded value-
String value = "%2524%2525%255E%2526";
while(value.matches(".*%25[A-Za-z0-9]*")) {
value = value.replace("%25", "%"); // manually trying to achieve %24%25%5E%26
}
value = java.net.URLDecoder(value, "UTF-8");
// I was expecting the decoded value to be $%^&, but it turns out to be $%%5E&
Fixed it. Added 2 <meta> tags
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
and used javascript escape() instead of encodeURI() and encodeURIComponent()

How to solve Flex utf-8 encoding

I develop a facebook application using flex' s XMLSocket and Java.
When i type 'ş' character in my client side, it prints, however when i send 'ş' character,
it is printed as ??? or any kind of unpredictable characters.
I tried to change my html file's meta tag to
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
but it did not work.
On the whole how can i get rid of this problem.
Thanks.
Use encodeURIComponent(yourstring), this might do the trick.

Categories