If the text is already plain text and passed to the function new HtmlToPlainText().getPlainText() then the new line character is getting added to the result text.
It looks like Jsoup is doing some formatting and adding a line break.
HtmlToPlainText htmlToPlainText = new HtmlToPlainText();
htmlToPlainText.getPlainText(Jsoup.parse(inputString));
I tried outputSettings.prettyPrint(false); but it is not helping.
Input text can be HTML or plain text.
I want the text to be returned as it is(no extra new line) if it is already plain text.
Input: This is the subject for test cnirnv cniornvo cojrpov nmcofrjpv mcprfjv mpcjfpv pvckfpv jvpfkvp cnirv
Output: This is the subject for test cnirnv cniornvo cojrpov nmcofrjpv mcprfjv mpcjfpv \npvckfpv jvpfkvp cnirv.
A new line character is added after mpcjfpv
We can do string replacement but I am looking for a way to do it as part of the library itself.
HtmlToPlainText resides in package org.jsoup.examples, which is not included in the library jar file on Maven Central. In other words, this class is not part of the jsoup API and is only meant for demonstration purposes.
If you want to output the plaintext of a parsed document, try something like this instead:
Document doc = Jsoup.parse("This is the subject for test cnirnv cniornvo cojrpov nmcofrjpv mcprfjv mpcjfpv pvckfpv jvpfkvp cnirv");
System.out.println(doc.text());
Related
In a XML file parsed to a Document I want to get a XML attribute that has embedded tabs and new lines.
I've googled and found that the XML parsing spec says the attribute text is "normalized", replacing white space characters with a blank.
I guess a have to replace the tabs and line breaks with an appropriate escaped character before I parse the XML.
In all of my googling I have not found a straightforward method to get from the File to a Document where the attribute text is returned with Tabs and Line breaks preserved.
The XML file is generated from a third party application so it may not be addressed there.
I want to use the JDK parser.
My initial attempts at reading the File into a string and parsing the String fail with a parse error on the first byte
Any suggestions on a straight forward approach?
An example element is at pastbin
Element example
[1]: https://pastebin.com/pc9uGbSD
I perform a XML Parse like this
public ReadPlexExport(Path xmlPath, ExportType exType) throws Exception {
this.xmlPath = xmlPath;
this.type = exType;
this.doc = DBF.newDocumentBuilder().parse(this.xmlPath.toFile());
}
The quick and dirty solution to my immediate problem was to read the XML file line by line as a text file, on each line replacing \t characters with the escaped tab value, writing the line to a new file, then appending an escaped line break.
The new XML files could be parsed. The original XML would always be in a form that allowed this hack as \t and line breaks would only ever occur in Attributes.
I want to check if a pdf file contains a long string, which is a string of a full XML document.
I can open both files and extract the text already. i've done that with the following code:
File temp = File.createTempFile("temp-pdf", ".tmp");
OutputStream out = new FileOutputStream(temp);
out.write(Base64.decodeBase64(testObject.getPdfAsDoc().getContent()));
out.close();
PDDocument document = PDDocument.load(temp);
PDFTextStripper pdfStripper = new PDFTextStripper();
String pdfText = pdfStripper.getText(document);
Integer posS =pdfText.indexOf("<?xml version");
Integer posE = pdfText.lastIndexOf("</ServiceSpecificationSchema:serviceSpecification>")+"</ServiceSpecificationSchema:serviceSpecification>".length();
pdfText =pdfText.substring( posS,posE );
String xmlText = testObject.getXmlAsDoc().getContent();
Now i have the problem, that the lines of both documents don't match, a cause of formats like linebreaks from the pdf file.
Example lines of TXT output from XML file:
<?xml version="1.0" encoding="UTF-8"?><ServiceSpecificationSchema:serviceSpecification xmlns:xs=" ..... >
Example lines of TXT output from PDF file:
<?xml version="1.0" encoding="UTF-8"?><ServiceSpecificationSchema:serviceSpecification
xmlns:xs=" ..... >
Second, i have page numbers between the XML tags from the PDF. Do you know a good way to remove this lines?
</operations>
Page 51 of 52
</consumerInterface>
What is the best approach to check if the pdf contains an XML?
I've already tried to remove all linebreaks and whitespaces from the file and compare them. But if i do that, i cannot find a line with the difference.
It does not have to be a valid XML file at the end.
Just want to post my solution if others need it.
My code is a little to large, to post it here.
Basicly i extract the text from the pdf and remove strings like page x and headlines from it. After that i removed all whitespaces as pointed out above. Finally i compare character by character of the extracted string to inform my users where they have done things wrong in the text. This method works pretty well, even if the auther does not care about formatting and just copy and paste the whole xml document.
I have specific logical symbols like ⇒,∨,∧,¬ and I want to write text with these symbols to docx document. With short symbols ∨,∧,¬ all fine, but symbol ⇒ overlaps with next character like
but it should look like
My code looks like
MainDocumentPart mdp = wordMLPackage.getMainDocumentPart();
P p = factory.createP();
R run = factory.createR();
p.getContent().add(run);
Text text = factory.createText();
text.setValue("((q⇒p)∧q)⇒p");
run.getContent().add(text);
mdp.addObject(p);
How to correct writing long symbols like ⇒?
You can use docx4j code generation to get what you want.
Create a document in Word which looks how you want it, then save as docx.
To generate code based on that docx, do one of the following:
upload it to http://webapp.docx4java.org/OnlineDemo/PartsList.html
or 2. install/use our Word AddIn; get it at http://www.docx4java.org/forums/docx4jhelper-addin-f30/docx4j-helper-addin-v1-final-available-t2253.html
If you are still having problems, post the XML you created in Word, or the code you generated following the above steps.
I am trying to pull data from microsoft-word and translate it to sql statement and inserting it an Oracle database.
When the data in ms-word contains a new line that is created by [Shift-Enter] and not just enter,
The text contains an icon that looks like a box with a question mark.
Where the ET is just standard new line using the enter key and the ST is new lines using the
Shift-Enter combination. So when generating the SQL and inserting it to oracle, oracle counts that not as a text, but as hex.
My question is, how to remove lines that is created by [shift-enter] to just a standard '\n'?
Thanks
Update
This is how i get the text information
POIFSFileSystem fs = new POIFSFileSystem(new FileInputStream(file));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
text = we.getText();
Update Answer:
This was a bug in poi-3.6. In poi-3.8 it shows as \r.
What you're almost certainly seeing are "fields" in the word document, which are special blocks of text such as links, macros etc
Option number one is to continue using WordExtractor, but call stripFields(String) on the resulting text before using it. That'll remove any of these fields from the text for you.
The other option is to use a different way of getting the text out. WordToTextConverter is part of Apache POI, and is more complex code that handles more of the format and should skip these for you (WordExtractor is pretty simple and low level). The other is to use Apache Tika, which provides a common way of extracting text from a number of file formats. That does have the proper code to deal with fields, and as an added bonus it'll be trivial for you to support .docx or .pdf when your requirements change!
Can I format the substring of some string (for example string for Paragraph - new Paragraph(someString) ) using any markers in Itext? Is something like that enabled?
For example:
new Paragraph("Congrats, you've [formatMarker]gained[/formatMarker] the privilege") ?
You can have iText parse HTML tags in order to format your text. Here is an example
Reader reader = new StringReader("<b>Here is Some HTML<b><h1>Hello World</h1>");
HTMLWorker worker = new HTMLWorker(document);
worker.parse(reader);
When you parse it adds your contents to the document. No need to store them in a Paragraph. If you want more functionality and control over the individual elements of the html, you can try using the static method HTMLWorker.parseList()
The API for iText is here http://api.itextpdf.com/itext/ and it has lots of details on both methods used above