In a XML file parsed to a Document I want to get a XML attribute that has embedded tabs and new lines.
I've googled and found that the XML parsing spec says the attribute text is "normalized", replacing white space characters with a blank.
I guess a have to replace the tabs and line breaks with an appropriate escaped character before I parse the XML.
In all of my googling I have not found a straightforward method to get from the File to a Document where the attribute text is returned with Tabs and Line breaks preserved.
The XML file is generated from a third party application so it may not be addressed there.
I want to use the JDK parser.
My initial attempts at reading the File into a string and parsing the String fail with a parse error on the first byte
Any suggestions on a straight forward approach?
An example element is at pastbin
Element example
[1]: https://pastebin.com/pc9uGbSD
I perform a XML Parse like this
public ReadPlexExport(Path xmlPath, ExportType exType) throws Exception {
this.xmlPath = xmlPath;
this.type = exType;
this.doc = DBF.newDocumentBuilder().parse(this.xmlPath.toFile());
}
The quick and dirty solution to my immediate problem was to read the XML file line by line as a text file, on each line replacing \t characters with the escaped tab value, writing the line to a new file, then appending an escaped line break.
The new XML files could be parsed. The original XML would always be in a form that allowed this hack as \t and line breaks would only ever occur in Attributes.
Related
If the text is already plain text and passed to the function new HtmlToPlainText().getPlainText() then the new line character is getting added to the result text.
It looks like Jsoup is doing some formatting and adding a line break.
HtmlToPlainText htmlToPlainText = new HtmlToPlainText();
htmlToPlainText.getPlainText(Jsoup.parse(inputString));
I tried outputSettings.prettyPrint(false); but it is not helping.
Input text can be HTML or plain text.
I want the text to be returned as it is(no extra new line) if it is already plain text.
Input: This is the subject for test cnirnv cniornvo cojrpov nmcofrjpv mcprfjv mpcjfpv pvckfpv jvpfkvp cnirv
Output: This is the subject for test cnirnv cniornvo cojrpov nmcofrjpv mcprfjv mpcjfpv \npvckfpv jvpfkvp cnirv.
A new line character is added after mpcjfpv
We can do string replacement but I am looking for a way to do it as part of the library itself.
HtmlToPlainText resides in package org.jsoup.examples, which is not included in the library jar file on Maven Central. In other words, this class is not part of the jsoup API and is only meant for demonstration purposes.
If you want to output the plaintext of a parsed document, try something like this instead:
Document doc = Jsoup.parse("This is the subject for test cnirnv cniornvo cojrpov nmcofrjpv mcprfjv mpcjfpv pvckfpv jvpfkvp cnirv");
System.out.println(doc.text());
Currently, I'm using XMLInputFactory and XMLEventReader to parse XML from a rss data feed. In the description, it contains html tags in the using of > and <. Java reads this as actual tags and it thinks that the end of the description, so it cuts off and goes to the next element. How can I exclude the tags from parsing?
I don't use the pull parser (XMLEventReader) much, but I believe that, as with the SAX parser, it can report a text node as a sequence of Characters events, rather than as a single event, and it's up to the application to concatenate them. The most likely place the parser is likely to choose to split the content is at entity boundaries, to avoid doing bulk copying of character data when expanding entities.
You could temporary replace every > and < tags by a specific unique label you know. Then, do your parsing, and replace them with the > and < tags again when you are done with your parsing, like in the following code.
String original = "<container>>This< is a >test<</container>";
String newStr = original.replace(">", "_TMP_CHARACTER_G_").replace("<", "_TMP_CHARACTER_L_");
System.out.println(original + "\n" + newStr);
// Print <container>>This< is a >test<</container>
// and <container>_TMP_CHARACTER_G_This_TMP_CHARACTER_L_ is a _TMP_CHARACTER_G_test_TMP_CHARACTER_L_</container>
// [Do your parsing here]
String theTagYouWant = newStr;
String theConvertedTag = theTagYouWant.replace("_TMP_CHARACTER_G_", ">").replace("_TMP_CHARACTER_L_", "<");
System.out.println(theConvertedTag);
// Print the original String <container>>This< is a >test<</container>
I want to store a URL in a properties file. This is the URL:
jdbc\:sqlserver\://dummydata\\SHARED
When programming this in Java, I obviously need to escape the backslashes. So my code ends up looking like this
properties.setProperty("db", "jdbc\\:sqlserver\\://dummydata\\\\SHARED");
The issue with this is that the properties file is saving the String URL and including the backslashes used for escaping, which is an incorrect URL. I was hoping that Java would interpret the backslashes used for escaping so that only the correct URL is saved. Is there a way to achieve this?
You're correct that a property value with : needs to escape the colons in a .properties text file, but you're not writing that text file directly.
You are giving the value to a Properties object using setProperty(), and presumably writing that to a text file using store(), and the store() method will escape the values as needed for you.
You should give the value you want to Properties, and forget about the encoding rules of the text file. Properties will handle all needed encoding. Since the value you want to give is jdbc:sqlserver://dummydata\SHARED, you write a string literal "jdbc:sqlserver://dummydata\\SHARED"
Example
String db = "jdbc:sqlserver://dummydata\\SHARED";
System.out.println(db); // To see actual string value
Properties properties = new Properties();
properties.setProperty("db", db);
try (FileWriter out = new FileWriter("test.properties")) {
properties.store(out, null);
}
Output
jdbc:sqlserver://dummydata\SHARED
Content of test.properties
#Tue Jun 11 11:54:24 EDT 2019
db=jdbc\:sqlserver\://dummydata\\SHARED
As you can see, the store() method has escaped the : and \ for you.
If you save the properties as an XML file instead, there's no need to escape anything, and Properties won't.
Example
try (FileOutputStream out = new FileOutputStream("test.xml")) {
properties.storeToXML(out, null);
}
Content of test.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<entry key="db">jdbc:sqlserver://dummydata\SHARED</entry>
</properties>
Properties.store() escapes backslashes, there is no way around it. I guess my first question is why is this an issue? Are you reading the file in any other way than using Properties.load(). If not they you don't need to worry about it as the load function will remove the escape characters.
properties.load(file);
System.out.println(properties.get("db"));
// output: jdbc\:sqlserver\://dummydata\\SHARED
As an aside are you sure you the URL is correct? Shouldn't you be storing it as properties.setProperty("jdbc:sqlserver://dummydata\SHARED")?
In the documentation for load, it says the following:
The method does not treat a backslash character, \, before a non-valid escape character as an error; the backslash is silently dropped. For example, in a Java string the sequence "\z" would cause a compile time error. In contrast, this method silently drops the backslash. Therefore, this method treats the two character sequence "\b" as equivalent to the single character 'b'.
This means that two backslashes will be treated as a single one because it's not a valid escape sequence. Loading this string should work just fine:
C:\\path\\to\\file
I want to check if a pdf file contains a long string, which is a string of a full XML document.
I can open both files and extract the text already. i've done that with the following code:
File temp = File.createTempFile("temp-pdf", ".tmp");
OutputStream out = new FileOutputStream(temp);
out.write(Base64.decodeBase64(testObject.getPdfAsDoc().getContent()));
out.close();
PDDocument document = PDDocument.load(temp);
PDFTextStripper pdfStripper = new PDFTextStripper();
String pdfText = pdfStripper.getText(document);
Integer posS =pdfText.indexOf("<?xml version");
Integer posE = pdfText.lastIndexOf("</ServiceSpecificationSchema:serviceSpecification>")+"</ServiceSpecificationSchema:serviceSpecification>".length();
pdfText =pdfText.substring( posS,posE );
String xmlText = testObject.getXmlAsDoc().getContent();
Now i have the problem, that the lines of both documents don't match, a cause of formats like linebreaks from the pdf file.
Example lines of TXT output from XML file:
<?xml version="1.0" encoding="UTF-8"?><ServiceSpecificationSchema:serviceSpecification xmlns:xs=" ..... >
Example lines of TXT output from PDF file:
<?xml version="1.0" encoding="UTF-8"?><ServiceSpecificationSchema:serviceSpecification
xmlns:xs=" ..... >
Second, i have page numbers between the XML tags from the PDF. Do you know a good way to remove this lines?
</operations>
Page 51 of 52
</consumerInterface>
What is the best approach to check if the pdf contains an XML?
I've already tried to remove all linebreaks and whitespaces from the file and compare them. But if i do that, i cannot find a line with the difference.
It does not have to be a valid XML file at the end.
Just want to post my solution if others need it.
My code is a little to large, to post it here.
Basicly i extract the text from the pdf and remove strings like page x and headlines from it. After that i removed all whitespaces as pointed out above. Finally i compare character by character of the extracted string to inform my users where they have done things wrong in the text. This method works pretty well, even if the auther does not care about formatting and just copy and paste the whole xml document.
I have been trying to use XStreamMarshaller to generate XML output in my Java Spring project. The XML I am generating has CDATA values in the element text. I am manually creating this CDATA text in the command object like this:
f.setText("<![CDATA[cdata-text]]>");
The XStreamMarshaller generated the element(text-data below is an alias) as:
<text-data><![CDATA[cdata-text]]></text-data>
The above XML display is as expected (Please ignore the back slash in the above element name: forum formatting). But when I do a View Source on the XML output generated I see this for the element: <text-data><![CDATA[cdata-text]]></text-data>.
Issue:
As you can see the less than and greater than characters have been replaced by < and > in the View Source. I need my client to read the source and identify CDATA section from the XML output which it will not in the above scenario.
Is there a way I can get the XStreamMarshaller to escape special characters in the text I provided?
I have set the encoding of the Marshaller to ISO-8859-1 but that does not work either. If the above cannot be done by XStreamMarshaller can you please suggest alternate marshallers/unmarshallers that can do this for me?
// Displaying my XML and View Source as suggested by Paŭlo Ebermann below:
XML View (as displayed in IE):
An invalid character was found in text content. Error processing resource 'http://localhost:8080/file-service-framework/fil...
Los t
View Source:
<service id="file-text"><text-data><![CDATA[
Los túneles a través de las montañas hacen más fácil viajar por carretera.
]]></text-data></service>
Thanks you very much.
Generating CDATA sections is the task of your XML-generating library, not of its client. So you should simply have to write
f.setText("cdata-text");
and then the library can decide whether to use <![CDATA[...]]> or the <-escaping for its contents. It should make no difference for the receiver.
Edit:
Looking at your output, it looks right (apart from the CDATA) - here you must work on your input, as said.
If IE throws an error here, most probably you don't have declared the right encoding.
I don't really know much about the Spring framework, but the encoding used by the Marshaller should be the same encoding as the encoding sent in either the HTTP header (Content-Type: ... ;charset=...) or the <?xml version="1.0" encoding="..." ?> XML prologue (these two should not differ, too).
I would recommend UTF-8 as encoding everywhere, as this can represent all characters, not only the Latin-1 ones.