Thanks for previous replies,
I am new to parsing concept using java, can anyone guide me how to modify the xml using saxparser. i searched long time to delete the tag but i dont know how to delete. pls guide me
You can remove an element using this :
SAXReader reader = new SAXReader();
reader.setEncoding(CharEncoding.UTF_8);
Document customXmlDocument = reader.read(inputStream);
// Get the element you want to remove and then pass it to the remove method as so
customXmlDocument.remove(Element)
Related
I want to ask how to get the text from element where didnt have unic id/class, i tried using xpath(copied from web browser) but it's not working this is the picture.
try below code;
("//*[#class='col-sm-4']//*[#class='text-grey'][1]").getText();
This might help.
Document doc = Jsoup.parse(driver.getPageSource());
Elements content = doc.select("span[class^=text-grey]");
ArrayList<String> allTextIntextGreyClass = (ArrayList<String>) content.eachText();
Then you can work with ArrayList to get the text you want to work with, or you can work with "content". If this does not work, you can get the inner HTML and work with it to get the right context you want. You can get innerHTML as follows:
String inHTML = driver.findElement(By.className("text-grey")).getAttribute("innerHTML");
My usecase: Get html-pages by jsoup and returns a w3c-DOM for further processing by XML-transformations:
...
org.jsoup.nodes.Document document = connection.get();
org.w3c.dom.Document dom = new W3CDom().fromJsoup(document);
...
Works well for most documents but for some it throws INVALID_CHARACTER_ERR without telling where.
It seems extremely difficult to find the error. I changed the code to first import the url to a String and then checking for bad characters by regexp. But that does not help for bad attributes (eg. without value) etc.
My current solution is to minimize the risk by removing elements by tag in the jsoup-document (head, img, script ...).
Is there a more elegant solution?
Try setting the outputSettings to 'XML' for your document:
document
.outputSettings()
.syntax(OutputSettings.Syntax.xml);
document
.outputSettings()
.charset("UTF-8");
This should ensure that the resulting XML is valid.
Solution found by OP in reply to nyname00:
Thank you very much; this solved the problem:
Whitelist whiteList = Whitelist.relaxed();
Cleaner cleaner = new Cleaner(whiteList);
jsoupDom = cleaner.clean(jsoupDom);
"relaxed" in deed means relaxed developer...
Kindly let me know any API to calculate the line count for RTF document.
Apache POI or Aspose works for document, but its not able to find line count for RTF.
Thanks.
Java already has a built-in RTF-Parser: RTFEditorKit.
Take a look at its read method.
For example:
test.rtf file contents
hello
stackoverflow
users
So, it has 3 lines separated by \n.
Code:
FileInputStream stream = new FileInputStream("test.rtf");
RTFEditorKit kit = new RTFEditorKit();
Document doc = kit.createDefaultDocument();
kit.read(stream, doc, 0);
String plainText = doc.getText(0, doc.getLength());
System.out.println(plainText.split("\\n").length);
Output = 3
You can use Aspose.Words for Java to get the number of lines of an RTF document. Please do the following:
Read RTF file using document class
Get BuiltInDocumentProperties object using getBuiltInDocumentProperties method
Now, get number of lines using getLines property of BuiltInDocumentProperties object
I hope this helps. Please note that I work as developer evangelist at Aspose. If you need any help with Aspose, do let me know.
When traversing an XML document like so
while(streamReader.hasNext()){
streamReader.next();
if(streamReader.getEventType() == XMLStreamReader.START_ELEMENT){
System.out.println(streamReader.getLocalName());
}
}
Do I need to create a new streamReader if I need to traverse the XML document again, like so?
XMLStreamReader streamReader =
factory.createXMLStreamReader(reader);
I don't see a method like 'reset()' to move the cursor back to the start of the XML file
Yes, you should create a new reader at that point.
If you need to traverse the document multiple times, do you definitely want to parse it in a streaming fashion in the first place, rather than loading it into a DOM of some description?
I'm having a very frustrating time extracting some elements from a JDOM document using an XPath expression. Here's a sample XML document - I'd like to remove the ItemCost elements from the document altogether, but I'm having trouble getting an XPath expression to evaluate to anything at the moment.
<srv:getPricebookByCompanyResponse xmlns:srv="http://ess.com/ws/srv">
<srv:Pricebook>
<srv:PricebookName>Demo Operator Pricebook</srv:PricebookName>
<srv:PricebookItems>
<srv:PricebookItem>
<srv:ItemName>Demo Wifi</srv:ItemName>
<srv:ProductCode>DemoWifi</srv:ProductCode>
<srv:ItemPrice>15</srv:ItemPrice>
<srv:ItemCost>10</srv:ItemCost>
</srv:PricebookItem>
<srv:PricebookItem>
<srv:ItemName>1Mb DIA</srv:ItemName>
<srv:ProductCode>Demo1MbDIA</srv:ProductCode>
<srv:ItemPrice>20</srv:ItemPrice>
<srv:ItemCost>15</srv:ItemCost>
</srv:PricebookItem>
</srv:PricebookItems>
</srv:Pricebook>
</srv:getPricebookByCompanyResponse>
I would normally just use an expression such as //srv:ItemCost to identify these elements, which works fine on other documents, however here it continually returns 0 nodes in the List. Here's the code I've been using:
Namespace ns = Namespace.getNamespace("srv","http://ess.com/ws/srv");
XPath filterXpression = XPath.newInstance("//ItemCost");
filterXpression.addNamespace(ns);
List nodes = filterXpression.selectNodes(response);
Where response is a JDOM element containing the above XML snippet (verified with an XMLOutputter). nodes continually has size()==0 whenever parsing this document. Using the XPath parser in Eclipse on the same document, this expression does not work either. After some digging, I got the Eclipse evaluator to work with the following expression: //*[local-name() = 'ItemCost'], however replacing the //srv:ItemCost in the Java code with this still produced no results. Another thing I noticed is if I remove the namespace declaration from the XML, //srv:ItemCost will resolve correctly in the Eclipse parser, but I can't remove it from the XML. I've been scratching my head for ours hours on this one now, and would really appreciate some nudging in the right direction.
Many thanks
Edit : Fixed code -
Document build = new Document(response);
XPath filterXpression = XPath.newInstance("//srv:ItemCost");
List nodes = filterXpression.selectNodes(build);
Strange, indeed... I tested on my side with jdom, and your snippet produced an empty list, the following works as intended:
public static void main(String[] args) throws JDOMException, IOException {
File xmlFile = new File("sample.xml");
SAXBuilder builder = new SAXBuilder();
Document build = builder.build(xmlFile);
XPath filterXpression = XPath.newInstance("//srv:ItemCost");
System.out.println(filterXpression.getXPath());
List nodes = filterXpression.selectNodes(build);
System.out.println(nodes.size());
}
It produces the output:
//srv:ItemCost
2