I am trying to use CSS Parser in a java project to extract the CSS rules/DOM from a String of the text input.
All the examples that I have come across take the css file as input. Is there a way to bypass the file reading and work with the string content of the css file directly.
Because the class that I am working on gets only the string content of the css file and all the reading has already been taken care of.
Right now I have this, where the 'cssfile' is the filepath for css file being parsed.
InputStream stream = oParser.getClass().getResourceAsStream(cssfile);
InputSource source = new InputSource(new InputStreamReader(stream));
CSSOMParser parser = new CSSOMParser();
CSSStyleSheet stylesheet = parser.parseStyleSheet(source, null, null);
CSSRuleList ruleList = stylesheet.getCssRules();
System.out.println("Number of rules: " + ruleList.getLength());
Reference link
A workaround that I found was to create a Reader using a StringReader with the contents and set the characterStream for the Input source. But there should be a better way to do this..
InputSource inputSource = new InputSource();
Reader characterStream = new StringReader(cssContent);
inputSource.setCharacterStream(characterStream);
CSSStyleSheet stylesheet = cssParserObj.parseStyleSheet(source, null,
null);
CSSRuleList ruleList = stylesheet.getCssRules();
Related
I am trying to get just the text printed out between one specific element tag in an XML file. Here is my java code:
SAXBuilder builder = new SAXBuilder();
byte[] requestFile = FileManager.getByteArray(args[0]);
byte[] responseFile = FileManager.getByteArray(args[1]);
InputStream request = new ByteArrayInputStream(requestFile);
InputStream response = new ByteArrayInputStream(responseFile);
Document requestDoc = builder.build(request);
Document responseDoc = builder.build(response);
String xpathResponseStr = "//status";
JDOMXPath xpath = new JDOMXPath(xpathResponseStr);
Element responseElem = (Element)xpath.selectSingleNode(requestDoc);
String statusRequestText = responseElem.getTextTrim();
System.out.println("RESPONSE: \n" + statusRequestText);
And here is my XML file that I am reading in:
<status>success</status>
<generatedDate>
<date>2022-09-08</date>
<time>12:03:23</time>
</generatedDate>
<filingInformation>
<paymentInformation>
<amount>0.00</amount>
</paymentInformation>
</filingInformation>
</response>
I am essentially trying to get my console to print the word "success" between the tags. But instead I am getting a null pointer. Not sure if this is because my xpath expression is incorrect or what exactly. Any input would help!
What I was doing wrong was I was calling the wrong Document object when running
Element responseElem = (Element)xpath.selectSingleNode(requestDoc);
It should have been passing in the responseDoc Document object instead of the reqestDoc Document object. Each of those objects had a different XML, and in the requestDoc, there was no element inside named <status>.
I gonna get an SVGDocument object to fill into JSVGCanvas, but I just had an SVG string without any files, so I cannot use URI to construct.
You can read your SVG from a StringReader like this:
StringReader reader = new StringReader(svgString);
String uri = "file:make-something-up";
String parser = XMLResourceDescriptor.getXMLParserClassName();
SAXSVGDocumentFactory f = new SAXSVGDocumentFactory(parser);
SVGDocument doc = f.createSVGDocument(uri, reader);
You need to make up a valid URI but it's not important unless you make relative references to other URIs from your SVG.
To convert from HTML String to
org.w3c.dom.Document
I'm using
jtidy-r938.jar
here is my code:
public static Document getDoc(String html) {
Tidy tidy = new Tidy();
tidy.setInputEncoding("UTF-8");
tidy.setOutputEncoding("UTF-8");
tidy.setWraplen(Integer.MAX_VALUE);
// tidy.setPrintBodyOnly(true);
tidy.setXmlOut(false);
tidy.setShowErrors(0);
tidy.setShowWarnings(false);
// tidy.setForceOutput(true);
tidy.setQuiet(true);
Writer out = new StringWriter();
PrintWriter dummyOut = new PrintWriter(out);
tidy.setErrout(dummyOut);
tidy.setSmartIndent(true);
ByteArrayInputStream inputStream = new ByteArrayInputStream(html.getBytes());
Document doc = tidy.parseDOM(inputStream, null);
return doc;
}
But sometime the library work incorrectly, some tag is lost.
Please tell a good open library to do this task.
Thanks very much!
You don't tell why sometimes the library doesn't give the good result.
Nevertheless, i am working very regularly with html files where I must extract data from and the main problem encountered is that fact that some tags are not valid because not closed for example.
The best solution i found to resolve is the api htmlcleaner (htmlCleaner Website).
It allows you to make your html file well formed.
Then, to transform it in document w3c or another strict format file is easier.
With HtmlCleaner, you could do such as :
HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean(html);
DomSerializer ser = new DomSerializer(cleaner.getProperties());
Document myW3cDoc = ser.createDOM(node);
I refer DomSerializer from htmlcleaner.
I am parsing one xml document using dom4j as below
SAXReader reader = new SAXReader();
document = reader.read("C:/test.xml");
But it does not keep the namespaces that were there when i write the xml as below
FileOutputStream fos = new FileOutputStream("c:/test.xml");
OutputFormat format = OutputFormat.createPrettyPrint();
XMLWriter writer = new XMLWriter(fos, format);
writer.write(document);
writer.flush();
how to do this using dom4j.I am using dom4j because of code easiness.
Don't agree. This snippet
System.out.println(new SAXReader()
.read(new ByteArrayInputStream("<a:c xmlns:a='foo'/>"
.getBytes(Charset.forName("utf-8")))).getRootElement()
.getNamespaceURI());
will print
foo
Your problem is that the SAXReader#read(String) method takes a System IDargument, not a file name. Instead, try feeding the reader a File, or an InputStream, or a URL.
reader.read(new File("C:/test.xml"))
I'm trying to parse a RTF file using Apache Tika. Inside the file there is a table with
several columns.
The problem is that the parser writes out the result without any information in which column the value was.
What I'm doing right now is:
AutoDetectParser adp = new AutoDetectParser(tc);
Metadata metadata = new Metadata();
String mimeType = new Tika().detect(file);
metadata.set(Metadata.CONTENT_TYPE, mimeType);
BodyContentHandler handler = new BodyContentHandler();
InputStream fis = new FileInputStream(file);
adp.parse(fis, handler, metadata, new ParseContext());
fis.close();
System.out.println(handler.toString());
It works but I need to know like meta-information.
Is there already a Handler which outputs something like HTML with a structure of the read RTF file?
I would suggest that rather than asking Tika for the plain text version, then wondering where all your nice HTML information has gone, you instead just ask Tika for the document as XHTML. You'll then be able to process that to find the information you want on your RTF File
If you look at the Tika Examples or the Tika Unit Tests, you'll see this same pattern for an easy way to get the XHTML output
Metadata metadata = new Metadata();
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "xml");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "no");
handler.setResult(new StreamResult(sw));
parser.parse(input, handler, metadata, new ParseContext());
String xhtml = sw.toString();