Reading multiple xml files java - java

i have ~25000 XML files i need to read in java. This is my code:
private static void ProcessFile() {
try {
File fXmlFile = new File("C:/Users/Emolk/Desktop/000010.xml");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
doc.getDocumentElement().normalize();
System.out.println("Root element :" + doc.getDocumentElement().getNodeName());
NodeList nList = doc.getElementsByTagName("sindex");
System.out.println("----------------------------");
for (int temp = 0; temp < nList.getLength(); temp++) {
Node nNode = nList.item(temp);
System.out.println("");
if (nNode.getNodeType() == Node.ELEMENT_NODE) {
Element eElement = (Element) nNode;
System.out.println("Name : " + eElement.getElementsByTagName("name").item(0).getTextContent());
System.out.println("Count : " + eElement.getElementsByTagName("count").item(0).getTextContent());
Entity CE = new Entity(eElement.getElementsByTagName("name").item(0).getTextContent(), Integer.parseInt(eElement.getElementsByTagName("count").item(0).getTextContent()));
Entities.add(CE);
System.out.println("Entity added! ");
}
}
System.out.println(Entities);
} catch (Exception e) {
e.printStackTrace();
}
}
How do i read 25000 files instead of just the one?
I tried joining all the xml files together using this: https://www.sobolsoft.com/howtouse/combine-xml-files.htm
But that gave me this error:
[Fatal Error] joined.xml:130:2: The markup in the document following the
root element must be well-formed.

If performance is not a concern, then you can do something like,
import java.io.File;
import java.util.List;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
public class ReadFiles {
public static void main(String[] args) {
File dir = new File("D:/Work"); //Directory where your file exists
File [] files = dir.listFiles();
for(File file : files) {
if(file.isFile() && file.getName().endsWith(".xml")) { //You can validate file name with extension if needed
ProcessFile(file, Entities); // Assumed you have declared Entities, may be list of other collection
}
}
System.out.println(Entities);
}
private static void ProcessFile(File fXmlFile, List<E> Entities) {
try {
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
doc.getDocumentElement().normalize();
System.out.println("Root element :" + doc.getDocumentElement().getNodeName());
NodeList nList = doc.getElementsByTagName("sindex");
System.out.println("----------------------------");
for (int temp = 0; temp < nList.getLength(); temp++) {
Node nNode = nList.item(temp);
System.out.println("");
if (nNode.getNodeType() == Node.ELEMENT_NODE) {
Element eElement = (Element) nNode;
System.out.println("Name : " + eElement.getElementsByTagName("name").item(0).getTextContent());
System.out.println("Count : " + eElement.getElementsByTagName("count").item(0).getTextContent());
Entity CE = new Entity(eElement.getElementsByTagName("name").item(0).getTextContent(), Integer.parseInt(eElement.getElementsByTagName("count").item(0).getTextContent()));
Entities.add(CE);
System.out.println("Entity added! ");
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
}

To read mutliple files you should use a loop of some kind for iteration. You can either scan for all valid files in a directory.
File folder = new File("path/to/directory");
File[] files = folder.listFiles();
for (int i = 0; i < files.length; i++) {
// you can also filter for .xml if needed
if (files[i].isFile()) {
// parse the file
}
}
Next, you need to decide how you want to parse your files: sequential or in parallel.
Parallel is a lot faster since you use multiple threads to parse files.
One Thread
You can reuse your code that you already wrote, and loop over the files:
for (File file : files) {
processFile(file, yourListOfEntities);
}
Multiple Threads:
Aquire a ScheduledExecutorService and submit multiple tasks.
ExecutorService service = Executors.newFixedThreadPool(5);
for (File file : files) {
service.execute(() -> processFile(file, yourListOfEntities));
}
An important note here: The default implementation of ArrayList is not thread safe, so you should (since the List is used by multiple threads) synchronize access to it:
List<Entity> synchronizedList = Collections.synchronizedList(yourListOfEntities);
Also, DocumentBuilder is not thread safe and should be created once per thread (you have it right if you just call your method). This note is just for the case if you think about optimizing it.

Related

Return values from xml, xml Iteration

I have an xml file as such
<?xml version="1.0" encoding="UTF-8"?>
<folder name="c">
<folder name="program files">
<folder name="uninstall information" />
</folder>
<folder name="users"/>
</folder>
I want to print out "c", "program files", "uninstall information" and "users" what i finally want to do is to print out only values of the name attribute with string starting from u , therefore users and uninsall information.
But i have not been able to print all the values out,
Below is my code where you can see i have tried to ways but no success so far.
public static Collection<String> folderNames(String xml, char startingLetter) throws Exception {
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
try {
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
FileInputStream fis = new FileInputStream("src/main/resources/test.xml");
org.xml.sax.InputSource is = new InputSource(fis);
Document doc = documentBuilder.parse(is);
NodeList nodeList = doc.getElementsByTagName("*");
for(int i =0; i < nodeList.getLength(); i++) {
Node node = nodeList.item(i);
/// Tried this
if(node.getNodeType() == Node.ELEMENT_NODE) {
String value = node.getTextContent();
System.out.println("value:::" +value);
}
/// tried this
// Element element = (Element)nodeList.item(i);
// NamedNodeMap attributes = element.getAttributes();
// Node nodeValue1 = nodeList.item(i);
// System.out.println(nodeValue1.getAttributes().item(i));
}
} catch (Exception e) {
e.getMessage();
}
return Collections.EMPTY_LIST;
}
for speedy test my imported classes looks like test
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
My approach without using getElementByTagsName
Document doc = documentBuilder.parse(is);
NodeList nodeList = doc.getElementsByTagName("folder");
for(int i =0; i < nodeList.getLength(); i++) {
if (nodeList.item(i).hasChildNodes()) {
for(int i1 = 0; i1 < nodeList.item(i).getChildNodes().getLength(); i1++) {
Node node = nodeList.item(i).getChildNodes().item(i);
System.out.println(node.getAttributes().item(i));
}
}
Node nodeValue1 = nodeList.item(i);
System.out.println(nodeValue1.getAttributes().item(i));
This isnt complete but it will require a recursive call, due to hierarchy in the xml
Example of printing all folder names starting with u:
String xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<folder name=\"c\">\n" +
" <folder name=\"program files\">\n" +
" <folder name=\"uninstall information\" />\n" +
" </folder>\n" +
" <folder name=\"users\"/>\n" +
"</folder>";
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
Document doc = documentBuilder.parse(new InputSource(new StringReader(xml)));
NodeList nodeList = doc.getElementsByTagName("folder");
for (int i = 0; i < nodeList.getLength(); i++) {
Element element = (Element) nodeList.item(i);
String name = element.getAttribute("name");
if (name.startsWith("u"))
System.out.println(name);
}
Output
uninstall information
users
You almost had it. First you have to identify the XML element, which you did.
if(node.getNodeType() == Node.ELEMENT_NODE) {
String value = node.getTextContent();
System.out.println("value:::" +value);
}
but instead of getting invoking getTextContent(), you need to find the attribute in that element. Some variation of the below. Of course, if there is more than one attribute you will need to accomodate looking at them all (using node.getAttributes().getLength()):
if(node.getNodeType() == Node.ELEMENT_NODE) {
if (node.getAttributes() != null) {
String name = node.getAttributes().item(0).getNodeName();
String value = node.getAttributes().item(0).getNodeValue();
System.out.println("attribute name:::" +name + " value:::" +value);
}
}

Java XPath - find tags prefixed with

I have a following HTML
<data-my-tag>
<data-another-tag>
... content ...
</data-another-tag>
<data-my-tag>
... content ...
</data-my-tag>
</data-my-tag>
Now I need to find all tags starting with prefix <data-. I need to find their names and also their contents. I know this is not possible to achieve with regex, so I started to work with javax.xml.parsers. It is easy for me to find some tags according to a particular name, but I am unable to find tags starting with some prefix.
What is the expression or code to find tags starting with prefix?
You can use XPath's starts-with function:
public void findElements(InputSource source,
String prefix) {
try {
XPath xpath = XPathFactory.newInstance().newXPath();
NodeList matches = (NodeList) xpath.evaluate(
"//*[starts-with(local-name(), '" + prefix + "')]",
source, XPathConstants.NODESET);
int count = matches.getLength();
for (int i = 0; i < count; i++) {
Node match = matches.item(i);
System.out.println("Element: " + match.getNodeName());
System.out.println("Text: " + match.getTextContent().trim());
System.out.println();
}
} catch (XPathException e) {
throw new RuntimeException(e);
}
}
Can we use something like this :
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import java.io.File;
public class Demo {
public static void main(String[] args) {
try {
File inputFile = new File("input.txt");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(inputFile);
doc.getDocumentElement().normalize();
NodeList nList = doc.getDocumentElement().getChildNodes();
for (int temp = 0; temp < nList.getLength(); temp++) {
Node nNode = nList.item(temp);
if (nNode.getNodeType() == Node.ELEMENT_NODE || nNode.getNodeName().startsWith("<data-")) {
System.out.println(nNode.getTextContent());
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
}

Parsing XML from webpage

If I copy and paste the xml from this site into a xml file I can parse it with java
http://api.indeed.com/ads/apisearch?publisher=8397709210207872&q=java&l=austin%2C+tx&sort&radius&st&jt&start&limit&fromage&filter&latlong=1&chnl&userip=1.2.3.4&v=2
However, I want to parse it directly from a webpage if possible!
Here's my current code:
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.w3c.dom.NodeList;
import org.w3c.dom.Node;
import org.w3c.dom.Element;
import org.xml.sax.SAXException;
import java.io.File;
import java.io.IOException;
public class XMLParser {
public void readXML(String parse) {
File xml = new File(parse);
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder;
try {
dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(xml);
// System.out.println("Root element :"
// + doc.getDocumentElement().getNodeName());
NodeList nList = doc.getElementsByTagName("result");
System.out.println("----------------------------");
for (int temp = 0; temp < nList.getLength(); temp++) {
Node nNode = nList.item(temp);
// System.out.println("\nCurrent Element :" +
nNode.getNodeName());
if (nNode.getNodeType() == Node.ELEMENT_NODE) {
Element eElement = (Element) nNode;
System.out.println("job title : "
+
eElement.getElementsByTagName("jobtitle").item(0)
.getTextContent());;
System.out.println("Company: "
+
eElement.getElementsByTagName("company")
.item(0).getTextContent());
System.out.println("City : "
+
eElement.getElementsByTagName("city").item(0)
.getTextContent());
System.out.println("State : "
+
eElement.getElementsByTagName("state").item(0)
.getTextContent());
System.out.println("Country : "
+
eElement.getElementsByTagName("country").item(0)
.getTextContent());
System.out.println("Date posted : "
+
eElement.getElementsByTagName("date").item(0)
.getTextContent());
System.out.println("Job summary : "
+
eElement.getElementsByTagName("snippet").item(0)
.getTextContent());
System.out.println("Latitude : "
+
eElement.getElementsByTagName("latitude").item(0).getTextContent());
System.out.println("longitude : "
+
eElement.getElementsByTagName("longitude").item(0).getTextContent());
}
}
} catch (ParserConfigurationException | SAXException | IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
public static void main(String[] args) {
new XMLParser().readXML("test.xml");
}
}
any help would be appreciated.
Give it the URI instead of the XML. It will download it for you.
Document doc = dBuilder.parse(uriString)
Please find the code snippet like this
String url = "http://api.indeed.com/ads/apisearch?publisher=8397709210207872&q=java&l=austin%2C+tx&sort&radius&st&jt&start&limit&fromage&filter&latlong=1&chnl&userip=1.2.3.4&v=2";
try
{
DocumentBuilderFactory f = DocumentBuilderFactory.newInstance();
DocumentBuilder b = f.newDocumentBuilder();
Document doc = b.parse(url);
}
you need to have the element/nodes you want in a for loop. So it can scan through xml file, and find the right node you searching for.
reads the xml file as a string, and creates a xml structure
builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.parse(connection.getInputStream());
NodeList nodes = doc.getElementsByTagName("mode");
for (int i = 0; i < nodes.getLength(); i++)
Element element = (Element) nodes.item(i);
//Gets tag from XML and it´s content
NodeList nodeMode = element.getElementsByTagName("mode");
Element elemMode = (Element) nodeMode.item(0);
and after if you want to pick out a value and parse to an int or what you want you do like this:
int currentMode = Integer.parseInt(elemMode.getFirstChild().getTextContent());
That's how I parsed data directly from url http://www.nbp.pl/kursy/xml/+something
static class Kurs {
public float kurs_sprzedazy;
public float kurs_kupna;
}
private static DocumentBuilder dBuilder;
private static Kurs getData(String filename, String currency) throws Exception {
Document doc = dBuilder.parse("http://www.nbp.pl/kursy/xml/"+filename+".xml");
doc.getDocumentElement().normalize();
NodeList nList = doc.getElementsByTagName("pozycja");
for(int i = 0; i < nList.getLength(); i++) {
Element nNode = (Element)nList.item(i);
if(nNode.getElementsByTagName("kod_waluty").item(0).getTextContent().equals(currency)) {
Kurs kurs = new Kurs();
String data = nNode.getElementsByTagName("kurs_sprzedazy").item(0).getTextContent();
data = data.replace(',', '.');
kurs.kurs_sprzedazy = Float.parseFloat(data);
data = nNode.getElementsByTagName("kurs_kupna").item(0).getTextContent();
data = data.replace(',', '.');
kurs.kurs_kupna = Float.parseFloat(data);
return kurs;
}
}
return null;
}

Possible way to parse the text alone from an xml document using java dom

I need to receive all the text alone from an xml file for receiving the specific tag i use this code. But i am not sure how to parse all the text from the XML i the XML files are different i don't know their root node and child nodes but i need the text alone from the xml.
try {
DocumentBuilderFactory dbFactory = DocumentBuilderFactory
.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(streamLimiter.getFile());
doc.getDocumentElement().normalize();
System.out.println("Root element :"
+ doc.getDocumentElement().getNodeName());
NodeList nList = doc.getElementsByTagName("employee");
System.out.println("-----------------------");
for (int temp = 0; temp < nList.getLength(); temp++) {
Node nNode = nList.item(temp);
if (nNode.getNodeType() == Node.ELEMENT_NODE) {
Element eElement = (Element) nNode;
NodeList nlList = eElement.getElementsByTagName("firstname")
.item(0).getChildNodes();
Node nValue = (Node) nlList.item(0);
System.out.println("First Name : "
+ nValue.getNodeValue());
}
}
} catch (Exception e) {
e.printStackTrace();
}
Quoting jsight's reply in this post: Getting XML Node text value with Java DOM
import java.io.ByteArrayInputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
class Test {
/**
* #param args the command line arguments
*/
public static void main(String[] args) throws Exception {
String xml = "<add job=\"351\">\n"
+ " <tag>foobar</tag>\n"
+ " <tag>foobar2</tag>\n"
+ "</add>";
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
ByteArrayInputStream bis = new ByteArrayInputStream(xml.getBytes());
org.w3c.dom.Document doc = db.parse(bis);
Node n = doc.getFirstChild();
NodeList nl = n.getChildNodes();
Node an, an2;
for (int i = 0; i < nl.getLength(); i++) {
an = nl.item(i);
if (an.getNodeType() == Node.ELEMENT_NODE) {
NodeList nl2 = an.getChildNodes();
for (int i2 = 0; i2 < nl2.getLength(); i2++) {
an2 = nl2.item(i2);
// DEBUG PRINTS
System.out.println(an2.getNodeName() + ": type (" + an2.getNodeType() + "):");
if (an2.hasChildNodes()) {
System.out.println(an2.getFirstChild().getTextContent());
}
if (an2.hasChildNodes()) {
System.out.println(an2.getFirstChild().getNodeValue());
}
System.out.println(an2.getTextContent());
System.out.println(an2.getNodeValue());
}
}
}
}
}
Output:
#text: type (3):
foobar
foobar
#text: type (3):
foobar2
Adapt this code to your problem and it should work.

Failing to get element values using Element.getAttribute()

I would like to read an xml file. I' ve found an example which is good until the xml element doesn't have any attributes. Of course i've tried to look after how could I read attributes, but it doesn't works.
XML for example
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<car>
<properties>
<test h="1.12" w="4.2">
<colour>red</colour>
</test>
</properties>
</car>
Java Code:
public void readXML(String file) {
try {
File fXmlFile = new File(file);
DocumentBuilderFactory dbFactory = DocumentBuilderFactory
.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
doc = dBuilder.parse(fXmlFile);
doc.getDocumentElement().normalize();
for (int temp = 0; temp < nList.getLength(); temp++) {
Node nNode = nList.item(temp);
if (nNode.getNodeType() == Node.ELEMENT_NODE) {
Element eElement = (Element) nNode;
System.out.println("test : "
+ getTagValue("test", eElement));
System.out.println("colour : " + getTagValue("colour", eElement));
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
public String getTagValue(String sTag, Element eElement) {
NodeList nlList = eElement.getElementsByTagName(sTag).item(0)
.getChildNodes();
Node nValue = (Node) nlList.item(0);
System.out.println(nValue.hasAttributes());
if (sTag.startsWith("test")) {
return eElement.getAttribute("w");
} else {
return nValue.getNodeValue();
}
}
Output:
false
test :
false
colour : red
My problem is, that i can't print out the attributes. How could i get the attributes?
There is alot wrong with your code; undeclared variables and a seemingly crazy algorithm. I rewrote it and it works:
import java.io.File;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public final class LearninXmlDoc
{
private static String getTagValue(final Element element)
{
System.out.println(element.getTagName() + " has attributes: " + element.hasAttributes());
if (element.getTagName().startsWith("test"))
{
return element.getAttribute("w");
}
else
{
return element.getNodeValue();
}
}
public static void main(String[] args)
{
final String fileName = "c:\\tmp\\test\\domXml.xml";
readXML(fileName);
}
private static void readXML(String fileName)
{
Document document;
DocumentBuilder documentBuilder;
DocumentBuilderFactory documentBuilderFactory;
NodeList nodeList;
File xmlInputFile;
try
{
xmlInputFile = new File(fileName);
documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilder = documentBuilderFactory.newDocumentBuilder();
document = documentBuilder.parse(xmlInputFile);
nodeList = document.getElementsByTagName("*");
document.getDocumentElement().normalize();
for (int index = 0; index < nodeList.getLength(); index++)
{
Node node = nodeList.item(index);
if (node.getNodeType() == Node.ELEMENT_NODE)
{
Element element = (Element) node;
System.out.println("\tcolour : " + getTagValue(element));
System.out.println("\ttest : " + getTagValue(element));
System.out.println("-----");
}
}
}
catch (Exception exception)
{
exception.printStackTrace();
}
}
}
If you have a schema for the file, or can make one, you can use XMLBeans. It makes Java beans out of the XML, as the name implies. Then you can just use getters to get the attributes.
Use dom4j library.
InputStream is = new FileInputStream(filePath);
SAXReader reader = new SAXReader();
org.dom4j.Document doc = reader.read(is);
is.close();
Element content = doc.getRootElement(); //this will return the root element in your xml file
List<Element> methodEls = content.elements("element"); // this will retun List of all Elements with name "element"
Attribute attrib = methodEls.get(0).attribute("attributeName"); // this is the "attributeName" attribute of first element with name "element"
If you're looking purely to obtain attributes (E.g. a config / ini file) I would recommend using a java properties file.
http://docs.oracle.com/javase/tutorial/essential/environment/properties.html
If you just want to read a file create a new fileReader and put it into a bufferedReader.
BufferedReader in = new BufferedReader(new FileReader("example.xml"));

Categories