I'm absolutely new to parsing and quite stumbled now. I've downloaded code from webpage with R function htmlParse and now want to extract some data from it. It has lots of elements, but i'd like to get data from exact one, starting with <script type="text/javascript">. How is it possible to do so and get the needed part of a code in a charachter format so it'll be possible than to extract other small elements from this bulk with, probably, regexp?
I am preparing to embark on a large solo project at my place of employment. First let me describe the project. I have been asked to create a Java program that can take a CamT54 file (which is just a xml file) and have java display the information in table form. Then users should be given the ability to remove certain components from the table and have it go back to xml format with the changes.
I'm not well versed in dealing with XML in Java so this is going to be a learn and work task. Before I begin investing time I would like to know that my approach is the best approach.
My plan is to use DOM4J to do the parsing and handling of the xml. I will use a JTable to display the data and incorporate some buttons to the GUI that allow the modifications of the data through the use of some action listeners.
Would this be a plausible plan? Can DOM4J effectively allow xml data to be displayed in a table format and furthermore could that data be easily modified or deleted then resaved to a new xml?
I thought I would go ahead and answer this as I finished the program and wanted to post what I thought was the easiest solution in case anyone else needed help.
It turned out the easiest approach (for me at least) was to use the standard DOM parser, here are the steps I took.
Parsed the entire XML into String array lists. XPath was required for this, I also had to convert the elements into Strings and remove the extra tag information from the string using substrings since I only wanted the actual value.
I populated a JTable with these arrays.
Once users finished editing and clicked a save button then another Dom parser would take the original XML and change each and every attribute using the values from the Arrays (that were deleted and repopulated with the JTable cell values when the user clicked "save").
I need to parse relatively big XML files on Android.
Some node internal structure contains HTML tags, for some other nodes I need to pull content from different depth levels. Therefore, instead of using XmlPullParser I plan to:
using XPath, find the proper node
using 'getElementsByTagName' find appropriate sub-node(s)
extract information and save it in my custom data objects.
The problem I have is performance. The way how I open file is following:
File file = new File(_path);
FileInputStream is = new FileInputStream(file);
XPath xPath = XPathFactory.newInstance().newXPath();
NamespaceContext context = new NamespaceContextMap("def", __URL__);
xPath.setNamespaceContext(context);
Object objs = xPath.evaluate("/def:ROOT_ELEMENT/*,
new InputSource(is), XPathConstants.NODESET);
Even though I need to get few strings that are in the very beginning of the XML file, it looks like XPath parses WHOLE xml file and put it in DOM structure.
In some cases I need access to full object and it is ok to have operation running few seconds for few megabyte file.
In other cases - I only need to get few nodes and don't want users to wait for my program to perform a redundant parsing.
Q1: What is the way to get some parts of XML file without parsing it in full?
Q2: Is there any way to restrict XPath from scanning/parsing WHOLE XML file? For instance: scan till 2nd level of depth?
Thank you.
P.S. In one particular case, XML file represents FB2 file format and if you have any specific tips that could resolve my problem for fb2-files parsing, please fill free to add additional comments.
I don't know too much about the XML toolset available for android, except to know that it's painfully limited!
Probably the best way to tackle this requirement is to write a streaming SAX filter that looks for the parts of the document you are interested in, and builds a DOM containing only those parts, which you can then query using XPath. I'm a bit reluctant to advise that, because it won't be easy if you haven't done such things before, but it seems the right approach.
So say you have a file that is written in XML or soe other coding language of that sort. Is it possible to just rewrite one line rather than getting the entire file into a string, then changing then line, then having to rewrite the whole string back to the file?
In general, no. File systems don't usually support the idea of inserting or modifying data in the middle of a file.
If your data file is in a fixed-size record format then you can edit a record without overwriting the rest of the file. For something like XML, you could in theory overwrite one value with a shorter one by inserting semantically-irrelevant whitespace, but you wouldn't be able to write a larger value.
In most cases it's simpler to just rewrite the whole file - either by reading and writing in a streaming fashion if you can (and if the file is too large to read into memory in one go) or just by loading the whole file into some in-memory data structure (e.g. XDocument), making the changes, and then saving the file again. (You may want to consider saving to a different file then moving the files around to avoid losing data if the save operation fails for some reason.)
If all of this ends up being too expensive, you should consider using multiple files (so each one is smaller) or a database.
If the line you want to replace is larger than the new line that you want to replace it with, then it is possible as long as it is acceptable to have some kind of padding (for example white-space characters ' ') which will not effect your application.
If on the other hand the new content are larger than the content to be replaced you will need to shift all the data downwards, so you need to rewrite the file, or at least from the replaced line onwards.
Since you mention XML, it might be you are approaching your problem in the wrong way. Could it be that what you need is to replace a specific XML node? In which case you might consider using DOM to read the XML into a hierarchy of nodes and adding/updating/removing in there before writing the XML tree back to the file.
I want to build an XML file as a datastore. It should look something like this:
<datastore>
<item>
<subitem></subitem>
...
<subitem></subitem>
</item>
....
<item>
<subitem></subitem>
...
<subitem></subitem>
</item>
</datastore>
At runtime I may need to add items to the datastore. The number of items may be high, so that I don't want to hold the whole document in memory and can't use DOM. I just want to write the part where a change occures. Or does DOM supports this?
I had a first look at StAX, but I am not sure if it does what I want.
Wouldn't it be the best to remember a cursor position at the end of the file just right before the root element is beeing closed? That is always the position where new items will be added. So if I remember that position and keep it up to date during changes, I could add an new item at the end, without iterating through the whole file .
Maybe a second cursor, could be used independendly from the first one, to iterate over the document just for reading purposes.
I can't see that StAX supports any of this, does it?
Isn't there a block based API for files instead of a stream bases one? Aren't files and filesystems typical examples for block "devices"? And if there is such an API, does it help me with my problem?
Thanks in advance.
Updating XML is basically impossible because there's no "cheap" way to insert data.
Appending XML is not so bad. All you need to do there is seek to the end of the file, then GO BACK over the "end tag" (</datastore> in this case), and then just start writing. This is a cheap operation all told, but none of the frameworks really support this as they're all mostly designed to work with well formed, full boat XML documents, as a whole, not in pieces.
You could use a StAX like thing, but in this case, StAX isn't aware of the <datastore> tag, rather it's just aware of the <item> tags and its elements. Then you create Items and start writing, over and over and over, to the same OutputStream that you have set up.
That's the best way to do this.
But if you need to delete or change data, then you get to rewrite stuff, or do hacks, such as marking elements as "inactive", hunting them down in the XML file, seeking to the 'active="Y"' attribute, and then inplace changing the Y to N. It can be done, it will be mostly efficient, but its far and away outside what the normal XML processing frameworks let you do. If I were to do that, I'd read the entire file and keep track of those entries and note their locations within it so later I could easily seek and change them efficiently.
Then when you update something, you "inactivate" the old one, and "append" the new one. Eventually get to GC the file by rewriting it all and throwing out the old, "inactive" entries.
As a rule of thumb, XML files aren't very efficient as datastores, not for the record-based data you seem to want to use them for.
But if you've already got the file and absolutely can't do anything about it, you can use StAX XMLEventReaders and XMLEventWriters to read through a file quickly and insert/modify elements in it.
But when I say quickly, what I mean is more quickly than DOM would be, but nowhere near as effective as any relational DB.
Update: Another option you can consider is vtd-xml, although I haven't tried it in real projects, it actually looks pretty decent.
If you always want to append items at the end, then the best way to handle this is to have two XML files. The outer one datstore.xml is simply a wrapper, and looks like this:
<!DOCTYPE datastore [
<!ENTITY e SYSTEM "items.xml">
]>
<datastore>&e;</datastore>
The file items.xml looks like this:
<item>....</item>
<item>....</item>
<item>....</item>
with no wrapper element.
When you want to append data, you can open items.xml and write to the end of it. When you want to read data, open datastore.xml with an XML parser.
Of course, once your data grows beyond 20Mb or so, it may well be better to use an XML database. But I've been using this approach for years for records of Saxon orders, with files that are currently about 8Mb, and it works fine.
It's not very easy or efficient to partially update an XML file so you won't find much support for it as a use case.
Really it sound like you need a proper database, perhaps with a tool to export the data as XML.
If you don't want to use a DB and insist on storing the data purely as XML you might consider keeping all your items in memory as objects. Whenever a new one is added you can write all of them out to XML. It might seem inefficient, but depending on your data size might still be good enough.
If you choose this path, you might want to check out the Xstream library to make this quite easy, see stream tutorial for a quick example.