importing data from WARC files (WebArchive)

importing data from WARC files (WebArchive) - java

I'm dealing with a not-so-normal use case where data is present in WARC files.
[https://en.wikipedia.org/wiki/Web_ARChive][1]
And i want to import the data into Neo4j.
One solution i can think of is to parse the WARC file (some java code to read), then write structured data into CSV so that it can then be loaded using some import tool.
Is extracting into CSV the only option to load data into Neo4j?
Could you give me some advise on how to go about implementing this use case?
Thanks,
Phaneendra

It depends.
It depends on what data you want to load from the Web Archive. If you're talking about loading the metadata ... then you do not need the intermediate step, process the file and insert the data straight into the database. You could use a stored procedure for that (apoc library is full of similar things) or a small server application using your favorite language + driver.
If you're talking about the content inside the Web Archive, it's a different story. Neo4j is not a blob/document store so you would have to extract and interpret the archived files. That would probably be more efficient in an indirect process.
Hope this helps,
Tom
BTW csv is not the only format that can be loaded. There are procedures for loading xml, json, ...

Related

Map Excel workbook with multiple sheets to XSD

I have an Excel workbook with multiple sheets. Each sheet holds a table, the different tables have different formats.
I need to read the entire workbook into my Java program. The most convenient method IMHO is to export the entire data into a single XML and parse it (using simpleXML or some other compatible parser).
I have found no method for applying a schema to multiple sheets of a workbook, only to a single sheet. Is it possible? If so, how?

When it comes to convenience, there are many factors that influence or define it. For example, it depends if this is an ongoing thing, or if it needs to be integrated into a process, etc.
Before recommending a solution as suggested, I would try to convince you to take a look at Apache's POI (the Java API for Microsoft Documents), specifically the Excel API. It gives you a Java API for your Java program that should allow you to read what you need pretty easily. It would be a one stop shop kind of thing.
Another approach might be to use Jdbc to Odbc and access the Excel via JDBC API (JDBC to ODBC provider). I can't tell from details in your question if your deployment model would allow for this (e.g. if you run on a platform that doesn't have an ODBC provider for Excel files), but on Windows for sure is an option; also, many places on internet detailing this approach.
If you insist on going down the XML export way, QTAssistant (I am associated with it) has a comprehensive solution (XML Builder) for generating XML from any supported relational data source. It provides a GUI and a command line. In your case it would need the XLS, an XSD which describes the XML you want to get out and a mapping file (basically another XML file) to create the XML you need. In general this feature is largely used to convert test data into XML for Web service calls, so it is geared towards a certain interaction pattern between the user, the tool, and the XML generation activities. If you're interested in more details, let me know.

How to create .mpp file in java?

I am able to create .mpx file by using mpxj library in java.
I need write ( create ) .mpp file in java can any one suggest me please.

I maintain MPXJ, and the short answer to your enquiry is that, at present, MPXJ does not write MPP files.
The main reason for this is simply that despite the effort which has gone into understanding the MPP file structure, there is still a great deal of it which is not well understood, hence it is difficult to reliably generate. The other issue is that even if I was to produce some code which could generate an MPP file, the features it could write to that file are likely to lag behind what MPXJ supports in the MSPDI file format, again due to my incomplete understanding of the MPP format.
My suspicion is that the next version of MS project (project 15? Project 2013?) may probably offer a ".mppx" file format, similar to the ".docx" etc formats used by other applications in the MS Office suite. This will be XML-based and will be more straightforward to generate than the binary MPP file format currently is... let's see what Microsoft come up with!
Jon

Visit http://www.mpxj.org/faq/
Can I use MPXJ to write MPP files?
Not at present. Although it is technically feasible to generate an MPP file, the knowledge we have of the file structure is still relatively incomplete, despite the amount of data we are able to correctly extract. It is therefore likely to take a considerable amount of development effort to make this work, and it is conceivable that we will not be ablet to write the full set of attributes that MPXJ supports back into the MPP file - simply because we don't understand the format well enough. You are therefore probably better off using MSPDI which does support the full range of data items present in an MPP file.
You can
Try this: http://www.aspose.com/java/project-management-component.aspx
It writes MPP and Microsoft Project XML.
But this not free

Try this: http://www.aspose.com/java/project-management-component.aspx
It writes MPP and Microsoft Project XML.

I think by "mpp" you probably mean "Microsoft PowerPoint", correct?
Q: Why do you think MPXJ (Microsoft Project Exchange/Java) can't do this?
http://www.mpxj.org/
Welcome to MPXJ! This library provides a set of facilities to allow
project information to be manipulated in Java and .Net. MPXJ supports
a range of data formats: Microsoft Project Exchange (MPX), Microsoft
Project (MPP,MPT), Microsoft Project Data Interchange (MSPDI XML),
Microsoft Project Database (MPD), Planner (XML), Primavera (PM XML,
XER, and database), and Asta Powerproject (PP, MDB).

Fast and easy data import tools/libraries

I'm looking tools/libraries which allows fast (easy) data import to existing database tables. For example phpmyadmin allows data import from .csv, .xml etc. In Hadoop hue via Beesvax for Hive we can create table from file. I'm looking tools which I can use with postgresql or libraries which allows doing such things fast and easily - I'm looking for way to avoid coding it manualy from reading file to inserting to db via jdbc.

You can do all that with standard tools in PostgreSQL, without additional libraries.
For .csv files you can use the built in COPY command. COPY is fast and simple. The source file has to lie on the same machine as the database for that. If not, you can use the very similar \copy meta-command of psql.
For .xml files (or any format really) you can use the built in pg_read_file() inside a plpgsql function. However, I quote:
Only files within the database cluster directory and the log_directory
can be accessed.
So you have to put your source file there or create a symbolic link to your actual file/directory. Then you can parse it with unnest() and xpath() and friends. You need at least PostgreSQL 8.4 for that.
A kick start on parsing XML in this blog post by Scott Bailey.

Alternative to ZIP as a project file format. SQLite or Other?

My Java application is currently using ZIP as a project file format. The project files contain a few XML files and many image and sound files.
The project files are getting pretty big, and since I can't find a way with the java.util.zip classes to write to a ZIP file without recreating it, my file saves are becoming very slow. So for example, if I just want to update one XML file, I need to rewrite the entire ZIP.
Is there some other Java ZIP library that will allow me to do random writes to a ZIP file?
I know switching to something like SQLite solves the random write issue. Would using SQLite just to write XML, Sound and Images as blobs be an appropriate use?
I suppose I could come up with my own file format and use RandomAccessFile but then there would be a lot of bookkeeping I'd have to write.
Update...
My file format is very much like Office Open XML. It is a ZIP file containing XML and other resources.
Someone must have solved the problem of how to do random writes to update a ZIP file. Does anyone know how?

There exist so-called single-file virtual file systems, that let you create file-based containers and provide file-system like structure and APIs. One of the samples is SolFS (it has C-written core with JNI wrapper) and some other C- and Delphi-written solutions (I don't remember their names at the moment). I guess there exist similar native Java solutions as well.

First of all I would separate your app's resources in those that are static (such as images) and those that can be changed (the xml files you mentioned).
Since the static files won't be re-written, you can continue to store them in a zip file, which IMHO is a good approach to deploy any resources.
Now you have 2 options:
Since the non-static files are probably not too big (the xml files are likely to be smaller than images+sounds), you can stick with your current solution (zip file) and simply maintain 2 zip files, of which only one (the smaller one with the changeable files) can/will be re-written.
You could use a in-memory-database (such as hsqldb) to store the changeable files and only persist them (transferring from the database to a file on the drive) when your application shuts down or that operation is explicitly needed.

sqlite is not always fast (at least in my experience). I would suggest individually compressing the XML files -- you'll still get decent compression, and just use the file system to save them. You could experiment with btrfs, or just go with ext4. If you're not on Linux, then this should still work okay, but it might not be as fast until things are cached in memory.
the idea is that if you do not have redundancy between XML files, then you don't get that much saving by compressing them in one "solid" archive.

Before offering another answer along the lines of using properly structured JARs, I have to ask -- why does the project need to be encapsulated in one file? How do you distribute the program to users to run?

If you must keep a project contained within a single file and be able to replace resources efficiently, yes I would say SQLite is a good choice.
If you do choose to use SQLite, also consider converting some of the XML schemas to one or more SQL tables rather than storing large XML documents as BLOBs.

Using Hibernate to work with Text Files

I am using Hibernate in a Java application to access my Database and it works pretty well with MS-SQL and MySQL. But some of the data I have to show on some forms has to come from Text files, and by Text files I mean Human-Readable files, they can be CSV, Tab-Delimited, or even a key, value pair, per line since my data is as simple as this, but my preference of course is XML files.
My question is: Can I use hibernate to read those files using HQL, Query , EntityManager and all those resources Hibernate provides me to access files. Which file format should I use and How I configure My persistence.xml file to recognize files as Tables?

Hibernate is written against the JDBC API. So, you need a JDBC driver that works with the file format you are interested in. Obviously, even for read-only access, this isn't going to perform well, but it might still be useful if that's not a high priority. On a Windows system, you can set up ODBC datasources for delimited text files, Excel files, etc. Then you can set up the JdbcOdbcDriver in your Java application to use this data source.
For most of the applications I work on, I would not consider this approach; I would use an import/export mechanism to convert from a real database (even if it's an in-process database like Berkeley DB or Derby) to the text files. Yes, it's an extra step, but it could be automated, and the performance isn't likely to be much worse than trying to use the text files directly (it will likely be much better, overall), and it will be more robust and easy to develop.

A quick google came up with
JDBC driver for csv files
JDBC driver for XML files
Hope this might provide some inspiration?

Like erickson said, your only hope is in finding a JDBC driver for that task. There is maybe xlsql (CSV, XML and Excel driver) which could fit the task. After that, you just have to either find or write the most simple Hibernate Dialect which fits your driver.

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.