I have 2 large files which I gather from Stackoverflow named posts.xml and questions.txt with the following structure:
posts.xml:
<posts>
<row Id="4" PostTypeId="1" AcceptedAnswerId="7" CreationDate="2008-07-31T21:42:52.667" Score="322" ViewCount="21888" Body="..."/>
<row Id="6" PostTypeId="1" AcceptedAnswerId="31" CreationDate="2008-07-31T22:08:08.620" Score="140" ViewCount="10912" Body="..." />
...
</posts>
A post can be question or answer (both)
questions.txt:
Id,CreationDate,CreationDatesk,Score
123,2008-08-01 16:08:52,20080801,48
126,2008-08-01 16:10:30,20080801,33
...
I wanna query on posts just one time and index the selected rows (which their ID is in questions.txt file) with lucene. Since the xml file is very large (about 50GB), the time of querying and indexing is important for me.
Now the question is: How can I find all the selected rows in posts.xml that are repeated in questions.txt
This is my approach until now:
SAXParserDemo.java:
public class SAXParserDemo {
public static void main(String[] args){
try {
File inputFile = new File("D:\\University\\Information Retrieval 2\\Hws\\Hw1\\files\\Posts.xml");
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser = factory.newSAXParser();
UserHandler userhandler = new UserHandler();
saxParser.parse(inputFile, userhandler);
} catch (Exception e) {
e.printStackTrace();
}
}
}
Handler.java:
public class Handler extends DefaultHandler {
public void getQuestiondId() {
ArrayList<String> qIDs = new ArrayList<String>();
BufferedReader br = null;
try {
String qId;
br = new BufferedReader(new FileReader("D:\\University\\Information Retrieval 2\\Hws\\Hw1\\files\\Q.txt"));
while ((qId = br.readLine()) != null) {
qId = qId.split(",")[0]; //this is question id
findAndIndexOnPost(qId); //find this id on posts.xml then index it!
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
private void findAndIndexOnPost(String qID) {
}
#Override
public void startElement(String uri,
String localName, String qName, Attributes attributes)
throws SAXException {
if (qName.equalsIgnoreCase("row")) {
System.out.println(attributes.getValue("Id"));
switch (attributes.getValue("PostTypeId")) {
case "1":
String id = attributes.getValue("Id");
break;
case "2":
break;
default:
break;
}
}
}
}
UPDATE:
I need to keep pointer on xml file in every iteration. But with SAX I don't know how to do this.
What you have to do is:
read the TXT file (probably a simple stream will do).
add all Id values to a List<Integer> questionIds - one by one. You will have to parse them manually (with a regex or String.indexOf()).
in your Handler implementation simply compare if questionIds.contains(givenId).
send the received object (from XML) to Elastic Search with a simple REST request (POST/PUT).
Ta-da! Your data is now indexed with lucene.
Also, change the way you pass data to SAX Parser. Instead of giving it a File, create an implementation of InputStream for it which you can give to saxParser.parse(inputStream, userhandler);. Info on getting position in a stream here: Given a Java InputStream, how can I determine the current offset in the stream?.
Related
I want to convert a XML to a JSON and after some process returning to a valid XML with the DTD schema.
I have this method that returns a JSONObject:
public JSONObject xml2JSON(InputStream xml) throws IOException, JDOMException {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
byte[] buffer = new byte[1024];
int len;
while ((len = xml.read(buffer)) > -1 ) {
baos.write(buffer, 0, len);
}
baos.flush();
InputStream is1 = new ByteArrayInputStream(baos.toByteArray());
InputStream is2 = new ByteArrayInputStream(baos.toByteArray());
String s = input2String(is1);
if(validationDTD(is2)) {
return XML.toJSONObject(s);
}
return null;
}
public Boolean validationDTD(InputStream xml) throws JDOMException, IOException {
try {
SAXBuilder builder = new SAXBuilder(XMLReaders.DTDVALIDATING);
Document validDocument = builder.build(xml);
validDocument.getDocType();
return true;
} catch (JDOMException e) {
return false;
} catch (IOException e) {
return false;
}
}
public String input2String(InputStream inputStream) throws IOException {
return IOUtils.toString(inputStream, Charset.defaultCharset());
}
And this method that returns the proper xml:
public String JSONtoXML(JSONObject jsonObject) {
String finalString = DOCTYPE.concat(XML.toString(jsonObject));
return finalString;
}
with a variable for adding the DTD:
private static final String DOCTYPE = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<!DOCTYPE ep-request SYSTEM \"myDtd.dtd\">";
I have this tests:
#Test
public void xml2JSONShouldReturnString() throws IOException, JDOMException {
InputStream xmlInputString = this.getClass().getClassLoader().getResourceAsStream("myXmlDtd.xml");
service.xml2JSON(xmlInputString);
}
#Test
public void validateDTDShouldReturnDocument() throws IOException, JDOMException {
InputStream xmlInputString = this.getClass().getClassLoader().getResourceAsStream("myXmlDtd.xml");
Assert.assertEquals(true, service.validationDTD(xmlInputString));
}
#Test
public void JSON2toxmlShouldReturnValidXML() throws IOException, JDOMException {
InputStream xmlInputString = this.getClass().getClassLoader().getResourceAsStream("myXmlDtd.xml");
JSONObject jsonObject = service.xml2JSON(xmlInputString);
String xmlOut = eblService.JSONtoXML(jsonObject);
Assert.assertEquals(true, service.validationDTD(new ByteArrayInputStream(xmlOut.getBytes())));
}
But the last one fails because the xml isnĀ“t in the correct format of my DTD.
How can I make a valid XML (that matches the DTD)?
EDIT:
Now I'm parsing the XML to POJO (generated with xjc -dtd mydtd.dtd) and POJO to JSON, and viceversa.
But I'm having troubles with POJO to XML serialization because My POJO contains:
#XmlElements({
#XmlElement(name = "file-reference-id", required = true, type =
FileReferenceId.class),
#XmlElement(name = "request-petition", required = true, type =
RequestPetition.class)
})
protected List<Object> fileRefenceIdOrRequestPetition;
the problem appears when my POJO contains a List of LinkedHashMap and returns that LinkedHashMap isn't in the JAXBContext, but if I change the type of my class to LinkedHashMap.class it misses the context of my FileReferenceId.class or whatever class that it its contained into the linkedHashMap.class
There are many different libraries for converting JSON to XML and they all produce different answers, with different strengths and weaknesses. (For example, they all have different solutions to the problem of handling JSON keys that aren't valid XML names.) Generally they don't give you much control over the format of the XML, which means that you typically have to transform the generated XML to the format you actually want, e.g. with an XSLT stylesheet. That will certainly be the case if you have a specific DTD that the XML has to conform to.
Note: The json-to-xml() function in XSLT 3.0 produces XML that directly reflects the JSON grammar, with constructs like
<map>
<string key="first">John</string>
<string key="last">Smith</string>
</map>
The idea here is that users will always want to transform this into their desired target format, and since you're in XSLT already, this transformation poses no problems.
I want to parse this xml:
<sparql xmlns="http://www.w3.org/2005/sparql-results#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/sw/DataAccess/rf1/result2.xsd">
<head>
<variable name="uri"/>
<variable name="id"/>
<variable name="label"/>
</head>
<results distinct="false" ordered="true">
<result>
<binding name="uri"><uri>http://dbpedia.org/resource/Davis_&_Weight_Motorsports</uri></binding>
<binding name="label"><literal xml:lang="en">Davis & Weight Motorsports</literal></binding>
<binding name="id"><literal datatype="http://www.w3.org/2001/XMLSchema#integer">5918444</literal></binding>
<binding name="label"><literal xml:lang="en">Davis & Weight Motorsports</literal></binding>
</result></results></sparql>
This is my handler:
public class DBpediaLookupClient extends DefaultHandler{
public DBpediaLookupClient(String query) throws Exception {
this.query = query;
HttpMethod method = new GetMethod("some_uri&query=" + query2);
try {
client.executeMethod(method);
InputStream ins = method.getResponseBodyAsStream();
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser sax = factory.newSAXParser();
sax.parse(ins, this);
} catch (HttpException he) {
System.err.println("Http error connecting to lookup.dbpedia.org");
} catch (IOException ioe) {
System.err.println("Unable to connect to lookup.dbpedia.org");
}
method.releaseConnection();
}
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
if (qName.equalsIgnoreCase("td") || qName.equalsIgnoreCase("uri") || qName.equalsIgnoreCase("literal")) {
tempBinding = new HashMap<String, String>();
}
lastElementName = qName;
}
public void endElement(String uri, String localName, String qName) throws SAXException {
if (qName.equalsIgnoreCase("uri") || qName.equalsIgnoreCase("literal") || qName.equalsIgnoreCase("td")) {
if (!variableBindings.contains(tempBinding))
variableBindings.add(tempBinding);
}
}
public void characters(char[] ch, int start, int length) throws SAXException {
String s = new String(ch, start, length).trim();
if (s.length() > 0) {
if ("td".equals(lastElementName)) {
if (tempBinding.get("td") == null) {
tempBinding.put("td", s);
}
}
else if ("uri".equals(lastElementName)) {
if (tempBinding.get("uri") == null) {
tempBinding.put("uri", s);
}
}
else if ("literal".equals(lastElementName)) {
if (tempBinding.get("literal") == null) {
tempBinding.put("literal", s);
}
}
//if ("URI".equals(lastElementName)) tempBinding.put("URI", s);
if ("URI".equals(lastElementName) && s.indexOf("Category")==-1 && tempBinding.get("URI") == null) {
tempBinding.put("URI", s);
}
if ("Label".equals(lastElementName)) tempBinding.put("Label", s);
}
}
}
And this is the result:
key: uri, value: http://dbpedia.org/resource/Davis_
key: literal, value: 5918444
key: literal, valueDavis
As you can see it gets seperated from the &
When I trace through the character() function I see that the lenght is wrong and is up to & instead of being up to the end of the string that I want to get as the result.
I copied this part of code and I don't know much about parser and handlers, I just know that much that I got from tracing the code, and wherever I searched it was said there should be & instead of & in an xml document, which is the case here.
What should I do in this code to get the complete string not get trimed by & character?
This is a lesson everyone has to learn when using SAX: the parser can break up text nodes and report the content in multiple calls to characters(), and it's the application's job to reassemble it (e.g. by using a StringBuilder). It's very common for parsers to break the text at any point where it would otherwise have to shunt characters around in memory, e.g. where entity references occur or where it hits an I/O buffer boundary.
It was designed this way to make SAX parsers super-efficient by minimizing text copying, but I suspect there's no real benefit, because the text copying just has to be done by the application instead.
Don't try and second-guess the parser as #DavidWallace suggests. The parser is allowed to break the text up any way it likes, and your application should cater for that.
I am trying to write to a text file. I am able to write to the console however, i am not able to write to my text file. One thing i have noticed is that String data doesn't contain any data if i were to just print to the console which is probably why nothing appears in my textfile. Does anyone know why that is though and how i can come about it?
writeFile() method code:
public static void writeFile(String filename, String content) throws IOException
{
try
{
Files.write(Paths.get(filename), content.getBytes()); // write file
}
catch (IOException e)
{
System.out.println("Error writing file: " + e);
}
}
Test code:
public class QuickTest {
public static void main(String... p) throws IOException {
List<SensorInfo> readings = new ArrayList<>();
SensorInfo info = null;
String data = createStringFromInfo(readings);
writeFile("datastore.txt", data);
String filedata = readFile("client-temp.txt");
List<SensorInfo> temps = createInfoFromData(filedata);
System.out.println(header());
for (SensorInfo reading : temps) {
System.out.print(reading.display());
}
}
}
CreateFromInfo Method:
public static String createStringFromInfo(List<SensorInfo> infoList)
{
String data = "";
for (SensorInfo info : infoList)
{
data += info.asData();
}
return data;
}
createInfoFromData
public static List<SensorInfo> createInfoFromData(String data)
{
List<SensorInfo> infoList = new ArrayList<>();
String[] lines = data.split("\n");
for (String line : lines)
{
SensorInfo info = new SensorInfo(line);
infoList.add(info);
}
return infoList;
}
That implementation of createStringFromInfo() confirms my guess: it will return an empty string when its argument is an empty list, as is the case in your program. I agree with you that that is why you get an empty file.
You fix it by filling the readings list with SensorInfo objects describing the information you want written to the file (before you invoke createStringFromInfo()). If the data for those SensorInfo objects should come from reading file "client-temp.txt" then you should read it in first, then pass that List to createStringFromInfo() to get the data to write.
I have large XML file in below format. I can read line by line and doing some string operations as I only need to extract values for a couple of fields. But, in general, how do we process file in below format ? I found Mahout XML parser, but I think it is not for below format.
<?xml version="1.0" encoding="utf-8"?>
<posts>
<row Id="1" PostTypeId="1" AcceptedAnswerId="13" CreationDate="2010-09-13T19:16:26.763" Score="155" ViewCount="160162" Body="<p>This is a common question by those who have just rooted their phones. What apps, ROMs, benefits, etc. do I get from rooting? What should I be doing now?</p>
" OwnerUserId="10" LastEditorUserId="16575" LastEditDate="2013-04-05T15:50:48.133" LastActivityDate="2013-09-03T05:57:21.440" Title="I've rooted my phone. Now what? What do I gain from rooting?" Tags="<rooting><root>" AnswerCount="2" CommentCount="0" FavoriteCount="107" CommunityOwnedDate="2011-01-25T08:44:10.820" />
<row Id="2" PostTypeId="1" AcceptedAnswerId="4" CreationDate="2010-09-13T19:17:17.917" Score="10" ViewCount="966" Body="<p>I have a Google Nexus One with Android 2.2. I didn't like the default SMS-application so I installed Handcent-SMS. Now when I get an SMS, I get notified twice. How can I fix this?</p>
" OwnerUserId="7" LastEditorUserId="981" LastEditDate="2011-11-01T18:30:32.300" LastActivityDate="2011-11-01T18:30:32.300" Title="I installed another SMS application, now I get notified twice" Tags="<2.2-froyo><sms><notifications><handcent-sms>" AnswerCount="3" FavoriteCount="2" />
</posts>
The data you have posted is from SO data dump (I know because I am currently playing with it on Hadoop). Following is the mapper I've written to create a tab separated file out of this.
You essentially read line by line and use JAXP api to parse and extract the required information
public class StackoverflowDataWranglerMapper extends Mapper<LongWritable, Text, Text, Text>
{
private final Text outputKey = new Text();
private final Text outputValue = new Text();
private final DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
private DocumentBuilder builder;
private static final Joiner TAG_JOINER = Joiner.on(",").skipNulls();
// 2008-07-31T21:42:52.667
private static final DateFormat DATE_PARSER = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS");
private static final SimpleDateFormat DATE_BUILDER = new SimpleDateFormat("yyyy-MM-dd");
#Override
protected void setup(Context context) throws IOException, InterruptedException
{
try
{
builder = factory.newDocumentBuilder();
}
catch (ParserConfigurationException e)
{
new IOException(e);
}
}
#Override
protected void map(LongWritable inputKey, Text inputValue, Mapper<LongWritable, Text, Text, Text>.Context context)
throws IOException, InterruptedException
{
try
{
String entry = inputValue.toString();
if (entry.contains("<row "))
{
Document doc = builder.parse(new InputSource(new StringReader(entry)));
Element rootElem = doc.getDocumentElement();
String id = rootElem.getAttribute("Id");
String postedBy = rootElem.getAttribute("OwnerUserId").trim();
String viewCount = rootElem.getAttribute("ViewCount");
String postTypeId = rootElem.getAttribute("PostTypeId");
String score = rootElem.getAttribute("Score");
String title = rootElem.getAttribute("Title");
String tags = rootElem.getAttribute("Tags");
String answerCount = rootElem.getAttribute("AnswerCount");
String commentCount = rootElem.getAttribute("CommentCount");
String favoriteCount = rootElem.getAttribute("FavoriteCount");
String creationDate = rootElem.getAttribute("CreationDate");
Date parsedDate = null;
if (creationDate != null && creationDate.trim().length() > 0)
{
try
{
parsedDate = DATE_PARSER.parse(creationDate);
}
catch (ParseException e)
{
context.getCounter("Bad Record Counters", "Posts missing CreationDate").increment(1);
}
}
if (postedBy.length() == 0 || postedBy.trim().equals("-1"))
{
context.getCounter("Bad Record Counters", "Posts with either empty UserId or UserId contains '-1'")
.increment(1);
try
{
parsedDate = DATE_BUILDER.parse("2100-00-01");
}
catch (ParseException e)
{
// ignore
}
}
tags = tags.trim();
String tagTokens[] = null;
if (tags.length() > 1)
{
tagTokens = tags.substring(1, tags.length() - 1).split("><");
}
else
{
context.getCounter("Bad Record Counters", "Untagged Posts").increment(1);
}
outputKey.clear();
outputKey.set(id);
StringBuilder sb = new StringBuilder(postedBy).append("\t").append(parsedDate.getTime()).append("\t")
.append(postTypeId).append("\t").append(title).append("\t").append(viewCount).append("\t").append(score)
.append("\t");
if (tagTokens != null)
{
sb.append(TAG_JOINER.join(tagTokens)).append("\t");
}
else
{
sb.append("").append("\t");
}
sb.append(answerCount).append("\t").append(commentCount).append("\t").append(favoriteCount).toString();
outputValue.set(sb.toString());
context.write(outputKey, outputValue);
}
}
catch (SAXException e)
{
context.getCounter("Bad Record Counters", "Unparsable records").increment(1);
}
finally
{
builder.reset();
}
}
}
I have a method (getSingleNodeValue()) which when passed an xpatch expression will extract the value of the specified element in the xml document refered to in 'doc'. Assume doc at this point has been initialised as shown below and xmlInput is the buffer containing the xml content.
SAXBuilder builder = null;
Document doc = null;
XPath xpathInstance = null;
doc = builder.build(new StringReader(xmlInput));
When i call the method, i pass the following xpath xpression
/TOP4A/PERLODSUMDEC/TINPLD1/text()
Here is the method. It basically just takes an xml buffer and uses xpath to extract the value:
public static String getSingleNodeValue(String xpathExpr) throws Exception{
Text list = null;
try {
xpathInstance = XPath.newInstance(xpathExpr);
list = (Text) xpathInstance.selectSingleNode(doc);
} catch (JDOMException e) {
throw new Exception(e);
}catch (Exception e){
throw new Exception(e);
}
return list==null ? "?" : list.getText();
}
The above method always returns "?" i.e. nothing is found so 'list' is null.
The xml document it looks at is
<TOP4A xmlns="http://www.testurl.co.uk/enment/gqr/3232/1">
<HEAD>
<Doc>ABCDUK1234</Doc>
</HEAD>
<PERLODSUMDEC>
<TINPLD1>10109000000000000</TINPLD1>
</PERLODSUMDEC>
</TOP4A>
The same method works with other xml documents so i am not sure what is special about this one. There is no exception so the xml is valid xml. Its just that the method always sets 'list' to null. Any ideas?
Edit
Ok as suggested, here is a simple running program that demonstrates the above
import org.jdom.*;
import org.jdom.input.*;
import org.jdom.xpath.*;
import java.io.IOException;
import java.io.StringReader;
public class XpathTest {
public static String getSingleNodeValue(String xpathExpr, String xmlInput) throws Exception{
Text list = null;
SAXBuilder builder = null;
Document doc = null;
XPath xpathInstance = null;
try {
builder = new SAXBuilder();
doc = builder.build(new StringReader(xmlInput));
xpathInstance = XPath.newInstance(xpathExpr);
list = (Text) xpathInstance.selectSingleNode(doc);
} catch (JDOMException e) {
throw new Exception(e);
}catch (Exception e){
throw new Exception(e);
}
return list==null ? "Nothing Found" : list.getText();
}
public static void main(String[] args){
String xmlInput1 = "<TOP4A xmlns=\"http://www.testurl.co.uk/enment/gqr/3232/1\"><HEAD><Doc>ABCDUK1234</Doc></HEAD><PERLODSUMDEC><TINPLD1>10109000000000000</TINPLD1></PERLODSUMDEC></TOP4A>";
String xpathExpr = "/TOP4A/PERLODSUMDEC/TINPLD1/text()";
XpathTest xp = new XpathTest();
try {
System.out.println(xp.getSingleNodeValue(xpathExpr, xmlInput1));
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
When i run the above, the output is
Nothing found
Edit
I have run some further testing and it appears that if i remove the namespace url it does work. Not sure why yet. Is there any way i can tell it to ignore the namespace?
Edit
Please also note that the above is implemented on JDK1.4.1 so i dont have the options for later version of the JDKs. This is the reason why i had to stick with Jdom.
The problem is with XML namespaces: your XPath query starts by selecting a 'TOP4A' element in the default namespace. Your XML file, however, has a 'TOP4A' element in the 'http://www.testurl.co.uk/enment/gqr/3232/1' namespace instead.
Is it an option to remove the xmlns from the XML?