I have a web server which returns JSON data that I would like to load into an Apache Spark DataFrame. Right now I have a shell script that uses wget to write the JSON data to file and then runs a Java program that looks something like this:
DataFrame df = sqlContext.read().json("example.json");
I have looked at the Apache Spark documentation and there doesn't seem a way to automatically join these two steps together. There must be a way of requesting JSON data in Java, storing it as an object and then converting it to a DataFrame, but I haven't been able to figure it out. Can anyone help?
You could store JSON data into a list of Strings like:
final String JSON_STR0 = "{\"name\":\"0\",\"address\":{\"city\":\"0\",\"region\":\"0\"}}";
final String JSON_STR1 = "{\"name\":\"1\",\"address\":{\"city\":\"1\",\"region\":\"1\"}}";
List<String> jsons = Arrays.asList(JSON_STR0, JSON_STR1);
where each String represents a JSON object.
Then you could transform the list to an RDD:
RDD<String> jsonRDD = sc.parallelize(jsons);
Once you've got RDD, it's easy to have DataFrame:
DataFrame data = sqlContext.read().json(jsonRDD);
Is it possible to get URLs into Nutch directly from a database or a service etc. I'm not interested in the ways which data is taken from the database or service and written to seed.txt.
No. This cannot be done directly with the default nutch codebase. You need to modify Injector.java to achieve that.
EDIT:
Try using DBInputFormat : an InputFormat that reads input data from an SQL table. You need to modify the Inject code here (line 3 in snippet below):
JobConf sortJob = new NutchJob(getConf());
sortJob.setJobName("inject " + urlDir);
FileInputFormat.addInputPath(sortJob, urlDir);
sortJob.setMapperClass(InjectMapper.class);
I have the following feeds from my vendor,
http://scores.cricandcric.com/cricket/getFeed?key=4333433434343&format=xml&tagsformat=long&type=schedule
I wanted to get the data from that xml files as java objects, so that I can insert into my database regularly.
The above data is nothing but regular updates from the vendor, so that I can update in my website.
can you please suggest me what are my options available to get this working
Should I use any webservices or just Xstream
to get my final output.. please suggest me as am a new comer to this concept
Vendor has suggested me that he can give me the data in following 3 formats rss, xml or json, I am not sure what is easy and less consumable to get it working
I would suggest just write a program that parses the XML and inserts the data directly into your database.
Example
This groovy script inserts data into a H2 database.
//
// Dependencies
// ============
import groovy.sql.Sql
#Grapes([
#Grab(group='com.h2database', module='h2', version='1.3.163'),
#GrabConfig(systemClassLoader=true)
])
//
// Main program
// ============
def sql = Sql.newInstance("jdbc:h2:db/cricket", "user", "pass", "org.h2.Driver")
def dataUrl = new URL("http://scores.cricandcric.com/cricket/getFeed?key=4333433434343&format=xml&tagsformat=long&type=schedule")
dataUrl.withReader { reader ->
def feeds = new XmlSlurper().parse(reader)
feeds.matches.match.each {
def data = [
it.id,
it.name,
it.type,
it.tournamentId,
it.location,
it.date,
it.GMTTime,
it.localTime,
it.description,
it.team1,
it.team2,
it.teamId1,
it.teamId2,
it.tournamentName,
it.logo
].collect {
it.text()
}
sql.execute("INSERT INTO matches (id,name,type,tournamentId,location,date,GMTTime,localTime,description,team1,team2,teamId1,teamId2,tournamentName,logo) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)", data)
}
}
Well... you could use an XML Parser (stream or DOM), or a JSON parser (again stream of 'DOM'), and build the objects on the fly. But with this data - which seems to consist of records of cricket matches, why not go with a csv format?
This seems to be your basic 'datum':
<id>1263</id>
<name>Australia v India 3rd Test at Perth - Jan 13-17, 2012</name>
<type>TestMatch</type>
<tournamentId>137</tournamentId>
<location>Perth</location>
<date>2012-01-14</date>
<GMTTime>02:30:00</GMTTime>
<localTime>10:30:00</localTime>
<description>3rd Test day 2</description>
<team1>Australia</team1>
<team2>India</team2>
<teamId1>7</teamId1>
<teamId2>1</teamId2>
<tournamentName>India tour of Australia 2011-12</tournamentName>
<logo>/cricket/137/tournament.png</logo>
Of course you would still have to parse a csv, and deal with character delimiting (such as when you have a ' or a " in a string), but it will reduce your network traffic quite substantially, and likely parse much faster on the client. Of course, this depends on what your client is.
Actually you have RESTful store that can return data in several formats and you only need to read from this source and no further interaction is needed.
So, you can use any XML Parser to parse XML data and put the extracted data in whatever data structure that you want or you have.
I did not hear about XTREME, but you can find more information about selecting the best parser for your situation at this StackOverflow question.
I want to Save result of an XSQL query to a file using Java.
Does any one know a way to do this?
The Oracle document on Using XSQL in Java Programs has instructions on how to call an XSQL from Java and get the result as an XMLDocument or send it to a PrintWriter or OutputStream. There's a short example program there that sends the result to System.out, but it could be easily modified to send it to a file.
Is there a simple Java library or approach that will take a SQL query and load data in a CSV file to oracle database. Pls help
You don't have to use Java to load a data file into a table unless it is absolutely necessary. Instead, I'd recommend Oracle's command-line SQL*Loader utility which was designed specially for this purpose.
For similar tasks I usually use Groovy scripts as it's really easy and quick to write and runs on the JVM off course.
...an example:
import groovy.sql.Sql
def file1 = new File(/C:\Documents and Settings\USER\Desktop\Book1.csv/)
def reader = new FileReader(file1)
def sql = Sql.newInstance("jdbc:oracle:thin:#XXXXXX:XXXX:XXX", "SCHEMA",
"USER", "oracle.jdbc.driver.OracleDriver")
reader.each { line ->
fields = line.split(';')
sql.executeInsert("insert into YOUR_TABLE values(${fields[0]},${fields[1]},${fields[2]})")
}
It's a basic example, if you have double quotes and semi columns in your csv you will probably want to use something like OpenCSV to handle that.
You could transform each line in the CSV with regular expressions, to make an insert query, and then send to Oracle (with JDBC).
I think this tool will help you for any type of database import-export problem.
http://www.dmc-fr.com/home_en.php
Do you have that CSV in a file on the database server or can you store it there? Then you may try to have Oracle open it by declaring a DIRECTORY object for the path the file is in and then create an EXTERNAL TABLE which you can query in SQL afterwards. Oracle does the parsing of the file for you.
If you are open to Python you can do bulk load using SQL*Loader
loadConf=('sqlldr userid=%s DATA=%s control=%s LOG=%s.log BAD=%s.bad DISCARD=%s.dsc' % (userid,datafile, ctlfile,fn,fn,fn)).split(' ')
p = Popen(loadConf, stdin=PIPE, stdout=PIPE, stderr=PIPE, shell=False, env=os.environ)
output, err = p.communicate()
It's will be much faster that row insert.
I uploaded basic working example here.