How to extract "infobox company" data from wiki dumps

How to extract "infobox company" data from wiki dumps - java

I have downloaded a big wiki dump XML file from https://dumps.wikimedia.org/enwiki/20170520/
I want to extract the metadata company name and parent company from this wikidumps. All the company data are located in the XML template like below:
{{Infobox company
| name =
| logo =
| type =
| industry =
| fate =
| predecessor = <!-- or: | predecessors = -->
| successor = <!-- or: | successors = -->
| founded = <!-- if known: {{Start date and age|YYYY|MM|DD}} in [[city]], [[state]], [[country]] -->
| founder = <!-- or: | founders = -->
| defunct = <!-- {{End date|YYYY|MM|DD}} -->
| hq_location_city =
| hq_location_country =
| area_served = <!-- or: | areas_served = -->
| key_people =
| products =
| owner = <!-- or: | owners = -->
| num_employees =
| num_employees_year = <!-- Year of num_employees data (if known) -->
| parent =
| website = <!-- {{URL|example.com}} -->
}}
I did some research and found about MediaWiki Parser.
Reference: https://github.com/dkpro/dkpro-jwpl/blob/master/de.tudarmstadt.ukp.wikipedia.parser/src/main/java/de/tudarmstadt/ukp/wikipedia/parser/tutorial/T1_SimpleParserDemo.java
https://dkpro.github.io/dkpro-jwpl/JWPLParser/
I tried to use this parser. But it requires the file to be converted in string. My wiki dump XML file is 60 GB in size. I can't convert this big file in string and keep in memory. Also, there is no description for the Mediawiki parser on how to find specific element like Infobox company, go inside it and extract name and other fields. Below is the sample code for Mediawiki parser:
public static void main(String[] args) throws IOException {
File file = new File("C:/Users/njaiswal/Downloads/accenture_data_from_wikidumps.xml");
String str = FileUtils.readFileToString(file);
// get a ParsedPage object
MediaWikiParserFactory pf = new MediaWikiParserFactory();
MediaWikiParser parser = pf.createParser();
ParsedPage pp = parser.parse(str);
// get the sections
for (Section section : pp.getSections()) {
System.out.println("section : " + section.getTitle());
System.out.println(" nr of paragraphs : " + section.nrOfParagraphs());
System.out.println(" nr of tables : " + section.nrOfTables());
System.out.println(" nr of nested lists : " + section.nrOfNestedLists());
System.out.println(" nr of definition lists: " + section.nrOfDefinitionLists());
for (Link link : section.getLinks(Link.type.INTERNAL)) {
System.out.println(" " + link.getTarget());
}
}
}
Is there any other parser that can solve my problem? Or can I use the same MediaWiki Parser to get to "Inbox company" and extract fields? Any help is appreciated. Thanks
Update: I tried to use wikiXMLj parser that Khalil suggested. I am able to get all the "Infobox" data, but I want to limit this to "Infobox company" data. Below is my code and output:
import edu.jhu.nlp.wikipedia.*;
public class Test {
public static void main(String[] args) throws Exception{
WikiXMLParser parser = WikiXMLParserFactory.getSAXParser("C:/Users/njaiswal/Downloads/enwiki-20170520-pages-articles-multistream.xml/enwiki-20170520-pages-articles-multistream.xml");
parser.setPageCallback(new PageCallbackHandler() {
public void process(WikiPage page) {
try {
InfoBox infobox=page.getInfoBox();
System.out.println(infobox.dumpRaw());
} catch (WikiTextParserException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
//do something with info box
}
});
parser.parse();
}
}
O/P:
{{Infobox Monarch
| name = Attila
| title = [[List of Hunnic rulers|Ruler]] of the [[Hunnic Empire]]
| place of burial =
}}
{{Infobox sea
| name = Aegean Sea
| image = Aegean Sea map.png
| caption = Map of the Aegean Sea
| pushpin_map = World
| pushpin_map_alt = World
| pushpin_label_position = right
}}
{{Infobox company
| name = Audi AG
| logo = Audi-Logo 2016.svg
| logo_size = 235
| image = Audi Ingolstadt.jpg
| image_size = 265
}}

I used before wikixmlj very simple dumb parser. this shall parse it perfectly:
// dumpPath should be like C:\your/Path/articles.xml.bz2"
WikiXMLParser wxsp = WikiXMLParserFactory.getSAXParser(dumpPath);
wxsp.setPageCallback(new PageCallbackHandler() {
#Override
public void process(WikiPage page) {
//System.out.println("info box:" + page.getInfoBox());
String regex = "\\{{Infobox company(.|\\n)+";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(page.getInfoBox());
while (matcher.find()) {
System.out.println(matcher.group(0));}
}
});
wxsp.parse(); }
demo of the regex

Related

How to find duplicate elements in a Stream in Java

I'm trying to find duplicate entries in map values. But the thing is the list of values have multiple attributes/properties. Basically, if a title shows up more than once in a database, I would mark one entry as unique and mark the rest as duplicates.
Here's my current code:
// I have a Map that looks like...
host1 : id | title | host1 | url | state | duplicate
id | title | host1 | url | state | duplicate
host2 : id | title | host2 | url | state | duplicate
id | title | host2 | url | state | duplicate
for (Map.Entry<String, List<Record>> e : recordsByHost.entrySet()) {
boolean executed = false;
for (Record r : e.getValue()) {
int frequency = Collections.frequency(
e
.getValue()
.stream()
.map(Record::getTitle)
.collect(Collectors.toList()),
r.getTitle()
);
if ((frequency > 1) && (!executed)) {
markDuplicates(r.getId(), r.getTitle());
executed = true;
} else {
executed = false;
}
The issue is when frequency is more than 2 (three records with the same title), the line evaluates to false and treats the third record / second duplicate as "unique".
I've been trying to rework my logic but I'm afraid I'm stuck. Any help / suggestions to get me unstuck would be greatly appreciated.

Set.add (and in fact, Collection.add) returns true if and only if the value was actually added to the Set. Since a Set always enforces uniqueness, you can use this to find duplicates:
void markDuplicates(Iterable<? extends Record> records) {
Set<String> foundTitles = new HashSet<>();
for (Record r : records) {
String title = r.getTitle();
if (title != null && !foundTitles.add(title)) {
// title was not added, because it's already been found.
markAsDuplicate(r);
}
}
}

Better practice for custom rows and cells in a tableview JavaFX

I made a table-view that looks like the following:
For doing this I made the following:
1.- Create and observable list of the POJO that represents the table "Modulo" in MySQL database, with this list I created the columns of the table-view with this code:
public ObservableList<Modulo> cargaTablaHabilitadosMasHoras(Connection conex){
ObservableList<Modulo> listaMod = FXCollections.observableArrayList();
String sql = "SELECT id_Mod, nombre_Mod, \n" +
" Turno_Mod, capacidad_Mod, id_Us, \n" +
" status_Mod \n"+
"FROM Modulo;";
//Static first column
listaMod.add(new Modulo("123", "Horario", "Rhut", 10, "123", true));
try(PreparedStatement stta = conex.prepareStatement(sql);
ResultSet res = stta.executeQuery()) {
while (res.next()) {
if (res.getBoolean("status_Mod")) {
listaMod.add(new Modulo( res.getString ("id_Mod"),
res.getString ("nombre_Mod"),
res.getString ("Turno_Mod"),
res.getInt ("capacidad_Mod"),
res.getString ("id_Us"),
res.getBoolean("status_Mod")));
}
}
}catch (SQLException ex) {
ex.printStackTrace();
}
return listaMod;
}
2.- Create a table with the custom data with this code:
public void otraTabla(Connection conex){
//loads the observable list of the POJO that represents the table Modulo
columns = modu.cargaTablaHabilitadosMasHoras(conex);
/*
creates and observable list that is going to be the base
of the tableview creating a grid of 8 x number of colums
obtained of the first list + 1 column that represents the hours
*/
ObservableList<String> row = FXCollections.observableArrayList();
row.addAll("1","2","3","4","5","6","7","8");
//for loop that iterates the tableview columns
for(Modulo columName : columns) {
//creates and column object to be integrated and manipulated
//whit the name of the column in the first list
TableColumn<String, String> col = new TableColumn(columName.getNombre_Mod());
//verify if is the first column with contains the hours
if (columName.getNombre_Mod().equals("Horario")) {
//if is the one creates the rows with the hours staring at 6 am
col.setCellValueFactory(cellData -> {
//star at 6 am
LocalTime lol = LocalTime.of(6, 0);
//get the value of ObservableList<String> row for for adding to LocalTime
Integer p = Integer.valueOf(cellData.getValue());
//adds the value to localtime
lol = lol.plusHours(p);
//Gives a format for the hour
DateTimeFormatter kk = DateTimeFormatter.ofPattern("hh:mm");
//returns the new String
return new SimpleStringProperty(lol.format(kk));
});
}else{
//if is a column load dinamically then gets
//the next date where there is space in the column at that time
col.setCellValueFactory(cellData -> {
String regresaFecha = "";
//Conection to the database it conection to the database it
//have to be inside of the loop or else the conection is lost
try(Connection localConnection = dbConn.conectarBD()) {
//get the level of the row in this case the hour
LocalTime lol = LocalTime.of(6, 0);
Integer p = Integer.valueOf(cellData.getValue());
lol = lol.plusHours(p);
//calls the method that calculed the next date where there is space in the table of the database
LocalDate fechaApunter = rehab.compararDiaADia(localConnection, Date.valueOf(LocalDate.now()),
Time.valueOf(lol), columName.getId_Mod(), columName.getCapacidad_Mod(), 30);
//date sent to the row of the tableview
regresaFecha = fechaApunter.toString();
} catch (SQLException e) {
e.printStackTrace();
}
return new SimpleStringProperty(regresaFecha);
});
}
//change color of the date depending of the
//distant relevant to the day is making the query to the database
if (!columName.getNombre_Mod().equals("Horario")) {
col.setCellFactory (coli -> {
TableCell<String, String> cell = new TableCell<String, String>() {
#Override
public void updateItem(String item, boolean empty) {
super.updateItem(item, empty);
if (item != null) {
LocalDate lol = LocalDate.parse(item);
Text text = new Text(item);
if (lol.isAfter(LocalDate.now())) {
if (lol.isAfter(LocalDate.now().plusDays(5))) {
text.setStyle(" -fx-fill: #990000;" +
" -fx-text-alignment:center;");
}else
text.setStyle(" -fx-fill: #cccc00;" +
" -fx-text-alignment:center;");
}
this.setGraphic(text);
}
}
};
return cell;
});
}
//add the column to the tableview
tvDisponivilidad.getColumns().addAll(col);
}
//add the Observable list place holder
tvDisponivilidad.setItems(row);
}
For loading the data I used this method:
public LocalDate compararDiaADia(Connection conex, Date fecha, Time hora,
String id_Mod, int capacidad, int dias){
LocalDate contador = fecha.toLocalDate();
LocalDate disDeHoy = LocalDate.now();
for (int i = 0; i < dias; i++) {
contador = fecha.toLocalDate();
contador = contador.plusDays(i);
String sttm = "SELECT COUNT(id_Reab) AS Resultado\n" +
"FROM Rehabilitacion\n" +
"WHERE '"+contador+"' BETWEEN inicio_Reab AND fin_Reab\n" +
"AND horario_Reab = '"+hora+"'\n" +
"AND id_Modulo = '"+id_Mod+"';";
try(PreparedStatement stta = conex.prepareStatement(sttm);
ResultSet res = stta.executeQuery(); ) {
if (res.next()) {
if (res.getInt("Resultado") < capacidad || res.getInt("Resultado") == 0) {
disDeHoy = contador;
break;
}else
disDeHoy = contador;
}
} catch (SQLException ex) {
ex.printStackTrace();
}
}
return disDeHoy;
}
What this method does is that for each column it checks where is the next day where there is less of the capacity of the module (each module has different capacity) at certain hour and returns that day, in the calling of the method the hour changes to populate all the rows in the table.
There are several problems with my approach, first is the time, it cost to load the table, it takes like one minute to make the query and populate the table this is a combination of factors but the principal factor is that for every day I made a query to the database and example of this:.
Here is my table where I made the queries:
mysql> SELECT * FROM imssRehab.Rehabilitacion;
+---------+-------------+------------+-----------------+---------+-----------+
| id_Reab | inicio_Reab | fin_Reab | horario_Reab | id_Prog | id_Modulo |
+---------+-------------+------------+-----------------+---------+-----------+
| 1 | 2016-06-01 | 2016-06-10 | 07:00:00.000000 | 1 | 215A3 |
| 2 | 2016-06-01 | 2016-06-10 | 07:00:00.000000 | 1 | 215A3 |
| 3 | 2016-06-01 | 2016-06-10 | 07:00:00.000000 | 1 | 215A3 |
| 4 | 2016-06-01 | 2016-06-10 | 07:00:00.000000 | 1 | 215A3 |
| 5 | 2016-06-01 | 2016-06-10 | 07:00:00.000000 | 1 | 215A3 |
| 6 | 2016-06-01 | 2016-06-10 | 07:00:00.000000 | 1 | 215A3 |
+---------+-------------+------------+-----------------+---------+-----------+
here is my query:
SELECT COUNT(id_Reab) AS Resultado
FROM Rehabilitacion
WHERE '2016-06-01' BETWEEN inicio_Reab AND fin_Reab
AND horario_Reab = '07:00'
AND id_Modulo = '215A3';
The result is 6 in this module my capacity is 5 so I have to advance a day and ask again until it finds a day where are less than 5 in this example until 2016-06-11. To get here I have to make 10 queries and open 10 connections. I use a connection pool and it's very efficient, but it gets overwhelmed by these 10 queries are only for the first row in the first column, normally there are between 15 to 20 columns assuming there is only one query for a row, it still is around 120-160 connections.
I try to reuse a connection every time I can, my first instinct was to use the connection that get pass to the method for loading the Observable List of modules but when I do this the method that makes the query of dates receives the connection closed with out and apparent reason. After many tests I came to the conclusion that has something to do with the lambda of the setCellValueFactory method, and if I want to make a connection it has to be inside creating more connections. I would like i try to alleviate this by loading the table in a different thread with a Task but the results where similar.
A solution to this would be to make a POJO especially for table but I don't think it's possible to create a class dynamically. I could have a POJO whit 20 possible columns and only load the columns that I would use, but what happens when there is more than 20 columns or the name of the modules changes?
So my question is this: how do I make the creation of the table more rapidly? And is there a better way to achieve this table? I don't like my solution with code it's more complex than I would like I'm hoping for a better and cleaner way.

QuerySolution, keep the "< >"

I'm currently stuck on my project on creating a Fuseki Triple Store Browser. I need to visualize all the data from a TripleStore and make the app browsable. The only problem is that the QuerySolution leaves out the "< >" that are in the triplestore.
If I use the ResultSetFormatter.asText(ResultSet) it returns this:
-------------------------------------------------------------------------------------------------------------------------------------
| subject | predicate | object |
=====================================================================================================================================
| <urn:animals:data> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq> |
| <urn:animals:data> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1> | <urn:animals:lion> |
| <urn:animals:data> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2> | <urn:animals:tarantula> |
| <urn:animals:data> | <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3> | <urn:animals:hippopotamus> |
-------------------------------------------------------------------------------------------------------------------------------------
Notice that the some of the data contains the smaller/greater than signs "<" and ">". As soon as i try to parse the data from the ResultSet, it removes those sign, so that the data looks like this:
-------------------------------------------------------------------------------------------------------------------------------
| subject | predicate | object |
===============================================================================================================================
| urn:animals:data | http://www.w3.org/1999/02/22-rdf-syntax-ns#type | http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq |
| urn:animals:data | http://www.w3.org/1999/02/22-rdf-syntax-ns#_1 | urn:animals:lion |
| urn:animals:data | http://www.w3.org/1999/02/22-rdf-syntax-ns#_2 | urn:animals:tarantula |
| urn:animals:data | http://www.w3.org/1999/02/22-rdf-syntax-ns#_3 | urn:animals:hippopotamus |
As you can see, the data doesn't contain the "<" and ">" signs.
This is how I parse the data from the ResultSet:
while (rs.hasNext()) {
// Moves onto the next result
QuerySolution sol = rs.next();
// Return the value of the named variable in this binding.
// A return of null indicates that the variable is not present in
// this solution
RDFNode object = sol.get("object");
RDFNode predicate = sol.get("predicate");
RDFNode subject = sol.get("subject");
// Fill the table with the data
DefaultTableModel modelTable = (DefaultTableModel) this.getModel();
modelTable.addRow(new Object[] { subject, predicate, object });
}
It's quite hard to explain this problem, but is there a way to keep the "< >" signs after parsing the data?

The '<>' are used by the formatter to indicate that the value is a URI rather than a string: so "http://example.com/" is a literal text value, whereas <http://example.com/> is a URI.
You can do the same yourself:
RDFNode node; // subject, predicate, or object
if (node.isURIResource()) {
return "<" + node.asResource().getURI() + ">";
} else {
...
}
But it's much easier to use FmtUtils:
String nodeAsString = FmtUtils.stringForRDFNode(subject); // or predicate, or object
What you need to do is get that code invoked when the table cell is rendered: currently the table is using Object::toString().
In outline, the steps needed are:
modelTable.setDefaultRenderer(RDFNode.class, new MyRDFNodeRenderer());
Then see http://docs.oracle.com/javase/tutorial/uiswing/components/table.html#renderer about how to create a simple renderer. Note that value will be an RDFNode:
static class MyRDFNodeRenderer extends DefaultTableCellRenderer {
public MyRDFNodeRenderer() { super(); }
public void setValue(Object value) {
setText((value == null) ? "" : FmtUtils.stringForRDFNode((RDFNode) value));
}
}

Java file name filter to match a search interval

I tried to find the answer to my problem in the questions history but can't find. So here is my problem.
Lets imagine that a have a directory structure like this:
project
| -- 20150201
| -- 20150202
| | -- 1423500700241.xml
| | -- 1423500720009.xml
| | -- 1423500760005.xml
| -- 20150203
| | -- 1423500780006.xml
| | -- 1423500800006.xml
| -- 20150204
| | -- 1423500820005.xml
| | -- 1423500840008.xml
| -- report
what I want is that from a date period selected by a user i process the files inside the directory;
Example:
When a user search for 20150201 to 20150203 i need to process the files inside the directories.
I did this so far:
public class FileFilterDateIntervalUtil implements Serializable, FilenameFilter {
private static final long serialVersionUID = 226591338838691089L;
private static final SimpleDateFormat DATE_FORMAT = new SimpleDateFormat("yyyyMMdd");
private String initialDate;
private String endingDate;
public FileFilterDateIntervalUtil(String initialDate, String endingDate) {
this.initialDate = initialDate;
this.endingDate = endingDate;
}
#Override
public boolean accept(File dir, String name) {
String currentDate = DATE_FORMAT.format(new Date(new File(dir, name).lastModified()));
return ( (this.initialDate.compareTo(currentDate) < 0) && (this.endingDate.compareTo(currentDate) >= 0) );
}
}
but when i did this i search for the last modifieds files, and this is not what i want, i want to search by the directories names by a date interval.
Someone can help me?
Thank you.

The problem seems to be with the way you have implemented the accept method. Instead of using a directory's last modified date and the current date, you should be using a directory's name alone to achieve your objective.
#Override
public boolean accept(File dir, String name) {
Date dirDate = null;
try {
dirDate = DATE_FORMAT.parse(name.trim());
} catch(Exception e) {
System.out.println("Cannot parse date "+name+" reason "+e.getMessage());
return false;
}
String dirDateStr = DATE_FORMAT.format(dirDate);
return initialDate.compareTo(dirDateStr) * dirDateStr.compareTo(endingDate) > 0;
}
The above accept method can be used to get the list of all the files that fall between the initial and ending date (excluding the initial and ending date).

Getting specific data from a cursor

After running a query I have a data like below in a cursor
ID| TOPIC | TITLE | TYPE | NAME |
---------------------------------
1 | AB | BCD | ref | Ferari|
----------------------------------
1 | AB | BCD | ref | TOYOTA|
----------------------------------
2 | BC | ABC | notref| AUDI |
----------------------------------
2 | BC | ABC |notref| BMW |
How can I get the NAME

you can get the NAME.........and you can store all the datas into an arraylist......and retrieve the datas based upon the id value.......
Try this
ArrayList datas=new ArrayList();
ArrayList list=new ArrayList();
String StoreTitle = "", StoreName="";
cursor.moveToFirst();
do{
int getID = cursor.getInt(cursor.getColumnIndexOrThrow("_id"));
String Title = cursor.getString(cursor.getColumnIndexOrThrow("TITLE"));
String StoreName= cursor.getString(cursor.getColumnIndexOrThrow("NAME"));
if(StoreTitle.equalsIgnoreCase(Title)){
list=new ArrayList();
String getTopic = cursor.getString(cursor.getColumnIndexOrThrow("TOPIC"));
String getTitle = cursor.getString(cursor.getColumnIndexOrThrow("TITLE"));
String getType = cursor.getString(cursor.getColumnIndexOrThrow("TYPE"));
String getName = cursor.getString(cursor.getColumnIndexOrThrow("NAME"));
String name=StoreName+" "+getName;
list.add(getTopic);
list.add(getTitle );
list.add(getType );
list.add(name);
datas.remove(getID);
datas.add(getID ,list);
}
else
{
list=new ArrayList();
String getTopic = cursor.getString(cursor.getColumnIndexOrThrow("TOPIC"));
String getTitle = cursor.getString(cursor.getColumnIndexOrThrow("TITLE"));
String getType = cursor.getString(cursor.getColumnIndexOrThrow("TYPE"));
String getName = cursor.getString(cursor.getColumnIndexOrThrow("NAME"));
String name=StoreName+" "+getName;
list.add(getTopic);
list.add(getTitle );
list.add(getType );
list.add(name);
datas.add(getID ,list);
}
StoreTitle = Title;
}while(cursor.moveToNext());

Just modify else part, Keep track of position variable
else
{
String name=arraylist[position].getName;
name=name+" "+cursor.getString(cursor.getColumnIndexOrThrow("Name"));
arraylist.setname(name);
}

We Keep Coding

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to extract "infobox company" data from wiki dumps - java

Related

How to find duplicate elements in a Stream in Java

Better practice for custom rows and cells in a tableview JavaFX

QuerySolution, keep the "< >"

Java file name filter to match a search interval

Getting specific data from a cursor

Categories

Resources