Paragraph to Table in Different Columns - java

I want to create a program such that if I have inserted a paragraph in a text area, I want certain parts of it to be put in the table in different columns. For example the statement is:
My name is James Olson. I am 21 years old. I am a doctor. I live in Canterville, Bacon Street, London.
Then the table should automatically look like:
| Name | Age | Profession | Area Name | Street Name | Area |
James | 21 | Doctor | Canterville | Bacon Street | London |
I also want to know which language would go the best - Python or Java.

Yes it is certainly possible to do so and I would personally prefer Python to do the Job.
I've written the code, it is not the best or the most efficient code but it will surely do the job but there is a catch in my code. It will only work if the sequence and pattern of your sentence is same. The pattern should be exactly like the one you provided in the example.
If you want the code to work for multiple sentences, a slight change in the code with a loop can do the work.
import pandas as pd
my_sent = "My name is James Olson. I am 21 years old. I am a doctor. I live in Canterville, Bacon Street, London."
my_words = my_sent.split()
my_stopwords = ['My', 'name', 'is', 'I', 'am', 'years', 'old.', 'I', 'am', 'a', 'I', 'live', 'in',]
cleaned_stopwords = []
useful_words = []
for temp in my_stopwords:
cleaned_stopwords.append(temp.lower().strip())
for word in my_words:
if word.lower().strip() not in cleaned_stopwords:
useful_words.append(word.title().strip(".").strip(","))
name = useful_words[0] + " " + useful_words[1]
street = useful_words[5] + " " + useful_words[6]
useful_words.pop(0)
useful_words.pop(0)
useful_words.insert(0, name)
useful_words.pop(4)
useful_words.pop(4)
useful_words.insert(4, street)
all_columns = ["Name", "Age", "Profession", "Area Name", "Street Name", "Area"]
my_df = pd.DataFrame([useful_words], columns = all_columns)
Output:
Name Age Profession Area Name Street Name Area
0 James Olson 21 Doctor Canterville Bacon Street London

Related

How to match a regex with words followed by date?

I'd like to write a regex to match sentences like these:
"I rated Minions (2015)..."
"I rated Beauty and the Beast (2015)..."
I've tried a regex like:
I rated \\w+ \\(\\b(18|19|20)\\d{2}\\b\\)
but it works only in the first case, when the title is a single word.
Between "I rated" and the year there is a title of a movie with no fixed length. Could you help me?
Try using regex like
\[^.?!(]* \\((18|19|20)\\d{2}\\)\
OR
\\w+ (?:\\w+ )*\\((?:1[89]|20)\\d{2}\\)
Assuming that:
you don't really need to validate the year
your text has mixed spurious sentences, as opposed to one-liner "I rated..."
you want to do something with the movie title and year separately
You can use:
String text = "I rated Minions (2015)... I like turtles. "
+ "I rated Beauty and the Beast (2015)... "
+ "I rated rare live footage of Louis XVI being beheaded (1789)";
// | starts with "I rated"
// | | group 1 with the title
// | | | open parenthesis
// | | | | group 2 with non-validated year
// | | | | | closing parenthesis
// | | | | |
Pattern pattern = Pattern.compile("I rated (.+?) \\((\\d+)\\)");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
System.out.printf(
"Title: %s - Year: %s%n",
// title is back-referenced as group 1
matcher.group(1),
// year is back-referenced as group 2
matcher.group(2)
);
}
... which will return:
Title: Minions - Year: 2015
Title: Beauty and the Beast - Year: 2015
Title: rare live footage of Louis XVI being beheaded - Year: 1789

Attempting to count unique users between two categories in Spark

I have a Dataset structure in Spark with two columns, one called user the other called category. Such that the table looks some what like this:
+---------------+---------------+
| user| category|
+---------------+---------------+
| garrett| syncopy|
| garrison| musictheory|
| marta| sheetmusic|
| garrett| orchestration|
| harold| chopin|
| marta| russianmusic|
| niko| piano|
| james| sheetmusic|
| manny| violin|
| charles| gershwin|
| dawson| cello|
| bob| cello|
| george| cello|
| george| americanmusic|
| bob| personalcompos|
| george| sheetmusic|
| fred| sheetmusic|
| bob| sheetmusic|
| garrison| sheetmusic|
| george| musictheory|
+---------------+---------------+
only showing top 20 rows
Each row in the table is unique but a user and category can appear multiple times. The objective is to count the number of users that two categories share. For example cello and americanmusic share a user named george and musictheory and sheetmusic share users george and garrison. The goal is to get the number of distinct users between n categories meaning that there is at most n squared edges between categories. I understand partially how to do this operation but I am struggling a little bit converting my thoughts to Spark Java.
My thinking is that I need to do a self-join on user to get a table that would be structured like this:
+---------------+---------------+---------------+
| user| category| category|
+---------------+---------------+---------------+
| garrison| musictheory| sheetmusic|
| george| musictheory| sheetmusic|
| garrison| musictheory| musictheory|
| george| musictheory| musicthoery|
| garrison| sheetmusic| musictheory|
| george| sheetmusic| musictheory|
+---------------+---------------+---------------+
The self join operation in Spark (Java code) is not difficult:
Dataset<Row> newDataset = allUsersToCategories.join(allUsersToCategories, "users");
This is getting somewhere, however I get mappings to the same category as in rows 3 and 4 in the above example and I get backwards mappings where the categories are reversed such that essentially is double counting each user interaction like in rows 5 and 6 of the above example.
What I would believe I need to do is have some sort of conditional in my join that says something along the lines of X < Y so that equal categories and duplicates get thrown away. Finally I then need to count the number of distinct rows for n squared combinations where n is the number of categories.
Could somebody please explain how to do this in Spark and specifically Spark Java since I am a little unfamiliar with the Scala syntax?
Thanks for the help.
I'm not sure if I understand your requirements correctly, but I will try to help.
According to my understanding expected result for above data should look like below. If it's not true, please let me know I will try to make requried modifications.
+--------------+--------------+-+
|_1 |_2 |
+--------------+--------------+-+
|personalcompos|sheetmusic |1|
|cello |musictheory |1|
|americanmusic |cello |1|
|cello |sheetmusic |2|
|cello |personalcompos|1|
|russianmusic |sheetmusic |1|
|americanmusic |sheetmusic |1|
|americanmusic |musictheory |1|
|musictheory |sheetmusic |2|
|orchestration |syncopy |1|
+--------------+--------------+-+
In this case you can solve your problem with below Scala code:
allUsersToCategories
.groupByKey(_.user)
.flatMapGroups{case (user, userCategories) =>
val categories = userCategories.map(uc => uc.category).toSeq
for {
c1 <- categories
c2 <- categories
if c1 < c2
} yield (c1, c2)
}
.groupByKey(x => x)
.count()
.show()
If you need symetric result you can just change if statement in flatMapGroups transformation to if c1 != c2.
Please note that in above example I used Dataset API, which for test purpose was created with below code:
case class UserCategory(user: String, category: String)
val allUsersToCategories = session.createDataset(Seq(
UserCategory("garrett", "syncopy"),
UserCategory("garrison", "musictheory"),
UserCategory("marta", "sheetmusic"),
UserCategory("garrett", "orchestration"),
UserCategory("harold", "chopin"),
UserCategory("marta", "russianmusic"),
UserCategory("niko", "piano"),
UserCategory("james", "sheetmusic"),
UserCategory("manny", "violin"),
UserCategory("charles", "gershwin"),
UserCategory("dawson", "cello"),
UserCategory("bob", "cello"),
UserCategory("george", "cello"),
UserCategory("george", "americanmusic"),
UserCategory("bob", "personalcompos"),
UserCategory("george", "sheetmusic"),
UserCategory("fred", "sheetmusic"),
UserCategory("bob", "sheetmusic"),
UserCategory("garrison", "sheetmusic"),
UserCategory("george", "musictheory")
))
I was trying to provide example in Java, but I don't have any experience with Java+Spark and it is too time consuming for me to migrate above example from Scala to Java...
I found the answer a couple of hours ago using spark sql:
Dataset<Row> connection per shared user = spark.sql("SELECT a.user as user, "
+ "a.category as categoryOne, "
+ "b.category as categoryTwo "
+ "FROM allTable as a INNER JOIN allTable as b "
+ "ON a.user = b.user AND a.user < b.user");
This will then create a Dataset with three columns user, categoryOne, and categoryTwo. Each row will be unique and will indicate when the user exists in both categories.

Spark Sql, unable to query multiple possible values in a array

I have the data schema of LinkeIn account as shown below. I need to query the skills which is in the for of array, where array may contains either JAVA OR java OR Java or JAVA developer OR Java developer.
Dataset<Row> sqlDF = spark.sql("SELECT * FROM people"
+ " WHERE ARRAY_CONTAINS(skills,'Java') "
+ " OR ARRAY_CONTAINS(skills,'JAVA')"
+ " OR ARRAY_CONTAINS(skills,'Java developer') "
+ "AND ARRAY_CONTAINS(experience['description'],'Java developer')" );
The above query is what i have tried and please suggest any better way.and also how to use case-insentive query ?
df.printschema()
root
|-- skills: array (nullable = true)
| |-- element: string (containsNull = true)
df.show()
+--------------------+
| skills|
+--------------------+
| [Java, java]|
|[Java Developer, ...|
| [dev]|
+--------------------+
Now lets register it as a temp table:
>>> df.registerTempTable("t")
Now, we will explode the array, convert each element as lower case and query using LIKE operator:
>>> res = sqlContext.sql("select skills, lower(skill) as skill from (select skills, explode(skills) skill from t) a where lower(skill) like '%java%'")
>>> res.show()
+--------------------+--------------+
| skills| skill|
+--------------------+--------------+
| [Java, java]| java|
| [Java, java]| java|
|[Java Developer, ...|java developer|
|[Java Developer, ...| java dev|
+--------------------+--------------+
Now, you can do a distinct on skills field.

How to Create A HashMap using Key Values from a text file?

I have to create a hashmap with the names I have in this text file
relationships text file:
Susan Sarandon | Tom Hanks : Cloud Atlas
Tom Hanks | Kevin Bacon : Apollo 13
Leonardo Dicaprio | Kevin Bacon : This Boy's Life
Robert De Niro | Kevin Bacon : This Boy's Life
Barack Obama | Tom Hanks : the Road We've Traveled
Helen Keller | Katharine Cornell : Helen Keller in Her Story
Katharine Cornell | Helen Hayes : Stage Door Canteen
Helen Hayes | John Laughlin : Murder with Mirrors
John Laughlin | Kevin Bacon : Footloose
Mark Zuckerberg | Joe Lipari : Terms and Conditions May Apply
Joe Lipari | Welker White : Eat Pray Love
Welker White | Kevin Bacon : Lemon Sky
This is the program I have now:
public static void main(String[] args)
throws FileNotFoundException
{
Scanner input = new Scanner(new File("relationships"));
HashMap<String, String> relationships = new HashMap<String, String>();
while (input.hasNextLine()) {
String[] columns = input.nextLine().split(" ");
relationships.put(columns[0], columns[1]);
}
System.out.println(relationships);
}
This is the output:
{Leonardo=Dicaprio, Katharine=Cornell, Joe=Lipari, Tom=Hanks, Robert=De, Susan=Sarandon, John=Laughlin, Mark=Zuckerberg, Barack=Obama, Welker=White, Helen=Hayes}
Does anyone know how to fix this please? Also how to seperate them so it actually looks like a list?
EDIT
I think you would just change your line:
String[] columns = input.nextLine().split(" ");
to:
String[] columns = input.nextLine().split(Pattern.quote(" | "));
Then column[0] would be the name on the left, and column[1] would be the name and movie title on the right.
Note that you'll need to import java.util.regex.Pattern; to do this

Delete in sqlite certain rows with specific data that are relatively old in a table using java

I want to delete for example the first 3 (oldest) that have the color1 as blue.
Example data set:
_id | name | surname | color1 | color2
1 | mark | jacobs | blue | green
2 | tony | hilo | black | red
13 | lisa | xyz | blue | green
4 | andre | qwerty | blue | green
9 | laura | abc | black | red
14 | kerr | jacobs | blue | green
I want to use execsql rather than db.delete..
which method is preferable ?
and what my code should be like ?
I will be using this inside eclipse in an android app.
db.execSQL("DELETE FROM MyTable WHERE _id IN " +
"(SELECT _id FROM MyTable WHERE color1 = ? ORDER BY _id LIMIT 3)",
new Object[] { "blue" });
execSQL is perfectly fine to use, especially when the command is so complex that using delete would require even more complex code.
It is NOT advisable to use execSql for this or any operation SELECT/INSERT/UPDATE/DELETE as execSql does not return anything, such as errors or rows affected by this query.
Instead although it takes a little longer to write out
Cursor c = db.query(table, new String[]{"_id"}, "color1" +"=?", new String[]{"blue"}, null,null,"_id ASC","3");
String ids="";
String qs = "";
for(c.moveToFirst();!c.isAfterLast();c.moveToNext()){
ids+=c.getInt(c.getColumnIndex("_id")+",";
qs+="?,"
}
ids= ids!=""?ids.substring(0, ids.length() - 1):ids;
qs= qs!=""?ids.substring(0, qs.length() - 1):qs;
db.delete(table, "_id IN ("+qs+")", ids.split(","));
Here's the reference for why execsql is bad for this situation
http://developer.android.com/reference/android/database/sqlite/SQLiteDatabase.html#execSQL(java.lang.String, java.lang.Object[])
DELETE FROM table WHERE _id IN
(SELECT _id FROM table ORDER BY _id ASC LIMIT 3);

Categories