I have a word2vec model stored in text file as
also -0.036738 -0.062687 -0.104392 -0.178325 0.010501 0.049380....
one -0.089568 -0.191083 0.038558 0.156755 -0.037399 -0.013798....
The size of the text file is more than 8GB.
I want to read this file into mysql database using the first word as key (in a column) and the rest of the line as another column. Is it possible to do so without reading each line and splitting it?
I went through some related questions but it didn't match what I want.
How to read a file and add its content to database?
read text file content and insert it into a mysql database
you can do it by:
making a simple for loop that iterates over the records on the model
aggregating about 100 records on an array
using mysql bulk insert feature to insert 100s of records at once
use a fast language like go if you can.
This thing you are trying to do, it's very possible, let me know if you need code for this.
Related
I am trying to read data from a csv file has 332,462 KB with 136 columns and 297,388 rows. Then I want to insert into an Oracle database table which has the exactly same column number mapping, except I add one more column at the end of this table to record today's date.
So, everything looks fine, no exceptions, the only thing is I can only read a small part like 7619 row, and the program stops. The finish part in the database is what I want, that is correct, but I don't know why it stops, I tried use readNext(), readAll(), and pass an inputStreamRead to CSVReader, all of these way have the same result.
What is the cause of this? One thing I am think is this csv file has some empty row that the CSVReader read it as the end of the file?
I have batch steps, where I read the file, process the file and save it to DB. But as it is a full load, hence I have to remove the table from the DB before inserting new records.
My question is that what place is best for writing the code to delete the existing table in spring batch (reader, processor, writer), below are the multiple scenarios:
Do this in open() method of ItemReader<> in reader class: Problem: if somehow the file which I'm reading is corrupt or blank, then in that case I will end up with empty table.
Create a flag, set it once and on basis of that flag delete the table in processor class: This can be done, but is there any other better way or better pace to do this
Create another temp table, copy all the records from file to this table and then in ´#AfterStep´ method, delete all the records from actual table and move all the records from temp table to this table.
Is there any method which just gets called once before the Itemprocessor, anything other then doing it using a flag? Please suggest
It sounds to me like you have 2 steps here:
Truncate existing table.
Insert data from file.
You need to decide which step should execute first.
If you want to execute the 'truncate' step first, you can add checks to validate the file before performing the truncate. For example, you could use a #BeforeStep to check that the file exists, is readable, and is not size 0.
If you need to guarantee that the entire file is parsed without error before loading the database table, then you will need to parse the data into some temporary location as you mention and then in a second step move the data from the temporary location to the final table. I see a few options there:
Create a temporary table to hold the data as you suggest. After reading the file, in a different step, truncate the target table and move from the temp table to the target table.
Add a 'createdDateTime' column or something similar to the existing table and an 'executionDate' job parameter. Then insert your new rows as they are parsed. In a second step, delete any rows that have a created time that is less than the executionDate (This assumes you are using a generated ID for a PK on the table).
Add a 'status' column to the existing table. Insert the new rows as 'pending'. Then in a second step, delete all rows that are 'active' and update rows that are 'pending' to 'active'.
Store the parsed data in-memory. This is dangerous for a few reasons; especially if the file is large. This also removes the ability to restart a failed job as the data in memory would be lost on failure.
You can:
Create 2 tables Data and DataTemp with the same schema
Have table Data with existing data
Copy the new data on DataTemp
If the new incoming data stored on DataTemp is valid, then you can transfer it to table Data.
If you don't have so much data you could do this transfer in a single transaction on another Tasklet.
If you have a lot of data you could use a chunk processing tasklet. As you have inserted it, you know it is not correpted and respect the db contraints, so if the transfer failed (probably because db is down), you can restart it later without loosing any data.
You can also avoid that transfer part by using 2 tables, and using in your configuration which table of the 2 is supposed to be used (active table). Or you could use a table alias/view that you update to reference the active table.
Is there a way to perform SQL queries on CSV text held in memory and read in from a Java Reader e.g. StringReader.
org.h2.tools.Csv.read(Reader reader, String[] colNames) would allow me to retrieve a result set containing all the rows and columns. However, I actually want to perform a query on the CSV text read from the Reader.
The background - I receive a file containing multiple CSV sections for each entity, see Can H2 Database query a CSV file containing multiple sections of different record groups?, and whilst parsing the file I store each of the CSV sections I need in Strings (a String for each one). This shouldn't bog down memory as I only keep the data in memory for a short time and each CSV section is relatively small). I need to perform queries on these CSV sections to build a document in a custom format.
I could write each CSV section to a file (as a set of files) and use CSVREAD, but I don't want to do that as I need my application to be as fast as possible and splitting and writing the sections to disk will thrash the hard drive to death.
You could write a user defined function that returns a result set, and use that to generate the required rows. Within your user defined function, you can use the Csv tool from H2 (actually any Csv tool).
This is not possible directly, since DBMS can usually only query their own optimized data storage. You have to import the text with the mentioned org.h2.tools.Csv.read into a table and perform the queries on that table. The table may be a temporary one, to prevent any writes on the disk, assuming the memory is sufficient.
I am uploading a CSV file using a Servlet and inserting it in an Oracle table using JDBC. I need to insert only the unique values in the CSV file. If the values in CSV file is already in the database table, then it should not be inserted in the table. So it should insert only unique values from the CSV file.
These options should help avoid an additional DB call to handle this situation.
Option 1: Very Simple and needs least coding ... but works when there is no global transaction boundary.
Just get all the inserts going, in case of any Constraint exception, just catch it and do "nothing", loop to another value
Option 2: Every time you read row from CSV, add it to a collection, before adding, just check if the object already exists (ex: arrayList.contains (object instance)) and continue adding only when there is no object with similar data). At the end, do a bulk insert.
Note: If the data is large, go for fixed set of data for bulk insert.
Consider these steps:
Read a value from the CSV file
Code to search a "value" against the database and if it not found then insert it.
i'm guessing the way ur inserting data in the database, is in a loop, you read from the csv and insert in the DB, so why not simply do a select satetement to check if the value exists and if it does don't insert it
Hi I am new to mysql and java aswell. I want to store a jpeg files and hash values of small chunks of file. I have stored the hash values of small chunks(100s in number) and now want to store the jpeg file also against these small chunks. My question is do I need to store the file again and again for each record or is it possible to save file once and link it to the records related to the file? if so then also please guide me that how can I do it?
You can save the file on the machine and store its path in the database.
Suppose you are having 1 field (say imagePath) in the table which takes "varchar" data. You can store the path of the image there and retrieve the image at runtime. By doing this, you can avoid saving same file multiple times. However, it will override the images having same name but different data. For that you have to use Primary key to append to the name of the file. I hope it will help you to understand.
I must admit I'm not entirely sure what you want to do, but if understand you correctly this is what I would do.
One table (say tblJpgs) with data about the jpg-file, maybe the path and the filename as suggested by Naved and additionally a description and whatever can be useful info about the file. In this table there will be one row per jpg-file and each row will have a unique id.
Then you will have another table (say tblChunks) for all the chunks. There should be a column for connecting each row with a tblJpgs.id. Then there is of course a column with the chunk itself. In this table there will be one row for each chunk, but there will be many rows for one jpg-file.
In this way you will only save the information in one place, which is very central in database structures. Storing the jpg-file or the path and filename for it in each row with the chunks would be against this fundamental database structure and should therefore be avoided.