I've read many posts about this topic but none of them talk about AWS RDS MySQL Database. Since three days ago, I'm running a python script in an AWS EC2 instance that write rows in my AWS RDS MySQL database. I have to write 35 millions rows, so I know this will take some time. Periodically, I check the performance of the database, and three days later (today) I realize that the database is slowing down. When it started, the first 100,000 rows were written in only 7 minutes (this is an example of the rows I'm working with)
0000002178-14-000056 AccountsPayableCurrent us-gaap/2014 20131231 0 USD 266099000.0000
After three days, 5,385,662 rows have been written in the database, but now it takes almost 3 hours to write 100,000 rows. What is happening?
The EC2 instance that I'm running is the t2.small. Here you can check the specs if you need so: EC2 SPECS . The RDS database that I'm running is the db.t2.small. Check the specs here: RDS SPECS
I will attached here some charts about the performance of the database and the EC2 Instance: Db CPU / Db Memory / Db Write IOPS / Db Write Throughput / EC2 Network in (bytes) / EC2 Network out (bytes)
It would be great if you could help me. Thanks a lot.
EDIT 1: How am I inserting rows? As I said before, I've a python script running on an EC2 instance, this script read text files, make some calculations with these values, and then write every "new" row to the database. Here are a small piece of my code. How I read the text files?
for i in path_list:
notify("Uploading: " + i)
num_path = "path/" + i + "/file.txt"
sub_path = "path/" + i + "/file.txt"
try:
sub_dict = {}
with open(sub_path) as sub_file:
for line in sub_file:
line = line.strip().split("\t")
sub_dict[line[0]] = line[1] # Save cik for every accession number
sub_dict[line[1] + "-report"] = line[25] # Save report type for every CIK
sub_dict[line[1] + "-frecuency"] = line[28] # Save frecuency for every CIK
with open(num_path) as num_file:
for line in num_file:
num_row = line.strip().split("\t")
# Reminder: sometimes in the very old reports, cik and accession number does not match. For this reason I have to write
# the following statement. To save the real cik.
try:
cik = sub_dict[num_row[0]]
except:
cik = num_row[0][0:10]
try: # If there is no value, pass
value = num_row[7]
values_dict = {
'cik': cik,
'accession': num_row[0][10::].replace("-", ""),
'tag': num_row[1],
'value': value,
'valueid': num_row[6],
'date': num_row[4]
}
sql = ("INSERT INTO table name (id, tag, value_num, value_id, endtime, cik, report, period) "
"VALUES ('{}', '{}', '{}', '{}', '{}', '{}', '{}', '{}', '{}', '{}')".format(
values_dict['cik'] + values_dict['accession'] + values_dict['date'] + values_dict['value'].split(".")[0] + "-" + values_dict['tag'],
values_dict['tag'],
float(values_dict['value']),
values_dict['valueid'],
values_dict['date'],
int(values_dict['cik']),
sub_dict[values_dict['cik'] + "-report"],
sub_dict[values_dict['cik'] + "-frecuency"]
))
cursor.execute(sql)
connection.commit()
I know there is no except:
to cath the try
statements, but this is only a piece of the script. I think the important part is how I insert every row. In case that I don't need to make calculations with the values, I'll use Load Data Infile
to write the text files to the database.
I just realize know that maybe is not a good idea to commit
every time I insert a row. I will try to commit after 10,000 rows or so.