View Full Version : Devastating days now over
ErosOlmi
29-05-2010, 19:39
I come from 3 devastating days that are now over.
One server in my company had an hardware failure on 4 physical disks at the same time. HP technicians said it was a very strange failure.
On that server we had 5 logical disks each configured in RAID 5 with spare disks.
One of the logical disks was used for a big (almost 150Gb) MSSQL database DATA file and another for the database transaction LOG.
Imagine? Both of that logical disks had a hardware failure of 2 physical disks each resulting in the complete lost of the transactions of the day because RAID 5 had no chance to rebuild the logical disks.
I had 48 hours without sleeping trying to recover the situation but disks were gone. HP support changed what it could be changed but nothing.
I had to send company and all people in charge of operations a mail where I stated we lost a complete day of work (many thousands of transactions: customer orders, delivery notes, accouting registrations, invoices, warehouse movements, productions batches, ...) and everyone had to rework the complete day again. I explained to each company department what to check and how to act to manually recover data.
I had a moment where I found myself completely lost with a sense of personal failure.
Well, company didn't said anything. All people started to silently work on the problem without asking anything. Every single people touched by the failure problem started to organize groups to create local working teams. At the end of the day almost 80% of the manual recovery job was done without a word. The day after also the remaining 20% was done and again without a word but only confirmations from each group when each of them finished to do the job.
Can you imagine that feeling that sometimes you feel looking a movies at cinema where many single persons all at the same time start to work for the same common target? For a second I had the same feeling.
Yesterday I was really exhausted but very very happy and honored to work in my current company and had such great colleagues.
Eros
Eros:
I am very happy to hear that your ordeal is over now. You are indeed fortunate to be surrounded by such a dedicated group of fellow workers. I have worked with such dedicated people in the past and know the very good feeling you must have experienced.
GOOD LUCK!
Don
danbaron
30-05-2010, 06:44
[font=courier new][size=8pt]I don't know much about this stuff, maybe I don't understand the situation. :roll:
But, you had a failure of physical disks, yes? :oops:
And, all of your storage, is on site? :?
No matter how redundant your storage system is, apparently, failures are still possible. :twisted:
What if there is a fire? :unguee: :twisted:
This time, the people were on your side. :P
But, how about if it happens again? :xyzw:
It seems, that if something really bad happens, then, blame will be allocated, --> not to the company president, but probably, - directly to you ("Sh*t rolls downhill."). :grrrr: :evil5: :x :diablo:
Maybe, it would be a good idea to back up your data, online (maybe even do it twice, from two providers). ;) 8)
Otherwise, nightmares?, sleepless nights? :shock:
(I think this incident shows that, minimally, you need to increase your on site storage redundancy: --> do a cost analysis, the cost of more storage redundancy, versus the cost of catastrophic data loss. Ideally, you would put the new storage at a different location within the building (fires)).
http://www.box.net/
http://www.carbonite.com
:oops:
Dan :x :P
Michael Hartlef
30-05-2010, 09:43
I come from 3 devastating days that are now over.
One server in my company had an hardware failure on 4 physical disks at the same time. HP technicians said it was a very strange failure.
On that server we had 5 logical disks each configured in RAID 5 with spare disks.
One of the logical disks was used for a big (almost 150Gb) MSSQL database DATA file and another for the database transaction LOG.
Imagine? Both of that logical disks had a hardware failure of 2 physical disks each resulting in the complete lost of the transactions of the day because RAID 5 had no chance to rebuild the logical disks.
I had 48 hours without sleeping trying to recover the situation but disks were gone. HP support changed what it could be changed but nothing.
I had to send company and all people in charge of operations a mail where I stated we lost a complete day of work (many thousands of transactions: customer orders, delivery notes, accouting registrations, invoices, warehouse movements, productions batches, ...) and everyone had to rework the complete day again. I explained to each company department what to check and how to act to manually recover data.
I had a moment where I found myself completely lost with a sense of personal failure.
Well, company didn't said anything. All people started to silently work on the problem without asking anything. Every single people touched by the failure problem started to organize groups to create local working teams. At the end of the day almost 80% of the manual recovery job was done without a word. The day after also the remaining 20% was done and again without a word but only confirmations from each group when each of them finished to do the job.
Can you imagine that feeling that sometimes you feel looking a movies at cinema where many single persons all at the same time start to work for the same common target? For a second I had the same feeling.
Yesterday I was really exhausted but very very happy and honored to work in my current company and had such great colleagues.
Eros
Ouch that was a hard hit. But also something to think about your backup procedures. Thank god they were cooperative and just did the work. At my company their would complain without end.
ErosOlmi
30-05-2010, 09:56
Dan, yes you are right looking at what happened in that way.
This event must tell us something, we have to learn from what happened.
We have already some redundancy:
every server has RAID 5 + spare disks, double electric connections, double network connection, triple fan, ...
every night DBMS data backup to disk for fast restore
every night DBMS data backup to tapes (both from backup file and from MSSQL agent). Tapes are located into different building
every night DBMS data is copied into another exactly identical server ready to be used the day after
on the file system we have triple backup (every 4 hours) on virtual tapes during the day
What we have missed was daily transactions on the DBMS, so live replication/sync.
Event occurred at 18:00 and we have no DBMS backup of data from 06:00 to 20:00.
We have to study how to replicate online DBMS transactions into another machine on the fly.
Next week we will start a total revision of our procedure in order to identify holes and wrong strategies.
Petr Schreiber
30-05-2010, 10:00
Eros,
congratulations for solving such a complicated situation!
I agree it had to be great feeling to see your colleagues cooperating without single complaint.
It is evident to me they respect you as person and the work you have done so far. You can learn a lot about true people nature in stress situation, and knowing your team is able to manage such a tricky situation properly and without complaints is fantastic.
Regarding further improvements in data security, I cannot advice anything, as I am not experienced in this.
I will ask friends which are more into this topic about possible solutions.
Petr
danbaron
30-05-2010, 20:21
[font=courier new][size=8pt]I understand that, not every procedure can be exactly scaled up.
But, how would I do it on my machine?
Say, I have the file, "c:\giantfile.dat".
I am worried that my internal drive (c) will fail, and I will lose, "giantfile.dat".
I have an external drive, "f".
I would copy, "c:\giantfile.dat", to, "f:\giantfile.dat".
Simultaneously, I would alter my code so that every write operation to, "c:\giantfile.dat", also produced the same write operation to, "f:\giantfile.dat".
In that case, I would never need to back up, "c:\giantfile.dat, to, "f:\giantfile.dat".
If, I still felt insecure, I have another external drive, "h".
I could do the same thing for, "h:\giantfile.dat".
And, if I had more drives, I could do it for them, too.
For me, instituting this method, would be very simple.
For you, and the complexity and limitations of your situation, I don't know. (Maybe you would need to buy a duplicate of the physical disk which now functions as the primary storage for your giant file.)
I assume that your giant file is not backed up between 6 AM and 8 PM, because that is when people are using it.
But then, you are always taking the risk of losing one day's transactions.
Dan
John Spikowski
30-05-2010, 20:55
What we have missed was daily transactions on the DBMS, so live replication/sync.
Have you looked at MS SQL Mirroring?
http://technet.microsoft.com/en-us/library/cc917680.aspx
ErosOlmi
30-05-2010, 21:46
Dan,
it is not a matter of file. You cannot copy a MSSQL data or log file while it is used by MSSQL agent. You need to use internal MSSQL backup agent or buy a Backup agent for your backup software. And we have both.
And a simple copy would not secure anything if you copy over your backup copy because if something happen in the middle you lose all.
To ensure operations we were confident on two facts:
RAID configuration on disks (3 disks plus 1 spare)
separation between DBMS DATA file and DBMS LOG file
RAID ensure that if one of your disks fails you can just change it (in our case it was automatic by the controller) and in few minutes you will have a new logical disk reconfigured
Separation of DATA and LOG files ensure that you can always recover transactions occurred from the last full backup from one of two files (DATA or LOG).
Of course we were wrong in having that confidence.
In our case 4 disks failed all at the same time: 2 in the logical disk where DATA file was stored and 2 in logical disks where LOG file was stored. Even if you try manually to do that it is quite hard to achieve but it happened.
John suggestion is one of the road. We can send transaction log atomic data to a different MSSQL server in such a way in case of disaster we can recover using the last full backup and than apply transactions from the other server.
In any case we will study the problem deeply next week with someone more expert than us in MSSQL strategies.
Michael Clease
30-05-2010, 23:45
[font=courier new][size=8pt]
But, how would I do it on my machine?
Say, I have the file, "c:\giantfile.dat".
I am worried that my internal drive (c) will fail, and I will lose, "giantfile.dat".
I have an external drive, "f".
I would copy, "c:\giantfile.dat", to, "f:\giantfile.dat".
Simultaneously, I would alter my code so that every write operation to, "c:\giantfile.dat", also produced the same write operation to, "f:\giantfile.dat".
This is a bad idea if the main drive developed a fault or windows had an issue writing data and corrupted the file hey presto you have two corrupt files and no backup. The only solution is to create a new file each time.
@Eros I know my company has the same backup procedure to your company..ie a complete days lose of work but a simple but not perfect solution could be to have a dedicated server that has task that snap shots the data every hour or any time period that is acceptable to local disks.
danbaron
31-05-2010, 05:39
[font=courier new][size=8pt]Maybe I understand what you mean, maybe not. If the file is stored on disk like a linked list, and you are trying to copy it while people are using it, then, I guess links would be deleted during the copying procedure, and the copy could not be completed.
In that case, maybe get a super-fast drive, and stop all transactions, say, for 5 minutes each hour, during which you make a complete copy. And, never delete one copy, before it is determined that the next one is OK. Maybe that is what Michael meant.
Charles Pegge
31-05-2010, 08:23
Relaying all the transactions to two remote machines, sounds the best solution to me. Assuming this is a small volume of data compared with the database itself, and the technology is readily available.
Anyway I am glad you were able to recover the situation Eros and hope you don't have to face too many of these critical events in your career.
Charles
ErosOlmi
31-05-2010, 11:50
Anyway I am glad you were able to recover the situation Eros and hope you don't have to face too many of these critical events in your career.
I had a lot of hardware crash without data lost.
This is the first time I have data lost and be sure it will remain ... in my brain.
ErosOlmi
31-05-2010, 11:57
[font=courier new][size=8pt]Maybe I understand what you mean, maybe not. If the file is stored on disk like a linked list, and you are trying to copy it while people are using it, then, I guess links would be deleted during the copying procedure, and the copy could not be completed.
In that case, maybe get a super-fast drive, and stop all transactions, say, for 5 minutes each hour, during which you make a complete copy. And, never delete one copy, before it is determined that the next one is OK. Maybe that is what Michael meant.
Microsoft SQL databases are files you cannot touch like standard files.
One of their purpose is not to have downtime and work 24 hours 7/7.
You cannot ask hundred of users (in my case) or thousands of users (in other company cases) to log off because you have to copy the file.
Also consider in many companies that files are in the size of tera bytes and you cannot simply copy them even if your hardware is very fast (in my case all hard disks are 15k disks, so 15k spin per seconds disk) you will takes hours.
What you need to have are agents that connects to data. Agents are special programs or services connecting with MS SQL server in order to get/set data or transactions.
John Spikowski
31-05-2010, 19:51
Eros,
I think you will find the SQL mirroring solution as the best choice for redundancy.
About a year before MS came out with SQL Mirroring, I created a SQL Mirroring interface for ProvideX Business Basic. ProvideX uses a proprietary keyed file system and was using a ODBC interface back to the PVX data for use with Crystal Reports. This was slow and painful to use. My solution would mirror all writes/removes (INSERT/UPDATE/DELETE) and use the ProvideX data files for reads which was most of the activity. Crystal ran directly off the SQL server which was a huge performance increase. I ended up selling the project to a large VAR and stepped away. When MS SQL Mirroring first came out it had some issues but now after 5 years, the bugs should have been worked out.
John
danbaron
31-05-2010, 21:41
[font=courier new][size=8pt]I'm no expert. :unguee:
:oops:
Dan :x :P
http://en.wikipedia.org/wiki/Solid-state_drive
http://www.ramsan.com/products/ramsan-6200.htm
ErosOlmi
01-06-2010, 05:02
huuu ... that would be great but cost is still out of my budget
danbaron
01-06-2010, 05:40
[font=courier new][size=8pt]Then, maybe what I think Charles and John meant.
Each transaction also goes to one or more remote machines.
In that case, if a day's transactions were lost on your machines, I guess you could still recover them, online.
:violent: