Header Image
 

You Can’t See the Forest Because of the Trees (A True ARCserve Story)

A few years ago, we had a very large customer who was experiencing major problems with their “Powder Horn Tape Library” running in a SAN environment where tapes would randomly have their slot assignments change. This caused backup failures due to tapes not being in their assigned locations.  Now if you are not familiar with this type of tape library, it consists of multiple 6,000 slot LSM’s (Library Storage Modules), each module being big enough for a person to walk inside. The robotic arm moved tapes so fast that one could be badly injured if that person was to put their arm inside an operational unit. As the unit was totally sealed, this would be very difficult to happen. The library also had 10 LT02 tape drives assigned to it. Here is a support log entry describing the problem:

MM/DD/YY 2:00 PM While adding tapes to the library, we encountered a problem where the library said it had empty slots which ARCserve could not use for importing. After analyzing the problem, it was discovered that the library thought it had 5 empty slots when in actuality it was full. The library was then instructed to re-inventory itself which corrected the problem. There was NO PROBLEM with ARCserve (although I thought that it may be an ARCserve problem in the beginning) as ARCserve gets it tape barcode/slot information from the library.

CA SWAT was involved after multiple weeks where support could not solve the problem. After a period of time, it was decided that CA would have a SWAT person on site 7 days a week and 24 hours a day. This was done because the tape slot problem was happening somewhat randomly (although mostly at night) and it was felt that one had to immediately get involved looking at the current mix of jobs running and the ARCserve and Operating Systems logs to try and see where the failures were occurring.

One night at about 2:30 AM while I was monitoring the backup jobs, the customer’s night shift tape person came in and proceeded to go into the server room to take out the used tapes and add new tapes into the library. The normal process for this operation would be to go to the tape library console and instruct the tape library to move the old tapes (up to 10 at a time) to the export cap. Then the operator would remove these tapes and load in up to 10 new tapes. Next the operator would instruct the tape library to move these 10 new tapes from the cap to their assigned slots within the tape library LSM. This could take quite awhile if one had 30 or 40 tapes to unload and reload. As I was curious about how this was actually done (and I was also very bored), I decided to go in the server room and watch the tape operator move the tapes. Normally, I would be in the datacenter watching the ARCserve job consoles waiting for something to fail.

Here is what I observed (My support log entry):

MM/DD/YY 2:35 AM -  I was talking to the customer’s night operations person and he stated that he had opened the side panel (not the front door or the cap) of the library and moved 5 tapes early this morning. I asked him why he did this and he stated that everyone (tape library service technician, and customer IT people, etc) sometimes do this to save a little time when they are in a rush. IF YOU DO THIS, BOTH THE TAPE LIBRARY AND ARCserve WILL NOT KNOW ABOUT THESE CHANGES. ALSO, AS THE ROBOTIC ARM WAS OPERATIONAL AT THE TIME, SEVERE INJURY COULD OCCUR WHILE A PERSON MOVED TAPES WHILE THE ROBOTIC ARM COULD ALSO BE IN MOTION.) This is the probable cause of the library problem we encountered at MM/DD/YY 2:00 PM. I have instructed EVERYONE that they are NEVER to move tapes or change anything on the tape library via taking off the side panels to the LSM. All changes must be done through the CAP or the front doors so that the tape library and ARCserve know of these changes. This could turn out to be the cause of many ARCserve problems in the past as the current tape operator and the previous operators would do this function every so often.

As it turned out, the tape slot problem was not the fault of ARCserve and was purely the fault of poorly monitored tape operator procedures on how to change tapes in the tape library. After putting into place new tape replacement procedures, the tape slot problem totally disappeared at this site. The reason that this was not caught earlier is that it was rare for any of the CA personnel to go into the server room and any tape slot failures would occur hours later after the next set of backup jobs started running and new tapes were needed for those backups.

1 Comment to You Can’t See the Forest Because of the Trees (A True ARCserve Story)

  1. Ravi's Gravatar Ravi
    October 21, 2010 at 1:15 pm | Permalink

    Thanks a lot for sharing your experience on this issue.


Leave a Reply

You can use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>