This document describes steps taken to commission a new automated tape library, a 10,000 slot Storage Tek (STK) SL8500, and relatively new tape technology – LTO-3. The tests were performed with up to 15 STK/IBM LTO-3 tape drives and 100 Fujifilm LTO-3 tape cartridges. The model of operation is based on the document DRAFT SL8500 operation model, which also has a description of the tape library and how it operates. Following this model, the drives were uniformly distributed across all four rails of the library, and the library was configured to float the cartridges.
1. Enstore was run unmodified so that the tape drive selected for a volume mount request does not take the location of the drive and tape into account. The implication is that the tape and drive may be located on different rails, so the tape needs to be passed up an elevator to the rail that the drive is located on, which involves arms on two rails participating in moving the cartridge to the drive. In float mode, when dismounted, the tape gets placed in a slot on the same rail as the drive. This is not optimal performance, but modifying enstore to preferentially select a free drive on the same rail as the cartridge is thought to require a major enstore code revision.
2. Though100 cartridges were scattered about the length of the library, yielding fairly representative average response times, because of the manner that the library returns cartridges; only a small portion of the slot space was accessed. The library software does not have functionality to move a cartridge from one location to another, preventing us from being able to arrange for full slot coverage.
The goal was to simulate on the order of one month’s usage, which we decided would entail writing 100 Terabytes and reading back 300 TB. We also wanted to perform accelerated mount/dismount tests. The tests executed were:
1. Write about 100 4GB (average size) files to each tape in sets of 50 tapes with up to 15 movers and 10 worker nodes. Each tape had its own file family and each worker node rotated through the five file families, guaranteeing a mount/dismount for each file. Read-after-every-write mode was set on each of the movers so that each file was immediately read back and the CRC verified. When the tape filled, files were selected randomly and read from the tape until 3x the number of files on each tape were read back. Once this was accomplished, the 50 tapes were recycled and the test started again. 80TB of tape were written in this manner.
2. 50 additional tapes were entered for a total of 100 when a media failure turned up early on.
3. The last 20TB written to tape had read-after-every-write turned off. In addition, once a file was written, a randomly selected file was read back, causing the drive to reposition to a random location after the write.
4. One tape is having the first 10 files on it, in a random ordering, read over and over again. We are tracking the number of passes.
The first figure below shows the total bytes transferred/day and the second only the bytes written/day


Plot 1 shows the mount count distribution for all of the media. The mean is around 900 for 100 tapes so about 90,000 mounts/dismount cycles were performed. This corresponds to about 6000 mount/dismount cycles for each drive.

There were no drive or tape cartridge failures.
In the first week of operating two robot arms required replacement. There have been no arm problems after that incident.
The lower robot arm would squeal along a small segment of its track. The problem was determined to be track (rack gear) alignment that led to the arm being skewed on one side of the library, resulting in one of the arm’s roller rubbing against the guide rail. This was corrected and has not been a problem since. Note that this is likely a result of the tighter tolerances required for the longer 10k SL8500. The robot aligns its skew against one side of the library and its skew on the other side depends on the upper and lower rack’s gears not being shifted relative to the other side due to gaps in installing the rack gear segments. This problem was fixed by swapping the upper and lower racks for a segment.
Tape TEST01 is undergoing a read torture test. This test performs a read of the first 10 (ave 4 GB) files on the tape, in a randomly selected order, repeatedly. The tape is kept mounted in the drive. At the time of this writing, the tape has had 1000 passes (10,000 reads) with no errors.
There was one incident of data loss on a tape. This tape was written in read-after-every-write mode (each file being immediately read back after write and the CRC verified), so it had been written successfully. During the read test cycle, a significant number of files failed to be read with media sense errors. The tape was sent to Storage Tek for analysis. Storage Tek verified from the onboard diagnostics that the files were written successfully but later failed, and sent the tape to IBM for further analysis. IBM found debris on the tape in the areas of failed reads, and performed a destructive analysis on the debris. They found the debris was a fibrous material consistent with plant or insect organic material. Around this same time it was noticed that there was a serious bug infiltration in the tape library room. Many moths, pill-bugs, and gnats were getting into the room. Insects can infiltrate the library and the tape drive transports are exposed – they have no tape door but are open-mouthed. The accepted conclusion of this analysis is that a bug gained entry to a tape drive transport and was crushed against the tape. Steps are being taken to seal the room against insect infiltration.
Two other less significant read incidents occurred:
· A read failed with a sense media error on a tape. The tape was automatically mounted in another drive and the file was successfully read. This behavior occurs infrequently in our other libraries.
· A file written to a tape failed with a selective CRC error. This means that the file was immediately read back after the write and either the CRC check failed or the read failed. In this case the read failed with a media sense error. Enstore successfully wrote the file to another tape, and the tape was marked no access. As already mentioned, read failures happen infrequently. However, in the case of a selective CRC, which is associated with a write, the read cannot be tried in another drive. No further investigation was performed.
Mount latency for the Sl8500 and the 9310 complex (two 9310s connected via pass-thru) are shown below. The latency is the same order of magnitude for both. Banding is visible in both. The lower band for the SL8500 is for mounts where the cartridge is in the same rail as the tape. The upper band is due to the increased latency introduced by the elevator when a cartridge has to be moved to a different rail to mount the tape in a drive on that rail.
The upper band in the 9310 complex results from moving a cartridge through the pass-thru port when the cartridge and tape drives are in different silos.
When SL8500s are complexed with pass-thru ports, cartridges that need to be mounted in a drive in another ACS (I.e. SL8500) will likely suffer the additional overhead of two elevator moves and a pass-thru move, which may significantly degrade the mount latency.

Mount Latency SL8500

Mount Latency stken 9310 complex
No attempt was made to stress the mount rate, but as the figure below indicates, we exercised the library up to a peak of about ~ 2750 exchanges/day.

Back-to-back local transfers of 5 2GB files in a row completed in 3 minutes. That's an aggregate rate of 55MB/s. This is below the maximum 80 MB/s streaming rate of the drive, but is consistent at 80 MB/s when the per-file overhead (~ 8 seconds/file) is taken into account.
The commissioning was performed from un-optimized worker nodes across two network hubs. The plots below show the performance during the commissioning. These results should be considered in the light that no attempt was made to optimize the performance (e.g. disk striping the worker nodes) except using a large average file size of 4GB.
All rates are in MB/s. Not that the blue reads are plotted over and obscure the red writes such that writes are obscured when they are at the same rate as reads in the dame time period.

Drive rate

Network rate

Overall Rate
The following items are considered of high importance for the tape library room readiness:
1. Maintaining and monitoring a tolerable temperature and relative humidity during summer and winter
2. Sealing against insect infiltration
3. Fire suppression and detection for the room
4. Fire suppression and detection for the library
5. Alternate/generator power source for the library and climate control
There is a CRAC unit in the tape library room which provides humidity, cooling, and some heating capability. This winter there will be two racks (16 each) of mover computers, the tape library, the 40KVA UPS, and possibly the CRAC to heat the room. This is estimated to be <40,000 BTU. Gerry Bellendir investigated heating and humidification for the winter:
“I looked up the original heat loss calculations for the space when it was the
Wide Band Lab.
The calculation dated 10/19/83 had a heat loss of 19,825 btu/hr (5.8kw) for the space.
If the equipment in the space has an estimated heat gain of 40,000 btu/hr (11.7kw) then this
would be enough to heat the space. In addition, the CRAC unit in the space does have a 30kw
(102,390
btu/hr) heater if heating is required.”
The
humidity and temperature are monitored by the CRAC unit and tied into the
building alarms. It is desirable but not critical to have monitors at other
locations in the room.
Work is in
progress to seal the room against insects. The library will not be considered ready for
production until the room is sealed.
The room is monitored by VESDA systems that will alarm the Fire department.
A tape library fire suppression system is installed. The tape library is monitored by its own VESDA system and two smoke detectors at either end of the library. The VESDA will alert the fire department. If both smoke detectors alarm, FM200 gas will be dumped into the library to extinguish the fire.
The library and mover computers have redundant power supplied by the tape library room’s 40KVA UPS and computer room A’s 1000 KVA UPS, both from different sources. The CRAC and Fire suppression panel are powered separately. The 1000KVA UPS is used to power down the computers in computer room A. It is unclear how much capacity will be left after computer room A computer shutdown to power the tape library and mover computers. There are provisions for hooking up generators to the 1000 KVA source and the CRAC, fire suppression panel.
Worker nodes were CDF farm nodes. encp requests from these nodes passed through the FCC hub router and the GCC tape library room router (see the network design note). encp version 3.6 was used.
enstore server and mover nodes (gccensrv1, gccensrv2, gccenmvr1-16) were all on the network local to the GCC tape library room.
SL8500:
- 10,000 slots
- Redundant AC power and drive power (front touch panel console is not redundant, but not necessary to operate)
- 15 IBM LTO-3 drives which stream at ether 40MB/s or 80MB/s. 128MB internal buffer. Open mouthed.
- ACSLS Sun V240: Control LAN name/IP fntt-gcc/192.168.89.248
-
Enstore (temporary gccen instance):
- gccensrv1: SLF 3.0.5, Supermicro Dual Xeon 3.6GHz, 2MB L2 cache, 4GB RAM., HT on, 1G NIC,
o Accounting, alarm, drivestat, event relay, file clerk, volume clerk, info server, inquisitor, log server, ratekeeper
- gccensrv2: SLF 3.0.5, Supermicro Dual Xeon 3.6GHz, 2MB L2 cache, 4GB RAM., HT on, 1G NIC
o media changer, library manager
Mover plant:
- 16 movers (15 active): SLF 3.0.5, Supermicro Dual Xeon 2.8GHz @MB L2 cache, HT off, 1G NIC, 2G FC