This document describes a vision of, as well as the steps needed to reach, a stable (primarily) Linux computing environment in two to three year's time. Relevant policies for software and hardware can also be found here.
The design of the proposed deployment bears a strong resemblance to NRAO's deployment, from which it was inspired; however, it is adjusted to serve the idiosyncrasies of our system, and to make the transition as smooth as possible for our users.
The deployment is divided into four stages which are described below. The final goal is to have a Network Appliance (NetApp) as the core of our computing environment. In addition we envision duplicate servers which will allow recovery from a disaster in less than an hour. We also envision distribution of services to different servers to minimize the magnitude of a possible disaster and to better utilize our network. In the last five years the departmental computing needs have quadrupled and this trend can only continue; therefore, efficient network utilization is a key to the deployment. The NetApp is a piece of hardware which resembles a RAID and has been used in many designs of computing environments; it is a proven concept. It is essentially a data repository that can be accessed by workstations running diverse operating systems. As such, it is fairly network secure, consolidates discs into one location, is easy to backup, fairly transparent to users, and accessible by different operating systems (and therefore a long-term investment). Since a considerable amount of money is involved we decided to prototype the NetApp, with a RAID that we already have, for proof of concept. After that, we will consider the appropriate hardware that we need to purchase. In any case, the time investment of Stages 1 and 2 will not have been wasted because they will have helped pave the way to Stage 3.
During the prototyping phase, currently in progress, Luna, a 2 × 2.8GHz Xeon processor with a 377 GB Level 5 RAID, is playing the role of the NetApp. Luna runs Fedora core 2 downloaded from ITC's support distribution web-site and is an NIS client to Astsun. Luna exports its RAID to a select few Linux and Solaris workstations that comprise the test module. It has a SLDT-320 with 160 GB tapes for automated backups. Luna's RAID is slightly smaller than the RAID on Astsun, but it is large enough to hold all home directories. It will be limited to exporting the RAID and talking to a small number of machines at this stage. The Solaris home directories of a selected group of people (Howard, Kiriaki and a few grads) will be moved to Luna during this phase. A user in this group should now be using the same password to log in to every machine within the test module. A script should differentiate at login how to setup the appropriate environment for Solaris or Linux, with Astsun the Solaris server and Selana the Linux server. The password is kept on Astsun (the NIS server). In this way we do not compromise Astsun while we are testing the configuration.
Solaris workstations Carina, Ariel, Helios, and Linux workstations Pulsar, Solstice and Realos will be part of the test module. Users logging in from Carina and from Pulsar should both see their home directory on Luna, while the login script should determine the operating system, configure the environment variables accordingly, and link to user specific directories. The selected users will verify the functionality of the configuration, while the system administrators will work to guarantee operations and security.
During this phase Astsun will still be serving mail, home directories for the rest of the users, web services, NIS, NFS and /astro8 (which contains astronomy specific software). Selana, the Linux server, will be serving the current Linux home directories of the few users that do not migrate during the test phase, and /astro (the Linux binaries of the astronomy specific software). NIS or NFS from Astsun will not be allowed, so that the network remains clean during this test phase. Calisto will be the printing and SAMBA server running Fedora. The RAID on Calisto will still be the repository for all Departmental PCs, including the backups of the secretarial work, as well the the connection to the Clark107 and ASTR265 PCs.
The purpose of this experiment is to test how secure and viable the configuration is, how transparent it is to a user, and how easy it is for an administrator to maintain. We will explore issues of how to use our network beneficially. We will also explore how the prototype will react in case of a disaster, and establish recovery policies.
We are fairly confident that the prototype will work by the end of July. As previously stated, if the experiment works we have paved the way for the final steps. If difficulties occur we can fall back to plan B, in which Astsun and Selana are isolated servers, with Selana coming to life while Astsun is gradually phased out.
At the beginning of Aug 2004, if all works as scheduled, we will move the home directories of those users who are comfortable with the change onto Luna, with more workstations talking to Luna from both Solaris and Linux. The goal will be to have as many users as possible migrated twelve months later.
We will work with groups to ensure that their software needs are first met within the test module and then start migrating additional users. Groups that wish to purchase new equipment should contact us to work out hardware issues; they should anticipate that by December these equipment will be reliably functioning. We anticipate that by next summer a fair amount of the graduate students will be moved to the new system, they compose our initial target group.
During this phase, web, mail, /astro8, and some home directories are served by Astsun, while Selana is serving /astro, but (hopefully) no home directories.
After Sep 2004, assuming that the prototyping phase has gone well, we will be looking into the kind of hardware we would like to get to replace Luna. NRAO's configuration is of order $45K, with a $1K annual fee for updates to the software that the NetApp is running. We have heard mixed reviews, but it remains to be seen what we will choose in the end. We can certainly put it in the wish list of ETF funding. Assuming we can purchase such a device (or something similar), next summer we should be able (relatively painlessly) to replace Luna with Luna II, the final NetApp. In addition, we should be thinking about purchasing a chassis with more disc capacity, which will allow us to expand Luna to a larger RAID.
Starting in Sep 2005, this final stage will see Astsun still maintaining /astro8, mail, web services and NIS/NFS for the very few remaining Solaris workstations. At the same time Selana will be maintaining /astro, and mail services will be moving towards Selana. Web services will stay with Astsun (moving the web will be an entirely separate project). Calisto will be the SAMBA and CUPS server.
ITC will be backing up Luna, since it will be the site of the home directories. These home directories are not intended for large amounts of data storage, but rather for small amounts of critical file storage (as is the case with the current Astsun RAID). In our department, we have many groups that handle large amounts of data, as well as groups that handle small but numerous files. The data-oriented groups are strongly advised to purchase RAIDs on which to consolidate their data, and either have ITC backup the RAID or use other backup solutions. These groups are advised to have a server attached to their RAIDs to 1) manage the RAID, and 2) to install group-specific software. It is less network intensive if groups reduce data on the server attached to the RAID. However, groups that wish to take advantage of the CPU power of many individual work stations will need to work with us to ensure a healthy network.
There is a recommended list of hardware (which will be updated every 6 months) that users are strongly encouraged to select from when purchasing equipment. This list includes CPU types, video/audio cards, SCSI and network cards. Users who purchase non-recommended hardware will be required to provide for the installation and maintenance from their own resources. All such arrangements must be approved by the departmental computing committee or system administrator, to ensure the continued safety and functionality of the departmental computing resources.
The adopted Linux flavor is Fedora. Users are (required) strongly encouraged to comply with the kick-start that we provide. Root access to users strongly discouraged, but will be considered on a case-by-case basis if the situation warrants. Research groups should work with us to provide a homogeneous computing environment. Individuals who wish to explore something that suits their research needs better are welcome to do so AT THEIR OWN RISK and outside our network.
Groups that wish to purchase equipment are encouraged to do so, keeping in mind that the period from October through December is the time they are most likely to see their equipment functioning in a stable operating environment.
We wish to see the Beowulf cluster expand through the addition of more nodes; however until the deployment is mature the cluster will remain fairly isolated from the rest of the network.
As we are envisioning duplicate servers; for safety reasons we also envision two physically different locations in which to deploy the servers, ideally one in this building and one in G26. The choice of location will be driven by 1) the ability to maintain an appropriate stable room temperature, 2) power availability, and 3) network speed and bandwidth.