CADAC
Adding Data
Client Configuration
This page assumes you have configured the SRB client for your system.
SRB Scommands
From the Scommands documentation:
Scommands are command line utilities that run in the Unix, Windows or Mac OS command shells. Most Scommand names are preceded by an "S". These are Unix-like commands for accessing SRB data and metadata.
The Scommands may conflict with other conflict with other commands on non-case-sensitive file systems. In particular, the SRB command to remove a file in the SRB is Srm, which conflict with srm.
Since srm works on your local file system, you may want to alias it to srm -i to avoid surprises.
This means that many familiar commands, like cd, pwd and ls, also appear as Scommands, but in the form of Scd, Spwd and Sls. In addition, there are a few FTP-like commands, i.e., Sput, Sget, which handle data transfers, and more specialized commands for dealing with metadata (not covered here).
Here is a complete list of the Scommands, with links to their documentation.
Browsing
Browsing the SRB is like browsing a file system, you change directories (Scd), look at stuff (Sls), and every now and then, forget where you are and check (Spwd).
Before we do anything, we need to connect to the SRB.
$ Sinit
When we first connect, we're placed in our home directory in the SRB. You can check this using Spwd.
$ Spwd /home/rick.cadac
Now, I'd like to see what I've put there, so I use Sls.
$ Sls /home/rick.cadac: IO_perf.0000 collision_320.png
Navigating the SRB hierarchy is identical to a normal POSIX file system. To go up a level, use Scd .., or to go to the top of the tree use Scd /.
$ Scd .. $ Spwd /home $ Scd / $ Sls /: C-/home C-/container C-/CADAC C-/Archive C-/styles C-/trash C-/LCAzone
(The C- prefixes in the listing indicates directories, which is often called a collection in the SRB.)
Moving around in the SRB while navigating your local file system can be a little confusing. The thing to remember is that Scommands operate on the SRB space. As an example, after using Scd above, I am still in my home directory on my local computer.
cable:~ rpwagner$ Spwd / cable:~ rpwagner$ pwd /Users/rpwagner
Creating a Directory
Now that we have an idea how to move around in the SRB, we need to know how to create directories, so that we can add data. In the top level directory listing from the previous example, we see a CADAC directory in the SRBthis is the place to put shared data. Data in this area may be organized in one of three ways:
- by user;
- by a group of users;
- by a shared project.
We are leaving it to members to decide which method to choose. The only requirement is that data is added to the SRB only once. In other words, please do not put a dataset into a personal directory, and then put a second copy into a project directory. In this case, either let your collaborators know where to find your data, or use Smv to move it to the project directory.
Alright, let's look in the CADAC directory and see what's there.
$ Scd /CADAC $ Sls /CADAC: C-/CADAC/kritsuk C-/CADAC/mlnorman
Those two directories are personal directories for Mike Norman and Alexei Kritsuk. (Please note: personal does not mean private, just that the data is managed by a single user.)
I am going to create the third directory, but this time, it will be a shared directory, for data produced by our group, the Laboratory for Computational Astrophysics. As you can probably surmise, the command for this Smkdir.
$ Smkdir lca
Now I'll check to ensure we have the expected results.
$ Sls /CADAC: C-/CADAC/lca C-/CADAC/kritsuk C-/CADAC/mlnorman
Putting a File
As a good first example of adding data, I am going to put a README as the first file in the new lca directory. This will help other CADAC members locate data, and serves as a reference for the data maintainer. The README is just a simple text file.
$ cat README = LCA Data = This is a shared directory for the Laboratory for Computational Astrophysics. == Contents = * README - This file * DD0003.tar.gz - Sample data set
(The sample data set will added by the end of this page.)
Like FTP, the SRB will put the file from your local machine into what ever your current directory in the SRB is. Spwd is very useful for helping to keep track of where you are.
$ Scd lca $ Spwd /CADAC/lca
To put in a single file, we call Sput, and tell it the name of the file to ingest (another SRB term).
$ Sput README $ Sls /CADAC/lca: README
Some useful flags to Sput include -r for recursive ingestion, and -b for bulk upload. Check out the Sput manpage for usage and examples.
Getting a File
Getting files is handled as you might guess, by calling Sget. Like standard Unix commands, Scommands will accept absolute and relative paths. Like any file operation, you may want to ensure that the destination directory doesn't already contain a file with the same name as one you're getting.
$ ls $ Sget /CADAC/lca/README $ ls README
Sget supports similar options to Sput, including a parallel I/O mode, and checksums. Again, check out the Sget manpage.
Another way to access a file is to use Scat, which acts like cat, but reads the file from the SRB and prints the contents to stdout on the local computer.
$ Scat README = LCA Data = This is a shared directory for the Laboratory for Computational Astrophysics. == Contents = * README - This file * DD0003.tar.gz - Sample data set
Where is It?
Files placed into the SRB are stored in physical resources (disks, tapes, databases, etc.), that are often called vaults. The reason for the name is that there is a database keeping track of the files, and where they are on each resource. Users do not normally have access to the file vaults, since moving, deleting or modifying a file would create an inconsistency in the database.
However, for the CADAC, we wanted to permit users to read data directly from the disk on GPFS-WAN, and be able to easily archive (replicate) data to tape. The result takes a minute to understand, but works well, and allows for growth in the future.
File locations in the CADAC's GPFS-WAN file vault are determined by two things:
- the SRB user who added the file, and
- the file's path in the SRB.
Files in the vault are organized first by the user who added them, and then by their path in the SRB. If SRB_VAULT_DIR is the location of our vault, and SRB_FILE_PATH is the path to some file in the SRB, then a sketch of a file's physical location is:
FILE_LOCATION = $SRB_VAULT_DIR + user_name.cadac/ + $SRB_FILE_PATH
You may notice some name mangling in the SRB file names in the file system. The file will keep its name in the SRB space, but things like underscores will be removed in the vault copy.
Finding a File
Suppose you wanted to do some analysis on the README from the Sput example. This is thoroughly made-up example, that is very realistic, if you replace the README with a data set from a 10243 turbulence simulation.
First, the CADAC file vault is located in our project directory on GPFS-WAN.
/gpfs-wan/projects/cadac/srbVault
The file was added by myself (Rick Wagner), my username is rick, and the path in the SRB is
/CADAC/lca/README
Given all of this information, we expect to find the file at
/gpfs-wan/projects/cadac/srbVault/rick.cadac/CADAC/lca/README
Let's check.
ds004 % cd /gpfs-wan/projects/cadac/srbVault/ ds004 % cd rick.cadac/ ds004 % cd CADAC/lca/ ds004 % ls README ds004 % pwd /gpfs-wan/projects/cadac/srbVault/rick.cadac/CADAC/lca
As an aside, while CADAC members have read and write permissions to the software directory on GPFS-WAN, we only have read access to the srbVault directory. This for the reasons given previously, that modifying the files directly is a bad idea.
To find out who owns a file in the SRB, use the -l flag to Sls.
$ Sls -l /CADAC/lca: rick 0 gpfs-wan-cadac-125 165 2007-08-09-21.41 % README
The first column is the file's owner.
Replicating Data
The final thing to know about placing data in the SRB, is the idea that data can reside on multiple physical resources using replication. In the particular case of the CADAC, our resource gpfs-wan-hpss ties GPFS-WAN and HPSS together, so that replication is done using a single flag, -a.
$ Sput -a DD0003.tar.gz
This tells the SRB to put file onto GPFS-WAN, while simultaneously making a copy in HPSS. Looking at a long listing of the /CADAC/lca SRB directory, the third column shows what physical resource the files reside in. For the data set I just placed there, we see it is in both GPFS-WAN and HPSS.
$ Sls -l /CADAC/lca: rick 0 gpfs-wan-cadac-125 140450 2007-08-09-22.56 % DD0003.tar.gz rick 1 z-hpss-sdsc 140450 2007-08-09-22.56 % DD0003.tar.gz rick 0 gpfs-wan-cadac-125 165 2007-08-09-21.41 % README
Changing the file (using Sput to replace it) will cause the SRB to update all of the replicas. Data can also be replicated between resources using theSreplicate command, and the -S flag to Sput will cause a file to be sent to a specific resource.