Corral User Guide

Corral is a collection of storage and data management resources primarily located at TACC, with 11 petabytes (PB) of storage installed in the UT data centers at TACC and in Arlington, and an additional 5PB of unreplicated storage for low-latency applications. Approximately 32 server nodes are associated with the storage and provide a variety of data movement, access, and management services, including serving web and database applications. Corral services provide high-reliability, high-performance storage for research requiring persistent access to large quantities of structured or unstructured data. Such data could include data used in analysis or other computational and visualization tasks on other TACC resources, as well as data used in collaborations involving many researchers who need to share large amounts of data.

PIs may request any quantity of storage across multiple allocations. The first 5TB of storage for each PI, regardless of the number of allocations, is available to researchers at all UT System institutions at no cost. For storage needs larger than 5TB, access to Corral is available at a cost of $118/TB/year. This policy is subject to change and users are encouraged to plan for the costs of storing their data in future years.

There are two primary mechanisms for storing and managing data stored on Corral: simple file storage ("file system access" and the iRODS data management service. File system access is appropriate for users who will make use of their data on Lonestar or other TACC systems and are comfortable with UNIX command-line utilization, while the iRODS service is appropriate for users who have complex metadata associated with their data, users who wish to manage their data using web and/or GUI interfaces, and users who wish to develop a data collection for open web sharing. Both services allow for both replicated and unreplicated storage, and both services should have similar reliability and performance characteristics. Replication in this context means that data is synchronously stored in both the TACC and Arlington storage installations, resulting in two copies of all data and metadata, while unreplicated means that only one storage site is used, and only one copy of the data and metadata are stored within the system. Users who will primarily be reading and writing data from Lonestar should choose the unreplicated TACC file system, /corral-tacc, as this will provide the best performance on Lonestar. Users with more general data storage and management needs should choose the replicated file system, /corral-repl.

In addition, there are a variety of more specialized data management and sharing applications that may be supported on Corral, including open-source databases such as Postgres and MySQL, discipline- or data-specific web applications, and tools for generating metadata or other derivative file formats. Users are encouraged to contact the Data Management and Collections group at TACC data@tacc.utexas.edu to discuss specialized needs or to ask about specific applications.

The Data Management and Collections group at TACC can provide specialized consulting services to help make the best use of Corral, either for existing projects or for planned research tasks. The group can also assist with developing data management plans for research that would incorporate the use of Corral and other TACC resources as part of research and data management workflows. For more information or to inquire about such consulting services, please contact us at data@tacc.utexas.edu.

Corral is available to researchers at all UT System campuses, including both academic and health institutions. Corral is intended to support research activities involving large quantities of data and/or complex data management requirements. There is no requirement that users have allocations on other TACC systems, and Corral can be utilized independently of TACC computational and visualization resources. All Corral users must have TACC accounts; if you do not yet have a TACC account you can create one on the TACC user portal.

You may request an allocation on Corral through the TACC User Portal. When requesting an allocation, indicate the quantity of storage you expect to utilize in terabytes, the nature of the research project that will be supported through the use of Corral, and the service or services you expect to utilize. It is also helpful if you provide a suggested name for the directory or a collection name under which your data will be stored on Corral. If you know whether you wish your data to be replicated, you may indicate that as well - unless you request otherwise, all data on Corral will be replicated. Once your allocation has been granted, you will receive an e-mail indicating the location of your data within iRODS and/or directly in the file system accessible from the Corral login/data movement nodes.

The access mechanism you will use will be based on the specific service or services you request, and could include either or both command-line and graphical tools. Basic command-line access through SSH is provided on the login node: data.tacc.utexas.edu. This node is not suitable for significant computational or analysis tasks but is provided for use in transferring and organizing data through command-line utilities.

Users with the basic file system allocation type can directly access their data on Lonestar and the Corral login node. The replicated file system is mounted as /corral-repl and the unreplicated TACC-only file system is mounted as /corral-tacc. The full path to your project directory will be provided to you when your allocation is granted. The file system may be mounted on other systems within TACC at the discretion of the system administrators. Please submit a help ticket if you have questions about whether a system has Corral mounted or wish to request that it be mounted.

Note that while the Corral file systems are mounted on Lonestar compute nodes, the local Lustre SCRATCH and WORK file systems may provide better performance for compute jobs, and users are encouraged to incorporate staging of their data to and from the SCRATCH file systems in particular as part of their job scripts.

Corral storage can be used for data subject to special security controls, such as HIPAA Personal Health Information and data subject to FERPA controls, but only in controlled circumstances after appropriate review and approval by both TACC and the organization or PI which owns the data. Users with sensitive data are required to contact the administrators through the help system or by sending an e-mail to help@tacc.utexas.edu to discuss your needs before storing such data on Corral.

Corral group quotas will be set to the quantity each project has been allocated. Default quotas are set to 1TB for all groups without an allocation, thus it is important to ensure that your data is owned by the correct project group. You can check the Unix group for your project by going to the TACC user portal and viewing the project details in the Projects and Allocations tab. There are no limitations on the size of files stored on Corral nor are there limits on the number of files total or the number of files per-directory. Please note that some data transfer tools and most web browsers do have limitations on the size of files they can download or upload. Limitations on overall usage are set through quotas on the project group - a "soft quota" is set which will provide a warning when you are near your limit, and "hard limits" will prevent the creation of new files once you are over your group's quota, and new files cannot be created until data is deleted. Quotas can also be set on a per-user basis if project PIs wish to control the usage of individuals within a research group.

Users can monitor their Corral allocation usage via the TACC User Portal. Click on "Allocations->Projects and Allocations", then click on the project detail button to examine the storage used in your allocation. If the value differs from your expectations or you have significant amounts of data with an incorrect project group, please submit a support ticket.

Files on Corral are never "purged" using automated processes. Once an allocation has expired, data will typically be retained for 6 months after the expiration of the allocation, however data may be deleted at any time at the discretion of the system administrators unless there is an allocation request pending. Important data should never be stored on only one system, and users are encouraged to maintain a second copy of their most important data on another system at TACC or elsewhere. Users who wish to replicate their data across multiple systems, including the Ranch tape archive, are encouraged to utilize the iRODS data management tool.

Corral now provides a cloud storage interface compatible with the S3 version 4 API, available only by request. If you wish to utilize the S3 interface to Corral, submit a request for the Corral resource as if you were requesting the usual file-based access, but add a note to your allocation request that you wish your storage to be accessible via the S3 API rather than the file system interface. If and when your allocation request is approved, a project-specific endpoint will be created and will be sent to you in lieu of a directory path.

The S3 interface is most suitable for programmatic interaction from within custom applications. However, for command-line client usage, we recommend the "minio" client, which documentation and download links available here: https://docs.minio.io/docs/minio-client-complete-guide.

Please direct any further questions you may have regarding the cloud storage interface to Corral through the TACC ticket system.

Data transfer mechanisms differ depending on whether you are using iRODS or basic file system access. For iRODS users, please see the iRODS user guide. Data transfer to and from the file system is described below.

Data transfer from any Unix/Linux system can be accomplished using the scp utility to copy data to and from the login node. A file can be copied from your local system to the remote server using the command:

Where filename is the path to the file on your local system, and the path is what was provided to you when your allocation was granted. While a whole directory can be copied recursively using the "-r" switch:

The scp utility has many options and allows you to provide many defaults and other options in a configuration file to make transfers easier. Type "man scp" at the prompt to get extensive documentation of the options available for transferring files.

If you are performing computational and analysis tasks on Lonestar, and those tasks are I/O intensive, you may achieve improved performance by "staging" data to the Lonestar $WORK or $SCRATCH file systems before running a compute task. This is due to the use of the high-performance network of Lonestar for access to $WORK and $SCRATCH, as opposed to the slightly slower TCP/IP network used for access to Corral. The simplest way to stage a file is to copy it to your $SCRATCH directory before you submit your job:

A wide variety of graphical tools are available that support the secure copy (SCP/SFTP) protocol; you may use whichever tool you prefer, but we recommend the open-source Cyberduck utility for both Mac and Windows users that do not already have a preferred tool. See examples below of configuring the Cyberduck utility for transferring data to TACC. You may use the same parameters in any tool with similar functionality.

Click on the "Open Connection" button in the top right corner of the Cyberduck window to open a connection configuration window (as shown below) transfer mechanism, and type in the server name "data.tacc.utexas.edu". Add your username and password in the spaces provided, and if the "more options" area is not shown click the small triangle or button to expand the window; this will allow you to enter the path to your project area so that when Cyberduck opens the connection you will immediately see your data. Then click the "Connect" button to open your connection.

**Note that in addition to your account password, you will be prompted for your TACC token value and will need to have the MFA pairing step completed to connect to the system.

Once connected, you can navigate through your remote file hierarchy using familiar graphical navigation techniques. You may also drag-and-drop files into and out of the Cyberduck window to transfer files to and from Corral.

It is crucial that users understand and utilize the available access controls on Corral and other storage systems. If permissions are not explicitly set on files added to this and other systems, the default permissions may allow anyone (or no one) to view that data. This represents a potential threat to the security and confidentiality of users' data, and can lead to additional time and effort later on as changing permissions after the fact can be very time-consuming, particularly in complex hierarchies.

Both files and directories have permissions settings, and it is important to set permissions on both files and directories in order to secure the data and grant access to the right individuals. While TACC makes every effort to ensure the security of our systems and the data they store from unauthorized users, it is your responsibility to ensure that your data is protected from other authorized users of the system, and thus that only the right individuals have access to your data.

With this in mind, it is a good practice to explicitly set the permissions on new data at the end of each upload or data-generation session, using the "chmod" command or the permissions controls in a graphical client such as Cyberduck. This ensures that your data always has the right permissions, and that data is appropriately protected as soon as it is added to the system.

Permissions on files and directories have 3 important categories, for each of which there are 3 levels of access that can be provided. The 3 categories are the owner, the group, and "other" meaning all users of the system. The 3 levels of access are read, write, and "execute" which allows a program to be run in the case of files or allows a directory's contents to be accessed in the case of directories or folders. Typically, users outside of your project group will not be able to access your files unless you explicitly allow them to do so. You can view the permissions for each file in a given directory by typing:

login1$ ls -l

within that directory, or

login1$ ls -l /full/path/of/directory

at any time

Permissions are shown as a set of three letters for each group, as in the following example line of output from "ls -l":

drwxrwxr-x 3 ctjordan G-802037 4096 Mar 6 10:08 mydirectory

In this example, the "d" at the front indicates that this is a directory, and it is readable and writable by the user and anyone in the user's group. Other users on the system can list the directory's contents but cannot write to it.

For a more fine-grained approach to files and permissions, use Access Control Lists or ACLs. With ACLS you can create customized groups of users with customized permissions. Please consult TACC's document "Manage Permissions with Access Control Lists" for detailed information.

File permissions can be managed using the "chmod" command from the command-line prompt, and from the permissions window in Cyberduck. The permissions window is shown below, and can be accessed by right-clicking on a file or folder in Cyberduck and selecting "Info ..." from the menu. Other graphical fie transfer utilities may have a similar window or panel used to control permissions. The chmod command has a straightforward syntax:

chmod permissions-to-changefilename

where permissions-to-change can be any or all of "u" for user, "g" for group, and "o" for other, a "+" to add permissions or a minus, "-", to remove permissions, and the initials of permissions to add or removed, "r" for read, "w" for write, and "x" for execution. For example, to add read access for all users of the system the command would be:

login1$ chmod o+r filename

There are various shortcuts one can use to apply specific permissions, and the user is encouraged to read the documentation for the chmod command by typing "man chmod" at the command-line prompt. The chown command may also be of interest in understanding permissions, and full documentation can be read using "man chown".