Katana

Katana is a shared computational cluster located on campus at UNSW that has been designed to provide easy access to computational resources. With over 3000 CPU cores spread over a large number of compute nodes each with up to 1Tb of memory, Katana provides a flexible compute environment where users can run jobs that wouldn't be possible or practical on their desktop or laptop. For full details of the compute nodes including a full list see the compute node information section below.

Katana can provide a perfect training or development area before moving on to a more powerful system such as Australia's peak HPC system Raijin located at the National Computing Infrastructure (NCI) with the knowledge that has been gained. It can also be the perfect spot if you are uncertain if High Performance Computing is for you or with local support, node mix, long job runtimes and a wide range of software (with a special focus on the biosciences) it may be the perfect location for your research pipeline.

ACCESS

Katana has two levels of access:

General access

Anyone at UNSW can apply for an entry level account on Katana. This is designed for those groups that are thinking that Katana would suit their research needs or will typically use less than 10,000 CPU hours a quarter. Those at this level still get access to the same level of support including software installation and help getting starting running their jobs. The only difference is the number of compute jobs that can be run at any time and how long they can run for.

Member access

Faculties, schools or research groups that have bought into Katana by purchasing one or more computing nodes get a higher level of access that is proportional to their investment in the system.

The current list is:

  • UNSW Business School (School of Economics)
  • UNSW Business School (School of Banking and Finance)
  • School of Mathematics and Statistics
  • School of Aviation
  • School of Chemical Sciences
  • School of Biological, Earth and Environmental Sciences (BEES)
  • School of Physics
  • School of Biotechnology and Biomolecular Sciences (BABS)
  • Centre of Excellence for Population and Aging Research (CEPAR)
  • Climate Change Research Centre (CCRC)
  • Connected Waters Initiative (CWI)
  • Faculty of Engineering
  • School of Mechanical and Manufacturing Engineering

 

Account Application Process

To apply for an account send an email to the UNSW IT Service Centre (ITServiceCentre@unsw.edu.au) giving your zID, your role within UNSW and the name of your supervisor or head of your research group.

Connecting to Katana

Once you have an account on Katana then you can log on to it using the instructions in this section or by following the instructions you receive in your introductory email.

Note: When you are connecting to Katana via katana.restech.unsw.edu.au you are connecting to one of two login nodes katana1.restech.unsw.edu.au and katana2.restech.unsw.edu.au. If it is important that you connect to the same login node each time then you should change katana.restech.unsw.edu.au for one of those addresses in the instructions below.

Linux and Mac

From a Linux or Mac OS machine this can be done as follows:

desktop:~$ ssh z1234567@katana.restech.unsw.edu.au

Windows

From a Windows machine a SSH client such as PuTTY is required. Once you have downloaded PuTTY you open it, make sure that SSH is selected and enter the host name of the cluster so that it looks like the image below.

 

Putty settings for Katana
Putty settings for Katana

 

You should also type "Katana" under "Saved Sessions" and click on save so that your connection settings are saved for next time. Now click on "Open" and you will get a security alert. Accept the security key and you will be asked for your username and password (zID and zPass).

SSH Issues

With all networks there is a limit to how long a connection between two computers will stay open if no data is travelling between them. This can cause problems when you are connected to Katana to run interactive jobs or even if you step away from your computer.

Windows

The solution to this problem is to set the SSH keepalive variable to 60 seconds as shown in the PuTTY configuration image below.

 

Keep alive settings for Katana
Keepalive Putty settings for Katana

Linux / Mac

If you use a Linux or Mac computer then you can change the same setting by creating a file
~.ssh/config which contains the following lines

Host *
  ServerAliveInterval 60

Keeping things running while you disconnect

In order to make sure that your commands will keep running even if you are disconnected you should use the 'screen' command. To start a new screen we use the command screen -S ID so we start a new screen by typing

z1234567@kdm.restech.unsw.edu.au:~$ screen -S zID

and then you can run the commands that you usually do.

At any time you can detach the screen by typing Control a then Control d and log out. When you log back in you can check your progress by typing

z1234567@kdm.restech.unsw.edu.au:~$ screen -R

to re-attach the screen.

When you are finished with the screen you can close it in the same manner that you would use to log out.

Note: If you use the screen command on the login nodes then you will need to take note of which login node you have connected to, katana1.restech.unsw.edu.au or katana2.restech.unsw.edu.au so that you can return to the correct node after you have disconnected.

Copying files in and out

More information about file storage is available in the storage section of the web site but the easiest way to copy files and data to and from Katana is to use FileZilla (https://filezilla-project.org) which provides you a graphical way to copy files and even edit files in situ.

Graphical sessions

If you have connected from a Linux machine (or a Mac with X11 support via X11.app or XQuartz) then connecting via SSH will allow you to open graphical applications from the command line. To run these programs you should start an interactive job on one of the compute nodes so that none of the computational processing takes place on the head node.

If you require an interactive graphical session to Katana then you can use X2Go which is available from http://wiki.x2go.org/doku.php. Download and install the version of the X2Go client that matches your operating system. Then start X2Go and create a session for Katana. The details that you need to enter for the session are:

  • Session name: Katana
  • Host: katana.restech.unsw.edu.au
  • Login: zID
  • Session type: Mate

X2Go settings for connecting to Katana

Note: If you want to be able to disconnect and return to your session you will need to use one of katana1.restech.unsw.edu.au and katana2.restech.unsw.edu.au to make sure that you return to the same server each time.

Once you have created the session you can then click on it to connect to Katana.

Note: If you use X2Go from a Mac then you may get the following errors:

  1. SSH daemon failed to open the application's public host key.
  2. Connection failed Cannot open file -

This happens because of missing SSH key files on the Mac client. To force the Mac to generate these keys log in over SSH from a Windows computer using PuTTY (or Linux computer using SSH) which will generate the missing SSH key files.

Note: The usability of a graphical connection to Katana is highly dependent on network latency and performance.

0
Katana Compute Node Information

For information about the current composition of Katana compute nodes please contact us.

0
Expanding Katana

Katana has significant potential for further expansion. It offers a simple and cost-effective way for research groups to invest in a powerful computing facility and take advantage of the economies that come with joining a system with existing infrastructure. A sophisticated job scheduler ensures that users always receive a fair share of the compute resources that is at least commensurate with their research group’s investment in the cluster. For more information please contact us.

0
Acknowledging Katana

If you use Katana for calculations that result in a publication then you should add the following text to your work.

This research includes computations using the computational cluster Katana supported by Research Technology Services at UNSW Sydney.

For additional attribution information expand this section.

If you are using nodes that have been purchased using an external funding source you should also acknowledge the source of those funds.

For information about acknowledging ARC funding see http://www.arc.gov.au/about_arc/acknowledgementform.htm

Your School or Research Group may also have policies for compute nodes that they have purchased.

Facilities external to UNSW

If you are using facilities at Intersect and NCI in addition to Katana they may also require some form of acknowledgement:

In particular you should look at http://www.intersect.org.au/attribution-policy and http://nf.nci.org.au/policies/nf_usage_policy.php.

0
Katana Tips and FAQ

There are a number of things that you can do to make the most of Katana and this section lists some of them as well as some of the questions that have been asked regarding Katana in the past.

Keep your jobs under 12 hours if possible

If you request more than 12 hours of WALLTIME then you can only use the nodes bought by your school or research group, or the Faculty of Science. Keeping your job's run time request under 12 hours means that it can run on any node in the cluster.

Two 10 hour jobs will probably finish sooner that one 20 hour job

In fact, if there is spare capacity on Katana, which there is most of the time, six 10 hours jobs will finish before a single 20 hour job will.

Requesting more resources for your job decreases the places that the job can run

The most obvious example is going over the 12 hour limit which limits the number of compute nodes that your job can run on but it is worth . For example specifying the CPU in your job script restricts you to the nodes with that CPU. A job that requests 20Gb will run on a 128Gb node with a 100Gb job already running but a 30Gb job will not be able to.

Running your jobs interactively makes it hard to manage multiple concurrent jobs

If you are currently only running jobs interactively then you should move to batch jobs which allow you to submit more jobs which then start, run and finish automatically.

If you have multiple batch jobs that are almost identical then you should consider using array jobs

If your batch jobs are the same except for a change in file name or another variable then you should have a look at using array jobs.

If you want to use software that is not already installed

Look at the software page to find out about the module command as it may just need to be loaded. If it is not already installed send an email through to the IT Service Centre (ITServiceCentre@unsw.edu.au) asking for it to be installed.

What is the best way of getting help or contacting the HPC team?

The best way to get help or ask a question of the UNSW Research Technology Solutions team is to email the UNSW IT Servicedesk (ITServiceCentre@unsw.edu.au). It is never a good idea to contact a member of the team directly as that person may be on leave or not the best person to deal with your request.

Where is the best place to store my code?

The best place to store source code is to use a version control server.  This means that you will be able to keep every version of your code and revert to an earlier version if you require.

I just got some money from a grant. What can I spend it on?

There are a number of different options for using research funding to improve your ability to run computationally intensive programs. The best starting point is to speak to the Research Technology Services team to figure out the different options.

Can I access Katana from outside UNSW?

If you have an account then you can connect to the clusters from anywhere. If you are using Windows then download a SSH client such as PuTTY and then connect directly to the cluster. More details above.

0
Katana glossary

Whilst using this site you will come across a number of terms that you may not be familiar with. Here is a glossary to help you.

 help you.

Word Definition
Cluster A High Performance Computer composed of multiple computers where jobs are farmed out from the login or head node.
Head Node The head node of the cluster is the computer that you log in to when you connect to the cluster. This node is used to compile software and submit jobs.
Storage Node To reduce the load on the head node cluster home directories and global scratch are attached to a storage node which is then accessible from anywhere in the cluster. This means that when compute jobs are run they can talk directly to the storage node without consuming any resources on the head node.
Data Transfer Node The Data Transfer Node also known as the Katana Data Mover (KDM) is a server that is used for transferring files to, from and within the cluster. By using KDM, the network traffic is offloaded from the Katana Head Node.
Compute Node The compute nodes are where the compute jobs run. Jobs are submitted from the head node and assigned to compute nodes by the job scheduler.
Blade The compute nodes Katana are called blade servers which allow a higher density of servers in the same space. Each blade consists of multiple CPUs with 6 or more cores.
CPU Core Each node in the cluster has one or more CPUs each of which has 6 or more cores. Each core is able to run one job at a time so a node with 12 cores could have 12 jobs running in parallel.
MPI MPI (Message Passing Infrastructure) is a technology for running compute jobs on more than node. Designed for situations where parts of the job can run on independent nodes with the results being transferred to other nodes for the next part of the job to be run.
Module The module command is a means of providing access to different versions of software without risking version conflicts.
Job Script A job script is a file containing all of the information needed to run a job including the resource requirements and the actual commands to run the job.
Job Scheduler The job scheduler monitors the jobs currenty running on the cluster and assigns waiting jobs to nodes based on recent cluster useage, job resource requirements and nodes available to the research group of the submitter. In summary the job scheduler determines when and where a job should run. The job scheduler that we use is called Maui.
Resource Manager A resource manager does everything to do with running job on a cluster apart from scheduling them. Amongst other tasks it receives and parses job submissions, starts jobs on compute nodes, monitors jobs, kills jobs, transfers files, etc. The resource manager that we use is called Torque and it is based on an older resource manager called PBS.
Scratch Space Scratch space is a non backed up storage area where users can store transient data. It should not be used for job code as it is not backed up.
Local Scratch Local scratch refers to the storage available internally on each compute node. Of all the different scratch directories this storage has the best performance however you will need to move your data into local scratch as part of your job script.
Global Scratch Global scratch differs from local scratch in that it is available from every node including the head node. If you have data files or working directories this is where you should put them.
Network Drive A network drive is a drive that is independant from the cluster. In our case the UNSW "H-Drive", the CCRC drives and some other shared drives are available by running the "network" command. Jobs should NEVER be run directly of the H-Drive for performance and reliability reasons.
Interactive Job An interactive job is a way of testing your program and data on a cluster. You request a terminal on one of the nodes and then load and run files from the command line. Once you have your job working well interactively you should turn it into a batch job.
Batch Job batch job is a job on a cluster that runs without any further input once it has been submitted. Almost all jobs on the cluster are batch jobs.
Array Job If you want to run the same job multiple times with only a handful of variables (filename, etc.) changing then you can create an array job which will submit multiple jobs for you from the one job script.
Environment Variables Environment variables are variables that are set in Linux to tell applications where to find programs and set program options. Information on setting environment variables and adding modules to your .bashrc so that they are available each time that you log in are available here.
Active Job When you look at the job list using showq active jobs are jobs that have been assigned to a compute node and are currently running.
Idle Job When you look at the job list using showq idle jobs are eligible to run but are waiting for a compute node that matches their requirements to become available. Which idle job will be assigned to a compute node next depends on the scheduler.
Blocked Job When you look at the job list using showq blocked jobs are jobs that cannot currently run due to a policy limitation on the system such as a restriction on the number of cores that can be used by the same person. Jobs stay blocked until the limit is no longer exceeded at which point the job will be reclassified as an idle job and will then wait for the scheduler to assign it to a compute node.

 

0