For the past few years I have been involved with developing and maintaining a system for forecasting near-shore waves. Our team has just received a significant grant for further development and as a result we are taking the opportunity to refactor many components of the old system.

We will also be receiving a new server to run the model and so I am taking this opportunity to consider how we set up the system. Basically, the steps that need to happen are:

  1. Some standard packages and libraries such as compilers and databases need to be downloaded and installed.

  2. Some custom scientific models need to be downloaded and compiled from source as they are not commonly provided as packages.

  3. New users need to be created to manage the databases and run the models.

  4. A suite of scripts that manage model-database interaction needs to be checked out from source code control and installed.

  5. Crontabs need to be set up to run the scripts at regular intervals in order to generate forecasts.

I have been pondering applying tools such as Puppet, Capistrano or Fabric to automate the above steps. It seems perfectly possible to implement most of the above functionality except there are a couple usage cases that I am wondering about:

  • During my preliminary research, I have found few examples and little discussion on how to use these systems to abstract and automate the process of building custom components from source.

  • We may have to deploy on machines that are isolated from the Internet- i.e. all configuration and set up files will have to come in on a USB key that can be inserted into a terminal that can connect to the server that will run the models.

I see this as an opportunity to learn a new tool that will help me automate my workflow, but I am unsure which tool I should start with. If any member of the community could suggest a tool that would support the above workflow and the issues specific to scientific computing, I would be very grateful.

Our production server will be running Linux, but support for OS X would be a bonus as it would allow the development team to setup test installations outside of VirtualBox.

There might be good information from people on here, but you might also want to ask the scientific community, as there's typically talks and posters on this at the fall AGU meeting each year. Much of it has to deal with workflow management for earth science data, which might have some stricter requirements than forecasting efforts.

I know that I saw a few presentations on this at the last meeting, but the AGU's new abstract system absolutely sucks for trying to go through large numbers of abstracts and/or to browse by discipline. (and sub discipline? not a chance).

Some of the folks were using workflow management systems (eg. Kepler and Taverna), but I don't think they got into the system aspects of provenance nearly as much the grid and compute cluster folks did. Even the earth science folks, who seemed to be taking provenance more seriously than other fields still seemed focused on data inputs more than what type of processor / OS / versions of libraries installed / etc.

The terms used to describe the field are all over the place -- I've seen it referred to as 'cyber infrastructure' (mostly an NSF thing), 'science informatics', etc. Sorry I can't be more specific, as this isn't quite my field. (complaining about the lack of documentation for this sort of thing, yes, but I deal with serving data well after it's been generated).

  • I've played with Kepler and Taverna a little bit, but I don't think they address the problem domain I am considering here. I'm looking for ways to use configuration management tools to automatically and repeatably set up servers to run my simulation- it seems like deploying a workflow management tool would follow after that step. Thanks for taking the time to answer though! – Sharpie Mar 30 '10 at 02:57
  • @Sharpie -- you're right, they don't, sorry if I made it sound like they do. If you're really looking for reproducibility and being able to swap things out, follow the cluster/grid folks, and use VMs so you can just pick up your config and move it. I don't have a good way of automating the setup of the environment, though. (and for some reason, people thought I was crazy when I insisted on being able to script the install of software ... until about the 6th time I had to reinstall it, and I knew I hadn't missed any steps.) – Joe H. Mar 30 '10 at 03:28
  • Aye, I suspect some of us will be using VMs as dev environments- but the production model will have to run on a true installation as we're actually targeting more of a "now-cast" than a forecast. Regardless, a VM would still have to be configured just the same as a true OS. And I hear you on the reinstalls- we mangled the system libraries on the old dev server a couple of times and had to reinstall. Cost us a few days each time- this is part of my motivation for trying to automate the configuration process. – Sharpie Mar 30 '10 at 03:32
  • @Sharpie -- I know there's overhead to VMs, but it'd mean that you could take multiple people's dev environments, and just throw them up on the hardware -- the only problem is if they're developing models that need to interact with each other. You might try asking the question on the AGU's Earth and Space Science Informatics mailing list, or check to see if GEONgrid has a mailing list, as you might manage to find other people who are dealing with this. – Joe H. Mar 30 '10 at 13:28

What linux distribution are you using and what software are you talking about that is "not commonly provided as packages"? It seems to me one way to make automating this step easy is to fix the root problem and get packages made up! This can take a little fiddling, but it makes future admin work much easier for yourself as well as the community.

Most packages managers are just series of scripts to sanitize the management of software. In this role they are very adept at scripting the compilation, distribution, installation and upgrading of software. Even if you are not interested in being involved enough to get your software moved upstream, most distributions have ways of making overlays or add on repositories of your own software, and these should be portable.

Your other steps are really basic, just a few lines of script code and those will all come together. The scripts for each step and the ones that string them all together shouldn't be more than a few dozen lines each. The scripts can easily be made to switch between local and internet sources.

