ADASS XII Conference

Data Management

O7.1 Data Management for the VO (Invited)

Patrick Dowler (CADC)

The Canadian Astronomy Data Centre has developed a a general purpose scientific data warehouse system and an API for accessing it.

The Catalog API defines a general mechanism for exploring and querying scientific content using a constraint-based design. The API provides access to separate but related catalogs and allows for entries in one catalog to be related to (usually derived from) entries in another catalog. The purpose of the API is to provide storage-neutral and content-neutral access methods to scientific data. The API defines a network-accessible Jini service.

We have implemented several instances of the warehouse to store related catalogs: the pixel catalog provides uniform access to our many archival data holdings, the source catalog stores the results of image analysis, and the processing catalog stores metadata describing exactly how sources are extracted from pixel data so that all results are reproducible. Thus, entries in the source catalog are connected to entries in the processing and pixel catalogs from which they are derived.

O7.2 The NRAO End-to-End (e2e) Project

Tim Cornwell, John Benson, Boyd Waters, Honglin Ye (NRAO)

The NRAO End-to-End (e2e) project has the goal of providing automated, streamlined handling of radio observations on NRAO telescopes all the way from proposal submission to archive access. Thus e2e will ease the use of NRAO telescopes both for expert radio astronomers and novices. The latter is particularly important in attracting new people to the use of NRAO telescopes. E2e must include new capabilities in the areas of proposal submission and handling, scripting of observations, scheduling (both conventional and dynamic), pipeline processing, and archive access. The project was initiated in July 2001 and has just completed the first cycle of development. To track and minimize the risk in our software development, we have chosen to adopt a spiral model, whereby a complete development cycle (from inception to testing and deployment) is completed in 9 months and thence repeated, hopefully learning more and more as we proceed. We expect to complete seven such cycles in the project, delivering new capabilities with each cycle.

The resources available are limited, thus placing a premium on careful costing, planning and scheduling, as well as reuse. We are endeavoring to reuse as much as possible, and so much of our work has been based in AIPS++. With this approach, a prototype archive has been completed with about 1 FTE-year of effort. We are placing an emphasis on early and frequent deployment, and so the archive prototype will be deployed for use with the VLA later this year, with deployment for GBT and VLBA planned for 2003. In the area of database access and presentation, we have developed a Calibrator Source tool that can be used by astronomers to find suitable calibrator sources for synthesis observations. This also will be deployed later this year.

O7.3 Data Organization in the SDSS Data Release 1

Ani Thakar, Alex Szalay (JHU) Jim Gray (Microsoft BARC) Chris Stoughton (FNAL)

The first official public data release from the Sloan Digital Sky Survey (Data Release 1 or DR1) is scheduled for January 2003. Due to the unprecedented size and complexity of the datasets involved, we face unique challenges in organizing and distributing the data to a large user community. We discuss the data organization, the archive loading and backup strategy, and the data mining tools that we plan to offer to the public and the astronomical community, in the overall context of large databases and the VO.

It was originally thought that the catalog data would be a fraction of the size of the raw data, which is expected to be several Tb. However, with the multiple versions and data products of the catalog data that will be simultaneously maintained and distributed, it now appears that the size of the catalog data will indeed be comparable to that of the raw data, and organizing and loading it will be quite a daunting task.

The DR1 archive will be organized in multiple Microsoft SQL Server relational databases residing on a Windows cluster and logically linked to each other. There will be two calibrations (reruns) of the primary dataset available at any given time: the "target" rerun, from which the spectroscopic targets were selected, and the "best" rerun, which is usually the latest-greatest rerun. The third dataset will be the spectra. In addition to the live datasets, there will be hot spares and offline backups, and a legacy database will preserve all versions of the data served to date.

The raw data is stored at FermiLab on a LINUX cluster, so it must be loaded across a LINUX/Windows interface. We have attempted to automate the loading and validation process as much as possible using a combination of perl scripts on the LINUX side and VB scripts and DTS packages on the Windows side. Each step of the loading and validation process is logged in a log database. A separate poster discusses the SDSS DR1 storage configuration.

To facilitate data mining in the DR1 archive, we have a variety of interfaces available that allow users to run sophisticated SQL queries on the datasets as well as browse the data using web-based explore and navigation tools. Additionally, we have built a Hierarchical Triangular Mesh (HTM) spatial index into the SQL Server databases for fast spatial lookups and constructed a neighbors table for fast nearest-neighbor searches.

O7.4 HDX Data Model - FITS, NDF and XML implementation

David Giaretta, Mark Taylor, Peter Draper, Norman Gray, Brian McIlwrath (Starlink)

A highly adaptable data model, HDX, based on the concepts embodied in FITS, various proposed XML-based formats, as well as Starlink's NDF and HDS will be described, together with the Java software that has been developed to support it. This follows on from the presentation given at ADASS 2001.

The aim is to provide a flexible model which is compatible FITS, can be extended to accommodate VO requirements, but which maintains enough mandatory structure to make application-level interoperability relatively easy. The implementation provides HDX factories and lower level data access classes allow a great deal of flexibility, in particular single FITS files can be regarded as HDX files, as can complex structures made up of XML and FITS and HDS components. It can also deal with distributed, large, datasets.

ADASS XII Conference

Data Management

O7.1 Data Management for the VO (Invited)

O7.2 The NRAO End-to-End (e2e) Project

O7.3 Data Organization in the SDSS Data Release 1

O7.4 HDX Data Model - FITS, NDF and XML implementation