A User's Guide for the On-the-fly Re-processing (OTFR) System

Document Conventions
Overview
Environment / Dependencies
OTFR Pipeline Stages

GETREQ process
POLxxx stage
LSTxxx stage
ARQxxx stage
WAKxxx stage
RSVxxx stage
MOVxxx stage
COLxxx stage
G2Fxxx stage
RETxxx stage
RSPxxx stage
CLNxxx stage

Changes to the Science Pipeline
Clearing-out the OTFR system
Path File Contents
OPUS login file Contents
Error Handling

Detection
Notification
Flushing
Possible Operator Action

Document Conventions

You will see several different colors of text used in this document. These were chosen to help the reader differentiate between the following types of parameter sources:

GREEN_ENTRIES - come from the process resource file for a particular pipeline process
ORANGE_ENTRIES - come from entries in the OTFR request and response files
TAN_ENTRIES - come from the OPUS_DEFINITIONS_DIR:opus_login.csh login script
RED_ENTRIES - come from the OPUS blackboard system

Hopefully the use of these different colors will make the source of each parameter more clear.

Overview

The On-The-Fly Re-processing (OTFR) system evolved from the On-The-Fly Calibration (OTFC) system, and like OTFC, it was constructed using the OPUS pipeline architecture. There are a number of online documents and a FAQ (start from the FAQ) describing the OPUS architecture which contain valuable information for the operator of the OTFR pipeline. That information will not be duplicated here.

The OTFR system consists of a single OTFR pipeline, which in the middle bridges off to an HST science pipeline (see figure 1). The OTFR pipeline consists of a set of stages (processes) that will accept OTFR requests from DADS and return OTFR responses. An OTFR request is a simple text file which includes the name of the dataset that is to be re-processed, and the directory location in DADS to which the resulting output products produced by OTFR will be returned. An OTFR response consists of a response file (again a simple text file, almost exactly matching the request file except for a status return message) along with calibrated and raw datasets and products, produced by the stages of the OTFR and HST science pipelines.

Each stage in the OTFR and HST science pipelines performs a part of the processing necessary to re-process observations. The processing is split up to allow multiple copies of pipeline stages to be running. This can improve the throughput of processing by keeping the CPU and I/O systems loaded. It also can improve the robustness in the event of hardware problems, by having different copies of processes running on different machines. If one machine goes down with a hardware failure, it is possible that other machines in the cluster could continue to process data.

Currently only STIS and WFPC-2 data will be processed in the OTFR system. Eventually other instruments will be added (NICMOS, ACS, etc).

Environment / Dependencies

The OTFR Pipeline system will operate on a UNIX-based set of workstations. Currently the Compaq Tru64 systems are supported.

The OTFR machines will need to put and get files to and from the existing DADS VMS-based distribution workstations. File transfers via FTP will occur between the VMS and UNIX workstations in several different pipeline stages, so network connectivity disruptions will likely cause problems in the pipeline processing.

Currently all pipeline stages are implemented as either UNIX shell scripts or Perl scripts. These scripts use a combination of system commands, Perl utilities, OPUS programs and STSDAS programs to perform the necessary processing, so software installations and updates to any of these software systems can cause changes in the pipeline behavior.

Database access to both OPUS and DADS databases occurs in many locations in the OTFR and HST science pipelines. The particular databases and tables are described in the details that follow. Disruptions in database connectivity will obviously cause problems in the pipeline processing.

The Pipeline Stages

The OTFR Pipeline will consist of the following stages (visible as columns on the OMG , the Observation Manager graphical display):

POL xxx: Poll for OTFR requests and convert them into OSFs (Observation Status Files) which send datasets through the OTFR pipeline
LST xxx: Produce a list of telemetry (POD) or EDT files that will be needed from the archive in order to reprocess the requested dataset
ARQ xxx: Retrieve the required POD or EDT files from the archive.
WAK xxx: "podwhack" (i.e. segment) each POD file into one dataset per POD file (for instruments with multiple exposures per POD).
RSVxxx: "Reserve" each needed science exposure and ASN name before entering the science pipeline
MOV xxx: transfer the segmented POD files or EDT files to the HST science pipeline
COL xxx: monitor the HST science pipeline for the products necessary to complete an OTFR request
G2F xxx: convert GEIS (and ASCII) files to FITS for all instruments (all instruments will have ASCII files, at least)
RET xxx: copy FITS files back to DADS disk
RSP xxx: create OTFR_response file and copy it back to DADS disk
CLN xxx: clean up (delete) files and directories in the OTFR and HST science pipelines

There is an additional process, GETREQ, that retrieves OTFR requests from the remote DADS system and copies them to the OTFR system to start them in the pipeline. This process feeds data to the OTFR pipeline, and therefore does not show up in the pipeline stages listed in the OMG. Consider it the "jump start" process for the OTFR pipeline.

The characters "xxx" above imply the stage will have an instrument specific OPUS process resource file , i.e. POLWF2, POLNIC, POLSTI, etc., which supplies instrument specific parameters and directory names to the process being run. This allows a single piece of code to run using different input parameters.

GETREQ (required process, does not show up in the pipeline stages)

Trigger: Delta-time poller (awakens at periodic time interval)

Summary:

Awakens periodically (DELTA_TIME) to check the OTFR request directory (REMOTE_REQUEST_DIR) on a disk local to a DADS VMS machine (RMT_MACHINE) for OTFR request files (REQUEST_FILEMASK). Any OTFR request files found will be copied back to a local directory (OUTPATH_HOLD) on the OTFR UNIX machine via a FTP process (RMT_LOGIN, FTP_PFILE). Once this copy is verified as successful, the OTFR request files will be deleted from the OTFR request directory on the DADS VMS machine, so that they are not picked up multiple times. Then the request files are renamed from the local incoming directory (OUTPATH_HOLD) to the local polling directory (OUTPATH) where they can be picked up by the next pipeline process.

This process serves OTFR request files to the OTFR UNIX system, by obtaining them from the DADS VMS system. Failures in this process will stop OTFR requests from entering the OTFR pipeline.

DADS will place OTFR request files in the OTFR request directory if the OTFR system has signalled that it is "online".

There will only be one copy of this process running. This should be enforced by a restrictions file in the OPUS PMG. This ensures that two separate GETREQ processes do not conflict over copying and deleting requests in the OTFR request directory.

OPUS Resource File Entries: OUTPATH_HOLD; OUTPATH; local directory to which OTFR request are renamed once they are completely copied over from DADS; DELTA_TIME; REMOTE_REQUEST_DIR; REQUEST_FILEMASK; RMT_MACHINE; RMT_LOGIN; FTP_PFILE

Success Conditions: No requests found on the DADS system OR; A copy of each OTFR request found on the DADS system will be made in the OUTPATH directory and; the original OTFR requests on the DADS system will be deleted.

Error Conditions
PROCESS GOES ABSENT (on the PMG , the Process Manager graphical display)

POLxxx

Trigger: File-poller (polls for files matching a certain filemask in a certain directory)

Summary

Checks for OTFR request files (FILE_OBJECT1) in the local request directory (FILE_DIRECTORY1) filled by GETREQ. For each request found, creates a directory under the instrument calibration directory (OUTPATH) in which the particular request will be worked, and copies the OTFR request there. The request will have a new file extension addition (FILE_PROCESSING). Reads the OTFR request and creates Observation Status Files (OSFs) for each DATASET entry in the request (current plan is only one DATASET entry per request). The OSFs will be used to control and track the processing of the request while it is in the OTFR pipeline. A trailer (log) file is started that will document the processing steps performed by OTFR on this request.

The OTFR Request File will be named with a unique pseudo-TIMESTAMP and a DATASET_NAME, e.g. 3112234912345_u2440101t.req . Since the DATASET_NAME could appear in more than one request, the TIMESTAMP must be guaranteed by DADS (who will be naming these files) to be unique for every OTFR request sent to the pipeline. The TIMESTAMP consists of several fields: DDHHMMSSnnnnn, where DD is the day of the month, HHMMSS are the hour, minute, and second at which the request was generated, and nnnnn is a unique number assigned by DADS that prevents duplicate timestamps. The OTFR request file will contain, at least,

DATASET_NAME=U2440101T

FILE_COUNT=0

TIMESTAMP=3112234912345

DIRECTORY=DISK$LEO:[LEO.MSWAM.OTFCAL.DADS]

END_FILE

When an OTFR request file is found in the polling directory (FILE_DIRECTORY1), a subdirectory named with the values of the TIMESTAMP_DATASET will be created under the calibration directory (OUTPATH) for this instrument. By grouping the subdirectories for each instrument under an instrument-specific directory, this allows segregation and disk management of processing by instrument.

If the subdirectory already exists, then this will be considered a duplicate request, and an OSF (Observation Status File) will be created for the request indicating the duplication ('d') in the PO column. The duplication OSF will also be given a special class (CLS = req) and a special DCF number (DCF = 999), so that these OSFs can be sorted together in the OMG. These entries on the OMG will hopefully catch the operator's attention. A message will also appear in the POLxxx process log file explaining the error. These duplications are likely only to be seen in the test environment, where the same OTFR requests can be run repeatedly using some tricks. DADS should not normally feed duplicate requests to the OTFR system.

The OTFR request file will be copied from the polling directory into this request-unique subdirectory. It is given a file extension addition (FILE_PROCESSING) at this time as an artifact of the OPUS architecture. This copy of the request will provide later pipeline stages access to needed information. For example, the DIRECTORY field in the request will need to be accessed in order to send the OTFR products back to DADS. The copy of the request in the local polling directory will be deleted automatically by the OPUS architecture to avoid processing the request multiple times.

The contents of the OTFR request are echoed to the process log file and the request trailer file, so that later searching of these log files can be used in request debugging. The request trailer file will be named TIMESTAMP_DATASET.trl, and will be created in the unique request-specific subdirectory. During processing, the OMG/ViewTrailer function should be able to display the contents of this file to indicate the steps performed so far in processing this request.

POLxxx will use the information in the OTFR request filename and the file contents to create OSFs on the OTFR blackboard. Here is an example OSF:

33C243B1-cw_____________________.31512014412345_u2440101t-wf2-000-____

The most relevant details here are the dataset identifier made up of the timestamp and datasetname (31512014412345_u2440101t) and the OSF status field (cw____________________). These indicate to the OPUS architecture that a particular dataset is waiting (w) for a particular kind of processing. Since the 'w' is in the second pipeline stage, the LS (LSTxxx) process will be the next process to peform work on this request. The OSF status field is filled from the OSF_START_STATUS resource entry. The data-id field of the OSF (in this case set to 'wf2') is filled from the POL_DATATYPE resource entry.

If a request does not contain a DATASET_NAME entry, then the request is considered corrupted, is renamed with a special file extension addition (FILE_ERROR), has an OSF created for the "bad" request (PO = 'b') to bring the error to the operator's attention, and will remain in the polling directory (FILE_DIRECTORY1) until further action is taken.

If OSF creation fails for a DATASET_NAME entry in the OTFR request, then the OTFR request will be renamed with a special file extension addition (FILE_ERROR) and this process will go ABSENT, indicating a problem to the OTFR operator in the PMG. This error is considered severe, because it likely indicates a problem in the OPUS architecture and would probably affect the processing of all datasets, so that is why the program is brought down.

A separate POLxxx OPUS resource file will exist for each instrument so that path-dependent locations for the instrument calibration directory can be used. This implies that a separate file polling process will need to be started in the OTFR pipeline for each instrument, though a single polling directory can be used to hold all requests if the file-polling mask is created with instrument-specific characters in it to differentiate between the instruments.

Once all OSFs are created, an entry is written to the ARCH_SERVER::ARCH_DB:otfr_request_stats database relation for keeping statistics on the key processing times for this OTFR request. Later pipeline tasks will update this database record with additional processing times, so that after the request has completed OTFR (successfully or not) this record will document how long processsing took, and whether or not processing problems were encountered.

OPUS Resource File Entries
FILE_DIRECTORY1 - directory searched for incoming request files
FILE_OBJECT1 - filemask used to search FILE_DIRECTORY1
FILE_PROCESSING - file extension addition appended to the request by the OPUS architecture
FILE_ERROR - special file extension addition appended to the request if an error occurs
OUTPATH - directory under which timestamp_dataset directories will be created for this instrument
POL_DATATYPE - 3-character instrument designator for use in OSF data_id field
OSF_START_STATUS - passed to the OSF_create utility to define the initial status column values in the OSF
ARCH_SERVER - name of DADS database server
ARCH_DB - name of DADS database

Success Conditions: TIMESTAMP_DATASET subdirectory created under instrument-specific CAL directory; A copy of the OTFR request will exist there with a new file extension.; A trailer file named TIMESTAMP_DATASET.trl will be created there, containing an echo of the OTFR request contents.; The original OTFR request in the polling directory will be deleted.; An OSF will be created for the DATASET_NAME listed in the OTFR request file with stages PO = c, LS = w.

Error Conditions
PROCESS GOES ABSENT

LSTxxx

Trigger: OSF - Requires 'w' in LS stage

Summary

Creates a file listing exposure names and POD file names needed to process this DATASET, including any member names if this is an association, or any special files (e.g. NIC SAA darks). If EDT files should be used for processing instead of POD files, an indication of this will appear in the generated file.

This process begins by testing DATASET (from the triggering OSF) to see if it is an association. This is performed by checking the last character of the name, and if it is a digit [0-9] or letters [a-i], and this instrument supports associations, then it is assumed to be an association. The dataset name is then converted to the name of the primary product by changing the last character to zero (0), and that name is searched for in the DSQUERY::OPUS_DB:asn_members database relation. Any collected members found for the association will be saved in a list of exposures. If the association has no collected members, this is considered an error, and the OSF is marked for failure.

An additional check is made for NICMOS datasets. If the dataset name appears in the DSQUERY::OPUS_DB:nic_saa_link database relation, then this dataset has a corresponding SAA dark association. The members of that association are then obtained by another query to DSQUERY::OPUS_DB:asn_members, and those names are added to the exposures list. If no collected members are found for the NIC SAA dark association, again the OSF is marked for error.

If the DATASET requested was not an association, then the name of that dataset is the only entry in the exposures list.

A file named like the OTFR request, but given a .podlist extension. (e.g. 1412020112345_n12345678.podlist) is created in the request-specific subdirectory to hold the data sources required for all exposures needed to re-process this DATASET. This file will indicate how to generate archive requests to retrieve the needed files for re-processing. There are currently two choices for the source of the exposure data: POD or EDT. Normally POD will be used, but there are some datasets for which valid POD files do not exist in the archive, and for these the archived EDT set will be used for re-processing.

Now the data source for each entry in the exposures list must be determined. A check is made of the ARCH_SERVER::ARCH_DB:otfr_special_processing relation, and if the exposure name appears there with osp_data_source = EDT, then this exposure is not supposed to be re-processed from POD files, but from archived EDT files instead. Usually this indicates there was a problem with the original POD files, and repaired EDT files were created and archived instead. In this case an EDT entry is made in the .podlist file for this exposure.

If no special processing records are found, then the exposure is searched for in the DSQUERY::OPUS_DB:podnames relation. This search is performed for the first 8 characters of the dataset name, with the last character wild-carded. This is done because the DATASET listed in the archive is not necessarily the same exact name found in the podnames relation. The last character of the dataset name can differ depending on how many times the dataset was sent from DDF/PACOR to STScI. All POD names found by this query are returned, and if different last characters in any exposure names were found a second query is made. This one goes to the DSQUERY::OPUS_DB:executed table to see what the prefered version of the exposure is. This name and the corresponding POD files are written to the .podlist file.

The .podlist file should now contain entries describing the names and data sources of all of the "raw material" files (PODs and EDT sets) needed to re-process DATASET.

OPUS Resource File Entries: OUTPATH - directory under which the calibration directory should exist for the dataset (named by TIMESTAMP_DATASET); DSQUERY - name of OPUS database server; OPUS_DB - name of OPUS database; ARCH_SERVER - name of DADS database server; ARCH_DB - name of DADS database

Success Conditions: A file named TIMESTAMP_DATASET.podlist will exist in the request-specific subdirectory, indicating which data sources must be retrieved from the archive to re-process this dataset.; OSF.LS = c; OSF.RQ = w

Error Conditions

ARQxxx

Trigger: OSF - 'w' in RQ stage

Summary:

This stage will perform an archive retrieval for each entry in the .podlist file. The files retrieved will be written back to the request-specific subdirectory.

This process will convert the list of POD and EDT entries in the .podlist file into a set of archive retrieval commands for retrieving the data files needed for re-processing. For this sample entry in a .podlist file

O4US01EVQ LZ_497C_048_0000155172 POD

the archive retrieval command would look something like this

java -DDSQUERY=CATLOG -DDBNAME=dadsops -DCONFIG=/store/archc/dadstools/config.props -jar /store/archc/dadstools/AcquireDataset.jar LZ_497C_048_0000155172 POD /info/devcl/pipe/otfr/stis/12109213928475_o4us01010/

This method of archive retrieval is called a "direct acquire". To accomplish this, an NFS disk must be shared between the archive system and the OTFR system, and the user-id performing the retrieval must be in the dads group (use UNIX command "id" to check). Several of the entries in the properties file $DADS_TOOL_DIR/config.props point to directories on this shared NFS disk

Each archive retrieval command contains the root name to search the archive for, the archive class of the files, and the destination directory for the retrieved files. If any EDT entries appear in the .podlist file, similar commands are used, substituting the EDT dataset name, the EDT archive class, and the same destination directory. Before beginning any retrieval commands, a number of set-up steps are performed.

In order for the archive to write the retrieved files back into the request-specific subdirectory, the permissions on that directory are opened up world-write.
In order for the archive retrieval command to write into the archive area in a way that the archive software can make use of its request message, the umask value of the ARQxxx process is changed to value = 01 (includes world read/write).
The archive request is given a time-stamp on the archive side, and this time-stamp must be in GMT format. To make this occur, the time-zone of the ARQxxx process is changed to GMT using the environment variable assignment TZ = GMT0.
The ARCH_SERVER::ARCH_DB:otfr_request_stats database relation is updated to indicate the start time of the archive retrievals.

Once the archive retrievals are successfully complete, the ARCH_SERVER::ARCH_DB:otfr_request_stats database relation is updated to indicate the end time of the archive retrieval process. The temporary changes to the umask and timezone values go away when ARQxxx finishes this stage of processing on this OTFR request, however the permissions change on the request-specific subdirectory stays in place throughout the lifetime of this OTFR request.

Files retrieved from the archive using direct acquire come back with this format:

lz_497c_048_0000155172.pod_030101190449_pod

The extra _030101190449_pod on the filename is the GMT timestamp and the archive class. These extra items will be removed before the file is passed on to the next pipeline stage.

If any of the archive retrieval commands for any of the .podlist entries fails, the OSF status will be set to 'e'.

OPUS Resource File Entries

         ARCH_SERVER - name of the archive database server
        ARCH_DB - name of the archive database
        DADS_TOOL_DIR - disk location of the direct acquire tool and config properties

Success Conditions: Setting directory permissions on the request-specific OTFR subdirectory to world-write; Retrieval of POD/EDT files from the archive, for each entry in the .podlist file; Rename of retrieved files from archive filenames to pipeline filenames.; Database updates for otfr_request_stats to set the archive retrieval start and end times.; OSF.RQ = c; OSF.WK = w

Error Conditions

WAKxxx

Trigger: OSF - 'w' in WK stage

Summary:

This stage will perform "pod-whacking", or segmentation of POD files into one exposure per POD file, for each POD entry in the .podlist file.

NOTE: This segmentation is only needed for instruments where a POD file can contain data for more than one exposure (e.g. STIS, NICMOS). It does NOT apply to instruments that have a one-to-one mapping from PODs to exposures (e.g. WFPC-2). For instruments that don't need pod-whacking, this stage does a simple copy of the archive-retrieved POD file to the POD file name required for submission to the science pipeline. A copy is done so that the file ownership is changed from the archive (which placed the files there) to the OTFR pipeline operator.

This process will read the contents of the .podlistfile, and for every POD entry found it will run the podwhacker OPUS utility to extract exposure data into one POD file per exposure. The newly created POD file will be named exposure_podname.pod, and will be written to the OTFR request-specific directory. Any EDT entries in the .podlist file will be skipped over.

Pod segmentation is needed for several reasons

This results in only the data requested by the user being fed into the HST science pipeline. Without pod-whacking, some POD files would contain extra science exposures not requested by the archive user, which would consume extra resources in the science pipeline, and clutter up the OMG with irrelevant entries.
This allows unique exposure-pod filenames to be created, which make it more clear which pod files pertain to which exposures. This can help in debugging problem requests.
This results in fewer potential collisions in the HST science pipeline when separate OTFR requests need different exposures from the same POD file.

Before running podwhacker, any pre-existing file matching the name of the output POD file will be deleted. After podwhacker is run, and its return code is checked for error, an additional check is made to be sure that the output POD file does now exist. If either a bad return code is received, or the intended output POD file is not found, the OSF is marked for error.

Success Conditions: A new set of POD files will appear in the OTFR request-specific directory, with names like exposure_podname.pod. There should be one for each POD entry in the .podlist file.; OSF.WK = c; OSF.MV = w

Error Conditions: Failure to open the .podlist file.; Failure to find a POD file listed in the .podlist file.; Failure during podwhacker execution.; Failure to find resulting podwhacker output file.; OSF.WK = e

RSVxxx

Trigger: OSF - 'w' in RV stage

Summary:

This stage will reserve exposure and association names on a separate blackboard to guarantee that no collisions will occur in the HST science pipeline between OTFR requests that happen to be processing the same files.

This task will first look for an already-existing reservation list file (.rsvlist) in the request-specific subdirectory. If found, this is an indication that a previous reservation attempt failed and this attempt is a re-try. The list of exposure and association names in the reservation list file is read in. If no such file is found, then one is created by reading all of the exposure names in the .podlist file, and performing a database look-up against DSQUERY::OPUS_DB:asn_members to find all association_id values listed for each exposure name in the .podlist file. The reservation list file is formed by combining the list of .podlist exposure names and any association ids found in the database.

Once the reservation list is either read in or created new, an attempt is made to create an OSF on a separate blackboard (RESERVATION_PATH) for each reservation list entry. If every OSF creation attempt succeeds, then this OTFR request can proceed, since the required names that will be used in the science pipeline have been reserved. If any one of the OSF creation attempts fails for either an exposure or association name, then a collision has occurred. This indicates that another OTFR request has already reserved one or more of these names for science processing, and that the current OTFR request will have to try again later to see if the existing reservations have been cleared. If any of the OSF creations happened to succeed for the current OTFR request before the failed creation occurred, the OSFs that were successfully created are deleted, so that the reservations needed for this request are not "partially" made.

The reservation OSFs will be deleted by the CLNxxx process once the OTFR request is finished processing.

OPUS Resource File Entries
Success Conditions: An existing .rsvlist file is read in or a new one is created.; OSFs are created on the RESERVATION_PATH for each .rsvlist entry.; OSF.RV = c; OSF.MV = w

Error Conditions: Failure to open .podlist file; Failure to read from or write a new .rsvlist file; Failed database connection to DSQUERY::OPUS_DB; OSF.RV = e; Failure to create reservation OSFs due to a collision on the reservation blackboard.; OSF.RV = z (try again later)

MOVxxx

Trigger: OSF - 'w' in MV stage

Summary:

This stage will attempt to move files for each entry in the .podlist file from the OTFR request-specific directory to the appropriate directory of the HST science pipeline.

This process opens and reads the .podlist file, and attempts to move files for each .podlist entry from the OTFR request-specific directory (where the files were placed by an archive retrieval), to one of the input directories for the HST science pipeline. Once moved, these exposures will be ready to begin science processing.

If a .podlist file entry specifies an EDT data source, then the entire set of files matching the EDT name (EDT_EXT_MASK) will be moved from the OTFR request-specific directory to the appropriate science pipeline directory (EDT_RENAME_DIRECTORY). Once the EDT files are successfully moved, an OSF is created in the science pipeline path (DATA_REDUCTION_PATH, OSF_DATA_TYPE, OSF_START_STATUS) that will cause science processing to begin for these data.

If a .podlist file entry specifies a POD data source, then the corresponding "whacked" POD file will be moved from the OTFR request-specific directory to the appropriate science pipeline directory (POD_RENAME_DIRECTORY). This move is slightly different from the EDT file move, in that POD files can be immediately picked up by the science pipeline for processing. Because of this, the POD file is first moved to the destination directory with an altered file extension to keep it from getting picked up right away, while its contents may not be complete. Once the file has been completely and successfully moved, it is renamed in place to remove the altered file extension, which makes it visible to the science pipeline.

One other change is made for old-style WFPC-2 POD files. These can be named starting with "w". These old pod file names do not follow the convention expected by the current science pipeline software. If one of these old filenames is detected, the file is renamed to prepend it with "lz_a2_", which will allow the current science pipeline software to process it.

For EDT and POD, the file move is performed so that it will fail if files with identical names are already found in the science pipeline area. This is by design so that two OTFR requests that happen to want the same files processed by the science pipeline do not interfere with each other. This type of interference is expected to be rare, except for the case of shared STIS wavecals. For POD files this is accomplished by setting the file permissions to owner:read-only, and using a special form of the move command (mv -i) that will fail when run in batch mode with already existing destination files. For EDT sets, the file permissions are not altered, but the special move command is still used. The end result of one of these collisions is that one of the requests will be unable to move files into the science pipeline right away, but the OSF for this failed move will be set so that the move is re-tried at a later time (see PODTRG). Once the first set of files clears the science pipeline, the second set of files with the same name will be allowed in.

Once all file moves are successful, the ARCH_SERVER::ARCH_DB:otfr_request_stats database relation is updated with a timestamp indicating when the datasets for this OTFR request entered the science pipeline.

OPUS Resource File Entries
Success Conditions: EDT or POD files moved to the science pipeline for processing.; For EDT, an OSF will be created in the science pipeline path.; OSF.MV = c; OSF.DC = w

Error Conditions
Failure to open .podlist file: Failure to find destination directory specified for file move.; Failure to move files to destination directory.; Failure to create an OSF in the HST science pipeline for a set of EDT files just moved.; Failure to set POD file permissions to owner:read-only.; OSF.MV = e; Failure to move files due to a collision in the science pipeline.; OSF.MV = z (try again later)

PODTRG

Trigger: Time trigger set to awaken every 10 minutes.

Summary:

This process will awaken and search for OSFs on the OTFR blackboard that have been sleeping in either the MOVxxx (MV) or COLxxx (DC) stages. If any are found, they are retriggered to attempt that pipeline stage again.
OPUS Resource File Entries
Success Conditions: OSFs found in need of retriggering according to SEARCH_OSF are retriggered as dictated by SET_OSF.
Error Conditions
Failure to find and open the pipeline stage file for this path.
Failure to test for OSF status values.

COLxxx

Trigger: OSF = 'w' in DC stage

Summary:

This stage attempts to collect products from the science pipeline in order to satisfy an OTFR request. When the needed products are ready in the science pipeline, they are copied back to the OTFR request-specific directory. A timeout scheme is implemented so that a collector will not wait indefinitely for science pipeline products to be ready before closing-out an OTFR request

There are two styles of collectors: one for the multi-file association instruments (NICMOS, ACS), and one for the others (WFPC-2, STIS, etc.). The multi-file association collector has the added complication of collecting science output for all association members and the requested product. The .podlist file contains the names of the exposures that drive the multi-file association collection. The triggering OSF supplies the name of the DATASET for the other type of collector.

Both collectors begin by checking for the presence of a timeout file. This file will appear in the OTFR request-specific subdirectory and be named TIMESTAMP_DATASET.timeout. It is created the first time the collector starts working on a particular OTFR request, and is used to limit the amount of time a collector will wait for needed output from the science pipeline. The age of this empty file is checked each time the collector is retriggered, and if the age of the file is greater than TIMEOUT_INTERVAL, then the collector picks up copies of any needed science output that is ready, and proceeds to flush this OTFR request from the pipeline. Any science output that is not ready by this time is considered lost.

If timeout has not occurred, then the status of the calibration stage in the science pipeline path (DATA_REDUCTION_PATH) is checked for each needed science output by testing the OSF. Only the DATASET is tested for WFPC-2 and STIS. All association members and the primary product are tested for NICMOS and ACS (OSFs are not created in the science pipeline for sub-products). If the status value(s) for calibration show COMPLETE (c), NOT PERFORMED (n), or FLUSHED (f), then science processing is considered done, and all needed science output files are copied back from the science pipeline directory (CAL_DIR) to the OTFR request-specific directory. The list of needed science output files is obtained from the .podlistcontents and the requested DATASET.

The ARCH_SERVER::ARCH_DB:otfr_request_stats database relation is updated with the timestamp when collection completed, regardless of whether normal completion or a timeout occurred.

Success Conditions
Science output files copied to the OTFR request-specific subdirectory.: OSF.DC = c; OSF.2F = w
Science output files not yet ready for this request (and timeout not yet expired): OSF.DC = z

Error Conditions
Failure to open the .podlist file
Failure to create the .timeout file
Failure to test an OSF in the science pipeline path
Multiple OSFs found for DATASET in the science pipeline path
Failure copying files from the science pipeline to the OTFR request-specific subdirectory
OSF.DC = e

G2Fxxx

Trigger: OSF = 'w' in 2F stage

Summary:

This stage performs a GEIS-to-FITS conversion for WF2 and other pre-SM97 instruments, converting all files that appear to be GEIS or ASCII files to FITS format. This stage will also be run for post-SM97 instruments (STIS, NICMOS, ACS) to perform conversion of text files (trailer file, etc.) to FITS format.

Success Conditions
GEIS and ASCII files in the working directory will be converted to FITS files.: OSF.2F = c; OSF.RE = w

Error Conditions: Error deleting existing FITS file before conversion.; Failed STWFITS.; OSF.2F = e; No datasets processed by STWFITS (nothing returned from the science pipeline, since all instruments at least have a trailer needing conversion); OSF.2F = n; OSF.RE = n; OSF.RS = w

RETxxx

Trigger: OSF - 'w' in RE stage

Summary:

This stage will be used to copy back FITS datasets to a directory on the DADS staging disks. All *.fits files in the working directory for this request will be copied back.

A FTP copy (RMT_MACHINE, RMT_MACHINE_TYPE, RMT_LOGIN, FTP_PFILE) of all FITS files in the OTFR request-specific directory will be made back to the DIRECTORY specified in the original OTFR request. Any files already existing there will be overwritten with new versions. The OTFR pipeline will delete the data (in a later pipeline stage, CLNxxx) from its disks without waiting for any kind of response from DADS, once a verified copy exists on the DADS disk.

OPUS Resource File Entries
OUTPATH - directory under which the calibration directory should exist for the dataset (named by TIMESTAMP_DATASET)
REQUEST_FILEEXT - file extension for OTFR request in the calibration directory
RMT_MACHINE - name of remote DADS machine
RMT_MACHINE_TYPE - type of remote DADS machine (UNIX or VMS)
RMT_LOGIN - login for remote DADS machine
FTP_PFILE - filename providing parameters for FTP access to DADS machine
Success Conditions: Copy of all FITS datasets to DADS staging disk as specified in the DIRECTORY parameter of the OTFR response file.; OSF.RE = c; OSF.RP = w

Error Conditions: Failure to copy dataset files to DADS staging directory (connection failure, permission problem, etc).; Failure in verifying successful copy to DADS staging directory.; OSF.RE = e; No files found to copy (just send a response file back); OSF.RE = n; OSF.RS = w

RSPxxx

Trigger: OSF - 'w' in RS stage

Summary:

This stage will be used to create an OTFR response file and copy it back to the DADS system via FTP.

The response file will be formed by renaming and editing the local copy of the OTFR request (REQUEST_FILEEXT). The filename extension of the OTFR request will be changed (RESPONSE_FILEEXT). The FILE_COUNT field in the response file will be updated to include all FITS files in the working directory.

The OSF for DATASET (or the primary product name, if DATASET is an association product) is searched for in the science pipeline path using the OPUS tool osf_test, to return the status column of the calibration stage (CA) of the pipeline. This indicates either success, failure, or non-action. If this status column = 'n' or 'f', then calibration was bypassed, and a message indicating this (RESPONSE_STATUS_NOCAL) will be placed in the STATUS entry in the OTFR response. If calibration occurred normally, a different status (RESPONSE_STATUS_OK) message will appear in the OTFR response.

The OTFR response will be copied, via FTP copy (RMT_MACHINE, RMT_MACHINE_TYPE, RMT_LOGIN, FTP_PFILE), back to a response directory (RESPONSE_DIRECTORY)on the DADS system.

The ARCH_SERVER::ARCH_DB:otfr_request_stats database table will be updated with the completion timestamp for this OTFR request.

OPUS Resource File Entries
OUTPATH - directory under which the calibration directory should exist for the dataset (named by TIMESTAMP_DATASET)
REQUEST_FILEEXT - file extension for OTFR request in the calibration directory
RESPONSE_FILEEXT - file extension for OTFR response file
RESPONSE_STATUS_NOCAL - status value placed in the OTFR response for a non-calibrated dataset
RESPONSE_STATUS_OK - status value placed in the OTFR response for a calibrated dataset
RESPONSE_DIRECTORY - directory on the remote DADS machine where the OTFR response is sent
RMT_MACHINE - name of remote DADS machine
RMT_MACHINE_TYPE - type of remote DADS machine (UNIX or VMS)
RMT_LOGIN - login for remote DADS machine
FTP_PFILE - filename providing parameters for FTP access to DADS machine
ARCH_SERVER - DADS database server
ARCH_DB - DADS database

Success Conditions: Creation of OTFR response file in local subdirectory by altering the original OTFR request.; Copy of OTFR response file to RESPONSE_DIRECTORY.; OSF.RS = c; OSF.CL = w

Error Conditions: Failure to create the OTFR response file.; Failure to count dataset files in local subdirectory for FILE_COUNT entry in OTFR response.; Failure to update the FILE_COUNT parameter in the OTFR response.; Failure to query the OSF for the value of the calibration pipeline stage (CA).; Failure to copy response file to DADS system (connection failure, permission problem, etc).

CLNxxx

Trigger: OSF - 'w' in CL stage

Summary:

This process cleans up a completed OTFR request by deleting any .podlist-specific science output in the science pipeline, any processing files left in the request-specific TIMESTAMP_DATASET subdirectory, the reservation OSFs on the reservation blackboard, and removing the subdirectory itself if no files are left in it.

This process begins by opening the .podlist file, and attempting to delete files from the calibration (CALDIR), EDT (SISDIR), and POD areas (POD_INPUT_DIR, POD_OUTPUT_DIR) of the science pipeline for any listed exposure/podname pairs. Files are removed using the first 8 characters of the exposure name and a wildcard to avoid potential trouble with differing last characters. The OSFs in the science pipeline for the exposure and the POD file are also deleted. This continues for every entry in the .podlistfile.

Once these deletions are complete, any OTFR requests that were stuck waiting to submit POD/EDT files to the science pipeline because of a name collision with the files used by this request should now be free to submit their datasets for science processing.

The DATASET is then used to delete product files from the science pipeline directories (CALDIR, SISDIR). The first 8 characters of the DATASET are used with a wildcard so that all sub-products and the main product will be found and deleted. The OSF for the primary product is also deleted (only the primary product has an OSF in the science pipeline). This will now permit science processing to begin for any other OTFR request that was stuck waiting because of a name collision.

The .rsvlist file is read and OSF deletions occur for any remaining science pipeline OSFs related to this request, and for the reservation OSFs that were created to allow this request to process in the science pipeline without colliding with another request. These reservation OSFs are deleted from the RESERVATION_PATH blackboard (they were created during the RV pipeline stage).

Working files are then removed from the OTFR request-specific directory, including those for the DATASET, the TIMESTAMP_DATASET, the .timeout file, and all .pod and .dan files. There shouldn't be any files left in the directory at this point, and if there are not, the directory itself is removed.

If files remain in the OTFR request-specific subdirectory after all of these deletions have been performed, this process will not remove the subdirectory (i.e. the clean will be incomplete), and the OSF will be marked with an error.
OPUS Resource file entries: OUTPATH - location of OTFR request processing directories; DATA_REDUCTION_PATH - name of the science pipeline path; RESERVATION_PATH - name of the reservation blackboard path; CALDIR - location of calibrated files in the science pipeline path; SISDIR - location of EDT set in the science pipeline path; POD_INPUT_DIR - initial location of POD files in the science pipeline path; POD_OUTPUT_DIR - final location of POD files in the science pipeline path
Success Conditions: Removal of data files and OSFs from the science pipeline path for any exposures listed in the .podlist file.; Removal of data files and OSFs from the science pipeline path for any POD files listed in the .podlist file.; Removal of science product data files and OSF from the science pipeline path for this DATASET.; Removal of OSFs from the science pipeline path for any exposure-related (i.e. ASN) files.; Removal of OSFs from the reservation blackboard.; Removal of DATASET, TIMESTAMP_DATASET, .timeout, .dan and .pod files from the OTFR request-specific subdirectory.; Removal of the OTFR request-specific subdirectory from calibration root directory (OUTPATH), if the subdirectory is empty.; OSF.CL = c

Error Conditions: Failed to open .podlist file.; Failed removal of empty directory.; OSF.CL = e

NOTE: Failure to delete files when cleaning will NOT necessarily cause a failure of the CLNxxx process unless the files are in the OTFR request-specific subdirectory. Problems cleaning files from the science pipeline are not considered critical.

Changes to the Science Pipeline

OTFR was designed to allow a single set of executable code (programs and scripts) to be used for both the pre-archive and OTFR pipelines. Separate sets of resource files and the ability to bring up different combinations of processes under OPUS facilitate this design. Here are some of the differences between the pre-archive and the OTFR science pipelines:

PSHWF2: This process is an OSF poller that runs in the OTFR science pipeline and will trigger on WFPC-2 datasets stuck in the 'm' state in the Data Partitioning pipeline stage. These datasets are signalling that they expect to be MERGED, but there are some cases where no more WFPC-2 datasets are to be expected. To make this decision, the process does a look-up in the DSQUERY::OPUS_DB:podnames relation and if only a single POD entry is found for this exposure, then it is assumed that no more datasets are expected, and the dataset is pushed on to the next pipeline stage. If multiple POD entries are found in the look-up, then the OSF is marked for error.
OCALxx: The OTFR science pipeline calibration script will make an extra check to see if calibration should be attempted for an exposure, before calling the calibration routine. This new check is made against the ARCH_SERVER::ARCH_DB:otfr_special_processing database relation, and if the osp_processing_steps field contains "BYPASS_CAL", then calibration is skipped for this exposure. This new check is in addition to the already existing check of the CALIBRAT header keyword, which can cause calibration to be skipped when its value is "N" or "F". The OTFR science pipeline calibration for WF2 will also convert the input FITS files from Generic Conversion to GEIS before calibrating.
DC_xxx: The science pipeline data collector has two "switches" that cause different behavior depending on how they are set. The OPUS_TESTING switch is used to allow an association that has already been collected (DSQUERY::OPUS_DB:asn_association.collect date filled and DSQUERY::OPUS_DB.asn_members.member status = C) to be collected again. This switch is set to TRUE for OTFR and also sometimes when OPUS testers want this behavior. The OTFR_WORLD switch is used to ignore the existence of an association OSF that has already counted down to "zero members outstanding before complete collection". Such a condition is considered an error during normal science processing, but because of STIS shared wavecals, this condition will be ignored under OTFR by setting the switch to TRUE.
FNDMRG: This process is a time poller that will periodically look for "Q/S split" datasets stuck in the Data Partitioning pipeline stage. A search is made of all OSFs in the science pipeline that are PROCESSING in the DP stage (DP_STAGE, DP_SEARCH_VALUE), and if that list contains datasets with the same IPPPSSOO but differing last characters, they are considered potential Q/S split candidates. A database look-up is made to ARCH_SERVER::ARCH_DB:otfr_special_processing, and if an entry for this IPPPSSOO exists with osp_processing_steps containing "MERGE_REQD", then this is an identified Q/S split. In this case the OSF with the higher DCF number will be triggered for automatic merging (DP_MERGE_VALUE).
MRGxxx: These processes are OSF pollers that automatically merge Q/S split datasets. There is one resource file for each instrument that can have splits (currently STIS and NICMOS). The Q/S split exposures are located by performing an OSF search, using DATASET for 8 characters and a wild-card. If any more or less than 2 OSFs are found, this is an error and the triggering OSF is marked for error. The catfilestool is used to merge the .PKX and .ULX files from the two exposures (SIS_DIR) in a temporary area (OPUS_TEMP_DIR). The .PKI and .ULI files are combined and the stored indicies are adjusted to account for the order of the split. The .ERX files should be identical, so one is chosen to use for the merged dataset. The .TRX files are combined. The merged files are then moved from the temporary area (OPUS_TEMP_DIR) back into the pipeline area (SIS_DIR), overwriting the files of the split dataset with the corresponding higher OSF DCF number. This OSF is then pushed on to the next pipeline stage (RESOURCE_FILE, RESOURCE_ITEM).
GC_xxx: The Generic Conversion science pipeline processes have been changed to check for and apply keyword repairs to FITS header keywords, after all keywords have been loaded from their sources. Keyword repairs are specified in the ARCH_SERVER::ARCH_DB:otfr_keyword_repair database table, and are meant to be used to fix telemetry errors, not software or database errors. An example of such a repair exists for WFPC-2, where the wrong shutter indicator appears in the telemetry, but the correct shutter was actually used for the exposure. Such keyword repair cases should be quite rare, however the Generic Conversion code for all active instruments has been changed to check for any repairs that should be applied to an exposure.

Clearing out the OTFR System

It has become a requirement that the OTFR operator have the capability of clearing out all requests that are currently being processed by the OTFR system (a restart of DADS is the usual trigger for such action).

The operator can use the facilities of the PMG to halt all OTFR processes, and the facilities of the OMG to delete all Observation Status Files for the currently processing requests in the OTFR pipelines. Note that each of these paths and pipelines must be halted and cleared using the OMG and PMG.

OTFR
OTFR science
Reservation (OMG clearing only; no PMG entries for this path)

A script has been created to clear out the processing directories. To run it type,

clear_otfr.sh otfr_pathname otfsci_pathname reserve_pathname

where otfr_pathname is the name of the path in which the OTFR processes are running (e.g. "otfr"), otfsci_pathname is the name of the path in which the OTFR science pipeline processes are running (e.g. otfsci), and reserve_pathname is the name of the path where science reservation OSFs are created. The script will search the path file for the locations of the processing directories (e.g. wf2_recalib, stis_recalib, etc.), and check for OTFR files in those directories. If files are found, a prompt is issued to the operator to be sure that they want to delete the files. The incoming request directories on the OTFR side (e.g. local_otfr_request_incoming, local_otfr_request_dir) are also searched. A separate deletion prompt is issued for each location where files are found.

The OTFR science pipeline will be cleared by searching the OTFR science path for entries matching DUDDC?, DUDDS?, and POD_ARCH_*. These are the typical locations of science files and POD files. Again, prompts are issued for any files found and the operator must confirm before each set of deletions will occur.

Additional prompts are then issued for cleaning out any OSFs remaining in the otfr, otfsci, and reserve blackboards. These could have been cleared using the OMG, as mentioned above, but these extra deletion checks are made in case any OSFs remain.

Path File Contents

The OPUS architecture provides a way for the same set of processes to be run against different sets of data and disk directories through the use of a path file. A path file (described in detail in the OPUSFAQ) can be thought of as a configuration set up for running a set of data. When you start a set of pipeline processes running under OPUS, you must tell it which path to use.

Process resource files are usually set up to "pull" certain parameter values "through a path file". This means that in the process resource file, instead of directly providing a value for a parameter, the value is given as the name of an entry in a path file. By using different path files, you can change the values of certain items used by the processes in a pipeline.

In the OTFR default path file, here are the parameters that are referenced by existing entries in the process resource files:

wf2_recalib - directory under which WFPC-2 processing directories are created

stis_recalib - directory under which STIS processing directories are created

nic_recalib - directory under which NICMOS processing directories are created

acs_recalib - directory under which ACS processing directories are created

pod_dir - science path mnemonic for pod input dir

acs_sis_dir - science path mnemonic link for ACS EDT dir

acs_cal_dir - science path mnemonic link for ACS CAL dir

nic_sis_dir - science path mnemonic link for NICMOS EDT dir

nic_cal_dir - science path mnemonic link for NICMOS CAL dir

stis_sis_dir - science path mnemonic link for STIS EDT dir

stis_cal_dir - science path mnemonic link STIS CAL dir

wf2_sis_dir - science path mnemonic link WFPC2 EDT dir

wf2_cal_dir - science path mnemonic link WFPC2 CAL dir

response_file_ext - file extension for the OTFR response file

request_file_ext - file extension for the OTFR request file, as it comes from DADS

request_work_ext - file extension for the OTFR request file as it progresses through the OTFR pipeline

data_reduction_path - name of the OTFR science pipeline path

reservation_path - name of the path in which exposure and association reservation OSFs are created

error_notify_interval - time in HOURS that a dataset will sit in an error state before DADS is notified

notify_polling_interval - polling time for checking for newly expired notify events

error_flush_interval - time in DAYS that a dataset will sit in an error state before the datatset is returned to DADS, as is

flush_polling_interval - polling time for checking for newly expired flush events

wf2_cal_file_ext = filespec for identifying WFPC-2 calibration products (for deletion during flushing)

stis_cal_file_ext = filespec for identifying STIS calibration products (for deletion during flushing)

nic_cal_file_ext = filespec for identifying NICMOS calibration products (for deletion during flushing)

acs_cal_file_ext = filespec for identifying ACS calibration products (for deletion during flushing)

edt_ext_mask - file extension mask for finding EDT files for a dataset

osf_start_status - OSF status column values for data reduction pipeline when EDT set is used

dads_tool_dir - NFS location of direct acquire tool

DADS_machine - remote DADS machine to use for FTP operations

DADS_machine_type - type of remote DADS machine (VMS or UNIX)

DADS_login - login id to use during FTP session on remote DADS machine

DADS_request_dir - directory on remote DADS machine where OTFR requests are picked up

DADS_response_dir - directory on remote DADS machine where OTFR responses are sent

local_otfr_request_incoming - directory on local OTFR system OTFR request are copied from the DADS system

local_otfr_request_dir - directory on local OTFR system where OTFR requests begin processing

OPUS Login File Contents

The OPUS architecture provides a way to define a set of environment variables across all processes that run within it. There is a special command file script, opus_login.csh, that is run prior to starting any process within OPUS. All variables defined in opus_login.csh will be available to processes run under OPUS.

These are the variables defined in the OTFR default opus_login.cshfile:

setenv DSQUERY - OPUS database server
setenv OPUS_DB - OPUS database
setenv SPSS_DB - SPSS database (assumed on DSQUERY server)
setenv ARCH_SERVER - DADS database server
setenv ARCH_DB - DADS database
setenv STSDASDISK - points to the STSDAS software tree
setenv TABLESDISK - points to the TABLES software tree
setenv STLOCAL - points to the STLOCAL software tree
setenv nref - location of NICMOS reference files
setenv ntab - location of NICMOS reference tables
setenv oref - location of STIS reference files
setenv otab - location of STIS reference tables
setenv jref - location of ACS reference files
setenv jtab - location of ACS reference tables
setenv uref - location of WFPC-2 reference files
setenv xref - location of FOC reference files
setenv yref - location of FOS reference files
setenv zref - location of HRS reference files
setenv xtab - location of FOC reference tables
setenv ytab - location of FOS reference tables
setenv ztab - location of HRS reference tables
setenv mtab - location of multiple-instrument reference tables
setenv crwfpc2comp - location of WFPC2 component lookup tables
setenv crotacomp - location of WFPC2 component lookup tables
setenv ttab - location of throughput tables

Error Handling

The OTFR pipeline will automatically detect pipeline errors and take action to first provide notification of the error, and then to "flush" the affected dataset from the OTFR pipeline, in whatever state the dataset happens to be. Both of these actions could be performed interactively by an operator using the OMG, however since operator attention to the OTFR pipeline is planned to be minimal, an automated system was deemed a requirement.

Each stage of the OTFR pipeline has four (4) error processes monitoring it:

Error Timer (_e)

This process will detect an error in a pipeline stage by searching for specific status codes (OSF_TRIGGERn) in the OSF column for that stage. If an error is detected, an OTFR response file is created with STATUS=STUCK. This puts this error "on the clock" and the dataset is pushed to the Error Notification process by setting the OSF column to '2'. If an OTFR response already exists for this dataset, then this is not the first error detected, and the dataset will be pushed to the Error Flushing process by setting the OSF column to '4'.

Error Notification (_n)

This process will check the age of the OTFR response file in the dataset directory against the process resource value ERROR_NOTIFY_INTERVAL. If the OTFR_response file is old enough, it will be copied back to DADS (RESPONSE_DIRECTORY) and the dataset will be pushed to the Error Flushing process by setting the OSF column to '4'. DADS will parse this response, and upon seeing the STATUS=STUCK value, will send e-mail notification to the end-user that the dataset has encountered a problem in OTFR processing and will be delayed. DADS should expect a second response from OTFR once the dataset is either repaired and completes normally, or is flushed from the OTFR pipeline. If the OTFR_response file is not old enough, then this dataset is pushed to the Error_Triggering process by setting the OSF column to '1'. The OTFR_response file will be checked again at a later time.

Error Triggering (_t)

This process will awaken periodically (DELTA_TIME) and check for datasets in need of retriggering in either the Error Notification (OSF column set to '1') or Error Flushing (OSF column set to '3') states. If any such datasets are found, their Observation Status Files (OSFs) are modified (set to '2' and '4' respectively) to send them back through the appropriate state, which will again check the age of the OTFR_response file.

Error Flushing (_f)

This process will check the age of the OTFR response file in the dataset directory against the process resource value ERROR_FLUSH_INTERVAL. If the response file is old enough, the dataset will be flushed by setting the OSF status columns as directed by the OSF_FLUSH entries in the process resource file. If a dataset hits an error in a pipeline stage that cannot be flushed (e.g. RETxxx, the stage that copies files back to DADS), the error will remain in the OTFR pipeline until it is attended to by the operator. These are considered cases in which the flush operation itself cannot be performed because of some problem (we can't "flush the flusher"). If the OTFR response file is not old enough, the dataset is pushed back to the Error Triggering process by setting the OSF column to '3', and the OTFR response will be checked again at a later time.

So the progression is

error detection; OSF goes from 'e' to '2'
error notification; OSF toggles between '2' and '1' until the notification timer (the age of the response file) expires and the OSF goes to '4'
error flushing; OSF toggles between '4' and '3' until the flushing timer (the age of the response file) expires and the OSF goes to 'f'

When the OSF goes to 'f', the dataset has been flushed.

The names of these error-handling processes follow this naming convention.

col_e - Error Timer process
col_n - Error Notification process
col_t - Error Triggering process
col_f - Error Flushing process

Note that the first three (3) characters reflect the name of the process for that stage. The underscore and letter indicate the particular error handling process. The error Timer (_e), Notification (_n), Triggering (_t), and Flushing (_f) processes are generic enough that they can be used for any instrument.

Note that the timer intervals (ERROR_NOTIFY_INTERVAL, ERROR_FLUSH_INTERVAL) in the process resource files are entered in different units. The ERROR_NOTIFY_INTERVAL is in units of HOURS, while the ERROR_FLUSH_INTERVAL is in units of DAYS. These seemed to be the most likely timescales on which these intervals would be set. Comments in the process resource file serve to remind the user of the units, if you ever want to change the values.

Errors in different OTFR pipeline stages will result in different flushing behavior.

    Error in Pipeline stage                 Action
-----------------------                 ------
PO - poll for OTFC_request              Send only OTFR_response back to DADS
LS - copy FITS files from DADS
RQ - submit an archive request
WK - segment a POD file
RV - reserve exposure names for science
MO - move POD/EDT to science pipeline

Errors in early pipeline stages result in only an OTFR_response file being returned to DADS, since no useful work has yet been done. The FILE_COUNT parameter in the response file will be set to zero (0) to indicate that no FITS files have been sent back by the OTFR pipeline.

    Error in Pipeline stage                 Action
-----------------------                 ------
CO - update FITS keywords               Send OTFR_response and any FITS files
2F - convert GEIS/ASCII to FITS           that have been created so far back
                                            to DADS.

Errors in intermediate stages result in a response file and any existing FITS files being returned to DADS.

    Error in Pipeline stage                 Action
-----------------------                 ------
RE - return FITS files to DADS          No action taken. Error requires
RS - create OTFC_response for DADS        operation intervention.
                                            (i.e. flushing has failed)

If a dataset hits an error in a pipeline stage that cannot be flushed (e.g. RE, the stage that copies files back to DADS), the error will remain in the OTFR pipeline until it is attended to by the operator. These are considered cases in which the flush operation itself cannot be performed because of some problem. No automated action seems possible.

Possible Operator Corrections of Problems

The OTFR operator will hopefully not need to attend to many pipeline problems. Here are a few known problems that could arise, and what actions can be taken by the operator to recover from them:

A User's Guide for the On-the-fly Re-processing (OTFR) System

Contents

Document Conventions

Overview

Environment / Dependencies

The Pipeline Stages

GETREQ (required process, does not show up in the pipeline stages)

POLxxx

LSTxxx

ARQxxx

WAKxxx

RSVxxx

MOVxxx

PODTRG

COLxxx

G2Fxxx

RETxxx

RSPxxx

CLNxxx

Changes to the Science Pipeline

Clearing out the OTFR System

Path File Contents

OPUS Login File Contents

Error Handling

Possible Operator Corrections of Problems

Back to Top of Document