Reingest Tools and Procedures

The reingest procedure is used to transfer datasets into the reingest pipeline that is used for this purpose. This reingest pipeline is a special ingest pipeline set up to be able to override the value used by the primary ingest pipeline for the archive_data_set_all.ads_data_source field. This allows the operator to indicate the nature of the reprocessing. This reingest pipeline is also used for ingest and catalog regression tests where the ads_data_source field is always REGR.

These tools were developed in multiple build releases under separate OPRS. Each OPR contains test plans that give examples of the use of these tools. A Web link will be provided for each OPR so that this additional information may be easily available.

The procedure to reingest datasets usually involves three steps:

  1. Use a command-line tool to send datasets to reingest pipeline;
  2. Run a reingest pipeline. Wait until all datasets are ingested and all OSFS are deleted.
  3. Use another command-line tool to transfer virtual files and then completely catalog datasets in ARCH_DB database.

None of the tools will work if there are any OSFs left in the reingest pipeline. This is needed so that there is no confusion about the completion of the processing for all datasets. In an ingest pipeline, the ingdel task removes any OSF that has a completion status (c) in its last column. Use the OMG for the reingest path to set the last column to c, in order to delete an OSF for a previous failure that requires reingesting.

The data that is to be reingested must be segregated into directories that contain only the data allowed for each tool. Multiple datasets can be loaded into these directories. Usually a reingest task requires only one directory. The kinds of data that need a separate directory are shown in Table 1. Association files can go into the same directory as the files for the members of that association.

OPUS 2010.4 changes should allow the data to be ingested in either the name that comes out of the originating OPUS pipeline, or the one that DADS delivers. The data format matters. In other words, SMS data can be in either yd5a11550.pod format, or yt5813360_pod.fits format. PDQ data that is a text file called o6n901fbq.pdq should work, and PDQ data that has already been converted to fits and is called o6n901fbq_pdq.fits will also work. OMS FITS data called o6n901fbq.jif will work just as well as OMS FITS data called o6n901fbq_jif.fits.

There are two additional tools that were intended for regression testing the ingest pipeline and the cataloging processes: ingest_regr_datasets.csh and catalog_regr_datasets.csh. It is not clear these tools have ever been used or if they are still needed or even if they would still work. But they were not deleted from the build tree. With the exception of those two scripts, AUTO regr_00202 now tests all these tools using real data, but skips DADS by using the nsa_bypass.pl tool discussed below.

Because each tool handles data having special requirements each tool has slightly different interfaces. The tools share many common subroutines and the output should look similar. Each tool can be called with no arguments to get a usage description. Here is a combination description of all the tools.

Table 1. Tool usage by class or data_id

CLASS Data ID Tools
CAL many ingest_hst_cal_oms.pl
catalog_hst_cal_oms.pl
(all CAL data can be in one directory in a single request)
OMS fgs,
fas
ingest_hst_cal_oms.pl
catalog_hst_cal_oms.pl
( all OMS data can be in one directory in a single request)

Table 2. Tool usage by data_id

CLASS Data ID Tools
EPC epc ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
MSC msc ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
MTL mtl ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
ORB orb ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
POD pod ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
PRB prb ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
PDQ pdq ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
SMS sms ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
TVI tvi ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
(these type of data waiting for PR 48609) )
TVL tvl ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
(these type of data waiting for PR 48609) )
ACC oma ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
(These type of data are deprecated by PR 65747)
ACM acm ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
ANC anc ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
AST ast ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
CDB cdb ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
CTB ctb ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
DIA adm,
cdm,
ndm,
sdm,
wdm
(dia to mix and match)
ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
DMP dmp ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
EDT
edi,
edl,
edn,
edo,
edu,
(edt to mix and match)
ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
DLG dlg ingest_dads_logs.pl
(no catalog required)

Usage

The tools have similar but separate input parameters described below.

Input directory

Each of the tools has a -i <input_dir> option now, so that the location of the input data can be specified, instead of forcing the user to be working from that directory. If this option is NOT provided, then the data are assumed to be in the current working directory.

Tools that take archive_class of CAL or OMS as an input parameter

CAL archive_class data have a many-to-one relationship with data_ids, but since the same ingest and catalog steps are done for all CAL data_ids, CAL data from multiple instruments can be mixed together in one input directory. The ingest and catalog steps for OMS data are similar to CAL data, so there is one set of tools for both these archive_classes:

>ingest_hst_cal_oms.pl -c <archive_class> -o <data_source> -i <input_dir>

>catalog_hst_cal_oms.pl -c <archive_class> -i <input_dir>

Tools that take data_id as an input parameter

Most of the data_ids used as input to these tools correspond to archive classes, with the exceptions of DIA and EDT archive classes, which both map to many data_ids each. However, the strings dia or edt will work as 'data_id' input for those types of data as of OPUS 2010.4.

>ingest_hst_non_cal.pl -d <data_id> -o <data_source> -i <input_dir> (see original PR 51848)

>catalog_hst_non_cal.pl -d <data_id> -i <input_dir>

>ingest_dads_logs.pl -o <data_source> -i <input_dir> (see original PR 52338)

For all tools above:

Other related tools

>ingest_regr_datasets.csh <path> (see original OPR 51879)

>catalog_regr_datasets.csh <path>

The above two tools still exist, but it is not clear they are ever used. They were tweaked by PR 59963 only to remove calls to no longer existing tools.

Note that the FUSE tools were removed since we do not expect to reingest FUSE again.

More details

See also

Examples

Sara needs to reingest a NICMOS broken ASN. She may need to clean out the catalog if it did not get in correctly the first time around.
(NOTE: this is an example to illustrate these tools, this is NOT a real broken NIC ASN!)
Last updated: OPUS 2010.4