Reingest Tools and Procedures

The reingest procedure is used to transfer datasets into the reingest pipeline that is used for this purpose. This reingest pipeline is a special ingest pipeline set up to be able to override the value used by the primary ingest pipeline for the archive_data_set_all.ads_data_source field. This allows the operator to indicate the nature of the reprocessing. This reingest pipeline is also used for ingest and catalog regression tests where the ads_data_source field is always REGR.

These tools were developed in multiple build releases under separate OPRS. Each OPR contains test plans that give examples of the use of these tools. A Web link will be provided for each OPR so that this additional information may be easily available.

The procedure to reingest datasets usually involves three steps:

Use a command-line tool to send datasets to reingest pipeline;
Run a reingest pipeline. Wait until all datasets are ingested and all OSFS are deleted.
Use another command-line tool to transfer virtual files and then completely catalog datasets in ARCH_DB database.

None of the tools will work if there are any OSFs left in the reingest pipeline. This is needed so that there is no confusion about the completion of the processing for all datasets. In an ingest pipeline, the ingdel task removes any OSF that has a completion status (c) in its last column. Use the OMG for the reingest path to set the last column to c, in order to delete an OSF for a previous failure that requires reingesting.

The data that is to be reingested must be segregated into directories that contain only the data allowed for each tool. Multiple datasets can be loaded into these directories. Usually a reingest task requires only one directory. The kinds of data that need a separate directory are shown in Table 1. Association files can go into the same directory as the files for the members of that association.

OPUS 2010.4 changes should allow the data to be ingested in either the name that comes out of the originating OPUS pipeline, or the one that DADS delivers. The data format matters. In other words, SMS data can be in either yd5a11550.pod format, or yt5813360_pod.fits format. PDQ data that is a text file called o6n901fbq.pdq should work, and PDQ data that has already been converted to fits and is called o6n901fbq_pdq.fits will also work. OMS FITS data called o6n901fbq.jif will work just as well as OMS FITS data called o6n901fbq_jif.fits.

There are two additional tools that were intended for regression testing the ingest pipeline and the cataloging processes: ingest_regr_datasets.csh and catalog_regr_datasets.csh. It is not clear these tools have ever been used or if they are still needed or even if they would still work. But they were not deleted from the build tree. With the exception of those two scripts, AUTO regr_00202 now tests all these tools using real data, but skips DADS by using the nsa_bypass.pl tool discussed below.

Because each tool handles data having special requirements each tool has slightly different interfaces. The tools share many common subroutines and the output should look similar. Each tool can be called with no arguments to get a usage description. Here is a combination description of all the tools.

CLASS	Data ID	Tools
Table 1. Tool usage by class or data_id
CAL	many	ingest_hst_cal_oms.pl catalog_hst_cal_oms.pl (all CAL data can be in one directory in a single request)
OMS	fgs, fas	ingest_hst_cal_oms.pl catalog_hst_cal_oms.pl ( all OMS data can be in one directory in a single request)

Table 2. Tool usage by data_id

CLASS Data ID Tools

EPC epc ingest_hst_non_cal.pl
catalog_hst_non_cal.pl

MSC msc ingest_hst_non_cal.pl
catalog_hst_non_cal.pl

MTL mtl ingest_hst_non_cal.pl
catalog_hst_non_cal.pl

ORB orb ingest_hst_non_cal.pl
catalog_hst_non_cal.pl

POD pod ingest_hst_non_cal.pl
catalog_hst_non_cal.pl

PRB prb ingest_hst_non_cal.pl
catalog_hst_non_cal.pl

PDQ pdq ingest_hst_non_cal.pl
catalog_hst_non_cal.pl

SMS sms ingest_hst_non_cal.pl
catalog_hst_non_cal.pl

TVI tvi ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
(these type of data waiting for PR 48609) )

TVL tvl ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
(these type of data waiting for PR 48609) )

ACC oma ingest_hst_non_cal.pl
catalog_hst_non_cal.pl
(These type of data are deprecated by PR 65747)

ACM acm ingest_hst_non_cal.pl
catalog_hst_non_cal.pl

ANC anc ingest_hst_non_cal.pl
catalog_hst_non_cal.pl

AST ast ingest_hst_non_cal.pl
catalog_hst_non_cal.pl

CDB cdb ingest_hst_non_cal.pl
catalog_hst_non_cal.pl

CTB ctb ingest_hst_non_cal.pl
catalog_hst_non_cal.pl

DIA adm,
cdm,
ndm,
sdm,
wdm
(dia to mix and match) ingest_hst_non_cal.pl
catalog_hst_non_cal.pl

DMP dmp ingest_hst_non_cal.pl
catalog_hst_non_cal.pl

EDT
edi,
edl,
edn,
edo,
edu,
(edt to mix and match) ingest_hst_non_cal.pl
catalog_hst_non_cal.pl

DLG dlg ingest_dads_logs.pl
(no catalog required)

Table 2. Tool usage by data_id
CLASS	Data ID	Tools
EPC	epc	ingest_hst_non_cal.pl catalog_hst_non_cal.pl
MSC	msc	ingest_hst_non_cal.pl catalog_hst_non_cal.pl
MTL	mtl	ingest_hst_non_cal.pl catalog_hst_non_cal.pl
ORB	orb	ingest_hst_non_cal.pl catalog_hst_non_cal.pl
POD	pod	ingest_hst_non_cal.pl catalog_hst_non_cal.pl
PRB	prb	ingest_hst_non_cal.pl catalog_hst_non_cal.pl
PDQ	pdq	ingest_hst_non_cal.pl catalog_hst_non_cal.pl
SMS	sms	ingest_hst_non_cal.pl catalog_hst_non_cal.pl
TVI	tvi	ingest_hst_non_cal.pl catalog_hst_non_cal.pl (these type of data waiting for PR 48609) )
TVL	tvl	ingest_hst_non_cal.pl catalog_hst_non_cal.pl (these type of data waiting for PR 48609) )
ACC	oma	ingest_hst_non_cal.pl catalog_hst_non_cal.pl (These type of data are deprecated by PR 65747)
ACM	acm	ingest_hst_non_cal.pl catalog_hst_non_cal.pl
ANC	anc	ingest_hst_non_cal.pl catalog_hst_non_cal.pl
AST	ast	ingest_hst_non_cal.pl catalog_hst_non_cal.pl
CDB	cdb	ingest_hst_non_cal.pl catalog_hst_non_cal.pl
CTB	ctb	ingest_hst_non_cal.pl catalog_hst_non_cal.pl
DIA	adm, cdm, ndm, sdm, wdm (dia to mix and match)	ingest_hst_non_cal.pl catalog_hst_non_cal.pl
DMP	dmp	ingest_hst_non_cal.pl catalog_hst_non_cal.pl
EDT	edi, edl, edn, edo, edu, (edt to mix and match)	ingest_hst_non_cal.pl catalog_hst_non_cal.pl
DLG	dlg	ingest_dads_logs.pl (no catalog required)

Usage

The tools have similar but separate input parameters described below.

Input directory

Each of the tools has a -i <input_dir> option now, so that the location of the input data can be specified, instead of forcing the user to be working from that directory. If this option is NOT provided, then the data are assumed to be in the current working directory.

Tools that take archive_class of CAL or OMS as an input parameter

CAL archive_class data have a many-to-one relationship with data_ids, but since the same ingest and catalog steps are done for all CAL data_ids, CAL data from multiple instruments can be mixed together in one input directory. The ingest and catalog steps for OMS data are similar to CAL data, so there is one set of tools for both these archive_classes:

>ingest_hst_cal_oms.pl -c <archive_class> -o <data_source> -i <input_dir>

(see original PR 53339)

>catalog_hst_cal_oms.pl -c <archive_class> -i <input_dir>

<archive_class> is the either CAL or OMS.

Tools that take data_id as an input parameter

Most of the data_ids used as input to these tools correspond to archive classes, with the exceptions of DIA and EDT archive classes, which both map to many data_ids each. However, the strings dia or edt will work as 'data_id' input for those types of data as of OPUS 2010.4.

>ingest_hst_non_cal.pl -d <data_id> -o <data_source> -i <input_dir> (see original PR 51848)

>catalog_hst_non_cal.pl -d <data_id> -i <input_dir>

The <data_id> is the OSF data_id used in normal processing. It is generally not equal to the archive class. If it is set to an invalid value, such as "xxx", then a list of valid values with the name of the related archive class is printed. This tool supports all HST data_ids except for classes CAL and OMS that have associations.

>ingest_dads_logs.pl -o <data_source> -i <input_dir> (see original PR 52338)

The data_id should NOT be specified on the command line, but is instead understood to be dlg.
At least one file in each dataset must have the extension ".log". There is no catalog tool for DADS logs.

For all tools above:

For verbose messaging, especially for debugging, use the standard OPUS environment variable MSG_REPORT_LEVEL. There is no more -v option. So to turn messaging up for a run of the command-line tools, set the MSG_REPORT_LEVEL on the command line before you run the tools. This will only affect processes run in the same window thereafter, not the entire account you are running in.
E.g. setenv MSG_REPORT_LEVEL MSG_ALL
Note that MSG_REPORT_LEVEL and STDB_REPORT_LEVEL settings are changed right before calls to genreq, InteractiveIngResponse, and update_db_tool. In the case of genreq and ingrsp, this was done to make debugging these comand-line tools easier. It was thought we did not need to debug those processes since they were not changing. However, we may wish to reconsider this change in the future.
In the case of update_db_tool, no extra reporting is necessary as this tool already does a wonderful job of making itself clear. Any extra reporting is redundant and makes it more difficult to see what the tool is doing.
The <data_source> is the four-character data source override value used as the value for ads_data_source in archive_data_set_all. This is a way for Ops to distinguish between test data, normally processed pipeline data, and reingested data.

Other related tools

>ingest_regr_datasets.csh <path> (see original OPR 51879)

>catalog_regr_datasets.csh <path>

The <path> is the path name of the regression Ingest pipeline.

The above two tools still exist, but it is not clear they are ever used. They were tweaked by PR 59963 only to remove calls to no longer existing tools.

Note that the FUSE tools were removed since we do not expect to reingest FUSE again.

More details

Who would use this tool set?
- DADS Ops
- OPUS Ops
- OPUS testers
- OPUS developers
How does this tool differ from ARCINS?
From what I can glean, this tool has traditionally been used by DADS Ops personnel whereas ARCINS is traditionally used by OPUS Ops personnel to, for example, get failed data into the archive under the PRB archive class.
It needs to be tested in this context, but I believe ARCINS could be replaced by this set of tools.
Caveats
- It takes two scripts and a pipeline to reingest the data.
- You have to figure out which two scripts depending on your input data type.
- You have to segregate the input data by type. I am not sure I can disagree with this requirement, however. I suspect that it is not egregious since the tools are normally used either for small amounts of data at a time, or a block of all the same type of data.

Examples

Sara needs to reingest a NICMOS broken ASN. She may need to clean out the catalog if it did not get in correctly the first time around.
(NOTE: this is an example to illustrate these tools, this is NOT a real broken NIC ASN!)

In a directory of her choosing, which we will call /node/sara/data, she places these files:

n3uy01060.tra        n3uy01piq_raw.fits   n3uy01pkq_ima.fits
n3uy01060_asn.fits   n3uy01piq_spt.fits   n3uy01pkq_raw.fits
n3uy01061.tra        n3uy01piq_trl.fits   n3uy01pkq_spt.fits
n3uy01phq.tra        n3uy01piq_trl.txt    n3uy01pkq_trl.fits
n3uy01phq_cal.fits   n3uy01pjq.tra        n3uy01pkq_trl.txt
n3uy01phq_ima.fits   n3uy01pjq_cal.fits   n3uy01plq.tra
n3uy01phq_raw.fits   n3uy01pjq_ima.fits   n3uy01plq_cal.fits
n3uy01phq_spt.fits   n3uy01pjq_raw.fits   n3uy01plq_ima.fits
n3uy01phq_trl.fits   n3uy01pjq_spt.fits   n3uy01plq_raw.fits
n3uy01phq_trl.txt    n3uy01pjq_trl.fits   n3uy01plq_spt.fits
n3uy01piq.tra        n3uy01pjq_trl.txt    n3uy01plq_trl.fits
n3uy01piq_cal.fits   n3uy01pkq.tra

Optionally, Sara could set these environment variables on the command line:
- setenv MSG_REPORT_LEVEL MSG_ALL
- setenv STDB_REPORT_LEVE STDB_INFO
This is a really good idea in case there there are going to be questions about the results.
From /node/sara, Sara will first run this command:
ingest_hst_cal_oms.pl -c CAL -o DLSG -i ./data >& n3uy01060_ingest.out Note: capturing the output is a really good idea in case there are going to be questions about the results.
Next, Sara would start an ingest pipeline in here reingest path (e.g. therappe) including the ingpol task.
When all the OSFs are gone from the reingest path, then Sara would run this command:
catalog_hst_cal_oms.pl -c CAL -o DLSG -i ./data >& n3uy01060_catalog.out
That's all there is to it.
Knowing Sara, she will verify that the catalog was filled as expected. Good idea :)

Last updated: OPUS 2010.4