ZDLRA, How to Maintain the Native Replication

The native replication for ZDLRA does not require a lot of maintenance or complicate tasks to keep it running. In my previous posts, I already wrote about an explanation about replication, how to configure the replication network between ZDLRA’s, how to configure the replication server, how to create the replication config (that links everything is done before), and how the replication protect the database. In this post, I will show some details that you need to monitor and to do maintain it running without errors.

RASYS.RA_REPLICATION_SERVER

Table RASYS.RA_REPLICATION_SERVER contains all the information about the replication servers that were created. From there we can check which backup policies are linked with the replication and the used streams:

SQL> col REPLICATION_SERVER_NAME format a30
SQL> col PROTECTION_POLICY format a20
SQL> col REP_SERVER_CONNECT_NAME format a20
SQL> set linesize 250
SQL> select REPLICATION_SERVER_NAME, REPLICATION_SERVER_STATE, PROTECTION_POLICY, REP_SERVER_CONNECT_NAME, MAX_STREAMS from RA_REPLICATION_SERVER;

REPLICATION_SERVER_NAME        REPLICATION_SERVER_ST PROTECTION_POLICY    REP_SERVER_CONNECT_N MAX_STREAMS
------------------------------ --------------------- -------------------- -------------------- -----------
ZDLRAS2_REP                    RUNNING               ZDLRAUP_BRONZE       RASYS                          4
ZDLRAS2_REP                    RUNNING               ZDLRA_BRONZE         RASYS                          4

SQL>

As you can see, we have the column with useful information about the configured replication for policy. And we will have one line per replication server (that we created using the ADD_REPLICATION_SERVER as I explained here).

MAX_STREAMS

Streams for replication servers defines the maximum of simultaneous replications tasks that the replication server can operate. This is a global definition for the entire server. So, these streams slots will be shared by all protection policies linked with the server.

The default value is 4, but can be defined when you use the CREATE_REPLICATION_SERVER or updating later using the UPDATE_REPLICATION_SERVER:

SQL> BEGIN
  2  DBMS_RA.UPDATE_REPLICATION_SERVER (
  3      replication_server_name => 'zdlras2_rep',
  4      max_streams => 8);
  5  END;
  6  /

PL/SQL procedure successfully completed.

SQL> select REPLICATION_SERVER_NAME, REPLICATION_SERVER_STATE, PROTECTION_POLICY, REP_SERVER_CONNECT_NAME, MAX_STREAMS from RA_REPLICATION_SERVER;

REPLICATION_SERVER_NAME        REPLICATION_SERVER_ST PROTECTION_POLICY    REP_SERVER_CONNECT_N MAX_STREAMS
------------------------------ --------------------- -------------------- -------------------- -----------
ZDLRAS2_REP                    RUNNING               ZDLRAUP_BRONZE       RASYS                          8
ZDLRAS2_REP                    RUNNING               ZDLRA_BRONZE         RASYS                          8

SQL>

As you can see, changing the server affected all. The way to tune this parameter is by checking the tasks that you have in the queue. If you see that tasks are increasing in the waitlist, you can think to increase the parameter. But please remember to verify your replication network (it is not at full usage), and the size of your backupset (incremental and archivelogs). Because if they are big, increase the concurrent stream task will not help.

TASKS and Library

Everything inside of ZDLRA is one task, INDEX_BACKUP, tape clones, validations, everything. And this is the same for replication. Each backupset that will be replication receives one task that is managed by upstream ZDLRA to send to downstream ZDLRA. And it is managed in the same way as a clone to tape tasks.

The main effort is to control and checks the tasks at the upstream side. From downstream the tasks will be just index_backup type.

RA_SBT_LIBRARY

This table store the information about the SBT library that you configured inside of ZDLRA. You will have one line per tape library and per server that you created for replication. This table is used to purely identify the LIB_KEY for your replication server (that will be used to track tasks later):

SQL> SELECT lib_key, lib_name FROM RA_SBT_LIBRARY;

   LIB_KEY LIB_NAME
---------- ------------------------------
     11234 OSBSBT
      9044 REP$LIB_9043_ZDLRAS2_REP

SQL>

RA_TASK

With the lib_key we can search inside of RA_TASK what are the tasks related to the replication. The type will be, most of the time the tasks will be BACKUP_SBT. The query bellow report all tasks running (I took after the backup was taken and the replication is running):

SQL> select TASK_ID, TASK_TYPE, STATE, WAITING_ON, DB_KEY, DB_UNIQUE_NAME, CREATION_TIME, ERROR_COUNT, INTERRUPT_COUNT, BP_KEY,BS_KEY,DF_KEY,VB_KEY from rasys.ra_task WHERE lib_key = 9044 and archived = 'N' order by 5,2,7,10,11,12,13;

   TASK_ID TASK_TYPE       STATE                WAITING_ON     DB_KEY DB_UNIQUE_NAME       CREATION_TIME                       ERROR_COUNT INTERRUPT_COUNT     BP_KEY     BS_KEY     DF_KEY     VB_KEY
---------- --------------- -------------------- ---------- ---------- -------------------- ----------------------------------- ----------- --------------- ---------- ---------- ---------- ----------
     50581 BACKUP_SBT      COMPLETED                            25140 ORCL19               23-JUL-20 12.24.52.632234 AM +02:00           0               0                 31181
     50584 BACKUP_SBT      COMPLETED                            25140 ORCL19               23-JUL-20 12.25.09.687280 AM +02:00           0               0                 31185
     50586 BACKUP_SBT      COMPLETED                            25140 ORCL19               23-JUL-20 12.25.22.223941 AM +02:00           0               0                 31190
     50592 BACKUP_SBT      COMPLETED                            25140 ORCL19               23-JUL-20 12.25.31.770802 AM +02:00           0               0                 31195
     50595 BACKUP_SBT      EXECUTABLE                           25140 ORCL19               23-JUL-20 12.25.43.271311 AM +02:00           0               0                 31213
     50596 SPAWN_SBT       EXECUTABLE                                                      23-JUL-20 12.25.43.379794 AM +02:00           0               0

6 rows selected.

SQL>

As you can see, you have one BACKUP_SBT per backupset. A similar query can be done using table RA_SBT_TASK.

As usual for every ZDLRA task, the most important is always check the STATE column. If is COMPLETED, RUNNING, or EXECUTABLE is ok, but if the state is RECONCILE_WAIT this means that something is wrong with the replication (can be paused or with error) and is important to check the table ra_incident_log:

SQL> select error_text from ra_incident_log where status = 'ACTIVE' and parameter = 'RECONCILE' order by last_seen desc;

ERROR_TEXT
--------------------------------------------------------------------------------------------------------------------------
ORA-64771: reconcile error during DO_PENDING_REP_SETUP; replication server: ZDLRAS2_REP; database: ORCL19
ORA-06512: at "SYS.DBMS_SYS_ERROR", line 105
ORA-64772: downstream replication server:  is not accessible
ORA-06512: at "RASYS.DBMS_RCVCAT", line 3468
ORA-06512: at "RASYS.DBMS_RCVCAT", line 1809
ORA-06512: at "RASYS.DBMS_RCVCAT", line 1763
ORA-28002: the password will expire within 7 days
ORA-06512: at "SYS.KBRSI_ICD", line 517
ORA-06512: at "RASYS.DBMS_RAI_WALLET2URL", line 9
ORA-06512: at line 1
ORA-06512: at "RASYS.DBMS_RCVCAT", line 1723

ERROR_TEXT
--------------------------------------------------------------------------------------------------------------------------
ORA-06512: at "RASYS.DBMS_RCVCAT", line 3364


SQL>

PAUSE_REPLICATION_SERVER and RESUME_REPLICATION_SERVER

Sometimes is needed to pause the replication, it is a normal procedure and done using a well-documented procedure (to pause and resume the replication).

So, to pause the replication we use DBMS_RA.PAUSE_REPLICATION_SERVER and will affect the whole server (and not just one policy):

SQL> BEGIN
  2      DBMS_RA.PAUSE_REPLICATION_SERVER (
  3          replication_server_name => 'ZDLRAS2_REP'
  4      );
  5  END;
  6  /

PL/SQL procedure successfully completed.

SQL> select REPLICATION_SERVER_NAME, PROTECTION_POLICY , REPLICATION_SERVER_STATE FROM RA_REPLICATION_SERVER;

REPLICATION_SERVER_N PROTECTION_POLICY    REPLICATION_SERVER_ST
-------------------- -------------------- ---------------------
ZDLRAS2_REP          ZDLRAUP_BRONZE       PAUSED
ZDLRAS2_REP          ZDLRA_BRONZE         PAUSED

SQL>

As you see above, the REPLICATION_SERVER_STATE column at RASYS.RA_REPLICATION_SERVER table will be as PAUSED.

When pausing the replication, all the tasks will be put in wait and will stay in the queue until you resume the replication:

SQL> select REPLICATION_SERVER_NAME, PROTECTION_POLICY , REPLICATION_SERVER_STATE FROM RA_REPLICATION_SERVER;

REPLICATION_SERVER_N PROTECTION_POLICY    REPLICATION_SERVER_ST
-------------------- -------------------- ---------------------
ZDLRAS2_REP          ZDLRAUP_BRONZE       PAUSED
ZDLRAS2_REP          ZDLRA_BRONZE         PAUSED

SQL> select TASK_ID, TASK_TYPE, STATE, WAITING_ON, DB_KEY, DB_UNIQUE_NAME, CREATION_TIME, ERROR_COUNT, INTERRUPT_COUNT, BP_KEY,BS_KEY,DF_KEY,VB_KEY from rasys.ra_task WHERE lib_key = 9044 and archived = 'N' order by 5,2,7,10,11,12,13;

   TASK_ID TASK_TYPE       STATE           WAITING_ON     DB_KEY DB_UNIQUE_NAME       CREATION_TIME                       ERROR_COUNT INTERRUPT_COUNT     BP_KEY     BS_KEY     DF_KEY     VB_KEY
---------- --------------- --------------- ---------- ---------- -------------------- ----------------------------------- ----------- --------------- ---------- ---------- ---------- ----------
     50614 BACKUP_SBT      LIBRARY_WAIT                    25140 ORCL19               23-JUL-20 12.49.32.442643 AM +02:00           0               0                 31281
     50617 BACKUP_SBT      LIBRARY_WAIT                    25140 ORCL19               23-JUL-20 12.49.37.325911 AM +02:00           0               0                 31285
     50615 SPAWN_SBT       EXECUTABLE                                                 23-JUL-20 12.49.32.592927 AM +02:00           0               0

SQL>

To resume the replication is easy and done with the procedure DBMS_RA.RESUME_REPLICATION_SERVER:

SQL> BEGIN
  2      DBMS_RA.RESUME_REPLICATION_SERVER (
  3          replication_server_name => 'ZDLRAS2_REP'
  4      );
  5  END;
  6  /

PL/SQL procedure successfully completed.

SQL>

And after that, all tasks will be in EXECUTABLE mode:

SQL> select TASK_ID, TASK_TYPE, STATE, WAITING_ON, DB_KEY, DB_UNIQUE_NAME, CREATION_TIME, ERROR_COUNT, INTERRUPT_COUNT, BP_KEY,BS_KEY,DF_KEY,VB_KEY from rasys.ra_task WHERE lib_key = 9044 and archived = 'N' order by 5,2,7,10,11,12,13;

   TASK_ID TASK_TYPE       STATE           WAITING_ON     DB_KEY DB_UNIQUE_NAME       CREATION_TIME                       ERROR_COUNT INTERRUPT_COUNT     BP_KEY     BS_KEY     DF_KEY     VB_KEY
---------- --------------- --------------- ---------- ---------- -------------------- ----------------------------------- ----------- --------------- ---------- ---------- ---------- ----------
     50614 BACKUP_SBT      RUNNING                         25140 ORCL19               23-JUL-20 12.49.32.442643 AM +02:00           0               0                 31281
     50617 BACKUP_SBT      RUNNING                         25140 ORCL19               23-JUL-20 12.49.37.325911 AM +02:00           0               0                 31285
     50619 BACKUP_SBT      RUNNING                         25140 ORCL19               23-JUL-20 12.49.42.598054 AM +02:00           0               0                 31289
     50621 BACKUP_SBT      RUNNING                         25140 ORCL19               23-JUL-20 12.49.45.352626 AM +02:00           0               0                 31293
     50623 BACKUP_SBT      EXECUTABLE                      25140 ORCL19               23-JUL-20 12.49.50.271201 AM +02:00           0               0                 31313
     50625 BACKUP_SBT      EXECUTABLE                      25140 ORCL19               23-JUL-20 12.49.58.429705 AM +02:00           0               0                 31333
     50628 BACKUP_SBT      EXECUTABLE                      25140 ORCL19               23-JUL-20 12.50.00.074445 AM +02:00           0               0                 31341
     50633 BACKUP_SBT      EXECUTABLE                      25140 ORCL19               23-JUL-20 12.50.03.716264 AM +02:00           0               0                 31354
     50635 BACKUP_SBT      EXECUTABLE                      25140 ORCL19               23-JUL-20 12.50.05.911542 AM +02:00           0               0                 31358
     50638 BACKUP_SBT      EXECUTABLE                      25140 ORCL19               23-JUL-20 12.50.10.922916 AM +02:00           0               0                 31382

10 rows selected.

SQL>

RECONCILE

Reconcile is the process, at ZDLRA, that occurs when upstream and downstream exchange metadata and the “reconcile” the gaps. This means that upstream will check downstream what backupssets needs to be sent. The reconcile does not occur just when you resume the replication, it is done for every ingested backup at upstream.

If you receive errors about replication usually they will be noticed as reconcile:

SQL> select error_text from ra_incident_log where status = 'ACTIVE' and parameter = 'RECONCILE' order by last_seen desc;

ERROR_TEXT
--------------------------------------------------------------------------------------------------------------------------
ORA-64771: reconcile error during DO_PENDING_REP_SETUP; replication server: ZDLRAS2_REP; database: ORCL19
ORA-06512: at "SYS.DBMS_SYS_ERROR", line 105
ORA-64772: downstream replication server:  is not accessible
ORA-06512: at "RASYS.DBMS_RCVCAT", line 3468
ORA-06512: at "RASYS.DBMS_RCVCAT", line 1809
ORA-06512: at "RASYS.DBMS_RCVCAT", line 1763
ORA-28002: the password will expire within 7 days
ORA-06512: at "SYS.KBRSI_ICD", line 517
ORA-06512: at "RASYS.DBMS_RAI_WALLET2URL", line 9
ORA-06512: at line 1
ORA-06512: at "RASYS.DBMS_RCVCAT", line 1723

ERROR_TEXT
--------------------------------------------------------------------------------------------------------------------------
ORA-06512: at "RASYS.DBMS_RCVCAT", line 3364


SQL>

After you solve the problem, you can call the DBMS_RA.RESUME_REPLICATION_SERVER to do a “force” restart of replication. Doing that the reconcile will be called for all replicated databases. There is a undocumented procedure (dbms_ra_int.replication_reconcile(‘<DBNAME>’,'<REP_SERVER_NAME>’)) that you can call to just one database, but is not needed to use it (normal resume is enough).

Replication and Patch

The replication for ZDLRA is basically a rman SBT tape copy (in a high, high abstraction) between two catalogs. So, when your patch is recommended that who received the replication will be at higher version.

So, if you are using replication (whatever the mode: hub, bi-directional, etc), the basic order is:

Pause replication
Patch downstream
Patch upstream
Resume replication

Of course, sometimes we can’t patch both at the same time (upstream and downstream). But always pause the replication before patch downstream, and resume after patch it. This guarantee that those who received will be always in a higher version than those who send. If you need, you can resume the replication before patch the upstream to allow the reconcile and reduce the amount of data that need to be transferred (just remember to stop it before patch the upstream)

When you operate in a bi-directional way, there is no change. You resume the replication after patch both sides (since all sides will be upstream and downstream).

Know issues

There is a very know issue for replication and it related to the username used to replicate. This user needs to be created to be used in the wallet for DBMS_RA.CREATE_REPLICATION_SERVER (I explained here). But after some time the password for this user can expire and all the replication needs to occurs any kind of warning.

So, if you see this kind of error:

SQL> select error_text from ra_incident_log where status = 'ACTIVE' and parameter = 'RECONCILE' order by last_seen desc;

ERROR_TEXT
--------------------------------------------------------------------------------------------------------------------------
ORA-64771: reconcile error during DO_PENDING_REP_SETUP; replication server: ZDLRAS2_REP; database: ORCL19
ORA-06512: at "SYS.DBMS_SYS_ERROR", line 105
ORA-64772: downstream replication server:  is not accessible
ORA-06512: at "RASYS.DBMS_RCVCAT", line 3468
ORA-06512: at "RASYS.DBMS_RCVCAT", line 1809
ORA-06512: at "RASYS.DBMS_RCVCAT", line 1763
ORA-28002: the password will expire within 7 days
ORA-06512: at "SYS.KBRSI_ICD", line 517
ORA-06512: at "RASYS.DBMS_RAI_WALLET2URL", line 9
ORA-06512: at line 1
ORA-06512: at "RASYS.DBMS_RCVCAT", line 1723

ERROR_TEXT
--------------------------------------------------------------------------------------------------------------------------
ORA-06512: at "RASYS.DBMS_RCVCAT", line 3364


SQL>

You just need to go to downstream and reset the password for the user. You can change the profile if want too.

MAINTAIN, SUSTAIN, KEEP the Replication

Whatever you call it, MAINTAIN, SUSTAIN, or KEEP the replication running is not difficult. The process is very similar to normal ZDLRA operation (the tasks are there and the concept is the same).

The most important is to take care of the queue, if you see a long queue at RA_TASK table maybe you need to tune the max_streams for the server. If you already have done this, check if not reach the maximum throughput for your replication network. If you are sharing ingest and replication at the same interface you can reach a race between them and will be better to isolate them (as recommended at best practices).

Disclaimer: “The postings on this site are my own and don’t necessarily represent my actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications were removed to allow reach the generic audience and to be useful for the community. Post protected by copyright.”

Fernando Simon

Have you hugged your backup today?