The native replication for ZDLRA does not require a lot of maintenance or complicate tasks to keep it running. In my previous posts, I already wrote about an explanation about replication, how to configure the replication network between ZDLRA’s, how to configure the replication server, how to create the replication config (that links everything is done before), and how the replication protect the database. In this post, I will show some details that you need to monitor and to do maintain it running without errors.
RASYS.RA_REPLICATION_SERVER
Table RASYS.RA_REPLICATION_SERVER contains all the information about the replication servers that were created. From there we can check which backup policies are linked with the replication and the used streams:
SQL> col REPLICATION_SERVER_NAME format a30 SQL> col PROTECTION_POLICY format a20 SQL> col REP_SERVER_CONNECT_NAME format a20 SQL> set linesize 250 SQL> select REPLICATION_SERVER_NAME, REPLICATION_SERVER_STATE, PROTECTION_POLICY, REP_SERVER_CONNECT_NAME, MAX_STREAMS from RA_REPLICATION_SERVER; REPLICATION_SERVER_NAME REPLICATION_SERVER_ST PROTECTION_POLICY REP_SERVER_CONNECT_N MAX_STREAMS ------------------------------ --------------------- -------------------- -------------------- ----------- ZDLRAS2_REP RUNNING ZDLRAUP_BRONZE RASYS 4 ZDLRAS2_REP RUNNING ZDLRA_BRONZE RASYS 4 SQL>
As you can see, we have the column with useful information about the configured replication for policy. And we will have one line per replication server (that we created using the ADD_REPLICATION_SERVER as I explained here).
MAX_STREAMS
Streams for replication servers defines the maximum of simultaneous replications tasks that the replication server can operate. This is a global definition for the entire server. So, these streams slots will be shared by all protection policies linked with the server.
The default value is 4, but can be defined when you use the CREATE_REPLICATION_SERVER or updating later using the UPDATE_REPLICATION_SERVER:
SQL> BEGIN 2 DBMS_RA.UPDATE_REPLICATION_SERVER ( 3 replication_server_name => 'zdlras2_rep', 4 max_streams => 8); 5 END; 6 / PL/SQL procedure successfully completed. SQL> select REPLICATION_SERVER_NAME, REPLICATION_SERVER_STATE, PROTECTION_POLICY, REP_SERVER_CONNECT_NAME, MAX_STREAMS from RA_REPLICATION_SERVER; REPLICATION_SERVER_NAME REPLICATION_SERVER_ST PROTECTION_POLICY REP_SERVER_CONNECT_N MAX_STREAMS ------------------------------ --------------------- -------------------- -------------------- ----------- ZDLRAS2_REP RUNNING ZDLRAUP_BRONZE RASYS 8 ZDLRAS2_REP RUNNING ZDLRA_BRONZE RASYS 8 SQL>
As you can see, changing the server affected all. The way to tune this parameter is by checking the tasks that you have in the queue. If you see that tasks are increasing in the waitlist, you can think to increase the parameter. But please remember to verify your replication network (it is not at full usage), and the size of your backupset (incremental and archivelogs). Because if they are big, increase the concurrent stream task will not help.
TASKS and Library
Everything inside of ZDLRA is one task, INDEX_BACKUP, tape clones, validations, everything. And this is the same for replication. Each backupset that will be replication receives one task that is managed by upstream ZDLRA to send to downstream ZDLRA. And it is managed in the same way as a clone to tape tasks.
The main effort is to control and checks the tasks at the upstream side. From downstream the tasks will be just index_backup type.
RA_SBT_LIBRARY
This table store the information about the SBT library that you configured inside of ZDLRA. You will have one line per tape library and per server that you created for replication. This table is used to purely identify the LIB_KEY for your replication server (that will be used to track tasks later):
SQL> SELECT lib_key, lib_name FROM RA_SBT_LIBRARY; LIB_KEY LIB_NAME ---------- ------------------------------ 11234 OSBSBT 9044 REP$LIB_9043_ZDLRAS2_REP SQL>
RA_TASK
With the lib_key we can search inside of RA_TASK what are the tasks related to the replication. The type will be, most of the time the tasks will be BACKUP_SBT. The query bellow report all tasks running (I took after the backup was taken and the replication is running):
SQL> select TASK_ID, TASK_TYPE, STATE, WAITING_ON, DB_KEY, DB_UNIQUE_NAME, CREATION_TIME, ERROR_COUNT, INTERRUPT_COUNT, BP_KEY,BS_KEY,DF_KEY,VB_KEY from rasys.ra_task WHERE lib_key = 9044 and archived = 'N' order by 5,2,7,10,11,12,13; TASK_ID TASK_TYPE STATE WAITING_ON DB_KEY DB_UNIQUE_NAME CREATION_TIME ERROR_COUNT INTERRUPT_COUNT BP_KEY BS_KEY DF_KEY VB_KEY ---------- --------------- -------------------- ---------- ---------- -------------------- ----------------------------------- ----------- --------------- ---------- ---------- ---------- ---------- 50581 BACKUP_SBT COMPLETED 25140 ORCL19 23-JUL-20 12.24.52.632234 AM +02:00 0 0 31181 50584 BACKUP_SBT COMPLETED 25140 ORCL19 23-JUL-20 12.25.09.687280 AM +02:00 0 0 31185 50586 BACKUP_SBT COMPLETED 25140 ORCL19 23-JUL-20 12.25.22.223941 AM +02:00 0 0 31190 50592 BACKUP_SBT COMPLETED 25140 ORCL19 23-JUL-20 12.25.31.770802 AM +02:00 0 0 31195 50595 BACKUP_SBT EXECUTABLE 25140 ORCL19 23-JUL-20 12.25.43.271311 AM +02:00 0 0 31213 50596 SPAWN_SBT EXECUTABLE 23-JUL-20 12.25.43.379794 AM +02:00 0 0 6 rows selected. SQL>
As you can see, you have one BACKUP_SBT per backupset. A similar query can be done using table RA_SBT_TASK.
As usual for every ZDLRA task, the most important is always check the STATE column. If is COMPLETED, RUNNING, or EXECUTABLE is ok, but if the state is RECONCILE_WAIT this means that something is wrong with the replication (can be paused or with error) and is important to check the table ra_incident_log:
SQL> select error_text from ra_incident_log where status = 'ACTIVE' and parameter = 'RECONCILE' order by last_seen desc; ERROR_TEXT -------------------------------------------------------------------------------------------------------------------------- ORA-64771: reconcile error during DO_PENDING_REP_SETUP; replication server: ZDLRAS2_REP; database: ORCL19 ORA-06512: at "SYS.DBMS_SYS_ERROR", line 105 ORA-64772: downstream replication server: is not accessible ORA-06512: at "RASYS.DBMS_RCVCAT", line 3468 ORA-06512: at "RASYS.DBMS_RCVCAT", line 1809 ORA-06512: at "RASYS.DBMS_RCVCAT", line 1763 ORA-28002: the password will expire within 7 days ORA-06512: at "SYS.KBRSI_ICD", line 517 ORA-06512: at "RASYS.DBMS_RAI_WALLET2URL", line 9 ORA-06512: at line 1 ORA-06512: at "RASYS.DBMS_RCVCAT", line 1723 ERROR_TEXT -------------------------------------------------------------------------------------------------------------------------- ORA-06512: at "RASYS.DBMS_RCVCAT", line 3364 SQL>
PAUSE_REPLICATION_SERVER and RESUME_REPLICATION_SERVER
Sometimes is needed to pause the replication, it is a normal procedure and done using a well-documented procedure (to pause and resume the replication).
So, to pause the replication we use DBMS_RA.PAUSE_REPLICATION_SERVER and will affect the whole server (and not just one policy):
SQL> BEGIN 2 DBMS_RA.PAUSE_REPLICATION_SERVER ( 3 replication_server_name => 'ZDLRAS2_REP' 4 ); 5 END; 6 / PL/SQL procedure successfully completed. SQL> select REPLICATION_SERVER_NAME, PROTECTION_POLICY , REPLICATION_SERVER_STATE FROM RA_REPLICATION_SERVER; REPLICATION_SERVER_N PROTECTION_POLICY REPLICATION_SERVER_ST -------------------- -------------------- --------------------- ZDLRAS2_REP ZDLRAUP_BRONZE PAUSED ZDLRAS2_REP ZDLRA_BRONZE PAUSED SQL>
As you see above, the REPLICATION_SERVER_STATE column at RASYS.RA_REPLICATION_SERVER table will be as PAUSED.
When pausing the replication, all the tasks will be put in wait and will stay in the queue until you resume the replication:
SQL> select REPLICATION_SERVER_NAME, PROTECTION_POLICY , REPLICATION_SERVER_STATE FROM RA_REPLICATION_SERVER; REPLICATION_SERVER_N PROTECTION_POLICY REPLICATION_SERVER_ST -------------------- -------------------- --------------------- ZDLRAS2_REP ZDLRAUP_BRONZE PAUSED ZDLRAS2_REP ZDLRA_BRONZE PAUSED SQL> select TASK_ID, TASK_TYPE, STATE, WAITING_ON, DB_KEY, DB_UNIQUE_NAME, CREATION_TIME, ERROR_COUNT, INTERRUPT_COUNT, BP_KEY,BS_KEY,DF_KEY,VB_KEY from rasys.ra_task WHERE lib_key = 9044 and archived = 'N' order by 5,2,7,10,11,12,13; TASK_ID TASK_TYPE STATE WAITING_ON DB_KEY DB_UNIQUE_NAME CREATION_TIME ERROR_COUNT INTERRUPT_COUNT BP_KEY BS_KEY DF_KEY VB_KEY ---------- --------------- --------------- ---------- ---------- -------------------- ----------------------------------- ----------- --------------- ---------- ---------- ---------- ---------- 50614 BACKUP_SBT LIBRARY_WAIT 25140 ORCL19 23-JUL-20 12.49.32.442643 AM +02:00 0 0 31281 50617 BACKUP_SBT LIBRARY_WAIT 25140 ORCL19 23-JUL-20 12.49.37.325911 AM +02:00 0 0 31285 50615 SPAWN_SBT EXECUTABLE 23-JUL-20 12.49.32.592927 AM +02:00 0 0 SQL>
To resume the replication is easy and done with the procedure DBMS_RA.RESUME_REPLICATION_SERVER:
SQL> BEGIN 2 DBMS_RA.RESUME_REPLICATION_SERVER ( 3 replication_server_name => 'ZDLRAS2_REP' 4 ); 5 END; 6 / PL/SQL procedure successfully completed. SQL>
And after that, all tasks will be in EXECUTABLE mode:
SQL> select TASK_ID, TASK_TYPE, STATE, WAITING_ON, DB_KEY, DB_UNIQUE_NAME, CREATION_TIME, ERROR_COUNT, INTERRUPT_COUNT, BP_KEY,BS_KEY,DF_KEY,VB_KEY from rasys.ra_task WHERE lib_key = 9044 and archived = 'N' order by 5,2,7,10,11,12,13; TASK_ID TASK_TYPE STATE WAITING_ON DB_KEY DB_UNIQUE_NAME CREATION_TIME ERROR_COUNT INTERRUPT_COUNT BP_KEY BS_KEY DF_KEY VB_KEY ---------- --------------- --------------- ---------- ---------- -------------------- ----------------------------------- ----------- --------------- ---------- ---------- ---------- ---------- 50614 BACKUP_SBT RUNNING 25140 ORCL19 23-JUL-20 12.49.32.442643 AM +02:00 0 0 31281 50617 BACKUP_SBT RUNNING 25140 ORCL19 23-JUL-20 12.49.37.325911 AM +02:00 0 0 31285 50619 BACKUP_SBT RUNNING 25140 ORCL19 23-JUL-20 12.49.42.598054 AM +02:00 0 0 31289 50621 BACKUP_SBT RUNNING 25140 ORCL19 23-JUL-20 12.49.45.352626 AM +02:00 0 0 31293 50623 BACKUP_SBT EXECUTABLE 25140 ORCL19 23-JUL-20 12.49.50.271201 AM +02:00 0 0 31313 50625 BACKUP_SBT EXECUTABLE 25140 ORCL19 23-JUL-20 12.49.58.429705 AM +02:00 0 0 31333 50628 BACKUP_SBT EXECUTABLE 25140 ORCL19 23-JUL-20 12.50.00.074445 AM +02:00 0 0 31341 50633 BACKUP_SBT EXECUTABLE 25140 ORCL19 23-JUL-20 12.50.03.716264 AM +02:00 0 0 31354 50635 BACKUP_SBT EXECUTABLE 25140 ORCL19 23-JUL-20 12.50.05.911542 AM +02:00 0 0 31358 50638 BACKUP_SBT EXECUTABLE 25140 ORCL19 23-JUL-20 12.50.10.922916 AM +02:00 0 0 31382 10 rows selected. SQL>
RECONCILE
Reconcile is the process, at ZDLRA, that occurs when upstream and downstream exchange metadata and the “reconcile” the gaps. This means that upstream will check downstream what backupssets needs to be sent. The reconcile does not occur just when you resume the replication, it is done for every ingested backup at upstream.
If you receive errors about replication usually they will be noticed as reconcile:
SQL> select error_text from ra_incident_log where status = 'ACTIVE' and parameter = 'RECONCILE' order by last_seen desc; ERROR_TEXT -------------------------------------------------------------------------------------------------------------------------- ORA-64771: reconcile error during DO_PENDING_REP_SETUP; replication server: ZDLRAS2_REP; database: ORCL19 ORA-06512: at "SYS.DBMS_SYS_ERROR", line 105 ORA-64772: downstream replication server: is not accessible ORA-06512: at "RASYS.DBMS_RCVCAT", line 3468 ORA-06512: at "RASYS.DBMS_RCVCAT", line 1809 ORA-06512: at "RASYS.DBMS_RCVCAT", line 1763 ORA-28002: the password will expire within 7 days ORA-06512: at "SYS.KBRSI_ICD", line 517 ORA-06512: at "RASYS.DBMS_RAI_WALLET2URL", line 9 ORA-06512: at line 1 ORA-06512: at "RASYS.DBMS_RCVCAT", line 1723 ERROR_TEXT -------------------------------------------------------------------------------------------------------------------------- ORA-06512: at "RASYS.DBMS_RCVCAT", line 3364 SQL>
After you solve the problem, you can call the DBMS_RA.RESUME_REPLICATION_SERVER to do a “force” restart of replication. Doing that the reconcile will be called for all replicated databases. There is a undocumented procedure (dbms_ra_int.replication_reconcile(‘<DBNAME>’,'<REP_SERVER_NAME>’)) that you can call to just one database, but is not needed to use it (normal resume is enough).
Replication and Patch
The replication for ZDLRA is basically a rman SBT tape copy (in a high, high abstraction) between two catalogs. So, when your patch is recommended that who received the replication will be at higher version.
So, if you are using replication (whatever the mode: hub, bi-directional, etc), the basic order is:
- Pause replication
- Patch downstream
- Patch upstream
- Resume replication
Of course, sometimes we can’t patch both at the same time (upstream and downstream). But always pause the replication before patch downstream, and resume after patch it. This guarantee that those who received will be always in a higher version than those who send. If you need, you can resume the replication before patch the upstream to allow the reconcile and reduce the amount of data that need to be transferred (just remember to stop it before patch the upstream)
When you operate in a bi-directional way, there is no change. You resume the replication after patch both sides (since all sides will be upstream and downstream).
Know issues
There is a very know issue for replication and it related to the username used to replicate. This user needs to be created to be used in the wallet for DBMS_RA.CREATE_REPLICATION_SERVER (I explained here). But after some time the password for this user can expire and all the replication needs to occurs any kind of warning.
So, if you see this kind of error:
SQL> select error_text from ra_incident_log where status = 'ACTIVE' and parameter = 'RECONCILE' order by last_seen desc; ERROR_TEXT -------------------------------------------------------------------------------------------------------------------------- ORA-64771: reconcile error during DO_PENDING_REP_SETUP; replication server: ZDLRAS2_REP; database: ORCL19 ORA-06512: at "SYS.DBMS_SYS_ERROR", line 105 ORA-64772: downstream replication server: is not accessible ORA-06512: at "RASYS.DBMS_RCVCAT", line 3468 ORA-06512: at "RASYS.DBMS_RCVCAT", line 1809 ORA-06512: at "RASYS.DBMS_RCVCAT", line 1763 ORA-28002: the password will expire within 7 days ORA-06512: at "SYS.KBRSI_ICD", line 517 ORA-06512: at "RASYS.DBMS_RAI_WALLET2URL", line 9 ORA-06512: at line 1 ORA-06512: at "RASYS.DBMS_RCVCAT", line 1723 ERROR_TEXT -------------------------------------------------------------------------------------------------------------------------- ORA-06512: at "RASYS.DBMS_RCVCAT", line 3364 SQL>
You just need to go to downstream and reset the password for the user. You can change the profile if want too.
MAINTAIN, SUSTAIN, KEEP the Replication
Whatever you call it, MAINTAIN, SUSTAIN, or KEEP the replication running is not difficult. The process is very similar to normal ZDLRA operation (the tasks are there and the concept is the same).
The most important is to take care of the queue, if you see a long queue at RA_TASK table maybe you need to tune the max_streams for the server. If you already have done this, check if not reach the maximum throughput for your replication network. If you are sharing ingest and replication at the same interface you can reach a race between them and will be better to isolate them (as recommended at best practices).
Disclaimer: “The postings on this site are my own and don’t necessarily represent my actual employer positions, strategies or opinions. The information here was edited to be useful for general purpose, specific data and identifications were removed to allow reach the generic audience and to be useful for the community. Post protected by copyright.”