Neste artigo vamos ver como fazer um failover e reinstate em um ambiente que temos o Broker configurado. Se você está acompanhando a série de artigos até verá que já passamos por failover manual, resintate, switchover (todos manuais e sem Broker) e no último nós configuramos o Broker.
De qualquer forma, ainda não chegamos a um ambiente ideal de Oracle MAA (Maximum Availability Architecture) com DataGuard e Oracle RAC. Até o momento não foi configurado o Fast-Start Failover e em caso de falha, mesmo com o Broker, ainda precisamos de intervenção manual em caso de falha no ambiente.
SEXTO ARTIGO
Neste artigo vamos ver como proceder em caso de falha do ambiente quando estamos com o Broker, vamos ver como realizar o failover e reinstate através do Broker. Como disse acima, ainda não estamos com o ambiente de MAA finalizado, mas isso não impede que ocorra uma falha e você tenha que interferir.
Acredito que já tenha ficado claro através da séria sobre MAA que muitos passos não são aplicados em um ambiente real na mesma sequência que a apresentada na série, você não vai fazer um failover só para testar o Broker – você já configura tudo sem precisar “restaurar” o ambiente. Através dos artigos cobrimos os diversos pontos que envolvem um ambiente MAA, você tem que estar preparado para agir mesmo quer o seu ambiente não esteja completamente configurado. Você vai dizer para o seu gerente que não pode realizar um failover porque não tem o Broker configurado?
AMBIENTE
Neste artigo temos um ambiente com dois Oracle RAC sincronizados através de DataGuard. O banco primary é o “maa” e o standby é o “maastb”, ambos sincronizados com DG operando em MAXIMUM AVAILABILITY e com real-time apply. Por fim, temos o Broker configurado e “cuidando” do ambiente.
Antes de continuar voltamos brevemente ao ponto destacado no artigo do failover manual: sincronia e importância do ambiente. Sobre a sincronia destaco que em um ambiente DG ela deve ser um dos pontos chaves, você opera em MAA e não deixa seu ambiente sincronizado? Quanto a importância, se você está operando em MAA com Oracle RAC em ambos os lados o seu ambiente primário e standby é importante também não é?
Então, você tem um banco de dados que armazena dados importantes e não mantêm seu standby sincronizado com o primary? Você já avisou seu gerente que vocês vão perder dados?
Não que todo o ambiente MAA com DG nunca deve perder dados. Talvez por razões de projeto exista um acordo quanto o que possa ser perdido de informação, algumas informações podem ser recuperadas por outros métodos. De qualquer forma, conheça bem o seu ambiente, saiba todos os seus detalhes.
CONTROLE
Em artigos anteriores já foi usado o mesmo artifício para exemplificar a sincronia do DG, voltamos com ele aqui. Utilizando a mesma tabela que foi criada antes, limpamos seus dados e em cada instância do primary serão inseridos dados. Com esta tabela mostramos que no ambiente atual, mesmo sem gerar archivelog, as transações estão protegidas.
Na instância primary maa1: SQL> SELECT database_role FROM v$database; DATABASE_ROLE ---------------- PRIMARY SQL> SELECT instance_name FROM v$instance; INSTANCE_NAME ---------------- maa1 SQL> DELETE FROM testedg; 1 row deleted. SQL> INSERT INTO testedg(c1, c2) VALUES(1, sysdate); 1 row created. SQL> COMMIT; Commit complete. SQL> Na instância primary maa2: SQL> SELECT instance_name FROM v$instance; INSTANCE_NAME ---------------- maa2 SQL> INSERT INTO testedg(c1, c2) VALUES(2, sysdate); 1 row created. SQL> COMMIT; Commit complete. SQL> SELECT c1, TO_CHAR(c2, 'DD/MM/RRRR HH24:MI:SS') AS momento FROM testedg; C1 MOMENTO ---------- ------------------- 1 11/05/2014 12:01:07 2 11/05/2014 12:02:17 SQL>
FAILOVER
Criando a falha
Aqui a falha será um pouco mais catastrófica que a do último failover, os servidores do Oracle RAC primary foram fisicamente desligados. Suponha que algo parecido aconteça no seu ambiente, que a sua controladora do Storage dos discos responsáveis pelo sue banco deu erro e você não estava com o seu Raid na vertical? Imagine que seu Exadata perdeu mais que duas células de Storage no turno da noite.
Então, você chega de manhã (isso se já não te ligaram a noite) e já tem uma comissão te “recepcionando” com cara assustada e pronta para entregar a bomba e dizer: Boa Sorte! Bom, vamos ver que o cenário não é tão negro assim dá pra sair dessa bem tranquilo.
FAILOVER TO
Já falei acima que a primeira preocupação é a síncrona, mas como você já tem todo o ambiente configurado de maneira correta sabe que a sincronia não será problema. Se você quiser conferir conecte-se no standby e verifique como ela está:
SQL> SELECT database_role FROM v$database; DATABASE_ROLE ---------------- PHYSICAL STANDBY SQL> SELECT protection_mode, protection_level FROM v$database; PROTECTION_MODE PROTECTION_LEVEL -------------------- -------------------- MAXIMUM AVAILABILITY MAXIMUM AVAILABILITY SQL>
Como estamos com o Broker configurado, vamos usar ele. Aqui um detalhe, se você tentar se conectar no primary irá receber um erro, você tem que recorrer a conexão do standby no Broker:
[oracle@rac11stb01 ~]$ dgmgrl DGMGRL for Linux: Version 11.2.0.3.0 - 64bit Production Copyright (c) 2000, 2009, Oracle. All rights reserved. Welcome to DGMGRL, type "help" for information. DGMGRL> CONNECT sys@maa Password: Unable to connect to database ORA-12543: TNS:destination host unreachable Failed. DGMGRL> CONNECT sys@maastb Password: Connected. DGMGRL>
Como vimos acima o Broker não consegue conectar no ambiente primary. Sabemos que ele está fora e precisamos do failover. Claro que antes de você fazer o failover você conversou com a equipe de infra/sistema e verificou a indisponibilidade de retorno rápido do ambiente. Assim, a única opção é o failover.
Depois de conectado no Broker você pode verificar como está o seu ambiente:
DGMGRL> SHOW CONFIGURATION VERBOSE; Configuration - dgrac Protection Mode: MaxAvailability Databases: maa - Primary database maastb - Physical standby database Properties: FastStartFailoverThreshold = '30' OperationTimeout = '30' FastStartFailoverLagLimit = '30' CommunicationTimeout = '180' FastStartFailoverAutoReinstate = 'TRUE' FastStartFailoverPmyShutdown = 'TRUE' BystandersFollowRoleChange = 'ALL' Fast-Start Failover: DISABLED Configuration Status: ORA-12543: TNS:destination host unreachable ORA-16625: cannot reach database "maa" DGM-17017: unable to determine configuration status DGMGRL> SHOW DATABASE maa Database - maa Role: PRIMARY Intended State: TRANSPORT-ON Instance(s): maa1 maa2 Database Status: DGM-17016: failed to retrieve status for database "maa" ORA-12543: TNS:destination host unreachable ORA-16625: cannot reach database "maa" DGMGRL> SHOW DATABASE maastb; Database - maastb Role: PHYSICAL STANDBY Intended State: APPLY-ON Transport Lag: 0 seconds Apply Lag: 0 seconds Real Time Query: OFF Instance(s): maastb1 (apply instance) maastb2 Database Status: SUCCESS DGMGRL>
Como você viu acima, o banco primary (maa) está fora e o Broker está esperando você se decidir. Para realizar o failover através do Broker basta digitar o comando “FAILOVER TO <nome do standby>” como mostrarei abaixo. Para realizar o failover execute:
DGMGRL> FAILOVER TO maastb; Performing failover NOW, please wait... Operation requires a connection to instance "maastb1" on database "maastb" Connecting to instance "maastb1"... Connected. Failover succeeded, new primary is "maastb" DGMGRL> No log do broker da instância maastb1: 05/11/2014 12:52:28 FAILOVER TO maastb Beginning failover to database maastb Notifying Oracle Clusterware to teardown database for FAILOVER 05/11/2014 12:53:18 Protection mode set to MAXIMUM AVAILABILITY 05/11/2014 12:53:29 Deferring associated archivelog destinations of sites permanently disabled due to Failover Notifying Oracle Clusterware to buildup primary database after FAILOVER Data Guard notifying Oracle Clusterware to start services and other instances change 05/11/2014 12:53:33 Posting DB_DOWN alert ... ... with reason Data Guard Failover 05/11/2014 12:53:37 Command FAILOVER TO maastb completed No log do broker da instância maastb2: 05/11/2014 18:52:26 FAILOVER TO maastb Command FAILOVER TO maastb completed with warning ORA-16523 05/11/2014 18:53:18 Protection mode set to MAXIMUM AVAILABILITY 05/11/2014 18:53:33 Command FAILOVER TO maastb completed
Observe atentamente as saídas mostradas acima. Na primeira temos a execução do comando e observe que o Broker para proceder com o comando conectou especificamente em uma única instância. Nos logs do Broker temos a informação de que a instância foi reiniciada devido a failover (teardown) e que os destinos de archives foram “desabilitados”.
Vamos ver o que aconteceu no alertlog da instância “maastb1” (faça uma relação com os horários dos logs anteriores). Adianto as desculpas, mas o log abaixo é extenso exatamente para ilustrar tudo o que ocorreu:
No alertlog da standby maastb (nova primary): Sun May 11 12:52:28 2014 Data Guard Broker: Beginning failover Sun May 11 12:52:29 2014 ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL Sun May 11 12:52:29 2014 MRP0: Background Media Recovery cancelled with status 16037 Errors in file /u01/app/oracle/diag/rdbms/maastb/maastb1/trace/maastb1_pr00_10962.trc: ORA-16037: user requested cancel of managed recovery operation Sun May 11 12:52:29 2014 Managed Standby Recovery not using Real Time Apply Recovery interrupted! Recovered data files to a consistent state at change 8878906 Sun May 11 12:52:33 2014 MRP0: Background Media Recovery process shutdown (maastb1) Managed Standby Recovery Canceled (maastb1) Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE Attempt to do a Terminal Recovery (maastb1) Media Recovery Start: Managed Standby Recovery (maastb1) started logmerger process Managed Standby Recovery not using Real Time Apply Parallel Media Recovery started with 2 slaves Sun May 11 12:52:36 2014 Media Recovery Waiting for thread 2 sequence 99 (in transit) Sun May 11 12:52:37 2014 krsv_proc_kill: Killing 6 processes (all RFS, wait for I/O) Begin: Standby Redo Logfile archival End: Standby Redo Logfile archival Terminal Recovery timestamp is '05/11/2014 12:52:40' Terminal Recovery: applying standby redo logs. Terminal Recovery: thread 2 seq# 99 redo required Terminal Recovery: Recovery of Online Redo Log: Thread 2 Group 9 Seq 99 Reading mem 0 Mem# 0: +DG01/maastb/onlinelog/group_9.260.844716095 Mem# 1: +FRA/maastb/onlinelog/group_9.591.844716097 Identified End-Of-Redo (failover) for thread 2 sequence 99 at SCN 0xffff.ffffffff Terminal Recovery: thread 1 seq# 140 redo required Terminal Recovery: Recovery of Online Redo Log: Thread 1 Group 6 Seq 140 Reading mem 0 Mem# 0: +DG01/maastb/onlinelog/group_6.267.844716079 Mem# 1: +FRA/maastb/onlinelog/group_6.604.844716081 Identified End-Of-Redo (failover) for thread 1 sequence 140 at SCN 0xffff.ffffffff Incomplete Recovery applied until change 8878917 time 05/11/2014 18:08:24 Sun May 11 12:52:41 2014 Media Recovery Complete (maastb1) Terminal Recovery: successful completion Sun May 11 12:52:41 2014 ARC0: Archiving not possible: error count exceeded Sun May 11 12:52:41 2014 ARC3: Archiving not possible: error count exceeded Forcing ARSCN to IRSCN for TR 0:8878917 Attempt to set limbo arscn 0:8878917 irscn 0:8878917 Resetting standby activation ID 723321957 (0x2b1d0465) ARCH: Archival stopped, error occurred. Will continue retrying ORACLE Instance maastb1 - Archival Error ORA-16014: log 6 sequence# 140 not archived, no available destinations ORA-00312: online log 6 thread 1: '+DG01/maastb/onlinelog/group_6.267.844716079' ORA-00312: online log 6 thread 1: '+FRA/maastb/onlinelog/group_6.604.844716081' Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE FINISH FORCE ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN ALTER DATABASE SWITCHOVER TO PRIMARY (maastb1) Maximum wait for role transition is 15 minutes. Backup controlfile written to trace file /u01/app/oracle/diag/rdbms/maastb/maastb1/trace/maastb1_rsm0_18329.trc Standby terminal recovery start SCN: 8878906 RESETLOGS after incomplete recovery UNTIL CHANGE 8878917 Online log +DG01/maastb/onlinelog/group_1.257.844716051: Thread 1 Group 1 was previously cleared Online log +FRA/maastb/onlinelog/group_1.568.844716053: Thread 1 Group 1 was previously cleared Online log +DG01/maastb/onlinelog/group_2.268.844716057: Thread 1 Group 2 was previously cleared Online log +FRA/maastb/onlinelog/group_2.564.844716059: Thread 1 Group 2 was previously cleared Online log +DG01/maastb/onlinelog/group_3.269.844716063: Thread 2 Group 3 was previously cleared Online log +FRA/maastb/onlinelog/group_3.562.844716065: Thread 2 Group 3 was previously cleared Online log +DG01/maastb/onlinelog/group_4.270.844716067: Thread 2 Group 4 was previously cleared Online log +FRA/maastb/onlinelog/group_4.559.844716071: Thread 2 Group 4 was previously cleared Standby became primary SCN: 8878905 Sun May 11 12:52:45 2014 Setting recovery target incarnation to 3 Switchover: Complete - Database mounted as primary Sun May 11 12:52:51 2014 Completed: ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WAIT WITH SESSION SHUTDOWN ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY Completed: ALTER DATABASE SET STANDBY DATABASE TO MAXIMIZE AVAILABILITY ALTER DATABASE OPEN Data Guard Broker initializing... This instance was first to open Picked broadcast on commit scheme to generate SCNs Sun May 11 12:52:53 2014 Assigning activation ID 723379342 (0x2b1de48e) LGWR: Primary database is in MAXIMUM AVAILABILITY mode Destination LOG_ARCHIVE_DEST_2 is UNSYNCHRONIZED LGWR: Destination LOG_ARCHIVE_DEST_1 is not serviced by LGWR Thread 1 advanced to log sequence 2 (thread open) Sun May 11 12:52:54 2014 ARC3: Becoming the 'no SRL' ARCH Thread 1 opened at log sequence 2 Sun May 11 12:52:54 2014 Current log# 2 seq# 2 mem# 0: +DG01/maastb/onlinelog/group_2.268.844716057ARC0: Becoming the 'no SRL' ARCH Current log# 2 seq# 2 mem# 1: +FRA/maastb/onlinelog/group_2.564.844716059 Successful open of redo thread 1 MTTR advisory is disabled because FAST_START_MTTR_TARGET is not set Sun May 11 12:52:55 2014 SMON: enabling cache recovery ARC0: Becoming the 'no SRL' ARCH Archived Log entry 323 added for thread 1 sequence 1 ID 0x2b1de48e dest 1: Archiver process freed from errors. No longer stopped Instance recovery: looking for dead threads Instance recovery: lock domain invalid but no dead threads Sun May 11 12:53:04 2014 *********************************************************************** Fatal NI connect error 12543, connecting to: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=rac11pri-scan.tjsc.jus.br)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=maa)(CID=(PROGRAM=oracle)(HOST=rac11stb01.tjsc.jus.br)(USER=oracle)))) VERSION INFORMATION: TNS for Linux: Version 11.2.0.3.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production Time: 11-MAY-2014 12:53:04 Tracing not turned on. Tns error struct: ns main err code: 12543 TNS-12543: TNS:destination host unreachable ns secondary err code: 12560 nt main err code: 513 TNS-00513: Destination host unreachable nt secondary err code: 113 nt OS err code: 0 Sun May 11 12:53:06 2014 [18329] Successfully onlined Undo Tablespace 2. Undo initialization finished serial:0 start:3723812044 end:3723813584 diff:1540 (15 seconds) Verifying file header compatibility for 11g tablespace encryption.. Verifying 11g file header compatibility for tablespace encryption completed Sun May 11 12:53:06 2014 SMON: enabling tx recovery Database Characterset is WE8MSWIN1252 Sun May 11 12:53:07 2014 minact-scn: Inst 1 is a slave inc#:4 mmon proc-id:10424 status:0x2 minact-scn status: grec-scn:0x0000.00000000 gmin-scn:0x0000.00000000 gcalc-scn:0x0000.00000000 No Resource Manager plan active *********************************************************************** Fatal NI connect error 12543, connecting to: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=rac11pri-scan.tjsc.jus.br)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=maa)(CID=(PROGRAM=oracle)(HOST=rac11stb01.tjsc.jus.br)(USER=oracle)))) VERSION INFORMATION: TNS for Linux: Version 11.2.0.3.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production Time: 11-MAY-2014 12:53:13 Tracing not turned on. Tns error struct: ns main err code: 12543 TNS-12543: TNS:destination host unreachable ns secondary err code: 12560 nt main err code: 513 TNS-00513: Destination host unreachable nt secondary err code: 113 nt OS err code: 0 Starting background process GTX0 Sun May 11 12:53:14 2014 GTX0 started with pid=40, OS id=10940 Starting background process RCBG Sun May 11 12:53:14 2014 RCBG started with pid=41, OS id=10942 replication_dependency_tracking turned off (no async multimaster replication found) Sun May 11 12:53:16 2014 Starting background process QMNC Sun May 11 12:53:16 2014 QMNC started with pid=42, OS id=10944 Completed: ALTER DATABASE OPEN ALTER SYSTEM SET log_archive_trace=0 SCOPE=BOTH SID='maastb1'; ALTER SYSTEM SET log_archive_format='arch_%t_%s_%r.arc' SCOPE=SPFILE SID='maastb1'; Sun May 11 12:53:22 2014 Starting background process SMCO Sun May 11 12:53:22 2014 SMCO started with pid=44, OS id=10957 Sun May 11 12:53:22 2014 *********************************************************************** Fatal NI connect error 12543, connecting to: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=rac11pri-scan.tjsc.jus.br)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=maa)(CID=(PROGRAM=oracle)(HOST=rac11stb01.tjsc.jus.br)(USER=oracle)))) VERSION INFORMATION: TNS for Linux: Version 11.2.0.3.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production Time: 11-MAY-2014 12:53:22 Tracing not turned on. Tns error struct: ns main err code: 12543 TNS-12543: TNS:destination host unreachable ns secondary err code: 12560 nt main err code: 513 TNS-00513: Destination host unreachable nt secondary err code: 113 nt OS err code: 0 Error 12543 received logging on to the standby PING[ARC2]: Heartbeat failed to connect to standby 'maa'. Error is 12543. Sun May 11 12:53:22 2014 ARC1: STARTING ARCH PROCESSES Sun May 11 12:53:22 2014 ARC0: Becoming the 'no SRL' ARCH Sun May 11 12:53:22 2014 ARC4 started with pid=48, OS id=10961 ARC4: Archival started ARC1: STARTING ARCH PROCESSES COMPLETE ARC1: Becoming the 'no SRL' ARCH Sun May 11 12:53:26 2014 ALTER SYSTEM SET standby_file_management='AUTO' SCOPE=BOTH SID='*'; ALTER SYSTEM SET archive_lag_target=0 SCOPE=BOTH SID='*'; ALTER SYSTEM SET log_archive_max_processes=4 SCOPE=BOTH SID='*'; ALTER SYSTEM SET log_archive_min_succeed_dest=1 SCOPE=BOTH SID='*'; ALTER SYSTEM SET db_file_name_convert='+DATA/maa','+DG01/maastb','+FRA/maa','+FRA/maastb' SCOPE=SPFILE; ALTER SYSTEM SET log_file_name_convert='+DATA/maa','+DG01/maastb','+FRA/maa','+FRA/maastb' SCOPE=SPFILE; Sun May 11 12:53:29 2014 ALTER SYSTEM SET log_archive_dest_state_2='RESET' SCOPE=BOTH; Sun May 11 12:53:31 2014 Starting background process CJQ0 Sun May 11 12:53:31 2014 CJQ0 started with pid=52, OS id=10995 Failover succeeded. Primary database is now maastb. Sun May 11 12:53:33 2014 Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter Sun May 11 12:53:33 2014 Starting background process VKRM Sun May 11 12:53:33 2014 VKRM started with pid=60, OS id=11000 Setting Resource Manager plan SCHEDULER[0x318F]:DEFAULT_MAINTENANCE_PLAN via scheduler window Setting Resource Manager plan DEFAULT_MAINTENANCE_PLAN via parameter Sun May 11 12:53:38 2014 Active Session History (ASH) performed an emergency flush. This may mean that ASH is undersized. If emergency flushes are a recurring issue, you may consider increasing ASH size by setting the value of _ASH_SIZE to a sufficiently large value. Currently, ASH size is 4194304 bytes. Both ASH size and the total number of emergency flushes since instance startup can be monitored by running the following query: select total_size,awr_flush_emergency_count from v$ash_info;
Analisando com calma tudo o que temos acima observamos que a primeira coisa que o Broker fez foi cancelar a aplicação de redo oriundos da antiga primary. Identificamos também que o standby se tornou primary no SCN# 8878905 e marcou um End-Of-Redo em ambas as threads. Depois, todas as transações e informações sofreram commit para que o banco passasse a ser o primary. Além disso, o banco abriu em MAXIMUM AVAILABILITY e já detectou que o log_archive_dest_2 está dessincronizado (por isso as falhas de TNS que aparecem). Acima eu não demonstrei alertlog da segunda instância que compõe o banco “maastb”, mas as informações contidas lá são simples (já que tudo foi feito pela “maastb1”).
Se você não tivesse o Broker você teria que fazer tudo isso manualmente. Todos os comandos desde o cancel ante o failover seriam manuais, inclusive o ajuste do CRS. Para verificar como ficou o ambiente podemos executar no Broker:
DGMGRL> SHOW CONFIGURATION VERBOSE; Configuration - dgrac Protection Mode: MaxAvailability Databases: maastb - Primary database Warning: ORA-16629: database reports a different protection level from the protection mode maa - Physical standby database (disabled) ORA-16661: the standby database needs to be reinstated Properties: FastStartFailoverThreshold = '30' OperationTimeout = '30' FastStartFailoverLagLimit = '30' CommunicationTimeout = '180' FastStartFailoverAutoReinstate = 'TRUE' FastStartFailoverPmyShutdown = 'TRUE' BystandersFollowRoleChange = 'ALL' Fast-Start Failover: DISABLED Configuration Status: WARNING DGMGRL> SHOW DATABASE VERBOSE maastb; Database - maastb Role: PRIMARY Intended State: TRANSPORT-ON Instance(s): maastb1 maastb2 Database Warning(s): ORA-16629: database reports a different protection level from the protection mode Properties: DGConnectIdentifier = 'maastb' ObserverConnectIdentifier = '' LogXptMode = 'SYNC' DelayMins = '0' Binding = 'OPTIONAL' MaxFailure = '0' MaxConnections = '1' ReopenSecs = '300' NetTimeout = '30' RedoCompression = 'DISABLE' LogShipping = 'ON' PreferredApplyInstance = '' ApplyInstanceTimeout = '0' ApplyParallel = 'AUTO' StandbyFileManagement = 'AUTO' ArchiveLagTarget = '0' LogArchiveMaxProcesses = '4' LogArchiveMinSucceedDest = '1' DbFileNameConvert = '+DATA/maa, +DG01/maastb, +FRA/maa, +FRA/maastb' LogFileNameConvert = '+DATA/maa, +DG01/maastb, +FRA/maa, +FRA/maastb' FastStartFailoverTarget = '' InconsistentProperties = '(monitor)' InconsistentLogXptProps = '(monitor)' SendQEntries = '(monitor)' LogXptStatus = '(monitor)' RecvQEntries = '(monitor)' SidName(*) StaticConnectIdentifier(*) StandbyArchiveLocation(*) AlternateLocation(*) LogArchiveTrace(*) LogArchiveFormat(*) TopWaitEvents(*) (*) - Please check specific instance for the property value Database Status: WARNING DGMGRL>
Observe acima que temos alguns detalhes a serem averiguados. Temos um belo “ORA-16629: database reports a different protection level from the protection mode” e ele ocorre pois o novo standby não está disponível para sincronização de archives (vou demonstrar abaixo). Vemos também que o “maastb” é o novo primary.
Se verificarmos o antigo primary iremos ver informações importantes:
DGMGRL> SHOW DATABASE VERBOSE maa; Database - maa Role: PHYSICAL STANDBY Intended State: APPLY-ON Transport Lag: (unknown) Apply Lag: (unknown) Real Time Query: OFF Instance(s): maa1 maa2 Properties: DGConnectIdentifier = 'maa' ObserverConnectIdentifier = '' LogXptMode = 'SYNC' DelayMins = '0' Binding = 'optional' MaxFailure = '0' MaxConnections = '1' ReopenSecs = '300' NetTimeout = '30' RedoCompression = 'DISABLE' LogShipping = 'ON' PreferredApplyInstance = '' ApplyInstanceTimeout = '0' ApplyParallel = 'AUTO' StandbyFileManagement = 'AUTO' ArchiveLagTarget = '0' LogArchiveMaxProcesses = '4' LogArchiveMinSucceedDest = '1' DbFileNameConvert = '+DG01/maastb, +DATA/maa, +FRA/maastb, +FRA/maa' LogFileNameConvert = '+DG01/maastb, +DATA/maa, +FRA/maastb, +FRA/maa' FastStartFailoverTarget = '' InconsistentProperties = '(monitor)' InconsistentLogXptProps = '(monitor)' SendQEntries = '(monitor)' LogXptStatus = '(monitor)' RecvQEntries = '(monitor)' SidName(*) StaticConnectIdentifier(*) StandbyArchiveLocation(*) AlternateLocation(*) LogArchiveTrace(*) LogArchiveFormat(*) TopWaitEvents(*) (*) - Please check specific instance for the property value Database Status: ORA-16661: the standby database needs to be reinstated DGMGRL>
O principal detalhe aqui é a informação que ele necessita de um reinstate: “ORA-16661: the standby database needs to be reinstated”. Se você disponibilizar o seu antigo primary você terá que fazer o reinstate dele.
CONTROLE 2
Até o momento tudo foi muito legal e bonito, fizemos o failover através do Broker e ele cuidou de tudo por nós. Claro que tudo isso aconteceu porque deixamos e mantivemos o ambiente são, a sincronização entre primary e standby estava correta e nenhuma informação foi perdida. Vamos ver a tabela de teste que usamos no início do artigo:
SQL> SELECT instance_name, status FROM gv$instance; INSTANCE_NAME STATUS ---------------- ------------ maastb2 OPEN maastb1 OPEN SQL> SELECT c1, TO_CHAR(c2, 'DD/MM/RRRR HH24:MI:SS') AS momento FROM testedg; C1 MOMENTO ---------- ------------------- 1 11/05/2014 12:01:07 2 11/05/2014 12:02:17 SQL>
REINSTATE
Já aviso com antecedência que os próximos logs são extensos. De qualquer forma eles estão recheados de detalhes técnicos que merecem atenção e um olhar mais cuidadoso.
Se você deixar tudo da forma como está agora você tem o banco “maastb” atuando como primary e esperando que o banco “maa” assuma seu lugar como standy. No log do Broker das instâncias do primary você verá a seguinte informação (bem como nas saídas do dgmgrl que mostrei acima):
05/11/2014 19:40:53 Error: The actual protection level 'RESYNCHRONIZATION' is different from the configured protection mode 'MAXIMUM AVAILABILITY'. Data Guard Broker Status Summary: Type Name Severity Status Configuration dgrac Warning ORA-16608 Primary Database maastb Warning ORA-16629 Physical Standby Database maa Error ORA-16661
Além disso, você deve estar se perguntando o que acontecerá se o meu antigo primary voltar e ficar disponível? Será que vai dar problema? Hum… não.
O simples fato de usarmos o Broker já nos ajuda tremendamente. Quando o antigo primary voltar ele precisará de um reinstate assim que retorne. Observe abaixo alertlog do antigo primary assim que ele foi religado (este log foi “cortado”, mas suas partes interessantes estão mantidas):
Mon May 12 05:14:46 2014 Starting ORACLE instance (normal) ****************** Large Pages Information ***************** Total Shared Global Region in Large Pages = 0 KB (0%) Large Pages used by this instance: 0 (0 KB) Large Pages unused system wide = 0 (0 KB) (alloc incr 4096 KB) Large Pages configured system wide = 0 (0 KB) Large Page size = 2048 KB ... ... SYS auditing is disabled Starting up: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production With the Partitioning and Real Application Clusters options. ORACLE_HOME = /u01/app/oracle/product/11.2.0.3/db_1 System name: Linux Node name: rac11pri01.tjsc.jus.br Release: 2.6.39-400.17.1.el6uek.x86_64 Version: #1 SMP Fri Feb 22 18:16:18 PST 2013 Machine: x86_64 VM name: VMWare Version: 6 ... ... fal_server = "MAASTB" log_archive_trace = 0 log_archive_config = "DG_CONFIG=(maa,maastb)" ... ... DBW0 started with pid=17, OS id=3509 Mon May 12 05:14:51 2014 LGWR started with pid=18, OS id=3516 Mon May 12 05:14:51 2014 CKPT started with pid=19, OS id=3519 Mon May 12 05:14:51 2014 SMON started with pid=20, OS id=3521 Mon May 12 05:14:51 2014 RECO started with pid=21, OS id=3523 ... ... Reconfiguration complete Mon May 12 05:14:52 2014 LCK0 started with pid=28, OS id=3558 Starting background process RSMN Mon May 12 05:14:52 2014 RSMN started with pid=29, OS id=3562 ORACLE_BASE not set in environment. It is recommended that ORACLE_BASE be set in the environment Reusing ORACLE_BASE from an earlier startup = /u01/app/oracle Mon May 12 05:14:53 2014 DMON started with pid=30, OS id=3565 Mon May 12 05:14:55 2014 ALTER SYSTEM SET local_listener='(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.17.42.14)(PORT=1521))))' SCOPE=MEMORY SID='maa1'; ALTER DATABASE MOUNT /* db agent *//* {1:51382:2} */ This instance was first to mount NOTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so NOTE: Loaded library: System ... ... Database mounted in Shared Mode (CLUSTER_DATABASE=TRUE) Lost write protection disabled Completed: ALTER DATABASE MOUNT /* db agent *//* {1:51382:2} */ ALTER DATABASE OPEN /* db agent *//* {1:51382:2} */ Data Guard Broker initializing... Data Guard Broker initialization complete Data Guard: verifying database primary role... Mon May 12 05:15:06 2014 Starting Data Guard Broker (DMON) Mon May 12 05:15:06 2014 INSV started with pid=35, OS id=3657 Mon May 12 05:15:10 2014 NSV1 started with pid=36, OS id=3669 Mon May 12 05:15:16 2014 Data Guard: version check completed Data Guard determines a failover has occurred - this is no longer a primary database ORA-16649 signalled during: ALTER DATABASE OPEN /* db agent *//* {1:51382:2} */...
Observe acima que ao subir ele abriu o banco, mas logo em seguida que o processo do Broker foi iniciado ele detectou o ORA-16649 ao fazer o “verifying database primary role”. Assim, ele detectou que “this is no longer a primary database” e deixou o banco em modo mount. No log do Broker da antiga primary nós temos:
05/12/2014 05:15:06 >> Starting Data Guard Broker bootstrap << Broker Configuration File Locations: dg_broker_config_file1 = "+DATA/maa/dr1maa.dat" dg_broker_config_file2 = "+FRA/maa/dr2maa.dat" 05/12/2014 05:15:10 DMON Registering service maa_DGB with listener(s) Broker Configuration: "dgrac" Protection Mode: Maximum Availability Fast-Start Failover (FSFO): Disabled, flags=0x0, version=0 Primary Database: maa (0x01010000) Standby Database: maastb, Enabled Physical Standby (0x02010000) 05/12/2014 05:15:16 Site maastb returned ORA-16623. Version Check Results: Database maastb returned ORA-16623 Disable redo transport to all standby databases RSM0 failed to remove all destinations, error ORA-16718
Observe que também temos um “ORA-16623” e “ORA-16718”. Isso simplesmente nos diz que o banco atual não pode abrir e precisa de reinstate para mudar de papel. Outro detalhe interessante que peço que observe é quem sinalizou o ORA-16623, se você notar foi a nova primary que fez isso e o erro nos diz que ocorreu uma troca de papeis.
Para resolver tudo nós precisamos fazer o reinstate do antigo primary para que ele se torne o novo standby. Para isso execute no dgmgrl o seguinte comando (estou conectando através de uma instância do standby):
DGMGRL> REINSTATE DATABASE maa; Reinstating database "maa", please wait... Operation requires shutdown of instance "maa1" on database "maa" Shutting down instance "maa1"... ORA-01109: database not open Database dismounted. ORACLE instance shut down. Operation requires startup of instance "maa1" on database "maa" Starting instance "maa1"... ORACLE instance started. Database mounted. Continuing to reinstate database "maa" ... Reinstatement of database "maa" succeeded DGMGRL>
Observe acima que ao chamar o reinstate do banco através do Broker ele conectou no antigo primary (maa), realizou o shutdown e abriu a instância em modo mount. Enquanto o comando acima era executado a seguinte informação aparecia no log do Broker da instância do primary:
Log Broker instancia maastb1: 05/11/2014 20:11:54 REINSTATE DATABASE maa Database maa can be reinstated 05/11/2014 20:13:02 Command REINSTATE DATABASE maa completed with warning ORA-16570 05/11/2014 20:13:46 REINSTATE DATABASE maa Database maa can be reinstated 05/11/2014 20:14:06 Successfully completed reinstatement of database 0x01001000; removing ReinstateContextArray property 05/11/2014 20:14:47 Command REINSTATE DATABASE maa completed EDIT DATABASE maa SET PROPERTY ActualApplyInstance = maa1 Apply Instance for database maa is maa1 Command EDIT DATABASE maa SET PROPERTY ActualApplyInstance = maa1 completed
A informação é praticamente a mesma que apareceu no dgmgrl. As informações ficam mais interessantes no alertlog das instâncias do primary, observe abaixo:
Alertlog da instância maastb1: Sun May 11 20:12:28 2014 ALTER SYSTEM SET log_archive_dest_2='service="maa"','LGWR SYNC AFFIRM delay=0 optional compression=disable max_failure=0 max_connections=1 reopen=1 db_unique_name="maa" net_timeout=30','valid_for=(all_logfiles,primary_role)' SCOPE=BOTH; ALTER SYSTEM SET log_archive_dest_state_2='ENABLE' SCOPE=BOTH; Sun May 11 20:12:30 2014 ****************************************************************** LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2 ****************************************************************** LGWR: Error 16009 disconnecting from destination LOG_ARCHIVE_DEST_2 standby host 'maa' Error 16009 for archive log file 1 to 'maa' LGWR: Failed to archive log 1 thread 1 sequence 3 (16009) Thread 1 advanced to log sequence 3 (LGWR switch) Current log# 1 seq# 3 mem# 0: +DG01/maastb/onlinelog/group_1.257.844716051 Current log# 1 seq# 3 mem# 1: +FRA/maastb/onlinelog/group_1.568.844716053 Sun May 11 20:12:42 2014 Archived Log entry 328 added for thread 1 sequence 2 ID 0x2b1de48e dest 1: Sun May 11 20:12:51 2014 ALTER SYSTEM SET log_archive_dest_state_2='RESET' SCOPE=BOTH; Sun May 11 20:13:48 2014 *********************************************************************** Fatal NI connect error 12514, connecting to: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=rac11pri-scan.tjsc.jus.br)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=maa_DGB)(CID=(PROGRAM=oracle)(HOST=rac11stb01.tjsc.jus.br)(USER=oracle)))) VERSION INFORMATION: TNS for Linux: Version 11.2.0.3.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production Time: 11-MAY-2014 20:13:48 Tracing not turned on. Tns error struct: ns main err code: 12564 TNS-12564: TNS:connection refused ns secondary err code: 0 nt main err code: 0 nt secondary err code: 0 nt OS err code: 0 *********************************************************************** Fatal NI connect error 12514, connecting to: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=rac11pri-scan.tjsc.jus.br)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=maa_DGB)(CID=(PROGRAM=oracle)(HOST=rac11stb01.tjsc.jus.br)(USER=oracle)))) VERSION INFORMATION: TNS for Linux: Version 11.2.0.3.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production Time: 11-MAY-2014 20:13:48 Tracing not turned on. Tns error struct: ns main err code: 12564 TNS-12564: TNS:connection refused ns secondary err code: 0 nt main err code: 0 nt secondary err code: 0 nt OS err code: 0 *********************************************************************** Fatal NI connect error 12514, connecting to: (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=rac11pri-scan.tjsc.jus.br)(PORT=1521))(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=maa_DGB)(CID=(PROGRAM=oracle)(HOST=rac11stb01.tjsc.jus.br)(USER=oracle)))) VERSION INFORMATION: TNS for Linux: Version 11.2.0.3.0 - Production TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production Time: 11-MAY-2014 20:13:48 Tracing not turned on. Tns error struct: ns main err code: 12564 TNS-12564: TNS:connection refused ns secondary err code: 0 nt main err code: 0 nt secondary err code: 0 nt OS err code: 0 Sun May 11 20:13:53 2014 ALTER SYSTEM SET log_archive_dest_state_2='ENABLE' SCOPE=BOTH; Sun May 11 20:13:54 2014 Thread 1 cannot allocate new log, sequence 4 Checkpoint not complete Current log# 1 seq# 3 mem# 0: +DG01/maastb/onlinelog/group_1.257.844716051 Current log# 1 seq# 3 mem# 1: +FRA/maastb/onlinelog/group_1.568.844716053 LGWR: Standby redo logfile selected to archive thread 1 sequence 4 LGWR: Standby redo logfile selected for thread 1 sequence 4 for destination LOG_ARCHIVE_DEST_2 Thread 1 advanced to log sequence 4 (LGWR switch) Current log# 2 seq# 4 mem# 0: +DG01/maastb/onlinelog/group_2.268.844716057 Current log# 2 seq# 4 mem# 1: +FRA/maastb/onlinelog/group_2.564.844716059 Sun May 11 20:13:59 2014 Archived Log entry 330 added for thread 1 sequence 3 ID 0x2b1de48e dest 1: ARC0: Standby redo logfile selected for thread 1 sequence 3 for destination LOG_ARCHIVE_DEST_2 Sun May 11 20:14:07 2014 ALTER SYSTEM SET log_archive_dest_2='service="maa"','LGWR SYNC AFFIRM delay=0 optional compression=disable max_failure=0 max_connections=1 reopen=300 db_unique_name="maa" net_timeout=30','valid_for=(all_logfiles,primary_role)' SCOPE=BOTH; ALTER SYSTEM ARCHIVE LOG Sun May 11 20:14:07 2014 Thread 1 cannot allocate new log, sequence 5 Checkpoint not complete Current log# 2 seq# 4 mem# 0: +DG01/maastb/onlinelog/group_2.268.844716057 Current log# 2 seq# 4 mem# 1: +FRA/maastb/onlinelog/group_2.564.844716059 ****************************************************************** LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2 ****************************************************************** LGWR: Standby redo logfile selected to archive thread 1 sequence 5 LGWR: Standby redo logfile selected for thread 1 sequence 5 for destination LOG_ARCHIVE_DEST_2 Thread 1 advanced to log sequence 5 (LGWR switch) Current log# 1 seq# 5 mem# 0: +DG01/maastb/onlinelog/group_1.257.844716051 Current log# 1 seq# 5 mem# 1: +FRA/maastb/onlinelog/group_1.568.844716053 Archived Log entry 339 added for thread 1 sequence 4 ID 0x2b1de48e dest 1: Archived Log entry 342 added for thread 2 sequence 4 ID 0x2b1de48e dest 1: Sun May 11 20:15:14 2014 Destination LOG_ARCHIVE_DEST_2 is SYNCHRONIZED LGWR: Standby redo logfile selected to archive thread 1 sequence 6 LGWR: Standby redo logfile selected for thread 1 sequence 6 for destination LOG_ARCHIVE_DEST_2 Thread 1 advanced to log sequence 6 (LGWR switch) Current log# 2 seq# 6 mem# 0: +DG01/maastb/onlinelog/group_2.268.844716057 Current log# 2 seq# 6 mem# 1: +FRA/maastb/onlinelog/group_2.564.844716059 Sun May 11 20:15:15 2014 Archived Log entry 344 added for thread 1 sequence 5 ID 0x2b1de48e dest 1: Alertlog instância maastb2: Mon May 12 02:12:33 2014 ****************************************************************** LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2 ****************************************************************** LGWR: Error 16009 disconnecting from destination LOG_ARCHIVE_DEST_2 standby host 'maa' Error 16009 for archive log file 3 to 'maa' LGWR: Failed to archive log 3 thread 2 sequence 3 (16009) Thread 2 advanced to log sequence 3 (LGWR switch) Current log# 3 seq# 3 mem# 0: +DG01/maastb/onlinelog/group_3.269.844716063 Current log# 3 seq# 3 mem# 1: +FRA/maastb/onlinelog/group_3.562.844716065 Mon May 12 02:12:36 2014 Deleted Oracle managed file +FRA/maastb/flashback/log_10.540.846646525 Archived Log entry 327 added for thread 2 sequence 2 ID 0x2b1de48e dest 1: Mon May 12 02:13:56 2014 LGWR: Standby redo logfile selected to archive thread 2 sequence 4 LGWR: Standby redo logfile selected for thread 2 sequence 4 for destination LOG_ARCHIVE_DEST_2 Thread 2 advanced to log sequence 4 (LGWR switch) Current log# 4 seq# 4 mem# 0: +DG01/maastb/onlinelog/group_4.270.844716067 Current log# 4 seq# 4 mem# 1: +FRA/maastb/onlinelog/group_4.559.844716071 Mon May 12 02:13:59 2014 Archived Log entry 329 added for thread 2 sequence 3 ID 0x2b1de48e dest 1: ARC3: Standby redo logfile selected for thread 2 sequence 3 for destination LOG_ARCHIVE_DEST_2 Mon May 12 02:14:09 2014 Thread 2 cannot allocate new log, sequence 5 Checkpoint not complete Current log# 4 seq# 4 mem# 0: +DG01/maastb/onlinelog/group_4.270.844716067 Current log# 4 seq# 4 mem# 1: +FRA/maastb/onlinelog/group_4.559.844716071 Mon May 12 02:14:10 2014 FAL[server, ARC0]: FAL archive failed, see trace file. Mon May 12 02:14:12 2014 ARC3: Archive log rejected (thread 1 sequence 1) at host 'maa' FAL[server, ARC3]: FAL archive failed, see trace file. ARCH: FAL archive failed. Archiver continuing ORACLE Instance maastb2 - Archival Error. Archiver continuing. ARC0: Archive log rejected (thread 2 sequence 2) at host 'maa' FAL[server, ARC0]: FAL archive failed, see trace file. ARCH: FAL archive failed. Archiver continuing ORACLE Instance maastb2 - Archival Error. Archiver continuing. ****************************************************************** LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2 ****************************************************************** LGWR: Standby redo logfile selected to archive thread 2 sequence 5 LGWR: Standby redo logfile selected for thread 2 sequence 5 for destination LOG_ARCHIVE_DEST_2 Thread 2 advanced to log sequence 5 (LGWR switch) Current log# 3 seq# 5 mem# 0: +DG01/maastb/onlinelog/group_3.269.844716063 Current log# 3 seq# 5 mem# 1: +FRA/maastb/onlinelog/group_3.562.844716065 Mon May 12 02:15:14 2014 Thread 2 cannot allocate new log, sequence 6 Checkpoint not complete Current log# 3 seq# 5 mem# 0: +DG01/maastb/onlinelog/group_3.269.844716063 Current log# 3 seq# 5 mem# 1: +FRA/maastb/onlinelog/group_3.562.844716065 Destination LOG_ARCHIVE_DEST_2 is SYNCHRONIZED LGWR: Standby redo logfile selected to archive thread 2 sequence 6 LGWR: Standby redo logfile selected for thread 2 sequence 6 for destination LOG_ARCHIVE_DEST_2 Thread 2 advanced to log sequence 6 (LGWR switch) Current log# 4 seq# 6 mem# 0: +DG01/maastb/onlinelog/group_4.270.844716067 Current log# 4 seq# 6 mem# 1: +FRA/maastb/onlinelog/group_4.559.844716071 Mon May 12 02:15:19 2014 Archived Log entry 346 added for thread 2 sequence 5 ID 0x2b1de48e dest 1:
Se você observar acima verá que o log_archive_dest_2 foi “reabilitado” e passou a sincronizar com o seu destino. Observe que ele respeitou as mesmas informações que havíamos definido desde o início da configuração do DG.
Os logs mais interessantes ficam no antigo primary (e novo standby), primeiro observe abaixo o log do Broker:
Log do Broker na maa1: 05/12/2014 05:30:07 Data Guard notifying Oracle Clusterware of database role change 05/12/2014 05:30:29 Creating process RSM0 05/12/2014 05:30:47 Physical RSM: Reinstatement... Converting old primary control file to physical standby control file 05/12/2014 05:30:52 Purging diverged redos on resetlogs branch 844762805, starting SCN 8878906 Purged 0 archived logs 05/12/2014 05:31:11 Data Guard Broker shutting down RSM0 successfully terminated 05/12/2014 05:31:14 >> DMON Process Shutdown << 05/12/2014 05:31:43 >> Starting Data Guard Broker bootstrap << Broker Configuration File Locations: dg_broker_config_file1 = "+DATA/maa/dr1maa.dat" dg_broker_config_file2 = "+FRA/maa/dr2maa.dat" 05/12/2014 05:31:47 Database needs to be reinstated or re-created, Data Guard broker ready 05/12/2014 05:31:49 DMON Registering service maa_DGB with listener(s) 05/12/2014 05:31:54 Apply Instance for Database maa set to maa1 Data Guard notifying Oracle Clusterware of database role change Creating process RSM0 05/12/2014 05:32:23 Notifying Oracle Clusterware to buildup after REINSTATEMENT Data Guard notifying Oracle Clusterware to start services and other instances change 05/12/2014 05:32:26 Command REINSTATE DATABASE maa completed 05/12/2014 05:32:50 Command EDIT DATABASE maa SET PROPERTY ActualApplyInstance = maa1 completed
Se analisarmos com calma o log acima veremos que ele detectou qual o SCN necessário para o reinstate, realizou o reinstate e corrigiu o CRS também. No alertlog da instância do antigo primary nós temos:
Alertlog da maa1: Mon May 12 05:30:29 2014 RSM0 started with pid=39, OS id=3988 Mon May 12 05:30:32 2014 Using STANDBY_ARCHIVE_DEST parameter default value as USE_DB_RECOVERY_FILE_DEST RFS[1]: Assigned to RFS process 3990 RFS[1]: Database mount ID mismatch [0x2b1de48e:0x2b44d48f] (723379342:725931151) FLASHBACK DATABASE TO SCN 8878905 Mon May 12 05:30:35 2014 RFS[2]: Assigned to RFS process 3993 RFS[2]: Database mount ID mismatch [0x2b1de48e:0x2b44d48f] (723379342:725931151) Flashback Restore Start Flashback Restore Complete Flashback Media Recovery Start started logmerger process Parallel Media Recovery started with 2 slaves Mon May 12 05:30:38 2014 Recovery of Online Redo Log: Thread 2 Group 3 Seq 99 Reading mem 0 Mem# 0: +DATA/maa/onlinelog/group_3.257.843489101 Mem# 1: +FRA/maa/onlinelog/group_3.284.843489101 Recovery of Online Redo Log: Thread 1 Group 1 Seq 139 Reading mem 0 Mem# 0: +DATA/maa/onlinelog/group_1.272.843488553 Mem# 1: +FRA/maa/onlinelog/group_1.286.843488555 Recovery of Online Redo Log: Thread 1 Group 2 Seq 140 Reading mem 0 Mem# 0: +DATA/maa/onlinelog/group_2.271.843488555 Mem# 1: +FRA/maa/onlinelog/group_2.285.843488555 Incomplete Recovery applied until change 8878906 time 05/11/2014 18:08:14 Mon May 12 05:30:47 2014 Flashback Media Recovery Complete Completed: FLASHBACK DATABASE TO SCN 8878905 alter database convert to physical standby ALTER DATABASE CONVERT TO PHYSICAL STANDBY (maa1) Flush standby redo logfile failed:1649 Clearing standby activation ID 723321957 (0x2b1d0465) The primary database controlfile was created using the 'MAXLOGFILES 192' clause. There is space for up to 188 standby redo logfiles Use the following SQL commands on the standby database to create standby redo logfiles that match the primary database: ALTER DATABASE ADD STANDBY LOGFILE 'srl1.f' SIZE 52428800; ALTER DATABASE ADD STANDBY LOGFILE 'srl2.f' SIZE 52428800; ALTER DATABASE ADD STANDBY LOGFILE 'srl3.f' SIZE 52428800; ALTER DATABASE ADD STANDBY LOGFILE 'srl4.f' SIZE 52428800; ALTER DATABASE ADD STANDBY LOGFILE 'srl5.f' SIZE 52428800; Waiting for all non-current ORLs to be archived... All non-current ORLs have been archived. Clearing online redo logfile 1 +DATA/maa/onlinelog/group_1.272.843488553 Clearing online log 1 of thread 1 sequence number 139 Clearing online redo logfile 1 complete Clearing online redo logfile 2 +DATA/maa/onlinelog/group_2.271.843488555 Clearing online log 2 of thread 1 sequence number 140 Clearing online redo logfile 2 complete Clearing online redo logfile 3 +DATA/maa/onlinelog/group_3.257.843489101 Clearing online log 3 of thread 2 sequence number 99 Clearing online redo logfile 3 complete Clearing online redo logfile 4 +DATA/maa/onlinelog/group_4.262.843489103 Clearing online log 4 of thread 2 sequence number 98 Clearing online redo logfile 4 complete Completed: alter database convert to physical standby Mon May 12 05:31:05 2014 Shutting down instance (immediate) Shutting down instance: further logons disabled Stopping background process MMNL Stopping background process MMON License high water mark = 6 alter database CLOSE NORMAL ORA-1109 signalled during: alter database CLOSE NORMAL... alter database DISMOUNT Shutting down archive processes Archiving is disabled ... ... Shutting down Data Guard Broker processes Mon May 12 05:31:14 2014 Completed: Data Guard Broker shutdown NOTE: force a map free for map id 2 Mon May 12 05:31:15 2014 ARCH: Archival disabled due to shutdown: 1089 Shutting down archive processes Archiving is disabled Mon May 12 05:31:15 2014 Stopping background process VKTM ... ... Instance shutdown complete Mon May 12 05:31:23 2014 Starting ORACLE instance (normal) ****************** Large Pages Information ***************** Total Shared Global Region in Large Pages = 0 KB (0%) ... ... Starting up: Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production With the Partitioning and Real Application Clusters options. ... ... DMON started with pid=30, OS id=4114 Mon May 12 05:31:31 2014 alter database mount This instance was first to mount Mon May 12 05:31:31 2014 ALTER SYSTEM SET local_listener='(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=10.17.42.14)(PORT=1521))))' SCOPE=MEMORY SID='maa1'; NOTE: Loaded library: /opt/oracle/extapi/64/asm/orcl/1/libasm.so NOTE: Loaded library: System SUCCESS: diskgroup DATA was mounted SUCCESS: diskgroup FRA was mounted NOTE: dependency between database maa and diskgroup resource ora.DATA.dg is established NOTE: dependency between database maa and diskgroup resource ora.FRA.dg is established Mon May 12 05:31:37 2014 ARCH: STARTING ARCH PROCESSES NSS2 started with pid=33, OS id=4147 Mon May 12 05:31:37 2014 ARC0 started with pid=34, OS id=4149 ARC0: Archival started ... ... ARC2: Becoming the 'no FAL' ARCH ARC2: Becoming the 'no SRL' ARCH ARC1: Becoming the heartbeat ARCH Physical Standby Database mounted. Lost write protection disabled ARC1: Becoming the active heartbeat ARCH ARC1: Becoming the active heartbeat ARCH ARC3: Archival started ARC0: STARTING ARCH PROCESSES COMPLETE Completed: alter database mount Mon May 12 05:31:43 2014 Starting Data Guard Broker (DMON) Mon May 12 05:31:43 2014 ... ... Using STANDBY_ARCHIVE_DEST parameter default value as USE_DB_RECOVERY_FILE_DEST Primary database is in MAXIMUM AVAILABILITY mode Changing standby controlfile to RESYNCHRONIZATION level Standby controlfile consistent with primary RFS[1]: Assigned to RFS process 4190 RFS[1]: Selected log 8 for thread 2 sequence 4 dbid 722024964 branch 847284763 Mon May 12 05:31:59 2014 Primary database is in MAXIMUM AVAILABILITY mode Standby controlfile consistent with primary Standby controlfile consistent with primary RFS[2]: Assigned to RFS process 4192 RFS[2]: Selected log 5 for thread 1 sequence 4 dbid 722024964 branch 847284763 ALTER SYSTEM SET log_archive_trace=0 SCOPE=BOTH SID='maa1'; ALTER SYSTEM SET log_archive_format='arch_%t_%s_%r.arc' SCOPE=SPFILE SID='maa1'; ALTER SYSTEM SET standby_file_management='AUTO' SCOPE=BOTH SID='*'; ALTER SYSTEM SET archive_lag_target=0 SCOPE=BOTH SID='*'; ALTER SYSTEM SET log_archive_max_processes=4 SCOPE=BOTH SID='*'; ALTER SYSTEM SET log_archive_min_succeed_dest=1 SCOPE=BOTH SID='*'; ALTER SYSTEM SET db_file_name_convert='+DG01/maastb','+DATA/maa','+FRA/maastb','+FRA/maa' SCOPE=SPFILE; ALTER SYSTEM SET log_file_name_convert='+DG01/maastb','+DATA/maa','+FRA/maastb','+FRA/maa' SCOPE=SPFILE; ALTER SYSTEM SET fal_server='maastb' SCOPE=BOTH; ALTER DATABASE RECOVER MANAGED STANDBY DATABASE THROUGH ALL SWITCHOVER DISCONNECT USING CURRENT LOGFILE Attempt to start background Managed Standby Recovery process (maa1) Mon May 12 05:32:01 2014 MRP0 started with pid=45, OS id=4198 MRP0: Background Managed Standby Recovery process started (maa1) Mon May 12 05:32:01 2014 RFS[3]: Assigned to RFS process 4194 RFS[3]: Selected log 9 for thread 2 sequence 3 dbid 722024964 branch 847284763 Mon May 12 05:32:01 2014 RFS[4]: Assigned to RFS process 4196 RFS[4]: Selected log 6 for thread 1 sequence 3 dbid 722024964 branch 847284763 RFS[4]: New Archival REDO Branch(resetlogs_id): 847284763 Prior: 844762805 RFS[4]: Archival Activation ID: 0x2b1de48e Current: 0x0 RFS[4]: Effect of primary database OPEN RESETLOGS RFS[4]: Managed Standby Recovery process is active RFS[4]: Incarnation entry added for Branch(resetlogs_id): 847284763 (maa1) Mon May 12 05:32:01 2014 Setting recovery target incarnation to 3 Mon May 12 05:32:02 2014 Deleted Oracle managed file +FRA/maa/flashback/log_2.339.847184419 Mon May 12 05:32:02 2014 Archived Log entry 632 added for thread 1 sequence 3 ID 0x2b1de48e dest 1: Archived Log entry 633 added for thread 2 sequence 3 ID 0x2b1de48e dest 1: Mon May 12 05:32:04 2014 RFS[5]: Assigned to RFS process 4212 RFS[5]: Opened log for thread 2 sequence 99 dbid 722024964 branch 844762805 RFS[4]: Opened log for thread 1 sequence 140 dbid 722024964 branch 844762805 Archived Log entry 634 added for thread 2 sequence 99 rlc 844762805 ID 0x2b1d0465 dest 2: started logmerger process Managed Standby Recovery starting Real Time Apply RFS[3]: Opened log for thread 2 sequence 1 dbid 722024964 branch 847284763 Parallel Media Recovery started with 2 slaves Mon May 12 05:32:07 2014 Media Recovery start incarnation depth : 1, target inc# : 3, irscn : 8878917 Archived Log entry 635 added for thread 2 sequence 1 rlc 847284763 ID 0x2b1de48e dest 2: RFS[5]: Opened log for thread 2 sequence 2 dbid 722024964 branch 847284763 Waiting for all non-current ORLs to be archived... All non-current ORLs have been archived. Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_2_seq_99.603.847344725 Identified End-Of-Redo (failover) for thread 2 sequence 99 at SCN 0x0.877b45 Archived Log entry 636 added for thread 1 sequence 140 rlc 844762805 ID 0x2b1d0465 dest 2: Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_1_seq_140.605.847344725 Identified End-Of-Redo (failover) for thread 1 sequence 140 at SCN 0x0.877b45 Mon May 12 05:32:08 2014 Completed: ALTER DATABASE RECOVER MANAGED STANDBY DATABASE THROUGH ALL SWITCHOVER DISCONNECT USING CURRENT LOGFILE Resetting standby activation ID 723321957 (0x2b1d0465) Media Recovery End-Of-Redo indicator encountered Media Recovery Continuing Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_2_seq_1.592.847344727 Archived Log entry 637 added for thread 1 sequence 4 ID 0x2b1de48e dest 1: Media Recovery Waiting for thread 1 sequence 1 Fetching gap sequence in thread 1, gap sequence 1-2 Archived Log entry 638 added for thread 2 sequence 2 rlc 847284763 ID 0x2b1de48e dest 2: Mon May 12 05:32:10 2014 Primary database is in MAXIMUM AVAILABILITY mode Standby controlfile consistent with primary Standby controlfile consistent with primary RFS[6]: Assigned to RFS process 4242 RFS[6]: Selected log 5 for thread 1 sequence 5 dbid 722024964 branch 847284763 Mon May 12 05:32:11 2014 RFS[7]: Assigned to RFS process 4234 RFS[7]: Opened log for thread 1 sequence 1 dbid 722024964 branch 847284763 Mon May 12 05:32:11 2014 RFS[8]: Assigned to RFS process 4238 RFS[8]: Opened log for thread 1 sequence 2 dbid 722024964 branch 847284763 Archived Log entry 639 added for thread 1 sequence 1 rlc 847284763 ID 0x2b1de48e dest 2: Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_1_seq_1.595.847344731 Media Recovery Waiting for thread 1 sequence 2 (in transit) Mon May 12 05:32:14 2014 Archived Log entry 640 added for thread 2 sequence 4 ID 0x2b1de48e dest 1: Archived Log entry 641 added for thread 1 sequence 2 rlc 847284763 ID 0x2b1de48e dest 2: Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_1_seq_2.593.847344731 Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_2_seq_2.584.847344727 Mon May 12 05:32:15 2014 Primary database is in MAXIMUM AVAILABILITY mode Standby controlfile consistent with primary Standby controlfile consistent with primary RFS[9]: Assigned to RFS process 4250 RFS[9]: Selected log 8 for thread 2 sequence 5 dbid 722024964 branch 847284763 Mon May 12 05:32:25 2014 Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_1_seq_3.604.847344723 Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_2_seq_3.339.847344723 Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_2_seq_4.581.847344735 Media Recovery Log +FRA/maa/archivelog/2014_05_12/thread_1_seq_4.594.847344731 Media Recovery Waiting for thread 1 sequence 5 (in transit) Recovery of Online Redo Log: Thread 1 Group 5 Seq 5 Reading mem 0 Mem# 0: +DATA/maa/onlinelog/group_5.263.843615365 Mem# 1: +FRA/maa/onlinelog/group_5.289.843615367 Media Recovery Waiting for thread 2 sequence 5 (in transit) Recovery of Online Redo Log: Thread 2 Group 8 Seq 5 Reading mem 0 Mem# 0: +DATA/maa/onlinelog/group_8.259.843615383 Mem# 1: +FRA/maa/onlinelog/group_8.703.843615385 Mon May 12 05:33:16 2014 Changing standby controlfile to MAXIMUM AVAILABILITY level RFS[6]: Selected log 6 for thread 1 sequence 6 dbid 722024964 branch 847284763 Mon May 12 05:33:16 2014 Deleted Oracle managed file +FRA/maa/flashback/log_3.259.843502613 Mon May 12 05:33:16 2014 Media Recovery Waiting for thread 1 sequence 6 (in transit) Recovery of Online Redo Log: Thread 1 Group 6 Seq 6 Reading mem 0 Mem# 0: +DATA/maa/onlinelog/group_6.261.843615373 Mem# 1: +FRA/maa/onlinelog/group_6.670.843615373 Archived Log entry 642 added for thread 1 sequence 5 ID 0x2b1de48e dest 1: Mon May 12 05:33:19 2014 Standby controlfile consistent with primary RFS[9]: Selected log 9 for thread 2 sequence 6 dbid 722024964 branch 847284763 Mon May 12 05:33:19 2014 Archived Log entry 643 added for thread 2 sequence 5 ID 0x2b1de48e dest 1: Media Recovery Waiting for thread 2 sequence 6 (in transit) Recovery of Online Redo Log: Thread 2 Group 9 Seq 6 Reading mem 0 Mem# 0: +DATA/maa/onlinelog/group_9.256.843615389 Mem# 1: +FRA/maa/onlinelog/group_9.504.843615389
Fazendo uma análise detalhada nós encontramos que o banco “maa” sofreu um FLASHBACK para o SCN 8878905 e que foi concluído com sucesso. Após, o banco foi convertido para PHYSICAL STANDBY e devido aos passos do procedimento ele foi fechado. Depois ele foi religado (automaticamente) e detectou que precisava de sincronia de archives com o seu primary (maastb). Ele detectou que ocorreu o End-Of-Redo e após isso continuou aplicando os archives que ocorreram após esse evento (via RFS – Media Recovery Continuing). Nesse caso, dependendo do tempo que seu antigo primary ficou “off-line”, a quantidade de archivelog necessário para replicar pode ser grande. Pode ocorrer inclusive a necessidade de recuperar do backup (do novo primary) os archives que foram apagados e necessários para fechar o gap existente entre os ambientes.
Depois, o banco foi configurado para atuar como standby e usar o real-time apply (USING CURRENT LOGFILE). Aqui um detalhe, o Broker configura automaticamente o real-time apply se detectar que o nanco standby tem configurado os standbylog files. Observe que ele detectou a presença de uma nova incarnação devido ao RESETLOGS sofrido pelo flashback e pelo End-Of-Redo. Para finalizar, ao identificar que o primary opera em MAXIMUM AVAILABILITY ele sincroniza os archivelog e deixa tudo sincronizado.
DETALHES
Se você notar todos os logs até aqui (principalmente os últimos) irá ver como eles se encaixam, todas estas informações que aqui estão apresentadas de forma serial acontecem em paralelo. O fluxo basicamente é um comando reinstate detectado através que dgmgrl que é comunicado ao Broker do primary e do standby. Durante o processo de reinstate o antigo primary sofre um flashback, é reiniciado, sincronizado e configurado como novo standby. Como o banco primary está operando em MAXIMUM AVAILABILITY tudo fica sincronizado, primary e standby.
Todos estes comandos que antes foram manuais (observe nos artigos de failover e de reinstate manual) agora foram realizados automaticamente pelo Broker. Não precisamos detectar qual o SCN que ocorreu a troca de papeis, muito menos fazer o flashback e corrigir o CRS.
Claro que tudo isso precisou de algumas configurações, você acha que o Broker iria conseguir conectar e reiniciar as instâncias sem o registro manual delas no listener? Lembra do dbStaticCionnection do Broker que configuramos no final do artigo anterior? Viu como alguns passos anteriores (que podem ter sido chatos) poupam algum tempo?
Um detalhe ao religar outras instâncias do antigo primary. Ao subir ele irá detectar um “ORA-01153: an incompatible media recovery is active” e “ORA-10458: standby database requires recovery”, isso é “normal” e forçará um shutdown abort da sua instância. Basta você religar a instância em modo mount para tudo voltar ao normal. Isso está ligado a um detalhe co CRS que explico no próximo parágrafo.
Como nem tudo são flores na vida de um DBA, você precisará ajustar o CRS das instâncias. O Broker ajusta o CRS do ambiente, mas em partes. Ele “esquece” de trocar o modo de abertuda das bases, assim o antigo primary e novo standby estará definido para iniciar em modo open e o novo primary estará em modo mount (por ser o antigo standby). Observe abaixo isso e a sua correção:
No antigo primary e novo standby: [oracle@rac11pri02 ~]$ srvctl config database -d maa -v Database unique name: maa Database name: maa Oracle home: /u01/app/oracle/product/11.2.0.3/db_1 Oracle user: oracle Spfile: +DATA/maa/spfilemaa.ora Domain: Start options: open Stop options: immediate Database role: PHYSICAL_STANDBY Management policy: AUTOMATIC Server pools: maa Database instances: maa1,maa2 Disk Groups: DATA,FRA Mount point paths: Services: Type: RAC Database is administrator managed [oracle@rac11pri02 ~]$ [oracle@rac11pri02 ~]$ [oracle@rac11pri02 ~]$ [oracle@rac11pri02 ~]$ srvctl modify database -d maa -s MOUNT [oracle@rac11pri02 ~]$ [oracle@rac11pri02 ~]$ [oracle@rac11pri02 ~]$ [oracle@rac11pri02 ~]$ srvctl config database -d maa -v Database unique name: maa Database name: maa Oracle home: /u01/app/oracle/product/11.2.0.3/db_1 Oracle user: oracle Spfile: +DATA/maa/spfilemaa.ora Domain: Start options: mount Stop options: immediate Database role: PHYSICAL_STANDBY Management policy: AUTOMATIC Server pools: maa Database instances: maa1,maa2 Disk Groups: DATA,FRA Mount point paths: Services: Type: RAC Database is administrator managed [oracle@rac11pri02 ~]$
No novo primary (antigo standby):
[oracle@rac11stb02 ~]$ srvctl config database -d maastb -v Database unique name: maastb Database name: Oracle home: /u01/app/oracle/product/11.2.0.3/db_1 Oracle user: oracle Spfile: Domain: Start options: mount Stop options: immediate Database role: PRIMARY Management policy: AUTOMATIC Server pools: maastb Database instances: maastb1,maastb2 Disk Groups: DG01,FRA Mount point paths: Services: Type: RAC Database is administrator managed [oracle@rac11stb02 ~]$ [oracle@rac11stb02 ~]$ [oracle@rac11stb02 ~]$ srvctl modify database -d maastb -s OPEN [oracle@rac11stb02 ~]$ [oracle@rac11stb02 ~]$ [oracle@rac11stb02 ~]$ srvctl config database -d maastb -v Database unique name: maastb Database name: Oracle home: /u01/app/oracle/product/11.2.0.3/db_1 Oracle user: oracle Spfile: Domain: Start options: open Stop options: immediate Database role: PRIMARY Management policy: AUTOMATIC Server pools: maastb Database instances: maastb1,maastb2 Disk Groups: DG01,FRA Mount point paths: Services: Type: RAC Database is administrator managed [oracle@rac11stb02 ~]$
Se você configurar o CRS antes de ligar as outras instâncias do standby, não sofrerá dos ORA-XXXX listados antes.
AMBIENTE FINAL
Chegamos a um ambiente em que temos um DG com um banco primary chamado de maastb que começou a atuar assim após failover do seu antigo primary. O antigo primary, banco maa, sofreu failover e um posterior reinstate para ser reintegrado ao ambiente como physical standby. Observe o ambiente final:
DGMGRL> SHOW CONFIGURATION; Configuration - dgrac Protection Mode: MaxAvailability Databases: maastb - Primary database maa - Physical standby database Fast-Start Failover: DISABLED Configuration Status: SUCCESS DGMGRL>
Todos os passos que seguimos até agora permitiram fazer isso através do Broker. O uso deste, comparado com o modo “manual”, nos poupou e deu menos preocupações. Ainda não estamos em um ambiente ideal, precisamos configurar o fast-start failover para ter um ambiente confortável e auto gerenciável.
No fim recomendo ligar as outras instância do antigo primary e fazer um backup do ambiente.
Pingback: Oracle e MAA – Artigo VII – Switchover Broker | Have you hugged your backup today?
Pingback: Oracle e MAA – Artigo VIII – Fast-Start Failover | Have you hugged your backup today?