While having fun playing with HADR on AWS, we bumped into bunch of interesting problems. We searched and tried around for a while, but found limited publicly available resources to refer to. Hope you don't mind sharing some insights/suggestions with us covering the following aspects:
1. recommendations on using virtual ip or not:
1.1 on AWS we realize IBM.ServiceIP may not be
able to manage vip, for two reasons:
1.1.1 a single subnet can't span across multiple-AZ;
1.1.2 TSA doesn't check if vip is really assigned to the host.
1.2 IBM.Application may be able to manage vip failover in AWS multi-AZ deployment through customized scripting, but the failover operation by itself bounds the RTO and could be SPOF by itself. Alternatively, we are trying to cover as many caveats as possible, if we don't use vip. Anything else we should watch out for, besides split-brain and client re-route complexity? We are aware vip could help to avoid split-brain, and reduce client re-route complexity, but the same goals could be accomplished through careful orchestration without vip. Btw, we may still want to use ROS (yes, we are warned about ROS limitations, but since we are paying for it anyway, can't resist the temptation using it.... ), so ACR/ClientAffinity is probably not the final solution.
2. recommendations on log file management, assuming $ cheap and performant shared storage solution is not available in multi-AZ AWS deployment,:
2.1 ARCHIVE LOG protection:
What's the best practice making archive logs always available to all HADR nodes, which should be able to survive single AZ-wise EBS failure? I can think of mounting logarchmeth1/2 to EFS or a third party solution, or even using fs/block storage replication (eg. DRBD), but that's quite a big change for us.... We could setup customized log-shipping, but that still doesn't protect archive logs that are not shipped yet..... In our current setup, archive logs may be required by QCapture program, or adding extra Standby.
2.2 ACTIVE LOG Protection:
MIRRORLOGPATH is an option, but we probably won't go for it due to performance penalty. We understand active log entries should be available on Principle Standby in PEER state anyway, but just in case HADR fails out of PEER state...
3. Best practice references on multi-AZ HADR deployment in general....
Really appreciate your inputs!
btw, we use V10.5FP8 on linux.
Rui Chen[Organization Members] @ Jan 22, 2018 - 04:27 PM (America/Eastern)