Solving Elusive Problems – Oracle Connectivity Timeout

Hopefully this will help others who come across an issue like this and as a guide on how to approach hard to solve problems.

The error causes the client application to timeout. There is no apparent pattern or specific time of day when its most likely to occur.

The error:

Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for 32-bit Windows: Version 10.2.0.1.0 – Production
Windows NT TCP/IP NT Protocol Adapter for 32-bit Windows: Version 10.2.0.1.0 – Production
Time: 09-JUL-2012 22:12:23
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-00505: Operation timed out
nt secondary err code: 60
nt OS err code: 0
Client address: <unknown>

A Google search points to a wide variety of issues but no specific solution (root cause) was found in any of these discussions to address the error.

Links below:

http://www.experts-exchange.com/Database/Oracle/Q_23523923.html

http://www.unix.com/red-hat/187125-tns-timeout-error-when-connecting-sqlplus-through-scripts-only.html

http://blockdump.blogspot.com/2012/07/connection-problems-inbound-connection.html

https://johnpjeffries.wordpress.com/tag/oracle-streams/

http://pavandba.com/category/networking-with-oracle/

http://oracle.veryoo.com/2012/03/tns-12535-tnsoperation-timed-out.html

These problems take an inordinate amount of resources and money to solve because it involves multiple disciplines. It generally starts with the Application Team working on the client error, but soon ends up at a dead end. Months pass with no clear solution in sight as the sporadic nature of the errors make it very time consuming to troubleshoot.

In this particular case an opened case with Oracle resulted in finger pointing.

More resources are assigned to solve the issue, from the network and security teams but each an expert in their own domain. A problem that spans across multiple domains, requires these teams to build bridges to identify and define the issue in pursuit of finding a solution.

Troubleshooting Methodology:

Investigation

  • Problem Statement: Create a clear, concise statement of the problem.
  • Problem Description: Identify the symptoms. What works? What doesn’t?
  • Identify Differences and Changes: What has changed recently? What is unique about this system?

Analysis

  • Brainstorm: Gather Hypotheses: What might have caused the problem?
  • Identify Likely Causes: Which hypotheses are most likely?
  • Test Possible Causes: Schedule the testing for the most likely hypotheses. Perform any non-disruptive testing immediately.

Implementation

  • Implement the Fix: Complete the repair.
  • Verify the Fix: Is the problem really fixed?
  • Document the Resolution: What did we do? Get a sign-off from the business owner.

The process:

A complete understanding from A to Z of the technology at play is fundamental to tackle such a problem, which is why tight team integration and coordination is paramount.

Understanding the Oracle RAC environment is the first step and this video does a pretty good job at laying the foundation.

httpv://www.youtube.com/watch?v=dS9uUXXTTko

 

We need to reduce all variables leaving the client and a single host to communicate with, so we can compare what a normal communication with an abnormal communication.

We need to remove the RAC elements by either shutting down nodes beyond a single server or by removing tnsnames entries in the tnsnames.ora file, so that we connect to a single node and not the whole RAC.

Additionally we should use IP addresses in the file or if names are used, that they are defined in the hosts file so we can rule out any DNS issues.

At this point we can connect an admins friendly tool, Wireshark and mirror traffic from the client to the sniffer.

A normal communication:

1001 11:40:10 192.168.0.101 192.168.0.10 TCP 4655 > 4568 [PSH, ACK] Seq=19592 Ack=19023 Win=65535 Len=52
1002 11:40:10 192.168.0.10 192.168.0.101 TCP 4568 > 4655 [PSH, ACK] Seq=19023 Ack=19644 Win=62780 Len=22
1003 11:40:10 192.168.0.101 192.168.0.10 TCP 4655 > 4568 [PSH, ACK] Seq=19644 Ack=19045 Win=65513 Len=156
1004 11:40:10 192.168.0.10 192.168.0.101 TCP 4568 > 4655 [PSH, ACK] Seq=19045 Ack=19800 Win=62780 Len=22
1005 11:40:10 192.168.0.101 192.168.0.10 TCP 4655 > 4568 [PSH, ACK] Seq=19800 Ack=19067 Win=65491 Len=13
1006 11:40:10 192.168.0.10 192.168.0.101 TCP 4568 > 4655 [PSH, ACK] Seq=19067 Ack=19813 Win=62780 Len=17
1007 11:40:10 192.168.0.101 192.168.0.10 TCP 4655 > 4568 [PSH, ACK] Seq=19813 Ack=19084 Win=65474 Len=10
1008 11:40:10 192.168.0.101 192.168.0.10 TCP 4655 > 4568 [FIN, ACK] Seq=19823 Ack=19084 Win=65474 Len=0
1009 11:40:10 192.168.0.10 192.168.0.101 TCP 4568 > 4655 [FIN, ACK] Seq=19084 Ack=19824 Win=62780 Len=0
1010 11:40:10 192.168.0.101 192.168.0.10 TCP 4655 > 4568 [ACK] Seq=19824 Ack=19085 Win=65474 Len=0

We can see above the host 192.168.0.101 communicating on an arbitrary port with the server on port 4568, which is actually the SID/listener configured for the database. This snippet is the end of a communication as we see the host sending data with TCP flag PSH with an ACK once data is received by the server and answer from the server.

Finally we see the client (192.168.0.101) sending an TCP FIN flag, signaling no more data and asking the server to acknowledge, to which the server replies to, ending for an ACK from the client.

An abnormal communication:

1011 9:45:09 192.168.0.101 192.168.0.10 TCP 4663 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1012 9:45:09 192.168.0.10 192.168.0.101 ICMP Destination unreachable (Port unreachable)
1013 9:45:11 192.168.0.101 192.168.0.10 TCP 4663 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1014 9:45:11 192.168.0.10 192.168.0.101 ICMP Destination unreachable (Port unreachable)
1015 9:45:18 192.168.0.101 192.168.0.10 TCP 4663 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1016 9:45:18 192.168.0.10 192.168.0.101 ICMP Destination unreachable (Port unreachable)
1017 9:45:31 192.168.0.101 192.168.0.10 TCP 4664 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1018 9:45:31 192.168.0.10 192.168.0.101 ICMP Destination unreachable (Port unreachable)
1019 9:45:34 192.168.0.101 192.168.0.10 TCP 4664 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1020 9:45:34 192.168.0.10 192.168.0.101 ICMP Destination unreachable (Port unreachable)
1021 9:45:40 192.168.0.101 192.168.0.10 TCP 4664 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1022 9:45:40 192.168.0.10 192.168.0.101 ICMP Destination unreachable (Port unreachable)

Above we see what a failed communication which caused the timeout error on the application looks like.

We see the client use an arbitrary port and send a TCP packet with a SYN flag trying to synchronize sequence numbers to begin communications, and the server replies with an ICMP destination unreachable (port unreachable).

We see the client try three times before changing the source TCP port by adding one to the number and trying unsuccessfully three more times, before the application gives up and times out.

Initial Conclusion:

We can conclude that contrary to Oracle’s assertion that it was a network issue, it is not.

The frame was successfully routed across the network, the router ARP’ed for the host, got the response and sent the frame. Furthermore, the intended destination host was on-line and willing to accept the frame into its communication buffer. The frame was then processed by TCP. The protocol TCP tries to send the data up to the destination port number (4568) and the port process didn’t exist or did not reply expeditiously. The protocol handler then reports Destination Unreachable – Port Unreachable.

The solution:

So it’s kick it back to Oracle or find the solution.

A list of possibilities emerged from troubleshooting and forums online, but all patch the issue by increasing timeout parameters either at the application layer or the OS layer; not really addressing the root cause.

  1. Change the database SID
  2. Disable iptables
  3. SQLNET.INBOUND_CONNECT_TIMEOUT=0 change to listener.ora and sqlnet.ora files
  4. Kernel level changes to the OS to increase TCP timeout parameters.

Taking a closer look and comparing the two packet captures, we see that the only difference between them is the source port. The source port is not something you would generally look at when putting in place security because you would lock down your host by whatever port it happens to be listening on and restricting who has access to that port.

Turns out that an automatically generated “iptables” blocked a range of 18 ports (4660-4678) used for (P2P).

Every time the client picked an arbitrary source TCP port to communicate with the server, and it happened to fall within the range of (4660-4678), it would be rejected by “iptables” with an icmp-port-unreachable.

 

Enterprise Backup Network with ANIRA

One of the most critical if not the most critical component of the IT infrastructure is the network, although many times taken for granted. In today’s Client-Server environment and even more with the Cloud Computing model, offices without connectivity to the network become useless in trying to carry out their daily business.

If your business or part of your business is disconnected from the others, it will impact your business in a significant way not including making your customers angry.

This post goes into what it takes to implement a cost-effective backup network, should the primary network link fail.

The scenario described includes multiple remote offices or field locations connected via bonded T1 circuits to an MPLS network. All major services are provided to these remote offices through a central location which is almost always the case, making an outage fatal to the remote office.

Despite redundant T1 circuits providing an aggregate of 3Mbps to the remote office, CRC errors or physical errors on one of the circuits will bring the bonded circuit down; so relying on the 2nd circuit active circuit as backup is a flawed approach.

The router performs only WAN functionality, leaving all other routing and VLAN based-network segmentation and security within the office to a layer-3 capable switch.

The routing protocol of choice is BGP as it is natively used by the MPLS network.

The backup link we are looking for would need to be cost effective, meaning it should not add to the bottom line significantly until it is needed. It would also require sufficient bandwidth for data and voice applications to be ran at an acceptable level from the remote office.

AT&T provides a product that fits this description called ANIRA (AT&T Netgate). There is a minimal monthly rate, a cap of 1Mpbs aggregate bandwidth and additional charge for usage.

This could be done with off-the-shelf equipment in lieu of the ANIRA product but this approach requires additional challenges such as creating the VPN tunnels to equipment at the main office and correct propagation of routes when the main circuit at the remote office goes down. This AT&T service provides the management of the backup devices as well as the connectivity through a VPN tunnel into the MPLS cloud.

The image above illustrates the network topology.

Should the remote office loose  network connectivity, traffic will start to flow through the Netgate which will trigger the device to connect and initiate a VPN tunnel advertising all routes belonging to that office into the MPLS network.

The routing protocol used to determine which path, traffic will take is VRRP or Virtual Router Redundancy Protocol. This will allow the default route used by the switch to float between the main router and the backup device.

Cisco configuration outlined below:

track 1 interface Multilink ip routing

interface FastEtherner0/0
description Internal Network
ip address 192.168.0.2 255.255.255.0
duplex auto
speed auto
vrrp 1 description LAN
vrrp 1 ip 192.168.0.1
vrrp 1 preempt delay minimum 60
vrrp 1 priority 110
vrrp 1 track 1 decrement 100
arp timeout 60

The Netgate device has an IP address of 192.168.0.3 and a VRRP IP address of 192.168.0.1.

A brief description of relevant configuration below:

The VRRP IP address 192.168.0.1 floats between the routers (main router/Netgate) depending which one has the highest priority. The Netgate has a default priority or weight of 50 and an additional 25 when the VPN is connected. In a normal state we want to main router to handle traffic so we force a priority to anything higher than 75 which is the maximum for the Netgate.

vrrp 1 priority 110

To be in a position to decide if the default route should move to the Netgate, we need to know if the T1’s are down. In this example having a T1 down should not be a deciding factor because there is an additional T1 that can handle the traffic, so we chose to monitor the bonded interface at the IP layer.

track 1 interface Multilink ip routing

In the event of an outage the main router will need to lower its priority or weight, below the priority of the Netgate, so that it becomes the new default router with IP address 192.168.0.1.

vrrp 1 track 1 decrement 100

This event will bring the main router’s priority to 10, well below the minimum for the Netgate.

When the main circuit comes back online we want to switch back to it and bring down the VPN tunnel. We accomplish this using the following command: vrrp 1 preempt

However when a T1 comes back up, its usually not a clean process and the telco might also be performing intrusive testing; so its important that we allow some time before we switch traffic back to the main circuit.

vrrp 1 preempt delay minimum 60

Using this configuration should be able to provide an automatic redundant backup network link for remote offices at an affordable price.