Solving Elusive Problems – Oracle Connectivity Timeout

Hopefully this will help others who come across an issue like this and as a guide on how to approach hard to solve problems.

The error causes the client application to timeout. There is no apparent pattern or specific time of day when its most likely to occur.

The error:

Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for 32-bit Windows: Version 10.2.0.1.0 – Production
Windows NT TCP/IP NT Protocol Adapter for 32-bit Windows: Version 10.2.0.1.0 – Production
Time: 09-JUL-2012 22:12:23
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-00505: Operation timed out
nt secondary err code: 60
nt OS err code: 0
Client address: <unknown>

A Google search points to a wide variety of issues but no specific solution (root cause) was found in any of these discussions to address the error.

Links below:

http://www.experts-exchange.com/Database/Oracle/Q_23523923.html

http://www.unix.com/red-hat/187125-tns-timeout-error-when-connecting-sqlplus-through-scripts-only.html

http://blockdump.blogspot.com/2012/07/connection-problems-inbound-connection.html

https://johnpjeffries.wordpress.com/tag/oracle-streams/

http://pavandba.com/category/networking-with-oracle/

http://oracle.veryoo.com/2012/03/tns-12535-tnsoperation-timed-out.html

These problems take an inordinate amount of resources and money to solve because it involves multiple disciplines. It generally starts with the Application Team working on the client error, but soon ends up at a dead end. Months pass with no clear solution in sight as the sporadic nature of the errors make it very time consuming to troubleshoot.

In this particular case an opened case with Oracle resulted in finger pointing.

More resources are assigned to solve the issue, from the network and security teams but each an expert in their own domain. A problem that spans across multiple domains, requires these teams to build bridges to identify and define the issue in pursuit of finding a solution.

Troubleshooting Methodology:

Investigation

  • Problem Statement: Create a clear, concise statement of the problem.
  • Problem Description: Identify the symptoms. What works? What doesn’t?
  • Identify Differences and Changes: What has changed recently? What is unique about this system?

Analysis

  • Brainstorm: Gather Hypotheses: What might have caused the problem?
  • Identify Likely Causes: Which hypotheses are most likely?
  • Test Possible Causes: Schedule the testing for the most likely hypotheses. Perform any non-disruptive testing immediately.

Implementation

  • Implement the Fix: Complete the repair.
  • Verify the Fix: Is the problem really fixed?
  • Document the Resolution: What did we do? Get a sign-off from the business owner.

The process:

A complete understanding from A to Z of the technology at play is fundamental to tackle such a problem, which is why tight team integration and coordination is paramount.

Understanding the Oracle RAC environment is the first step and this video does a pretty good job at laying the foundation.

httpv://www.youtube.com/watch?v=dS9uUXXTTko

 

We need to reduce all variables leaving the client and a single host to communicate with, so we can compare what a normal communication with an abnormal communication.

We need to remove the RAC elements by either shutting down nodes beyond a single server or by removing tnsnames entries in the tnsnames.ora file, so that we connect to a single node and not the whole RAC.

Additionally we should use IP addresses in the file or if names are used, that they are defined in the hosts file so we can rule out any DNS issues.

At this point we can connect an admins friendly tool, Wireshark and mirror traffic from the client to the sniffer.

A normal communication:

1001 11:40:10 192.168.0.101 192.168.0.10 TCP 4655 > 4568 [PSH, ACK] Seq=19592 Ack=19023 Win=65535 Len=52
1002 11:40:10 192.168.0.10 192.168.0.101 TCP 4568 > 4655 [PSH, ACK] Seq=19023 Ack=19644 Win=62780 Len=22
1003 11:40:10 192.168.0.101 192.168.0.10 TCP 4655 > 4568 [PSH, ACK] Seq=19644 Ack=19045 Win=65513 Len=156
1004 11:40:10 192.168.0.10 192.168.0.101 TCP 4568 > 4655 [PSH, ACK] Seq=19045 Ack=19800 Win=62780 Len=22
1005 11:40:10 192.168.0.101 192.168.0.10 TCP 4655 > 4568 [PSH, ACK] Seq=19800 Ack=19067 Win=65491 Len=13
1006 11:40:10 192.168.0.10 192.168.0.101 TCP 4568 > 4655 [PSH, ACK] Seq=19067 Ack=19813 Win=62780 Len=17
1007 11:40:10 192.168.0.101 192.168.0.10 TCP 4655 > 4568 [PSH, ACK] Seq=19813 Ack=19084 Win=65474 Len=10
1008 11:40:10 192.168.0.101 192.168.0.10 TCP 4655 > 4568 [FIN, ACK] Seq=19823 Ack=19084 Win=65474 Len=0
1009 11:40:10 192.168.0.10 192.168.0.101 TCP 4568 > 4655 [FIN, ACK] Seq=19084 Ack=19824 Win=62780 Len=0
1010 11:40:10 192.168.0.101 192.168.0.10 TCP 4655 > 4568 [ACK] Seq=19824 Ack=19085 Win=65474 Len=0

We can see above the host 192.168.0.101 communicating on an arbitrary port with the server on port 4568, which is actually the SID/listener configured for the database. This snippet is the end of a communication as we see the host sending data with TCP flag PSH with an ACK once data is received by the server and answer from the server.

Finally we see the client (192.168.0.101) sending an TCP FIN flag, signaling no more data and asking the server to acknowledge, to which the server replies to, ending for an ACK from the client.

An abnormal communication:

1011 9:45:09 192.168.0.101 192.168.0.10 TCP 4663 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1012 9:45:09 192.168.0.10 192.168.0.101 ICMP Destination unreachable (Port unreachable)
1013 9:45:11 192.168.0.101 192.168.0.10 TCP 4663 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1014 9:45:11 192.168.0.10 192.168.0.101 ICMP Destination unreachable (Port unreachable)
1015 9:45:18 192.168.0.101 192.168.0.10 TCP 4663 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1016 9:45:18 192.168.0.10 192.168.0.101 ICMP Destination unreachable (Port unreachable)
1017 9:45:31 192.168.0.101 192.168.0.10 TCP 4664 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1018 9:45:31 192.168.0.10 192.168.0.101 ICMP Destination unreachable (Port unreachable)
1019 9:45:34 192.168.0.101 192.168.0.10 TCP 4664 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1020 9:45:34 192.168.0.10 192.168.0.101 ICMP Destination unreachable (Port unreachable)
1021 9:45:40 192.168.0.101 192.168.0.10 TCP 4664 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1022 9:45:40 192.168.0.10 192.168.0.101 ICMP Destination unreachable (Port unreachable)

Above we see what a failed communication which caused the timeout error on the application looks like.

We see the client use an arbitrary port and send a TCP packet with a SYN flag trying to synchronize sequence numbers to begin communications, and the server replies with an ICMP destination unreachable (port unreachable).

We see the client try three times before changing the source TCP port by adding one to the number and trying unsuccessfully three more times, before the application gives up and times out.

Initial Conclusion:

We can conclude that contrary to Oracle’s assertion that it was a network issue, it is not.

The frame was successfully routed across the network, the router ARP’ed for the host, got the response and sent the frame. Furthermore, the intended destination host was on-line and willing to accept the frame into its communication buffer. The frame was then processed by TCP. The protocol TCP tries to send the data up to the destination port number (4568) and the port process didn’t exist or did not reply expeditiously. The protocol handler then reports Destination Unreachable – Port Unreachable.

The solution:

So it’s kick it back to Oracle or find the solution.

A list of possibilities emerged from troubleshooting and forums online, but all patch the issue by increasing timeout parameters either at the application layer or the OS layer; not really addressing the root cause.

  1. Change the database SID
  2. Disable iptables
  3. SQLNET.INBOUND_CONNECT_TIMEOUT=0 change to listener.ora and sqlnet.ora files
  4. Kernel level changes to the OS to increase TCP timeout parameters.

Taking a closer look and comparing the two packet captures, we see that the only difference between them is the source port. The source port is not something you would generally look at when putting in place security because you would lock down your host by whatever port it happens to be listening on and restricting who has access to that port.

Turns out that an automatically generated “iptables” blocked a range of 18 ports (4660-4678) used for (P2P).

Every time the client picked an arbitrary source TCP port to communicate with the server, and it happened to fall within the range of (4660-4678), it would be rejected by “iptables” with an icmp-port-unreachable.

 

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s