Hopefully this will help others who come across an issue like this and as a guide on how to approach hard to solve problems.
The error causes the client application to timeout. There is no apparent pattern or specific time of day when its most likely to occur.
Fatal NI connect error 12170.
TNS for 32-bit Windows: Version 10.2.0.1.0 – Production
Windows NT TCP/IP NT Protocol Adapter for 32-bit Windows: Version 10.2.0.1.0 – Production
Time: 09-JUL-2012 22:12:23
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-00505: Operation timed out
nt secondary err code: 60
nt OS err code: 0
Client address: <unknown>
A Google search points to a wide variety of issues but no specific solution (root cause) was found in any of these discussions to address the error.
These problems take an inordinate amount of resources and money to solve because it involves multiple disciplines. It generally starts with the Application Team working on the client error, but soon ends up at a dead end. Months pass with no clear solution in sight as the sporadic nature of the errors make it very time consuming to troubleshoot.
In this particular case an opened case with Oracle resulted in finger pointing.
More resources are assigned to solve the issue, from the network and security teams but each an expert in their own domain. A problem that spans across multiple domains, requires these teams to build bridges to identify and define the issue in pursuit of finding a solution.
- Problem Statement: Create a clear, concise statement of the problem.
- Problem Description: Identify the symptoms. What works? What doesn’t?
- Identify Differences and Changes: What has changed recently? What is unique about this system?
- Brainstorm: Gather Hypotheses: What might have caused the problem?
- Identify Likely Causes: Which hypotheses are most likely?
- Test Possible Causes: Schedule the testing for the most likely hypotheses. Perform any non-disruptive testing immediately.
- Implement the Fix: Complete the repair.
- Verify the Fix: Is the problem really fixed?
- Document the Resolution: What did we do? Get a sign-off from the business owner.
A complete understanding from A to Z of the technology at play is fundamental to tackle such a problem, which is why tight team integration and coordination is paramount.
Understanding the Oracle RAC environment is the first step and this video does a pretty good job at laying the foundation.
We need to reduce all variables leaving the client and a single host to communicate with, so we can compare what a normal communication with an abnormal communication.
We need to remove the RAC elements by either shutting down nodes beyond a single server or by removing tnsnames entries in the tnsnames.ora file, so that we connect to a single node and not the whole RAC.
Additionally we should use IP addresses in the file or if names are used, that they are defined in the hosts file so we can rule out any DNS issues.
At this point we can connect an admins friendly tool, Wireshark and mirror traffic from the client to the sniffer.
A normal communication:
|1001||11:40:10||192.168.0.101||192.168.0.10||TCP||4655 > 4568 [PSH, ACK] Seq=19592 Ack=19023 Win=65535 Len=52|
|1002||11:40:10||192.168.0.10||192.168.0.101||TCP||4568 > 4655 [PSH, ACK] Seq=19023 Ack=19644 Win=62780 Len=22|
|1003||11:40:10||192.168.0.101||192.168.0.10||TCP||4655 > 4568 [PSH, ACK] Seq=19644 Ack=19045 Win=65513 Len=156|
|1004||11:40:10||192.168.0.10||192.168.0.101||TCP||4568 > 4655 [PSH, ACK] Seq=19045 Ack=19800 Win=62780 Len=22|
|1005||11:40:10||192.168.0.101||192.168.0.10||TCP||4655 > 4568 [PSH, ACK] Seq=19800 Ack=19067 Win=65491 Len=13|
|1006||11:40:10||192.168.0.10||192.168.0.101||TCP||4568 > 4655 [PSH, ACK] Seq=19067 Ack=19813 Win=62780 Len=17|
|1007||11:40:10||192.168.0.101||192.168.0.10||TCP||4655 > 4568 [PSH, ACK] Seq=19813 Ack=19084 Win=65474 Len=10|
|1008||11:40:10||192.168.0.101||192.168.0.10||TCP||4655 > 4568 [FIN, ACK] Seq=19823 Ack=19084 Win=65474 Len=0|
|1009||11:40:10||192.168.0.10||192.168.0.101||TCP||4568 > 4655 [FIN, ACK] Seq=19084 Ack=19824 Win=62780 Len=0|
|1010||11:40:10||192.168.0.101||192.168.0.10||TCP||4655 > 4568 [ACK] Seq=19824 Ack=19085 Win=65474 Len=0|
We can see above the host 192.168.0.101 communicating on an arbitrary port with the server on port 4568, which is actually the SID/listener configured for the database. This snippet is the end of a communication as we see the host sending data with TCP flag PSH with an ACK once data is received by the server and answer from the server.
Finally we see the client (192.168.0.101) sending an TCP FIN flag, signaling no more data and asking the server to acknowledge, to which the server replies to, ending for an ACK from the client.
An abnormal communication:
|1011||9:45:09||192.168.0.101||192.168.0.10||TCP||4663 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460|
|1012||9:45:09||192.168.0.10||192.168.0.101||ICMP||Destination unreachable (Port unreachable)|
|1013||9:45:11||192.168.0.101||192.168.0.10||TCP||4663 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460|
|1014||9:45:11||192.168.0.10||192.168.0.101||ICMP||Destination unreachable (Port unreachable)|
|1015||9:45:18||192.168.0.101||192.168.0.10||TCP||4663 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460|
|1016||9:45:18||192.168.0.10||192.168.0.101||ICMP||Destination unreachable (Port unreachable)|
|1017||9:45:31||192.168.0.101||192.168.0.10||TCP||4664 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460|
|1018||9:45:31||192.168.0.10||192.168.0.101||ICMP||Destination unreachable (Port unreachable)|
|1019||9:45:34||192.168.0.101||192.168.0.10||TCP||4664 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460|
|1020||9:45:34||192.168.0.10||192.168.0.101||ICMP||Destination unreachable (Port unreachable)|
|1021||9:45:40||192.168.0.101||192.168.0.10||TCP||4664 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460|
|1022||9:45:40||192.168.0.10||192.168.0.101||ICMP||Destination unreachable (Port unreachable)|
Above we see what a failed communication which caused the timeout error on the application looks like.
We see the client use an arbitrary port and send a TCP packet with a SYN flag trying to synchronize sequence numbers to begin communications, and the server replies with an ICMP destination unreachable (port unreachable).
We see the client try three times before changing the source TCP port by adding one to the number and trying unsuccessfully three more times, before the application gives up and times out.
We can conclude that contrary to Oracle’s assertion that it was a network issue, it is not.
The frame was successfully routed across the network, the router ARP’ed for the host, got the response and sent the frame. Furthermore, the intended destination host was on-line and willing to accept the frame into its communication buffer. The frame was then processed by TCP. The protocol TCP tries to send the data up to the destination port number (4568) and the port process didn’t exist or did not reply expeditiously. The protocol handler then reports Destination Unreachable – Port Unreachable.
So it’s kick it back to Oracle or find the solution.
A list of possibilities emerged from troubleshooting and forums online, but all patch the issue by increasing timeout parameters either at the application layer or the OS layer; not really addressing the root cause.
- Change the database SID
- Disable iptables
- SQLNET.INBOUND_CONNECT_TIMEOUT=0 change to listener.ora and sqlnet.ora files
- Kernel level changes to the OS to increase TCP timeout parameters.
Taking a closer look and comparing the two packet captures, we see that the only difference between them is the source port. The source port is not something you would generally look at when putting in place security because you would lock down your host by whatever port it happens to be listening on and restricting who has access to that port.
Turns out that an automatically generated “iptables” blocked a range of 18 ports (4660-4678) used for (P2P).
Every time the client picked an arbitrary source TCP port to communicate with the server, and it happened to fall within the range of (4660-4678), it would be rejected by “iptables” with an icmp-port-unreachable.