Solving Elusive Problems – Oracle Connectivity Timeout

Hopefully this will help others who come across an issue like this and as a guide on how to approach hard to solve problems.

The error causes the client application to timeout. There is no apparent pattern or specific time of day when its most likely to occur.

The error:

Fatal NI connect error 12170.
VERSION INFORMATION:
TNS for 32-bit Windows: Version 10.2.0.1.0 – Production
Windows NT TCP/IP NT Protocol Adapter for 32-bit Windows: Version 10.2.0.1.0 – Production
Time: 09-JUL-2012 22:12:23
Tracing not turned on.
Tns error struct:
ns main err code: 12535
TNS-12535: TNS:operation timed out
ns secondary err code: 12560
nt main err code: 505
TNS-00505: Operation timed out
nt secondary err code: 60
nt OS err code: 0
Client address: <unknown>

A Google search points to a wide variety of issues but no specific solution (root cause) was found in any of these discussions to address the error.

Links below:

http://www.experts-exchange.com/Database/Oracle/Q_23523923.html

http://www.unix.com/red-hat/187125-tns-timeout-error-when-connecting-sqlplus-through-scripts-only.html

http://blockdump.blogspot.com/2012/07/connection-problems-inbound-connection.html

https://johnpjeffries.wordpress.com/tag/oracle-streams/

http://pavandba.com/category/networking-with-oracle/

http://oracle.veryoo.com/2012/03/tns-12535-tnsoperation-timed-out.html

These problems take an inordinate amount of resources and money to solve because it involves multiple disciplines. It generally starts with the Application Team working on the client error, but soon ends up at a dead end. Months pass with no clear solution in sight as the sporadic nature of the errors make it very time consuming to troubleshoot.

In this particular case an opened case with Oracle resulted in finger pointing.

More resources are assigned to solve the issue, from the network and security teams but each an expert in their own domain. A problem that spans across multiple domains, requires these teams to build bridges to identify and define the issue in pursuit of finding a solution.

Troubleshooting Methodology:

Investigation

  • Problem Statement: Create a clear, concise statement of the problem.
  • Problem Description: Identify the symptoms. What works? What doesn’t?
  • Identify Differences and Changes: What has changed recently? What is unique about this system?

Analysis

  • Brainstorm: Gather Hypotheses: What might have caused the problem?
  • Identify Likely Causes: Which hypotheses are most likely?
  • Test Possible Causes: Schedule the testing for the most likely hypotheses. Perform any non-disruptive testing immediately.

Implementation

  • Implement the Fix: Complete the repair.
  • Verify the Fix: Is the problem really fixed?
  • Document the Resolution: What did we do? Get a sign-off from the business owner.

The process:

A complete understanding from A to Z of the technology at play is fundamental to tackle such a problem, which is why tight team integration and coordination is paramount.

Understanding the Oracle RAC environment is the first step and this video does a pretty good job at laying the foundation.

httpv://www.youtube.com/watch?v=dS9uUXXTTko

 

We need to reduce all variables leaving the client and a single host to communicate with, so we can compare what a normal communication with an abnormal communication.

We need to remove the RAC elements by either shutting down nodes beyond a single server or by removing tnsnames entries in the tnsnames.ora file, so that we connect to a single node and not the whole RAC.

Additionally we should use IP addresses in the file or if names are used, that they are defined in the hosts file so we can rule out any DNS issues.

At this point we can connect an admins friendly tool, Wireshark and mirror traffic from the client to the sniffer.

A normal communication:

1001 11:40:10 192.168.0.101 192.168.0.10 TCP 4655 > 4568 [PSH, ACK] Seq=19592 Ack=19023 Win=65535 Len=52
1002 11:40:10 192.168.0.10 192.168.0.101 TCP 4568 > 4655 [PSH, ACK] Seq=19023 Ack=19644 Win=62780 Len=22
1003 11:40:10 192.168.0.101 192.168.0.10 TCP 4655 > 4568 [PSH, ACK] Seq=19644 Ack=19045 Win=65513 Len=156
1004 11:40:10 192.168.0.10 192.168.0.101 TCP 4568 > 4655 [PSH, ACK] Seq=19045 Ack=19800 Win=62780 Len=22
1005 11:40:10 192.168.0.101 192.168.0.10 TCP 4655 > 4568 [PSH, ACK] Seq=19800 Ack=19067 Win=65491 Len=13
1006 11:40:10 192.168.0.10 192.168.0.101 TCP 4568 > 4655 [PSH, ACK] Seq=19067 Ack=19813 Win=62780 Len=17
1007 11:40:10 192.168.0.101 192.168.0.10 TCP 4655 > 4568 [PSH, ACK] Seq=19813 Ack=19084 Win=65474 Len=10
1008 11:40:10 192.168.0.101 192.168.0.10 TCP 4655 > 4568 [FIN, ACK] Seq=19823 Ack=19084 Win=65474 Len=0
1009 11:40:10 192.168.0.10 192.168.0.101 TCP 4568 > 4655 [FIN, ACK] Seq=19084 Ack=19824 Win=62780 Len=0
1010 11:40:10 192.168.0.101 192.168.0.10 TCP 4655 > 4568 [ACK] Seq=19824 Ack=19085 Win=65474 Len=0

We can see above the host 192.168.0.101 communicating on an arbitrary port with the server on port 4568, which is actually the SID/listener configured for the database. This snippet is the end of a communication as we see the host sending data with TCP flag PSH with an ACK once data is received by the server and answer from the server.

Finally we see the client (192.168.0.101) sending an TCP FIN flag, signaling no more data and asking the server to acknowledge, to which the server replies to, ending for an ACK from the client.

An abnormal communication:

1011 9:45:09 192.168.0.101 192.168.0.10 TCP 4663 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1012 9:45:09 192.168.0.10 192.168.0.101 ICMP Destination unreachable (Port unreachable)
1013 9:45:11 192.168.0.101 192.168.0.10 TCP 4663 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1014 9:45:11 192.168.0.10 192.168.0.101 ICMP Destination unreachable (Port unreachable)
1015 9:45:18 192.168.0.101 192.168.0.10 TCP 4663 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1016 9:45:18 192.168.0.10 192.168.0.101 ICMP Destination unreachable (Port unreachable)
1017 9:45:31 192.168.0.101 192.168.0.10 TCP 4664 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1018 9:45:31 192.168.0.10 192.168.0.101 ICMP Destination unreachable (Port unreachable)
1019 9:45:34 192.168.0.101 192.168.0.10 TCP 4664 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1020 9:45:34 192.168.0.10 192.168.0.101 ICMP Destination unreachable (Port unreachable)
1021 9:45:40 192.168.0.101 192.168.0.10 TCP 4664 > 4568 [SYN] Seq=0 Win=65535 Len=0 MSS=1460
1022 9:45:40 192.168.0.10 192.168.0.101 ICMP Destination unreachable (Port unreachable)

Above we see what a failed communication which caused the timeout error on the application looks like.

We see the client use an arbitrary port and send a TCP packet with a SYN flag trying to synchronize sequence numbers to begin communications, and the server replies with an ICMP destination unreachable (port unreachable).

We see the client try three times before changing the source TCP port by adding one to the number and trying unsuccessfully three more times, before the application gives up and times out.

Initial Conclusion:

We can conclude that contrary to Oracle’s assertion that it was a network issue, it is not.

The frame was successfully routed across the network, the router ARP’ed for the host, got the response and sent the frame. Furthermore, the intended destination host was on-line and willing to accept the frame into its communication buffer. The frame was then processed by TCP. The protocol TCP tries to send the data up to the destination port number (4568) and the port process didn’t exist or did not reply expeditiously. The protocol handler then reports Destination Unreachable – Port Unreachable.

The solution:

So it’s kick it back to Oracle or find the solution.

A list of possibilities emerged from troubleshooting and forums online, but all patch the issue by increasing timeout parameters either at the application layer or the OS layer; not really addressing the root cause.

  1. Change the database SID
  2. Disable iptables
  3. SQLNET.INBOUND_CONNECT_TIMEOUT=0 change to listener.ora and sqlnet.ora files
  4. Kernel level changes to the OS to increase TCP timeout parameters.

Taking a closer look and comparing the two packet captures, we see that the only difference between them is the source port. The source port is not something you would generally look at when putting in place security because you would lock down your host by whatever port it happens to be listening on and restricting who has access to that port.

Turns out that an automatically generated “iptables” blocked a range of 18 ports (4660-4678) used for (P2P).

Every time the client picked an arbitrary source TCP port to communicate with the server, and it happened to fall within the range of (4660-4678), it would be rejected by “iptables” with an icmp-port-unreachable.

 

Enterprise Backup Network with ANIRA

One of the most critical if not the most critical component of the IT infrastructure is the network, although many times taken for granted. In today’s Client-Server environment and even more with the Cloud Computing model, offices without connectivity to the network become useless in trying to carry out their daily business.

If your business or part of your business is disconnected from the others, it will impact your business in a significant way not including making your customers angry.

This post goes into what it takes to implement a cost-effective backup network, should the primary network link fail.

The scenario described includes multiple remote offices or field locations connected via bonded T1 circuits to an MPLS network. All major services are provided to these remote offices through a central location which is almost always the case, making an outage fatal to the remote office.

Despite redundant T1 circuits providing an aggregate of 3Mbps to the remote office, CRC errors or physical errors on one of the circuits will bring the bonded circuit down; so relying on the 2nd circuit active circuit as backup is a flawed approach.

The router performs only WAN functionality, leaving all other routing and VLAN based-network segmentation and security within the office to a layer-3 capable switch.

The routing protocol of choice is BGP as it is natively used by the MPLS network.

The backup link we are looking for would need to be cost effective, meaning it should not add to the bottom line significantly until it is needed. It would also require sufficient bandwidth for data and voice applications to be ran at an acceptable level from the remote office.

AT&T provides a product that fits this description called ANIRA (AT&T Netgate). There is a minimal monthly rate, a cap of 1Mpbs aggregate bandwidth and additional charge for usage.

This could be done with off-the-shelf equipment in lieu of the ANIRA product but this approach requires additional challenges such as creating the VPN tunnels to equipment at the main office and correct propagation of routes when the main circuit at the remote office goes down. This AT&T service provides the management of the backup devices as well as the connectivity through a VPN tunnel into the MPLS cloud.

The image above illustrates the network topology.

Should the remote office loose  network connectivity, traffic will start to flow through the Netgate which will trigger the device to connect and initiate a VPN tunnel advertising all routes belonging to that office into the MPLS network.

The routing protocol used to determine which path, traffic will take is VRRP or Virtual Router Redundancy Protocol. This will allow the default route used by the switch to float between the main router and the backup device.

Cisco configuration outlined below:

track 1 interface Multilink ip routing

interface FastEtherner0/0
description Internal Network
ip address 192.168.0.2 255.255.255.0
duplex auto
speed auto
vrrp 1 description LAN
vrrp 1 ip 192.168.0.1
vrrp 1 preempt delay minimum 60
vrrp 1 priority 110
vrrp 1 track 1 decrement 100
arp timeout 60

The Netgate device has an IP address of 192.168.0.3 and a VRRP IP address of 192.168.0.1.

A brief description of relevant configuration below:

The VRRP IP address 192.168.0.1 floats between the routers (main router/Netgate) depending which one has the highest priority. The Netgate has a default priority or weight of 50 and an additional 25 when the VPN is connected. In a normal state we want to main router to handle traffic so we force a priority to anything higher than 75 which is the maximum for the Netgate.

vrrp 1 priority 110

To be in a position to decide if the default route should move to the Netgate, we need to know if the T1’s are down. In this example having a T1 down should not be a deciding factor because there is an additional T1 that can handle the traffic, so we chose to monitor the bonded interface at the IP layer.

track 1 interface Multilink ip routing

In the event of an outage the main router will need to lower its priority or weight, below the priority of the Netgate, so that it becomes the new default router with IP address 192.168.0.1.

vrrp 1 track 1 decrement 100

This event will bring the main router’s priority to 10, well below the minimum for the Netgate.

When the main circuit comes back online we want to switch back to it and bring down the VPN tunnel. We accomplish this using the following command: vrrp 1 preempt

However when a T1 comes back up, its usually not a clean process and the telco might also be performing intrusive testing; so its important that we allow some time before we switch traffic back to the main circuit.

vrrp 1 preempt delay minimum 60

Using this configuration should be able to provide an automatic redundant backup network link for remote offices at an affordable price.

Mobile is going to crush Facebook – NOT

Mark Cuban’s recent article on Facebook IPO Post Mortem and his point that Mobile will crush Facebook. He has several valid points but in my view misses that mobile is a challenge for more than just Facebook.

Facebook is valuable because it makes possible targeted advertising to consumers to a greater extent that has been possible with Google.

Google has been mining data for years and not only what we search. They mine what we share with others over email (gmail users), and then correlate that information with trends based on our geographic location. Does Google know my age? Probably since I have used other services were I have given up this information and if not it could fairly easily put me in a range. Does Google know if I’m single or married or have kids? They can probably guess by scrapping content from my emails. Does Google know what I buy? If I have used Google to pay for it or again scrapping any invoice/receipt that arrives in my inbox. Does Google know what I like to watch? Sure, they own YouTube.

I agree that Facebook has a problem with mobile but the value proposition that it presents to brands is greater than what Google has to offer and furthermore mobile is not a problem for just Facebook, but for all the players whose revenue model rely on income from advertisements.

Advertising has always relied on passive users, thus the strategy was to capture the largest amount of eyeballs using delivery channels such as TV, radio or print media. Google and Facebook make it possible to provide relevant advertisement in a way not possible before.

Mobile will change how brands interact with consumers; from the consumer perspective it offers just in time access to on-demand content, were people reach out to information when they want from where they want. This presents a challenge for the traditional marketing strategy that has perfected the “couch potato” pitch. Today people using mobile devices will not be passively sitting in front of the regular channel, so if a brand expects to get in front of consumers it will have to find a different way.

So what’s different about the mobile platform. Smaller screens, less bandwidth or bandwidth caps. The tablet will remove these limitations but its a totally different device whose use and demographic are also totally different.

Trying to force mobile to become an additional marketing channel in the same way TV or Web is used is a mistake, because the way people interact with mobile is different from the way people interact with TV, radio, print and the personal computer. Realizing this, the handicap of a smaller screen and bandwidth limitations, utilizing the advantages of relevant/targeted advertising and features that mobile brings with it such as location will necessarily spawn new ways for brands to connect with consumers.
Data Mining: How Companies Now Know Everything About You

How Our Decisions are Shaped

Dan Ariely, a professor of behavioral economics at Duke University, presents examples of cognitive illusions that help illustrate why humans make predictably irrational decisions.

EG is the celebration of the American entertainment industry. Since 1984, Richard Saul Wurman has created extraordinary gatherings about learning and understanding. EG is a rich extension of these ideas – a conference that explores the attitude of understanding in music, film, television, radio, technology, advertising, gaming, interactivity and the web – The Entertainment Gathering

Dan Ariely is the Alfred P. Sloan Professor of Behavioral Economics at MIT Sloan School of Management. He also holds an appointment at the MIT Media Lab where he is the head of the eRationality research group. He is considered to be one of the leading behavioral economists. Currently, Ariely is serving as a Visiting Professor at the Duke University, Fuqua School of Business where he is teaching a course based upon his findings in Predictably Irrational.

Ariely was an undergraduate at Tel Aviv University and received a Ph.D. and M.A. in cognitive psychology from the University of North Carolina at Chapel Hill, and a Ph.D. in business from Duke University. His research focuses on discovering and measuring how people make decisions. He models the human decision making process and in particular the irrational decisions that we all make every day.

Ariely is the author of the book, Predictably Irrational: The Hidden Forces That Shape Our Decisions, which was published on February 21, 2008 by HarperCollins.

Enhanced by Zemanta

Railway Bridge Health Monitoring System

In my last post I put forth the idea of using the unique capabilities and UX of the iPhone to help track defects in railways, which came about after my initial conversations with a friend from the railroad industry.

Hours into our conversation I was perplexed at the lack of proactive monitoring of the today’s bridges used by trains for transport.

Should a uniquely located bridge collapse, an energy crisis could ensue as a result of the coal fields in the northeast/midwest being severed from the southwest.

I looked at several existing methods and solutions used today to address this issue and drew from each to conclude in a refined approach to monitoring the health of railway bridges.

There were basically 3 design considerations which needed to be met:

  1. Easy to deploy
  2. Low Maintenance
  3. Long Term

The system had to be easily deploy-able were an electrician in the field could install the components of the solution. Obviously low maintenance is also key, reducing the total cost of ownership and Long Term reducing the need for personnel to visit these bridges.

Application Requirements:

In order to monitor the health of a structure, vibrations of the structure need to be gathered and analyzed to develop a baseline under normal conditions. Subsequent measurements of vibrations can then be compared to the baseline to determine if an anomaly exists.

To accomplish this requirement sensors (3-axis accelerometers) are placed throughout the span of the bridge collecting data. The frequency components of interest range between 0.25-20Hz, the measurements would need to take place 40 secs before and after the passage of the train and time synchronization between the sensors would also be a factor to take into account.

Existing approaches use technology such as Solar panels to supply power in remote areas, GSM for data transmission, GPS for time synchronization and a star topology for the sensors to communicate to a head node which would collect and transmit the data for analysis.

There are multiple problems here since solar panels are expensive, prone to theft, vandalism and damage; GSM data transmission isn’t always viable when there isn’t network coverage in remoter areas and relying on a head node to collect and transmit the data would be like putting all your eggs in one basket. If the head node failed, the system would stop working.

The techniques I came across with basically fell into 2 categories: Existing bridges and new bridges.

I focused on existing bridges since there are very sophisticated things being done with new bridges. Today engineers are embedding sensors and fiber in the concrete while the bridges are being built in order to take measurements, but this approach is obviously not viable for existing bridges.

The methods in use for existing bridges included visual inspection, wired solutions which were bulky, expensive and time consuming to setup and a few wireless solutions some of which were proprietary, not scalable and interesting work from India.

In summary there are several challenges in deploying such a solution at sometimes remote and hostile locations. A lack of power which calls for alternate sources of energy, a way to effectively and reliably collect and transmit the data for analysis and keeping installation and maintenance costs low.

Since the train comes and goes, so can the data collected by the sensors. The train would activate the standby sensors as it approaches the bridge and then collect the data buffered by the sensors after passing the bridge. This approach would deal with the transmission of data limitations while at the same time eliminating the need of power for this component of the system. The train would carry the data and uploaded it to a collection station.

httpv://www.youtube.com/watch?v=PVH1K1Eocz0

To deal with reliability and power requirements the linear path Star Topology would be dropped in favor of a Mesh Network which provides TRUE self-organizing and self-healing properties. On top of the Mesh Network, TSMP (Time Synchronized Mesh Protocol) would be used providing more than 99.9% reliability and the lowest power consumption per delivered packet.

The key for achieving maximum reliability is to use channel hopping, in which each packet is sent on a different channel. In this case, transient failures in links on a given channel are handled gracefully, and persistent link failures that develop after the site survey do not destabilize the network.

Sensors of this type using this approach can last 7-10 years on a small battery meeting the application requirements.

Now to raise some money, build a working prototype and demo it to the Railway companies.

Enhanced by Zemanta

Healthcare Electronic Clipboard iPad Application

Just after finishing up my first iPhone application I got involved in the Healthcare industry with the implementation of an Electronic Healthcare Records (EHR) system for a 3 location practice. Additionally I came across the video below on a Doctor from Croatia putting the iPhone to use in the field including remote diagnostic procedures and CPR with his own invention.

httpv://www.youtube.com/watch?v=Q-E-B3Pc8mk

This brought about the idea of building applications for the healthcare industry for the iPhone platform. The iPhone though did not present the ideal device for doctors to use because of its size and difficulty in entering information.

After the release of the Apple iPad on January 27, 2010 the idea of an electronic clipboard didn’t seem too far fetched, so I put together a mock-up of what the app on the iPad would look like and can be seen below.

Enhanced by Zemanta