Conversation with #gubug at 2005-01-12 11:32:46 on macnewbold@irc.freenode.net (irc) (11:32:46) : The topic for #gubug is: roBust, Solid, Dependable -- http://www.anthonychavez.org/images/haxorpc.jpg (15:26:57) jlp: no one has any ideas? remember, the problem is that AS537 has lost contact with it's preferred site B, how can I, at foo.com, detect that this has happened? (15:28:28) jlp: because once I can detect it, I can start serving site A's A record for www.foo.com to AS537 hosts (even though it's normally sub-optimal) (15:49:36) macnewbold: Does your DNS server have access to a BGP feed? :) You could use that to determine when routing was broken perhaps, though really a pain. (15:50:43) macnewbold: it seems one way to tell when they can't talk to site B is if they contact you about every TTL to get an update, rather than contacting you approx. once every _two_ ttls, like they would if they were picking randomly between two working sites (15:51:03) jlp: not really, no... think more in terms about what you DO know already based on traffic you're receiving... think not only of dns traffic (15:51:31) jlp: we're assuming foo.com gets a LOT of traffic all the time, btw (15:51:53) macnewbold: if your load is extremely high (double normal?) you could assume site B was probably down (15:52:01) macnewbold: but that's not specific to AS537 (15:52:22) jlp: but you're on the right track (15:52:33) jlp: just pointed in the wrong direction :-) (15:53:15) jlp: let's say AS537 is comcast, for the sake of the discussion (15:55:31) macnewbold: if you're getting very little comcast traffic, point them to the other site (where all the other comcast traffic is going), but if you're getting lots of comcast traffic, point it to yourself (15:56:36) macnewbold: (This is kind of cool, since it is very likely that something like the right answer to this question is how google.com actually works... :) ) (16:00:37) jlp: it's actually how sites like akamai work (akamai handles a lot of front-end load for big sites) (16:01:21) jlp: the idea is you have a good feel for how much traffic you're normally getting from, say, comcast (or AS537). when it drops off drastically it means one of three things: (16:02:11) jlp: 1. all those users suddenly decided to go elsewhere for their stuff (unlikely), 2. all those users just decided to stop using the network (also unlikely), or 3. something is broken either at the remote end or between you and them (16:03:09) jlp: also notable, on your dns servers, you were probably normally seeing a fairly even distribution of dns hits from e.g. comcast (16:03:38) jlp: when comcast can suddenly no longer see site B, site B's dns server will be seeing a dearth of requests, while site A will be seeing twice as many (16:04:00) jlp: (or, in google's case, sites A, B, C, D, E, F, and G :-) (16:04:23) jlp: it's a very interesting problem (16:10:02) macnewbold: well they messed something up today... there was about a 10 minute outage of gmail... (16:10:35) macnewbold: though it wasn't the servers being unreachable, at least not just before it came back, cause I was getting a "please try again in a minute" message (16:13:22) jlp: hm, must have been on the servers themselves, then