Active Geolocation Project

Research Questions

As we’ve already mentioned, active geolocation determines where in the world an Internet host is located, by measuring packet round-trip times between the host we want to locate and a number of other hosts (“landmarks”) in known locations. Packets travel through the network at a predictable speed, roughly half the speed of light in vacuum. In principle, it’s as simple as converting each round-trip time to a distance, drawing a bunch of circles on a map, and finding their intersection. In practice, there are complications, but algorithms to deal with them have been floating around the literature for over a decade: Octant is probably the best-known, and there are dozens of refinements.

We want to address two problems with the existing algorithms, and a problem with the databases that most people use to do passive geolocation (just look the IP address up in a table — much more convenient, if you're willing to rely on the company that compiled the database).

The algorithms have mostly only been tested in Europe and the USA, which have denser and faster data networks than almost anywhere else. They may be making assumptions that don't hold up in other parts of the world. It's also unclear how reliable they are when the host to be located is very far away from all the landmarks. They can probably get the continent right, but are they accurate within the continent?

The algorithms are also not designed to handle proxy servers. If you are funneling all your network traffic through a proxy, you probably want to be sure exactly which country the proxy is in; this affects which laws apply, which online merchants will do business with you, all sorts of things. So you might like to apply active geolocation, to be very sure where the proxy is. But you can't directly measure round-trip times to the proxy, only through the proxy and back to your own machine, which might be very far away. Abstractly, you would expect this to come out as all the round-trip times being slower by a constant factor; is it enough to just subtract that off? And what is the right way to estimate the constant factor?

The databases, meanwhile, seem accurate enough, usually, but they are known to have problems [1] [2], and the companies that compile them will not share their methodology. By cross-checking the databases with self-reported and measured locations, we hope to quantify how likely the databases are to be in error, and by how much.