class: center,middle # How to Catch when Proxies Lie ##
Verifying the Physical Locations of
Network Proxies with Active Geolocation
.authors[ .agroup[
Zachary Weinberg
·
Nicolas Christin
·
Vyas Sekar
.affil[Carnegie Mellon University] ] .agroup[
Shinyoung Cho
.affil[SUNY Stonybrook] ] .agroup[
Phillipa Gill
.affil[UMass-Amherst] ] ]
invisible
italic
and
bold
text to force fonts to load
.institutions[
] ??? Hello everyone. I’m going to talk about what you can do when you suspect your VPN servers aren’t where the VPN company says they are. I’m a PhD student at Carnegie Mellon’s CyLab. This is joint work with two of the CyLab faculty, Nicolas Christin and Vyas Sekar, and also with Shinyoung Cho at SUNY Stonybrook and Phillipa Gill at the University of Massachussetts. --- # Implausible claims
??? This is a verbatim quote from a major commercial VPN service’s website. They claim to have servers in “190+ countries” and this is their list just for Asia and the Pacific. I marked in red several countries that seem really unlikely. North Korea for obvious political reasons and the rest of them because they’re tiny islands with fewer than 5000 inhabitants. Once you notice this, you wonder, do we have any reason to believe _any_ of these locations are true? (By the way, whenever I say “country” in this talk, I mean a region with its own ISO 3166 country code. That includes both sovereign states and dependent territories.) --- # Implausible claims, audited
Claim 218 countries No more than 40 true countries ??? To spoil my own punch line, this map shows all the countries where that service said they have servers, and which of those claims are true. Green is true, orange is false, light tan is not claimed in the first place. They said they have 218 countries but the servers are really in less than 40 countries. The rest of this talk is about how we know that. I’m going to show you how you _can_ locate a server anywhere in the world, without trusting operator claims or IP-to-location databases. Then I’m going to show you how to apply that technique to VPN servers specifically, and then I’ll come back to this, and the same for six other providers, and what it means. --- # Active geolocation
Same principle as GPS,
but use packet round-
trip time (RTT)
CBG: Linear estimate of maximum packet travel
??? How can we find out where proxy servers really are, without trusting any information that could be faked? The basic idea is called active geolocation. It works on the same principle as the Global Positioning System, but instead of radio waves we use ping packets. People have been studying how to do this for more than twenty years; one of the simplest techniques is called CBG, Constraint-Based Geolocation. It goes like this: We have _landmark_ hosts in, say, France, the UK, and Denmark, we ping the _target_ host from each, we assume the relationship between travel time and travel distance is linear and we find out it can be only so many kilometers from each, we draw disks on the map and we find out it’s gotta be in Belgium. Or maybe a couple places in southeastern England. We’re assuming it’s not on a disused anti-aircraft platform in this wedge of the North Sea here. The problem is, radio waves travel in straight lines at a constant velocity, but packets don’t. There’s always routing delays, and also “circuitous” routes, major detours from the great-circle distance—often packets get routed from Australia to Japan by way of California, because that’s how the peering goes. That’s 21 thousand kilometers’ worth of extra latency. On the right I’ve plotted the relationship between delay and distance for pings from one of the landmarks we used to all of the others, and you can see there _is_ a relationship but it’s messy. The black line is CBG’s linear speed estimate, it’s as steep as possible without going above any of the points. There are much fancier models in the literature. --- # (Quasi-)Octant
Minimum as well as
maximum distance
Piecewise-linear travel
time estimate, using
convex hull of points
??? For instance, Octant uses piecewise linear models, based on the convex hull of the points on the scatterplot, to estimate the minimum as well as the maximum travel distance. Rings instead of disks. In this example, that lets it rule out England. Octant also does things with hop-by-hop travel times for some additional accuracy, but we had to remove that part of the algorithm because we couldn’t collect traceroutes through most of the VPN servers, they black-hole ICMP time exceeded packets. I’m going to be calling our implementation “Quasi-Octant” from now on because of that. --- # Spotter
Probabilistic combination
of Gaussian rings
Cubic polynomial
estimates of
μ
and
σ
??? And Spotter draws probability density functions instead of flat shapes on the map, based on cubic polynomial regression on the delay-distance relationship. Comes out basically the same in this example. The papers describing the fancier algorithms often compare back to CBG and claim some percentage reduction in the uncertainty of the estimate, over the same test set. But the catch is they’re all tested on North America or Europe, and often only on PlanetLab nodes, which may have better connectivity than average for that area. There’s reports that Octant’s minimum distance estimates are unsound in China, because its network is always congested, so it’s not safe to say “this packet must have traveled at least this distance.” And hardly anyone has tested active geolocation on hosts that could be anywhere in the world. --- # Testing active geolocation around the world
RIPE Atlas anchors
and stable probes
Global population density
as of 2015
https://atlas.ripe.net
GPWv4, CIESIN/SEDAC
http://sedac.ciesin.columbia.edu/data/set/gpw-v4-population-density-rev10/maps
??? So we did that test. We measured the accuracy of CBG, Quasi-Octant, Spotter, and a hybrid—cubic regression but geometric intersection—on test targets all around the world. We used landmarks from the RIPE Atlas measurement constellation. They have two classes of measurement hosts, anchors and probes. Anchors work better as landmarks, mainly because they’re guaranteed to have stable stable IP addresses, but you can also use probes if you’re careful. There’s about 300 overall. We don’t use them all for every measurement, but that’s just a performance hack. Their coverage outside of Europe could be better. For comparison, on the right is an estimate of world population density as of 2015. Even if you scale that by Internet access, there’s still a huge discrepancy. There are measurement constellations with more hosts in North America, like CAIDA Ark, but I haven’t found one with many more hosts in Latin America, Africa, or Asia. But there’s enough worldwide coverage to make this worth trying, at least. --- # Testing active geolocation around the world
RIPE Atlas anchors
and stable probes
Crowdsourced test hosts
(40 volunteer, 150 MTurk)
??? We calibrate all our algorithms on ping times from landmarks to landmarks, so we need a second set of hosts to be testing targets. We crowdsourced these. 40 from volunteers, 150 paid workers from Amazon’s Mechanical Turk micro-task service. I was complaining about RIPE Atlas not having enough hosts in Latin America, Africa, and Asia, but it’s hard for a researcher based in the USA to get volunteers from there, too. Mechanical Turk lets you request workers from specific countries, which we used to prevent India and the USA consuming my whole budget for this, but in many countries we didn’t get any workers at all. But, again, there’s enough to tell us something. --- # Measuring RTT with a Web app
??? We couldn’t measure round-trip times with ordinary ping packets, because the proxies we ultimately want to investigate are behind aggressive ingress filters. Also, we couldn’t ask our volunteers or MTurk workers to download, compile, and run a command line program, so we had to do those measurements with a Web application, which has lots of restrictions on how it can access the network. The short version is, we have to use TCP handshakes, on a well-known port, and when we are using a Web application we can’t be sure whether we’re measuring one round-trip or two, which means the distance estimates are randomly twice as big as they should be. If the browser’s running on Windows it can even be three or four round trips, I don’t know why. But this is a useful problem to have, because it’s giving us unpredictable extra latency, which is the same thing we have to worry about because of congested regional networks and circuitous routes. --- # Algorithm comparison
Algorithms using minimum distance estimates do not cope with extra latency ??? So we didn’t try to compensate for the extra round-trips at all. We tested unmodified CBG, Quasi-Octant, Spotter, and Hybrid on the crowdsourced data and here’s how well they did. The most important criterion is on the left. We want the true location always to be _inside_ the prediction region. None of the algorithms managed that, and the left plot shows how badly they failed. How far away was the edge of the prediction from the true location? Turns out the simplest algorithm, CBG, is least likely to fail this way. Then we look at _why_ they’re failing with the other two plots, and what we find is, the problem is minimum distance estimates. Quasi-Octant, Spotter, and Hybrid are all failing because they assume the packets have to have gotten some distance away in a given time, but that’s not true because of all the systematic errors. --- # Avoiding underestimation
Underestimating travel
distance can cause
empty prediction
Underestimates observed
for ~1% of all disks
??? CBG only fails when some of its disks _underestimate_ the distance. Here’s the example from the beginning again, if we add this pink disk, which is an underestimate, we get no prediction region at all because there’s no overlap among all four disks. If it were just a little bit bigger, but still too small, we’d think the server had to be farther to the southeast than it really is. Since we know the true locations of all the crowdsourced test hosts, we can calculate how often underestimates happen. It comes out to about 1% of all the disks. Anything to the left of this red line. So we make two changes to CBG. We know it’s physically possible for a packet to go twenty thousand kilometers in 240 milliseconds, using a communications satellite, so we say that CBG’s speed estimate has to be at least fast enough to allow that. And, if there’s no overlap among all the disks, we discard down to the largest subset that does have an overlap. In this example, there’s two possibilities, so we take the bigger overlap, which means throwing out the pink disk and giving the same answer we originally had. We retested and sure enough, those two changes eliminated all of the misses. So we used this modified CBG for the main study, geolocating VPN proxies. --- # Seven VPN providers
.caption[ VPN commercial landscape data collected by [VPN.com](https://www.vpn.com) ] ??? I’m not going to name the VPN companies we tested, because there’s many more companies we haven’t tested. I don’t want you to think the companies in this study are unusually misleading about their advertised server locations. I suspect this is an industry wide problem, and if we tested all of the companies we could find, we’d discover at least some falsehoods for most of them. But what I will tell you is that this slide shows 157 VPN providers, with the set of countries that they advertise servers in for each, and the lettered ones are the ones we tested. The data comes from the comparison site VPN dot com. Providers A through E are all in the top 20 by number of countries advertised, and F and G are much more typical. --- # Location databases agree with providers…
??? We checked the providers’ claims against five major IP-to-location databases and you can see that the databases mostly agree with them. Eighty percent agreement or better for most. IP2Location and IPInfo are down around 50% for provider A, which is curious, but it could just mean they’re out of date. We’ve all heard that IP-to-location databases are notoriously full of errors, but also, a lot of the sources they use could be faked pretty easily. Whois, address registry allocations, airport codes in routers’ DNS names, that sort of thing. So, suppose a VPN company has a way to fake server locations in IP-to-location databases. Most of their customers, what they probably want is for websites to _think_ they’re surfing from Ruritania. They want to watch Ruritanian TV but the website will only stream TV to people in the country. And the website enforces that by looking up client addresses in one of these databases, so if they fake their server locations, they can give their customers what they want, without needing to have servers in lots of different countries. Saves money. Economically rational. But maybe you’re not subscribing to VPNs to watch TV. Maybe you have a reason why you really need your packets routed through Ruritania. Then a faked server location is no good to you. --- # Measurement through VPN servers
Cannot measure
A
Can measure
B
and
C
A
=
B
− 0.49
C
??? I need to go over a couple more technical wrinkles before we get to the results. We’re not using the Web app for the VPN servers, we’re using a command-line program that can reliably measure a single round trip time. But it’s the wrong round trip. To geolocate VPN servers, we need the round trip time between the server and each landmark. That’s _A_ on the left diagram. But we can’t measure that directly, because we can’t run code on the server itself, and most of them don’t respond to pings. We _can_ measure _B_, the round-trip time from our client _through_ the server to each landmark, and also _C_, the round-trip time from our client through the server and back to the client and then back again. In an ideal world, _A_ would be equal to _B_ minus half of _C_, because _C_ goes back and forth between the client and the server twice. A few of the servers _can_ be pinged, so we use that to check this equation, and it holds up: linear regression says 0.49_C_ with R-squared greater than point nine nine. --- # Disambiguation with external knowledge
All these targets belong
to the same AS and /24
All the data centers inside
the oval are in Chile
??? Also, sometimes, CBG by itself gives us an ambiguous answer but we can resolve it with outside information. For instance, if we have a group of servers whose IP adddresses all belong to the same Autonomous System and the same /24, probably they’re all in the same location. If we get some prediction regions that cross a national border and some that don’t, for a group like that, we assume the crossings are a mistake. In this example on the left, we say all these servers are in Canada and not the USA. This is often useful when a big city is near a border, like Toronto here, and for small countries like Singapore or even Belgium, where the prediction has to be really tight not to cross into any neighbors. Also, we’re geolocating _servers_. Servers live in data centers. For instance, all the data centers inside the prediction region on the right are in Chile, not Argentina, so we can assume that this proxy is in Chile. Incidentally, all the providers use DNS-based load-balancing, so we look up all the IP addresses for their servers and test each one independently. Those 20 servers on the left correspond to five DNS names all belonging to one provider. --- # Provider A
Claim 218 countries No more than 40 true countries ??? So now we come back to this slide I showed you at the beginning. Provider A claims to have VPN servers in 218 countries. Almost every ISO country code there is. They’re missing a few places in Africa and South America. Nobody we tested says they have a server in Antarctica, by the way. The green countries are the ones where they really do have servers, and the orange countries are the ones where they said they do, but they don’t. Almost nothing is really in South America, or Africa, or Central Asia, or Oceania. Fewer than advertised in several other places. And this isn’t just a matter of its being difficult to operate servers in certain locations. There would be no problem getting hosting in Norway, or New Zealand, or Egypt, or Argentina, but they don’t; conversely, getting hosting in Russia is a hassle, but they have. I can’t show it to you on this map, but there is very little relationship between the claimed location and the actual location. Claimed locations from all over the world turn out to be concentrated into data centers in Florida, the UK, and the Czech Republic. --- # Provider B
Claim 109 countries No more than 30 true countries ??? Provider B isn’t making claims quite as grandiose as A, but there’s still quite a lot of lies, especially relating to South America, Africa, and Central Asia. --- # Provider C
Claim 84 countries No more than 50 true countries ??? I’m going to go quickly through the rest of these, the overall patterns are much the same. I can’t show it on this map, but this provider had servers that were supposed to be in the USA but we measure them being in Saudi Arabia, Iran, and China, which is precisely backward from what you would expect. If you’re going to go to the trouble of getting data center space in those countries, why wouldn’t you advertise it? --- # Provider D
Claim 52 countries No more than 45 true countries ??? This provider’s servers are slow and overloaded, which makes all the predictions come out more uncertain. It’s possible that they have fewer countries than this says they do. --- # Provider E
Claim 53 countries No more than 35 true countries ??? Nobody seems to want to put servers in southeastern Europe, which seems odd to me. But hey, at least these people aren’t lying about Italy! --- # Provider F
Claim 19 countries Could be as many as 31 countries ??? This provider also has slow, overloaded servers producing position uncertainty and rendering us unable to tell how much lying is going on. But at least we know that we don’t know. --- # Provider G
Claim 20 countries No more than 18 true countries ??? The servers this provider said were in France and Italy are actually in Germany. A hundred years ago someone might have started a war over that. --- # Summary
Dishonest claims are more likely to occur in the “long tail” of countries. ??? To sum up, provider claims are fully credible for a little less than half of the tested IP addresses, and _could_ be true for nearly two-thirds. But which countries account for the bulk of the credible claims? USA, Australia, UK, Netherlands, Germany, Canada, France, and so on. The places where bandwidth is cheap and business is easy to do. The dishonesty happens in the long tail of countries — not by population, but by ease of access to hosting. There’s some odd exceptions. I don’t know why they tend to lie about Sweden and tell the truth about Russia. --- # Either we’re wrong or the databases are
??? Now let’s look again at the provider claims and the location databases, adding some rows at the bottom for how much _we_ agree with the claims. Depending on how much we give the providers the benefit of the doubt, we agree with their claims anywhere from 30% to 90% of the time. Except for provider D, we always agree less than the databases do. (The maps I showed match the “generous” row.) Either the databases are wrong or I’m wrong, and I’m pretty sure I’m not wrong. --- # Questions raised * Is other research using VPNs invalidated? * How easy is it to fake IP-to-location records? * What if the VPN actively interferes with these measurements? * What do people think they’re buying? * Should Web apps be able to measure precise network timing? ??? Obviously this is not the last word on VPN server locations, there are plenty of ways our results could be improved. I want to end, though, with some questions raised by just the work so far. We started this project because we weren’t seeing known cases of Internet censorship, through Provider A’s VPNs. I wonder if anyone else may have done measurement studies using these providers and didn’t measure what they thought they were measuring. How easy _is_ it to tamper with IP-to-location databases, the way we think they’re doing? We know there’s tons of _errors_ in IP-to-location databases, but I haven’t seen anyone looking for active falsification. Some studies say, if the target delays its responses to pings, it can foul all the distance estimates. With the measurements we’re doing, the VPN could be even more aggressive than that, and respond early to some of our SYNs. Is there anything we could do to prevent that kind of interference? I can think of one way, but it needs a custom protocol and synchronized clocks on all the landmarks… On a policy note, to know if this is a clear or a fuzzy case of false advertising, we need to understand what VPN customers think they’re buying; if it’s just access to Ruritanian streaming TV or if they truly expect their packets to get routed through Ruritania. And finally, remember I said we’d built a Web application that runs an active geolocation measurement? That could be used by a malicious website to locate a human without their permission. Maybe Web apps shouldn’t be allowed to measure precise network timings.