Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Building a Billion User Load Balancer [video] (usenix.org)
242 points by phodo on Jan 9, 2017 | hide | past | favorite | 50 comments


Map of Google data centers: http://imgur.com/l1dDdQe

Map of Facebook data centers and PoP: http://imgur.com/dek8ESX


I'm surprised that neither have any datacenters in India. Does anyone know why this might be?


Cost was a smaller factor than politics; the Indian government wanted the private keys for our certs in order to let FB put a POP there. That was an absolute dealbreaker, so we served India from Singapore and other POPs in nearby countries.

Regarding building a datacenter, that's much higher risk since it's a $100M-1B capital investment. I'm guessing both Facebook and Google see too much volatility to make the risk worth it. It could be an expensive paperweight if the Indian government changed their minds.


Wow. Thank you for not bending over on the private key issue. That's terrible.


My intuition tells me it's a combination of:

* They eventually will build one, but Indian consumers are (as of right now) less valuable to them (from an ad targeting perspective) than other geographies e.g. the economic/time/effort investment ROI is better elsewhere in the short term

* SE Asia (and their POPs there) have reasonable connectivity to India, so it might good enough for the time being


That's not quite the answer he gave to this exact question in the video: geopolitics and cost were the major reasons, they want pops there.


Google has announced a region for Mumbai (India) coming this year: https://cloud.google.com/about/locations/

(Can't/won't speculate as to why not earlier.)


Both Google and Facebook both have datacenters/PoPs in Taiwan? I can't imagine the 24mm people there justify it, and both services are either largely unused or blocked in the nearby PRC...

I'm more surprised that Google has no presence in Japan, though.


They do. The above map is wholly owned DCs, but there are many more PoPs, and an enormous number of edge cache locations: https://peering.google.com/#/infrastructure


They fixed that, at least for GCP. How the other Google bits fit into that, I do not know. They're planning to expand to a bunch of new regions this year.

https://www.blog.google/topics/google-cloud/google-cloud-pla...


There's a major Asia-Pacific undersea cable connected to Taiwan, so Taiwan is a useful Asia-Pacific gateway.


More precisely, Taiwan is well connected by direct cables to Japan, China, Philippines, Viet Nam, Malaysia, Thailand, Korea, Singapore and the US. As far as I'm aware Google owns a cable connecting Taiwan with Japan (part of [1]).

[1] https://en.wikipedia.org/wiki/FASTER_(cable_system)


In my experience, having a PoP in Taiwan, means that some data requests from S. Korea, Philippines, and Australia are routed to that.

I never know how that works, since Singapore and Tokyo, Japan should be closer---but it still happens.


load and network conditions (paths from you to Tokyo are experiencing packet loss for $REASONS, Singapore is overloaded/broken/full, etc).


Google has an office in Tokyo but I guess not a datacenter.


but plenty of Google Global Caches, so performance (latency to cached objects and DNS queries) is still quite good


I wonder what the reasons behind their similar distribution are.

One is surely the density of users, but maybe there are others related to the actual hosting, like cost, law or infrastructure.


infrastructure : electricity, roads, price of land, availability of workforce, water, temeprature, closeness to the backbones.

Add that to the density of users and the timing they expect. Once you factor all that, you end up with a limited amount of possibilities.


Both companies have looked for optimal choices, taking into account all factor, it is logic that they ended up with the same answers.


Too bad for Africa and Australia.


Yeah, massive surprise for me to see Australia/NZ not have anything dedicated. Network reliability over the whole region must be pretty poor and although in population terms it's probably too low, you'd think in purchasing power terms it would still be worth it.


Interesting that Linux kernel performance (ipvs) is acceptable at l4 vs something like dpdk. I guess you just overcome the limitation by increasing the number of l4 instances load balanced by ecmp.

Fun to see DSR in use.

Also interesting to see that all the inherent problems with geolocation via gslb (DNS client IP is not the same as the real client IP) don't wind up being a big problem apparently. This seems to be a growing concern in my experience: users aren't located where thier ISP DNS servers are located.


It's mostly because the point of DPDK and similar is to go around a lot of the processing in kernel, and IPVS does exactly this. I'm surprised IPVS isn't more popular, it's built into the kernel and extremely fast.

HTTP proxy type load balancers are slugs in comparison

Scaling app servers to nearly unlimited size is easy to explain but really hard in practice. It basically amounts to this:

1) Balance requests using DNS anycast so you can spread load before it hits your servers

2) Setup "Head End" machines with as large pipes as possible (40Gbps?) and load balance at the lowest layer you can. Balance at IP level using IPVS and direct server return. A single reasonable machine can handle a 40Gbps pipe. I guess you could setup a bunch of these but I doubt many people are over 40Gpbs. Oh, and don't use cloud services for these. The virtualization overhead is high on the network plane and even with SR-IOV you don't get access to all hardware NIC queues. Also, I don't know of any cloud provider thats compatible with direct server return since they typically virtualize your "private cloud" at layer 3, whereas IPVS actually touches layer 2 a little. Do yourself a favor and get yourself a few colo's for your load balancers.

3) Setup a ton of HTTP-proxy type load balancers. This includes Nginx, Varnish, Haproxy etc... One of these machines can probably handle 1-5 Gbps of traffic so expect 20 or so behind each layer 3 balancer. These NEED to be hardened substantially because most attacks will be layer 4 and up once an adversary realizes they can't just flood you out(due to powerful IPVS balancers above). SYN cookies are extremely important here since you're dealing with TCP... just try to set everything up to avoid storing TCP state at all costs. This also means no NAT. You might want to keep these in the colo with your L3 load balancers.

4) Now for your app servers. Depending on if you're using a dog slow language or not, you'll want between 3 and 300 app servers behind each HTTP proxy. You don't really need to harden these as much since the traffic is lower and any traffic that reaches here is clean HTTP. Go ahead and throw these on the cloud if want


>"'Im surprised IPVS isn't more popular, it's built into the kernel and extremely fast."

I feel it actually is popular at places that do 10's of Gigs of traffic and up, usually in combination with a routing daemon - Bird, Quagga etc. I have worked in couple of shops now that utilized a similar architecture. I also read recently about a Google LB that leveraged IPVS and now this of course.


I saw ipvs was implemented in kernel but didn't realize it bypassed the stack. Thanks for clarifying.


not all of it. IVPS runs between layer 2 and 3 if using direct server return. It does bypass quite a bit though...


What if you're not dealing with millions of connections but instead only a few thousand from whitelisted IP's and you need to optimise for high availability & latency? Could it be done with just anycast -> IPVS layer -> app servers ?


If its stateless traffic then yes.

The ECMP/Anycast just gets you beyond the limit of an single pair of IPVS boxes which are are kept in sync with keepalived/vrrp for HA.

But a pair of boxes with ipvs + keepalived + iptables should be be able to handle a few thousand connections no problems. Your concern would then likely be the bandwidth going through the box. But if your client pull rather than push using direct server return should be able to get you past the bandwidth limitations of a single box.


Yeah it works pretty much the same. If your clients aren't geographically dispersed replace anycast with DNS round robin or use both like most huge sites do.

Also there's three layers :) dns->ipvs->httpproxy->app servers.

You could ditch the HTTP proxy layer if your app servers are extremely fast like netty/go/grizzly.


Out of kernel offloads (dpdk, solarflare's openonload, and mellanox's VMA) are good for two primary use cases:

* Reducing context switches at exceptionally high packet rates * Massively reducing latency with tricks like busy polling (which the kernel's native stack is gaining)

LVS is pretty much the undisputed king for serious business load balancing. I've heard (anecdotally) that Uber uses gorb[1] and google has released seesaw, which are both fancy wrappers ontop of LVS for load balancing.

Source: Almost 10 years optimizing Linux and hardware for low latency in a trading firm.

[1] https://github.com/kobolog/gorb

[2] https://github.com/google/seesaw


Mellanox and Solarflare certainly have carved out a nice market for themselves. They are not cheap though which is I guess why they are mostly found in trading shops since latency likely equates to money being left on the table.


High frequency trading is Solarflare's original market (and Mellanox has their InfiniBand market), but both of them are becoming more and more common in the commodity server market as well. Particularly since Intel dropped the ball on 40G (and REALLY dropped the ball on 25/100G), and Broadcom is out of the adapter market, there is a void that other vendors are filling.

When you're spending $30,000 on a server, it doesn't really matter if you spend $1200 on a network card. Those CPU cycles and storage bytes have to go somewhere to make money.


I totally agree. It sounds like their price points @ $1,200 have come down as well.

Did Intel just never get 25/100G card to market?


Not yet. They launched 40G a while ago, but some issues have kept them from the same kind of dominance they had with 10G.

25G is supposedly coming soon-ish, but 100G is still 1-2 years away. It's going to be hard to compete with vendors who shipped their first products in 2015/2016.

Meanwhile they have a 100G OmniPath adapter, but who cares?


>"but some issues have kept them from the same kind of dominance they had with 10G."

Oh interesting. Can you elaborate on those issue are with 40 Gig cards?


> DNS client IP is not the same as the real client IP

see https://developers.google.com/speed/public-dns/faq

"I've read claims that Google Public DNS can slow down certain multimedia applications or websites. Are these true?

...

To help reduce the distance between DNS servers and users, Google Public DNS has deployed its servers all over the world. In particular, users in Europe should be directed to CDN content servers in Europe, users in Asia should be directed to CDN servers in Asia, and users in the eastern, central and western U.S. should be directed to CDN servers in those respective regions. We have also published this information to help CDNs provide good DNS results for multimedia users.

In addition, Google Public DNS engineers have proposed a technical solution called EDNS Client Subnet. This proposal allows resolvers to pass in part of the client's IP address (the first 24/64 bits or less for IPv4/IPv6 respectively) as the source IP in the DNS message, so that name servers can return optimized results based on the user's location rather than that of the resolver. To date, we have deployed an implementation of the proposal for many large CDNs (including Akamai) and Google properties. The majority of geo-sensitive domain names are already covered.

"


But that's google, I can tell you practically speaking many ISPs don't do this.

Client subnet extension would be very nice.



For those who are interested in replicate the very same architecture in your environment, I am working on a similar solution. Please check it out

https://github.com/luizbafilho/fusis

It is a control plane for IPVS and adds distribution, fault tolerance, self-configuration and a nice JSON API to it.

It is almost done, but it needs documentation on how to use it.

Moreover, for those who like numbers, I did some benchmarks using a 16 cores machine with two bounded 10Gbit interface

Scenarios:

1 request per connection with 1 byte: 115k connections/s

20 requests per connection with 1 byte: 670k requests/s

20 requests per connection with 1 megabyte: 14Gbps

It scenario tests one specific aspect of the load balancer.


In case anyone else is looking for a transcript: I can't check right now if this talk is much different from the one he did in 2015, but just in case, here's a link to the Youtube video for the older one:

https://www.youtube.com/watch?v=MKgJeqF1DHw

That's the only way I found to get any sort of text.


Good thing this is out. I'm gonna put one on my blog, in expectation of the traffic I'd like to have!


The "cartographer" doesn't use end user related information to select the best PoP? Or does "Sonar" measure the latency/throughput by looking at existing connections to users?

Ok I watched the presentation now, Sonar apparently does that indeed.


Sonar is the system we use to measure latency from the client devices to all of our PoPs.

Cartographer is the system that consumes all of these sonar measurements and uses several other real time data sources (BGP routes, Link capacity, PoP health, PoP capacity, etc) and continually generates a GLB map for the most optimal targetting of requests to our PoPs.


I am curious does Cartographer also adjust iBGP preferences for balancing amongst transit providers for egress traffic?


This also gives interesting overview of SDN, NFV, scale inside google data center.

https://www.youtube.com/watch?v=vMgZ_BdipYw


wondering the tinydns and cartographer work together ? i didn't see any dynamic response capability in tinydns ? did i miss it in the docs of tinydns ?


Cartographer ingests BGP topology and POP and datacenter health (and other things), and then pushes what is basicially a lookup table to tinydns. None of the dynamism is inside the DNS server itself, which lets it focus on what it is good at--responding to a crap ton of DNS requests per second.


Is this audio/video only or am I missing something? If it is maybe it should be marked in the title.


There is an embedded video on the page


Then maybe the title shoold have [video] at the end?

I was unable to find anything to read.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: