Building a Billion User Load Balancer [video]

dorianm · on Jan 9, 2017

Map of Google data centers: http://imgur.com/l1dDdQe

Map of Facebook data centers and PoP: http://imgur.com/dek8ESX

Retr0spectrum · on Jan 9, 2017

I'm surprised that neither have any datacenters in India. Does anyone know why this might be?

nocarrier · on Jan 9, 2017

Cost was a smaller factor than politics; the Indian government wanted the private keys for our certs in order to let FB put a POP there. That was an absolute dealbreaker, so we served India from Singapore and other POPs in nearby countries.

Regarding building a datacenter, that's much higher risk since it's a $100M-1B capital investment. I'm guessing both Facebook and Google see too much volatility to make the risk worth it. It could be an expensive paperweight if the Indian government changed their minds.

degenerate · on Jan 9, 2017

Wow. Thank you for not bending over on the private key issue. That's terrible.

Joe8Bit · on Jan 9, 2017

My intuition tells me it's a combination of:

* They eventually will build one, but Indian consumers are (as of right now) less valuable to them (from an ad targeting perspective) than other geographies e.g. the economic/time/effort investment ROI is better elsewhere in the short term

* SE Asia (and their POPs there) have reasonable connectivity to India, so it might good enough for the time being

noponpop · on Jan 9, 2017

That's not quite the answer he gave to this exact question in the video: geopolitics and cost were the major reasons, they want pops there.

theDoug · on Jan 9, 2017

Google has announced a region for Mumbai (India) coming this year: https://cloud.google.com/about/locations/

(Can't/won't speculate as to why not earlier.)

brobinson · on Jan 9, 2017

Both Google and Facebook both have datacenters/PoPs in Taiwan? I can't imagine the 24mm people there justify it, and both services are either largely unused or blocked in the nearby PRC...

I'm more surprised that Google has no presence in Japan, though.

mikecb · on Jan 9, 2017

They do. The above map is wholly owned DCs, but there are many more PoPs, and an enormous number of edge cache locations: https://peering.google.com/#/infrastructure

zeeZ · on Jan 9, 2017

They fixed that, at least for GCP. How the other Google bits fit into that, I do not know. They're planning to expand to a bunch of new regions this year.

https://www.blog.google/topics/google-cloud/google-cloud-pla...

briandear · on Jan 9, 2017

There's a major Asia-Pacific undersea cable connected to Taiwan, so Taiwan is a useful Asia-Pacific gateway.

mafribe · on Jan 9, 2017

More precisely, Taiwan is well connected by direct cables to Japan, China, Philippines, Viet Nam, Malaysia, Thailand, Korea, Singapore and the US. As far as I'm aware Google owns a cable connecting Taiwan with Japan (part of [1]).

[1] https://en.wikipedia.org/wiki/FASTER_(cable_system)

true_religion · on Jan 9, 2017

In my experience, having a PoP in Taiwan, means that some data requests from S. Korea, Philippines, and Australia are routed to that.

I never know how that works, since Singapore and Tokyo, Japan should be closer---but it still happens.

pyvpx · on Jan 9, 2017

load and network conditions (paths from you to Tokyo are experiencing packet loss for $REASONS, Singapore is overloaded/broken/full, etc).

bschwindHN · on Jan 9, 2017

Google has an office in Tokyo but I guess not a datacenter.

pyvpx · on Jan 9, 2017

but plenty of Google Global Caches, so performance (latency to cached objects and DNS queries) is still quite good

blauditore · on Jan 9, 2017

I wonder what the reasons behind their similar distribution are.

One is surely the density of users, but maybe there are others related to the actual hosting, like cost, law or infrastructure.

di4na · on Jan 9, 2017

infrastructure : electricity, roads, price of land, availability of workforce, water, temeprature, closeness to the backbones.

Add that to the density of users and the timing they expect. Once you factor all that, you end up with a limited amount of possibilities.

y04nn · on Jan 9, 2017

Both companies have looked for optimal choices, taking into account all factor, it is logic that they ended up with the same answers.

ape4 · on Jan 9, 2017

Too bad for Africa and Australia.

sambe · on Jan 9, 2017

Yeah, massive surprise for me to see Australia/NZ not have anything dedicated. Network reliability over the whole region must be pretty poor and although in population terms it's probably too low, you'd think in purchasing power terms it would still be worth it.

noponpop · on Jan 9, 2017

Interesting that Linux kernel performance (ipvs) is acceptable at l4 vs something like dpdk. I guess you just overcome the limitation by increasing the number of l4 instances load balanced by ecmp.

Fun to see DSR in use.

Also interesting to see that all the inherent problems with geolocation via gslb (DNS client IP is not the same as the real client IP) don't wind up being a big problem apparently. This seems to be a growing concern in my experience: users aren't located where thier ISP DNS servers are located.

79d697i6fdif · on Jan 9, 2017

It's mostly because the point of DPDK and similar is to go around a lot of the processing in kernel, and IPVS does exactly this. I'm surprised IPVS isn't more popular, it's built into the kernel and extremely fast.

HTTP proxy type load balancers are slugs in comparison

Scaling app servers to nearly unlimited size is easy to explain but really hard in practice. It basically amounts to this:

1) Balance requests using DNS anycast so you can spread load before it hits your servers

2) Setup "Head End" machines with as large pipes as possible (40Gbps?) and load balance at the lowest layer you can. Balance at IP level using IPVS and direct server return. A single reasonable machine can handle a 40Gbps pipe. I guess you could setup a bunch of these but I doubt many people are over 40Gpbs. Oh, and don't use cloud services for these. The virtualization overhead is high on the network plane and even with SR-IOV you don't get access to all hardware NIC queues. Also, I don't know of any cloud provider thats compatible with direct server return since they typically virtualize your "private cloud" at layer 3, whereas IPVS actually touches layer 2 a little. Do yourself a favor and get yourself a few colo's for your load balancers.

3) Setup a ton of HTTP-proxy type load balancers. This includes Nginx, Varnish, Haproxy etc... One of these machines can probably handle 1-5 Gbps of traffic so expect 20 or so behind each layer 3 balancer. These NEED to be hardened substantially because most attacks will be layer 4 and up once an adversary realizes they can't just flood you out(due to powerful IPVS balancers above). SYN cookies are extremely important here since you're dealing with TCP... just try to set everything up to avoid storing TCP state at all costs. This also means no NAT. You might want to keep these in the colo with your L3 load balancers.

4) Now for your app servers. Depending on if you're using a dog slow language or not, you'll want between 3 and 300 app servers behind each HTTP proxy. You don't really need to harden these as much since the traffic is lower and any traffic that reaches here is clean HTTP. Go ahead and throw these on the cloud if want

bogomipz · on Jan 10, 2017

>"'Im surprised IPVS isn't more popular, it's built into the kernel and extremely fast."

I feel it actually is popular at places that do 10's of Gigs of traffic and up, usually in combination with a routing daemon - Bird, Quagga etc. I have worked in couple of shops now that utilized a similar architecture. I also read recently about a Google LB that leveraged IPVS and now this of course.

noponpop · on Jan 9, 2017

I saw ipvs was implemented in kernel but didn't realize it bypassed the stack. Thanks for clarifying.

79d697i6fdif · on Jan 9, 2017

not all of it. IVPS runs between layer 2 and 3 if using direct server return. It does bypass quite a bit though...

woodcut · on Jan 9, 2017

What if you're not dealing with millions of connections but instead only a few thousand from whitelisted IP's and you need to optimise for high availability & latency? Could it be done with just anycast -> IPVS layer -> app servers ?

bogomipz · on Jan 10, 2017

If its stateless traffic then yes.

The ECMP/Anycast just gets you beyond the limit of an single pair of IPVS boxes which are are kept in sync with keepalived/vrrp for HA.

But a pair of boxes with ipvs + keepalived + iptables should be be able to handle a few thousand connections no problems. Your concern would then likely be the bandwidth going through the box. But if your client pull rather than push using direct server return should be able to get you past the bandwidth limitations of a single box.

nrjdhsbsid · on Jan 10, 2017

Yeah it works pretty much the same. If your clients aren't geographically dispersed replace anycast with DNS round robin or use both like most huge sites do.

Also there's three layers :) dns->ipvs->httpproxy->app servers.

You could ditch the HTTP proxy layer if your app servers are extremely fast like netty/go/grizzly.

SEJeff · on Jan 9, 2017

Out of kernel offloads (dpdk, solarflare's openonload, and mellanox's VMA) are good for two primary use cases:

* Reducing context switches at exceptionally high packet rates * Massively reducing latency with tricks like busy polling (which the kernel's native stack is gaining)

LVS is pretty much the undisputed king for serious business load balancing. I've heard (anecdotally) that Uber uses gorb[1] and google has released seesaw, which are both fancy wrappers ontop of LVS for load balancing.

Source: Almost 10 years optimizing Linux and hardware for low latency in a trading firm.

[1] https://github.com/kobolog/gorb

[2] https://github.com/google/seesaw

bogomipz · on Jan 10, 2017

Mellanox and Solarflare certainly have carved out a nice market for themselves. They are not cheap though which is I guess why they are mostly found in trading shops since latency likely equates to money being left on the table.

skuhn · on Jan 10, 2017

High frequency trading is Solarflare's original market (and Mellanox has their InfiniBand market), but both of them are becoming more and more common in the commodity server market as well. Particularly since Intel dropped the ball on 40G (and REALLY dropped the ball on 25/100G), and Broadcom is out of the adapter market, there is a void that other vendors are filling.

When you're spending $30,000 on a server, it doesn't really matter if you spend $1200 on a network card. Those CPU cycles and storage bytes have to go somewhere to make money.

bogomipz · on Jan 10, 2017

I totally agree. It sounds like their price points @ $1,200 have come down as well.

Did Intel just never get 25/100G card to market?

skuhn · on Jan 10, 2017

Not yet. They launched 40G a while ago, but some issues have kept them from the same kind of dominance they had with 10G.

25G is supposedly coming soon-ish, but 100G is still 1-2 years away. It's going to be hard to compete with vendors who shipped their first products in 2015/2016.

Meanwhile they have a 100G OmniPath adapter, but who cares?

bogomipz · on Jan 11, 2017

>"but some issues have kept them from the same kind of dominance they had with 10G."

Oh interesting. Can you elaborate on those issue are with 40 Gig cards?

justinsaccount · on Jan 9, 2017

> DNS client IP is not the same as the real client IP

see https://developers.google.com/speed/public-dns/faq

"I've read claims that Google Public DNS can slow down certain multimedia applications or websites. Are these true?

...

To help reduce the distance between DNS servers and users, Google Public DNS has deployed its servers all over the world. In particular, users in Europe should be directed to CDN content servers in Europe, users in Asia should be directed to CDN servers in Asia, and users in the eastern, central and western U.S. should be directed to CDN servers in those respective regions. We have also published this information to help CDNs provide good DNS results for multimedia users.

In addition, Google Public DNS engineers have proposed a technical solution called EDNS Client Subnet. This proposal allows resolvers to pass in part of the client's IP address (the first 24/64 bits or less for IPv4/IPv6 respectively) as the source IP in the DNS message, so that name servers can return optimized results based on the user's location rather than that of the resolver. To date, we have deployed an implementation of the proposal for many large CDNs (including Akamai) and Google properties. The majority of geo-sensitive domain names are already covered.

"

noponpop · on Jan 11, 2017

But that's google, I can tell you practically speaking many ISPs don't do this.

Client subnet extension would be very nice.

fforflo · on Jan 9, 2017

Slides here

http://www.slideshare.net/patrickshuff/buildinga-billionuser...

luizbafilho · on Jan 10, 2017

For those who are interested in replicate the very same architecture in your environment, I am working on a similar solution. Please check it out

https://github.com/luizbafilho/fusis

It is a control plane for IPVS and adds distribution, fault tolerance, self-configuration and a nice JSON API to it.

It is almost done, but it needs documentation on how to use it.

Moreover, for those who like numbers, I did some benchmarks using a 16 cores machine with two bounded 10Gbit interface

Scenarios:

1 request per connection with 1 byte: 115k connections/s

20 requests per connection with 1 byte: 670k requests/s

20 requests per connection with 1 megabyte: 14Gbps

It scenario tests one specific aspect of the load balancer.

uniclaude · on Jan 9, 2017

In case anyone else is looking for a transcript: I can't check right now if this talk is much different from the one he did in 2015, but just in case, here's a link to the Youtube video for the older one:

https://www.youtube.com/watch?v=MKgJeqF1DHw

That's the only way I found to get any sort of text.

visarga · on Jan 9, 2017

Good thing this is out. I'm gonna put one on my blog, in expectation of the traffic I'd like to have!

ju-st · on Jan 9, 2017

The "cartographer" doesn't use end user related information to select the best PoP? Or does "Sonar" measure the latency/throughput by looking at existing connections to users?

Ok I watched the presentation now, Sonar apparently does that indeed.

patrickshuff · on Jan 9, 2017

Sonar is the system we use to measure latency from the client devices to all of our PoPs.

Cartographer is the system that consumes all of these sonar measurements and uses several other real time data sources (BGP routes, Link capacity, PoP health, PoP capacity, etc) and continually generates a GLB map for the most optimal targetting of requests to our PoPs.

bogomipz · on Jan 10, 2017

I am curious does Cartographer also adjust iBGP preferences for balancing amongst transit providers for egress traffic?

tonyplee · on Jan 9, 2017

This also gives interesting overview of SDN, NFV, scale inside google data center.

https://www.youtube.com/watch?v=vMgZ_BdipYw

je42 · on Jan 9, 2017

wondering the tinydns and cartographer work together ? i didn't see any dynamic response capability in tinydns ? did i miss it in the docs of tinydns ?

nocarrier · on Jan 9, 2017

Cartographer ingests BGP topology and POP and datacenter health (and other things), and then pushes what is basicially a lookup table to tinydns. None of the dynamism is inside the DNS server itself, which lets it focus on what it is good at--responding to a crap ton of DNS requests per second.

acqq · on Jan 9, 2017

Is this audio/video only or am I missing something? If it is maybe it should be marked in the title.

RyJones · on Jan 9, 2017

There is an embedded video on the page

acqq · on Jan 9, 2017

Then maybe the title shoold have [video] at the end?

I was unable to find anything to read.