Herding Code 217: Nick Craver on Stack Overflow Engineering

The guys talk to Nick Craver about all the magic behind the scenes at StackExchange global headquarters. Download / Listen: Herding Code 217: Nick Craver on Stack Overflow Engineering

Show Notes:

Hello
- (00:15) Nick explains his job: software development, sysadmin (site reliablity engineer), and sometime DBA. Just not devops.
- (01:30) Jon introduces Nick's recent blog post series and the Trello board where people can recommend and vote on new topics:
Architecture / Network
- (01:58) Jon notes that they're using Windows and CentOS and asks why CentOS as opposed to other Linux flavors. Nick says they tried Ubuntu first, but it's more tuned for clients. CentOS is a variant of Red Hat Enterprise Linux, so all the packages work.
- (02:53) Nick says that they use whatever's the best tool for each job - factoring in the costs of each new tool (training, migrating, supporting, vendor overhead, etc.). They run Elastic Search, Redis, HAProxy and Logstash.
- (03:50) Jon asks about how they're using protobuf to serialize the information they're persisting Redis. Nick talks about some specifics, including the different levels of caching that they're doing. They're using pub / sub with Redis via websockets. They're not clustering, partly because there's one Redis cache per SE site. He and Jon discuss how multitenant scenarios often require custom implementations.
- (07:20) Jon asks about how they're using websockets. Nick says that they're used in a lot of places for optional updates. Their problems come from running on very few servers - they end up with huge connection tables per server.
- (08:23) Jon asks about one of their Redis instances that's handling machine learning with Providence. Nick says they log some metadata and performance information about about every single request. Providence is a system their data team wrote that analyzes the data, figures out locations, suggested user tags, etc. As a user, you have control over the personalization and can download the data or disable recommendations if you want. There's also a mobile feed, for mobile apps as well. There are 40k ops/sec all day long against Redis. Scott K asks if he can manipulate his feed.
- (11:51) Scott K asks about the L1 and L2 cache that Nick's talked about. Nick clarifies that he's been referring to HTTP caching on the web server and Redis caching.
Tag Engine
- (12:36) Nick talks about the tag engine and explains how it's a lot more complex than people think since it handles some complex queries, needs to sort the results, has minimum question score filters, etc. Every two minutes, the tag engine refreshes all the deltas based on row versions in the SQL Server database. Their previous implementations were saturating the CPU L3 cache. They wrote some affinity code to split things out so Providence and Stack Server run on different processors. Marc Gravell is now working on getting the tag engine working on GPU's (demo code). Nick said that the difference between a $500 to a $5000 card is only about a 40% improvement. They're really excited about upcoming graphics cards with 8GB memory, since that will allow them to fit the entire tag engine in memory on the graphics card; with two cards they'll be able to run active-active and active-passive configurations (for live upgrades) for only about $600 per card. Jon says he saw a question on Hacker News asking if it could just be handled by a bitmap index, but Nick explains how more complicated queries and sorting make that not a good solution. Jon asks about the GPGPU split between C# and C++ code. Nick explains how it's almost all handled via C# code, but it's missing a few newer CUDA commands.
- (19:48) Jon asks about the Elastic Search implementation. Elastic Search doesn't really support types, it has field groupings, which makes the upgrade more difficult. Nick explains that things are pretty vanilla now, but they'd like to make some customizations to support nested search results when time permits.
Data and SQL Server
- (21:03) Jon asks about their SQL Server implementation. Nick talks about the clustering setup.
- (21:49) Jon asks what version of SQL Server they're using. Nick says they're currently running the latest version of SQL Server 2014 and will move to SQL Server 2016 as soon as it releases. [Note: SQL Server 2016 has since released and they've upgraded.]
- (22:11) Nick talks about some of the top reasons they're looking forward to SQL Server 2016: string_split and JSON parsing. These are both useful for queries that take a list as a parameter. Jon reminisces about a time long ago when he used XML to pass lists to SQL queries.
- (23:32) Kevin asks if they're able to do a piecemeal migration without downtime. Nick explains how they do upgrades using replicas. They can test on other replicas, then fail over to them, or roll back to the previous master. They hate Windows clustering, and Windows Server 2016 and SQL Server 2016 should soon support distributed affinity groups which would allow them to do simple affinity group based upgrades.
- (26:15) Jon asks about SQL Server on Linux. Nick says he can't really talk about it.
- (26:35) Jon asks if they use SQL Server Hekaton / In-Memory OLTP. Nick says they don't, they run enough memory in their database servers that it's not needed. Nick says it's more for high-frequency no-lock access.
- (27:27) Jon is a little horrified that they don't use stored procedures and do all database queries using inline SQL. Nick explains that it's much simpler to modify queries and app code that consumes the query together rather than managing separate deployments. There's no performance benefit to using stored procedures over inline queries.
- (29:27) Jon asks about how they handle migrations. Nick explains how changing tables is handled via a migration file and a ten line bot.
Source Control, Localization, Build
- (30:47) Jon asks about their use of GitLab. Nick says it works okay, but they're testing GitHub Enterprise internally due to performance. GitHub is significantly faster, search works a lot better (due to using Postgres search rather than Elastic Search), and there are some nice new features in GitHub like squashing commits.
- (32:23) Jon asks about the localization features and is educated about the ja, ru, pt and es versions of stackoverflow. Nick explains some of the different localization issues that you run into in localization. Most localization solution work by string replacement, which requires string allocations. That doesn't work at scale. They've written a system called Moonspeak which uses Roslyn to precompile view. This allows a direct response.write of the localized string via switch statements, which is a lot more efficient. They haven't had time to open source it yet, but would like to.
- (36:36) Jon asks about their build process using MSBuild. They use it mostly because it's what the tooling uses. They could customize it more with PowerShell, but that would tie them more to TeamCity and Nick's not sure there's a benefit to making that move. Nick's waiting to see where csproj is going - he's got some big doubts it'll be as terse as project.json, but he's interested to see. Nick says historically MSBuild has been optimized for three-way merge generation; Jon says that was technically Visual Studio's fault since MSBuild actually has had glob support for a while.
Upcoming Technologies, Visual Studio
- (40:35) Nick complains about how slow Visual Studio is to reload projects. Their developers have scripts that just kill and restart Visual Studio, because that's faster than handling project reloads.
- (41:16) Jon asks if Nick's played with Visual Studio "15". Nick wonders about the technology used in the installer. He says it's generally good, but they're running into some issues with solution files changing when moving between versions.
- (42:33) Nick says that they generally don't ever use File / New, they copy an existing project and rename things. There's a discussion about whether it's possible to customize project templates. Jon says you can export a project as a solution template; K Scott mentions that SideWaffle has some capabilities there, too, but there was some "wonkiness". And what's the deal with GUIDs in project and solution files?
- (46:09) Scott K mentions a command line base project generator that he started on years ago called ProjectStarter. He wishes that it was possible to configure Visual Studio to define a custom build tool rather than assuming everything's in csproj. He gets that Visual Studio features like IntelliSense depend on controlling the build, but doesn't like that Visual Studio has to "know everything about everything".
- (48:55) Jon says he sees two ways that cross-platform can work: either make the frameworks able to work without the tools knowing and controlling everything, or updating Visual Studio Code so it's able to know and control everything. He hopes it's the first way.
- (49:50) Nick complains about how sometimes in-memory builds don't reflect changes, or csproj doesn't save before a build. He'd like everything to save before builds. Scott K calls out the Save All The Time extension for Visual Studio that Paul Betts made.
- (50:40) Jon asks Nick if they've looked at ASP.NET Core. Nick says that they'll mostly be starting with their internal tools. They have several libs that they'll need to port, and they've got some difficult problems with libraries like MiniProfiler that need to support both .NET 4.x and .NET Core because the underlying APIs have some significant differences. You can't just multi-target code that targets things like HttpContext. Other libraries like dapper and stackexchange-redis haven't been as bad, and they've been working on them because lots of other developers are depending on them.
- (55:03) Jon calls out some of Nick's recent C# 6 tweets. Nick says he likes null coalescing and ternaries - they see more terse code as a lot more readable, but it of course varies by team. Roslyn has been really big for them, things like Moonspeak rely on it.
Questions from Twitter
- (56:23) Matt Warren asks "What performance issues have you had the most fun finding and fixing." Nick mentions the tag engine and a fun debugging issue they ran into where the TimeSpan int constructor uses Ticks rather than seconds or milliseconds, so their cache code was only caching values for tiny fractions of a second rather than thirty seconds. They find out so many issues using MiniProfiler; he wishes more developers would use MiniProfiler (or another tool like Glimpse) in their applications. They run MiniProfiler for every single request on StackOverflow and the overhead is minimal - if they can do it, you can do it.
- (58:17) Matt Warren asks "What the craziest thing they've done to increase performance." Nick talks about the IL related work they've done - sometimes instead of conditional code, it's faster to just swap out the method body. They're pragmatic, they only do this for extreme cases like things that run for every request and have real performance implications. What's the trick for StackOverflow? Keep it simple.