Monday, June 6, 2016

Wix - 11/11/2014

Nifty Architecture Tricks From Wix – Building A Publishing Platform At Scale

15724861706_3a81497b37_m
Wix operates websites in the long tale. As a HTML5 based WYSIWYG web publishing platform, they have created over 54 million websites, most of which receive under 100 page views per day. So traditional caching strategies don’t apply, yet it only takes four web servers to handle all the traffic. That takes some smart work.
Aviran Mordo, Head of Back-End Engineering at Wix, has described their solution in an excellent talk: Wix Architecture at Scale. What they’ve developed is in the best tradition of scaling is specialization. They’ve carefully analyzed their system and figured out how to meet their aggressive high availability and high performance goals in some most interesting ways.
Wix uses multiple datacenters and clouds. Something I haven’t seen before is that they replicate data to multiple datacenters, to Google Compute Engine, and to Amazon. And they have fallback strategies between them in case of failure.
Wix doesn’t use transactions. Instead, all data is immutable and they use a simple eventual consistency strategy that perfectly matches their use case.
Wix doesn’t cache (as in a big caching layer). Instead, they pay great attention to optimizing the rendering path so that every page displays in under 100ms.
Wix started small, with a monolithic architecture, and has consciously moved to a service architectureusing a very deliberate process for identifying services that can help anyone thinking about the same move.
This is not your traditional LAMP stack or native cloud anything. Wix is a little different and there’s something here you can learn from. Let’s see how they do it…

Stats

  • 54+ million websites, 1 million new websites per month.
  • 800+ terabytes of static data, 1.5 terabytes of new files per day
  • 3 data centers + 2 clouds (Google, Amazon)
  • 300 servers
  • 700 million HTTP requests per day
  • 600 people total, 200 people in R&D
  • About 50 services.
  • 4 public servers are needed to serve 45 million websites

Platform

  • MySQL
  • Google and Amazon clouds
  • CDN
  • Chef

Evolution

  • Simple initial monolithic architecture. Started with one app server. That’s the simplest way to get started. Make quick changes and deploy. It gets you to a particular point.
    • Tomcat, Hibernate, custom web framework
    • Used stateful logins.
    • Disregarded any notion of performance and scaling.
  • Fast forward two years.
    • Still one monolithic server that did everything.
    • At a certain scale of developers and customers it held them back.
    • Problems with dependencies between features. Changes in one place caused deployment of the whole system. Failure in unrelated areas caused system wide downtime.
  • Time to break the system apart.
    • Went with a services approach, but it’s not that easy. How are you going to break functionality apart and into services?
    • Looked at what users are doing in the system and identified three main parts: edit websites, view sites created by Wix, serving media.
    • Editing web sites includes data validation of data from the server, security and authentication, data consistency, and lots of data modification requests.
    • Once finished with the web site users will view it. There are 10x more viewers than editors. So the concerns are now:
      • high availability. HA is the most important feature because it’s the user’s business.
      • high performance
      • high traffic volume
      • the long tail. There are a lot of websites, but they are very small. Every site gets maybe 10 or 100 page views a day. The long tail make caching not the go to scalability strategy. Caching becomes very inefficient.
    • Media serving is the next big service. Includes HTML, javascript, css, images. Needed a way to serve files the 800TB of data under a high volume of requests. The win is static content is highly cacheable.
    • The new system looks like a networking layer that sits below three segment services: editor segment (anything that edits data), media segment (handles static files, read-only), public segment (first place a file is viewed, read-only).

Guidelines For How To Build Services

  • Each service has its own database and only one service can write to a database.
  • Access to a database is only through service APIs. This supports a separation of concerns and hiding the data model from other services.
  • For performance reasons read-only access is granted to other services, but only one service can write. (yes, this contradicts what was said before)
  • Services are stateless. This makes horizontal scaling easy. Just add more servers.
  • No transactions. With the exception of billing/financial transactions, all other services do not use transactions. The idea is to increase database performance by removing transaction overhead. This makes you think about how the data is modeled to have logical transactions, avoiding inconsistent states, without using database transactions.
  • When designing a new service caching is not part of the architecture. First, make a service as performant as possible, then deploy to production, see how it performs, only then, if there are performance issues, and you can’t optimize the code (or other layers), only then add caching.

Editor Segment

  • Editor server must handle lots of files.
  • Data stored as immutable JSON pages (~2.5 million per day) in MySQL.
  • MySQL is a great key-value store. Key is based on a hash function of the file so the key is immutable. Accessing MySQL by primary key is very fast and efficient.
  • Scalability is about tradeoffs. What tradeoffs are we going to make? Didn’t want to use NoSQL because they sacrifice consistency and most developers do not know how to deal with that. So stick with MySQL.
  • Active database. Found after a site has been built only 6% were still being updated. Given this then these active sites can be stored in one database that is really fast and relatively small in terms of storage (2TB).
  • Archive database. All the stale site data, for sites that are infrequently accessed, is moved over into another database that is relatively slow, but has huge amounts of storage. After three months data is pushed to this database is accesses are low. (one could argue this is an implicit caching strategy).
  • Gives a lot of breathing room to grow. The large archive database is slow, but it doesn’t matter because the data isn’t used that often. On first access the data comes from the archive database, but then it is moved to the active database so later accesses are fast.

High Availability For Editor Segment

  • With a lot of data it’s hard to provide high availability for everything. So look at the critical path, which for a website is the content of the website. If a widget has problems most of the website will still work. Invested a lot in protecting the critical path.
  • Protect against database crashes. Want to recover quickly. Replicate databases and failover to the secondary database.
  • Protect against data corruption and data poisoning.  Doesn’t have to be malicious, a bug is enough to spoil the barrel. All data is immutable. Revisions are stored for everything. Worst case  if corruption can’t be fixed is to revert to version where the data was fine.
  • Protect against unavailability. A website has to work all the time. This drove an investment inreplicating data across different geographical locations and multiple clouds. This makes the system very resilient.
    • Clicking save on a website editing session sends a JSON file to the editor server.
    • The server sends the page to the active MySQL server which is replicated to another datacenter.
    • After the page is saved to locally, an asynchronous process is kicked upload the data to a static grid, which is the Media Segment.
    • After data is uploaded to the static grid, a notification is sent to a archive service running on the Google Compute Engine. The archive goes to the grid, downloads a page, and stores a copy on the Google cloud.
    • Then a notification is sent back to the editor saying the page was saved to GCE.
    • Another copy is saved to Amazon from GCE.
    • One the final notification is received it means there are three copies of the current revision of data: one in the database, the static grid, and on GCE.
    • For the current revision there are three copies. For old revision there two revisions (static grid, GCE).
    • The process is self-healing. If there’s a failure the next time a user updates their website everything that wasn’t uploaded will be uploaded again.
    • Orphan files are garbage collected.

Modeling Data With No Database Transactions

  • Don’t want a situation where a user edit two pages and only one page is saved in the database, which is an inconsistent state.
  • Take all the JSON files and stick them in the database one after the other. When all the files are saved another save command is issued which contains a manifest of all the IDs (which is hash of the content which is the file name on the static server) of the saved pages that were uploaded to the static servers.

Media Segment

  • Stores lots of files. 800TB of user media files, 3M files uploaded daily, and 500M metadata records.
  • Images are modified. They are resized for different devices and sharpened. Watermarks can be inserted and there’s also audio format conversion.
  • Built an eventually consistent distributed file system that is multi datacenter aware with automatic fallback across DCs. This is before Amazon.
  • A pain to run. 32 servers, doubling the number every 9 months.
  • Plan to push stuff to the cloud to help scale.
  • Vendor lock-in is a myth. It’s all APIs. Just change the implementation and you can move to different clouds in weeks.
  • What really locks you down is data. Moving 800TB of data to a different cloud is really hard.
  • They broke Google Compute Engine when they moved all their data into GCE. They reached the limits of the Google cloud. After some changes by Google it now works.
  • Files are immutable so the are highly cacheable.
  • Image requests first go to a CDN. If the image isn’t in the CDN the request goes to their primary datacenter in Austin. If the image isn’t in Austin the request then goes to Google Cloud. If it’s not in Google cloud it goes to a datacenter in Tampa.

Public Segment

  • Resolve URLs (45 million of them), dispatch to the appropriate renderer, and then render into HTML, sitemap XML, or robots TXT, etc.
  • Public SLA is that response time is < 100ms at peak traffic. Websites have to be available, but also fast. Remember, no caching.
  • When a user clicks publish after editing a page, the manifest, which contains references to pages, are pushed to Public. The routing table is also published.
  • Minimize out-of-service hops. Requires 1 database call to resolve the route. 1 RPC call to dispatch the request to the renderer. 1 database call to get the site manifest.
  • Lookup tables are cached in memory and are updated every 5 minutes.
  • Data is not stored in the same format as it is for the editor. It is stored in a denormalized format, optimized for read by primary key. Everything that is needed is returned in a single request.
  • Minimize business logic. The data is denormalized and precalculated. When you handle large scale every operation, every millisecond you add, it’s times 45 million, so every operation that happens on the public server has to be justified.
  • Page rendering.
    • The html returned by the public server is bootstrap html. It’s a shell with JavaScript imports and JSON data with references to site manifest and dynamic data.
    • Rendering is offloaded to the client. Laptops and mobile devices are very fast and can handle the rendering.
    • JSON was chosen because it’s easy to parse and compressible.
    • It’s easier to fix bugs on the client. Just redeploy new client code. When rendering is done on the server the html will be cached, so fixing a bug requires re-rendering millions of websites again.

High Availability For Public Segment

  • Goal is to be always available, but stuff happens.
  • On a good day: a browser makes a request, the request goes to a datacenter, through a load balancer, goes to a public server, resolves the route, goes to the renderer, the html goes back to the browser, and the browser runs the javascript. The javascript fetches all media files and the JSON data and renders a very beautiful web site. The browser then make a request to the Archive service. The Archive service replays the request in the same way the browser does and stores the data in a cache.
  • On a bad day a datacenter is lost, which did happen. All the UPSs died and the datacenter was down. The DNS was changed and then all the requests went to the secondary datacenter.
  • On a bad day Public is lost. This happened once when a load balancer got half of a configuration so all the Public servers were gone. Or a bad version can be deployed that starts returning errors. Custom code in the load balancer handles this problem by routing to the Archive service to fetch the cached if the Public servers are not available. This approach meant customers were not affected when Public went down, even though the system was reverberating with alarms at the time.
  • On a bad day the Internet sucks. The browser makes a request, goes to the datacenter, goes to the load balancer, gets the html back. Now the JavaScript code has to fetch all the pages and JSON data. It goes to the CDN, it goes to the static grid and fetches all the JSON files to render the site. In these processes Internet problems can prevent files from being returned. Code in JavaScript says if you can’t get to the primary location, try and get it from the archive service, if that fails try the editor database.

Lessons Learned

  • Identify your critical path and concerns. Think through how your product works. Develop usage scenarios. Focus your efforts on these as they give the biggest bang for the buck.
  • Go multi-datacenter and multi-cloud. Build redundancy on the critical path (for availability).
  • De-normalize data and Minimize out-of-process hops (for performance). Precaluclate and do everything possible to minimize network chatter.
  • Take advantage of client’s CPU power. It saves on your server count and it’s also easier to fix bugs in the client.
  • Start small, get it done, then figure out where to go next. Wix did what they needed to do to get their product working. Then they methodically moved to a sophisticated services architecture.
  • The long tail requires a different approach. Rather than cache everything Wix chose to optimize the heck out of the render path and keep data in both an active and archive databases.
  • Go immutable. Immutability has far reaching consequences for an architecture. It affects everything from the client through the back-end. It’s an elegant solution to a lot of problems.
  • Vendor lock-in is a myth. It’s all APIs. Just change the implementation and you can move to different clouds in weeks.
  • What really locks you down is data. Moving lots of data to a different cloud is really hard.

No comments:

Post a Comment