Monday, June 6, 2016

AppLovin - 3/10/2015

AppLovin: Marketing To Mobile Consumers Worldwide By Processing 30 Billion Requests A Day

16442530119_470a4487ee_m
This is a guest post from AppLovin‘s VP of engineering, Basil Shikin, on the infrastructure of its mobile marketing platform. Major brands like Uber, Disney, Yelp and Hotels.com use AppLovin’s mobile marketing platform. It processes 30 billion requests a day and 60 terabytes of data a day.
AppLovin’s marketing platform provides marketing automation and analytics for brands who want to reach their consumers on mobile. The platform enables brands to use real-time data signals to make effective marketing decisions across one billion mobile consumers worldwide.

Core Stats

  • 30 Billion ad requests per day
  • 300,000 ad requests per second, peaking at 500,000 ad requests per second
  • 5ms average response latency
  • 3 Million events per second
  • 60TB of data processed daily
  • ~1000 servers
  • 9 data centers
  • ~40 reporting dimensions
  • 500,000 metrics data points per minute
  • 1 Pb Spark cluster
  • 15GB/s peak disk writes across all servers
  • 9GB/s peak disk reads across all servers
  • Founded in 2012, AppLovin is headquartered in Palo Alto, with offices in San Francisco, New York, London and Berlin.

Technology Stack


Third Party Services

Data Storage

  • Aerospike for user profile storage
  • Vertica for aggregated statistics and real-time reporting
  • Aggregating 350,000 rows per second and writing to Vertica at 34,000 rows per second
  • Peak 12,000 user profiles per second written to Aerospike
  • MySQL for ad data
  • Spark for offline processing and deep data analysis
  • Redis for basic caching
  • Thrift for all data storage and transfers
  • Each data point replicated in 4 data centers
  • Each service is replicated at least in 2 data centers (at most in 8)
  • Amazon Web Services used for long term data storage and backups

Core App And Services

  • Custom C/C++ Nginx module for high performance ad serving
  • Java for data processing and auxiliary services
  • PHP / Javascript for UI
  • Jenkins for continuous integration and deployment
  • Zookeeper for distributed locking
  • HAProxy and IPVS for high availability
  • Coverity for Java/C++ static code analysis
  • Checkstyle and PMD for PHP static code analysis
  • Syslog for DC-centralized log server
  • Hibernate for transaction-based services

Servers And Provisioning

  • Ubuntu
  • Cobbler for bare metal provisioning
  • Chef for configuring servers
  • Berkshelf for Chef dependencies
  • Docker with Test Kitchen for running infrastructure tests

Monitoring Stack


Server Monitoring

  • Icinga for all servers
  • ~100 custom Nagios plugins for deep server monitoring
  • 550 various probes per server
  • Graphite as data format
  • Grafana for displaying all graphs
  • PagerDuty for issue escalation
  • Smokeping for network mesh monitoring

Application Monitoring

  • VividCortex for MySQL monitoring
  • JSON /health endpoint on each service
  • Cross-DC database consistency monitoring
  • 9 4K 65” TVs for showing all graphs across the office
  • Statistical deviation monitoring
  • Fraudulent users monitoring
  • Third-party systems monitoring
  • Deployments are recorded in all graphs

Intelligent Monitoring

  • Intelligent alerting system with a feedback loop: a system that can introspect anything can learn anything
  • Third-party stats about AppLovin are also monitored
  • Alerting is a cross-team exercise: developers, ops, business, data scientists are involved

Architecture Overview


General Considerations

  • Store everything in RAM
  • If it does not fit, save it to SSD
  • L2 cache level optimizations matter
  • Use right tool for the right job
  • Architecture allows swapping any component
  • Upgrade only if an alternative is 10x better
  • Write your own components if there is nothing suitable out there
  • Replicate important data at least 3x
  • Make sure every message can be re-played without data corruption
  • Automate everything
  • Zero-copy message passing

Message Processing

  • Custom message processing system that guarantees message delivery
  • 3x replication for each message
  • Sending a message = writing to disk
  • Any service may fail, but no data are lost
  • Message dispatching system connects all components together, provides isolation and extensibility of the system
  • Cross-DC failure tolerance

Ad Serving

  • Nginx is really fast: can serve an ad in less than a millisecond
  • Keep all ad serving data in memory: read only
  • jemalloc gave a 30% speed improvement
  • Use Aerospike for user profiles: less than 1ms to fetch a profile
  • Pre-compute all ad serving data on one box and dispatch across all servers
  • Torrents are used to propagate serving data across all servers. Using Torrents resulted in 83% network load drop on the originating server compared to HTTP-based distribution.
  • mmap is used to share ad serving data across nginx processes
  • XXHash is the fastest hash function with a low collision rate. 75x faster than SHA-1 for computing checksums
  • 5% of real traffic goes to staging environment
  • Ability to run 3 A/B tests at once (20%/20%/10% of traffic for three separate tests, 50% for control)
  • A/B test results are available in regular reporting

Data Warehouse

  • All data are replicated
  • Running most reports takes under 2 seconds
  • Aggregation is key to allow fast reports on large amounts of data
  • Non-aggregated data for the last 48 hours is usually to resolve most issues
  • 7 days of raw logs is usually enough for debug
  • Some reports must be pre-computed
  • Always think multiple data centers: every data point goes to a multiple locations
  • Backup in S3 for catastrophic failures
  • All raw data are stored in Spark cluster

Team


Structure

  • 70 full-time employees
  • 15 developers (platform, ad serving, frontend, mobile)
  • 4 data scientists
  • 5 dev. ops.
  • Engineers in Palo Alto, CA
  • Business in San Francisco, CA
  • Offices in New York, London and Berlin

Interaction

  • HipChat to discuss most issues
  • Asana for project-based communication
  • All code is reviewed
  • Frequent group code reviews
  • Quarterly company outings
  • Regular town hall meetings with CEO
  • All engineers (junior to CTO) write code
  • Interviews are tough: offers are really rare

Development Cycle

  • Developers, business side or data science team comes up with an idea
  • Idea is reviewed and scheduled to be executed on a Monday meeting
  • Feature is implemented in a branch; development environment is used for basic testing
  • A pull request is created
  • Code is reviewed and iterated upon
  • For big features group code reviews are common
  • The feature gets merged to master
  • The feature gets deployed to staging with the next build
  • The feature gets tested on 5% real traffic
  • Statistics are examined
  • If the feature is successful it graduates to production
  • Feature is closely monitored for couple days

Avoiding Issues

  • The system is designed to handle failure of any component
  • No failure of a single component can harm ad serving or data consistency
  • Omniscient monitoring
  • Engineers watch and analyze key business reports
  • High quality of code is essential
  • Some features take multiple code reviews and iterations before graduating
  • Alarms are triggered when:
    • Stats for staging are different from production
    • FATAL errors on critical services
    • Error rate exceeds threshold
    • Any irregular activity is detected
  • data are never dropped
  • Most log lines can be easily parsed
  • Rolling back of any change is easy by design
  • After every failure: fix, make sure same thing does not happen in the future, and add monitoring

Lessons Learned


Product Development

  • Being able to swap any component easily is key to growth
  • Failures drive innovative solutions
  • Staging environment is essential: always be ready to loose 5%
  • A/B testing is essential
  • Monitor everything
  • Build intelligent alerting system
  • Engineers should be aware of business goals
  • Business people should be aware of limitations of engineering
  • Make builds and continuous integration fast. Jenkins run on a 2 bare metal servers with 32 CPU, 128G RAM and SSD drives

Infrastructure

  • Monitoring all data points is critical
  • Automation is important
  • Every component should support HA by design
  • Kernel optimizations can have up to 25% performance improvement
  • Process and IRQ balancing lead to another 20% performance improvement
  • Power saving features impact performance
  • Use SSDs as much as possible
  • When optimizing, profile everything. Flame graphs are great!

No comments:

Post a Comment