CRED’s scaling up strategy during IPL

IPL 2021, one of our biggest campaigns, tested our systems on the three parameters, which comprise a scale, including - Stress (force applied to the system over an extended period), Impulse (rapid shock to the system), & Surprises (traversal of unexpected code paths & traffic patterns).
Image © Analytics India Magazine
Listen to this story

The Wall, Jammy, Mr Dependable and…Indiranagar ka Gunda! Rahul Dravid’s soft personality got a quirky spin after the unforgettable Cred ad. Since then, the company has come up with several innovative advertisements that usually offer a light-hearted break between the nail-biting IPL matches. As the official sponsor of IPL, CRED has developed several interesting initiatives, which have played rich dividends by routing many potential customers to its platform. Operating and managing such a large influx of customer visits on their platform is no mean feat, and CRED has managed to do a good job.

Analytics India Magazine caught up with Srinivas Iyengar. He heads Backend Technology & Platforms at CRED

THE BELAMY

Sign up for your weekly dose of what's up in emerging technology.

AIM: The engineering department at CRED has pulled off some great feats in the past. Tell us about your team.

Srinivas Iyengar: Engineering at CRED encompasses various functions, including frontend, backend and QA across all lines of business. It also includes data, which comprises data platform, data science, and analytics. Frontend functions at CRED include mobile web, Android and iOS, and backend functions comprise development and testing teams across all lines of business. 

When it comes to our tech stack, for frontend, we work with Kotlin, Swift and Flutter for Mobile and React-Redux for Web, and we work with Java and GoLang for backend and data platform. This is for both developers and testers. For those interested in details, we rely on both proprietary and open source technologies, some of which are:

  • Programming languages – Java, Golang, Python (backend, data platform)
  • Data stores – MySQL, DynamoDB, Elasticsearch, Redis
  • Message brokers & queues – Kafka, SQS
  • Deployments – We use AWS ECS for orchestrating container-based deployments on EC2 & Fargate.
  • We heavily leverage other core service offerings from AWS, like S3 (for object storage), Lambda, SNS, Codebuild, CloudWatch, Step functions – workflow orchestration, EMR, etc.
  • Our observability stack consists of Datadog – for APM (Application Performance Monitoring), Pagerduty – for alerts, Coralogix – for logging, SignalFx and AWS CloudWatch – for metrics instrumentation.

AIM: What are the mechanisms in place to scale up, especially during events like IPL?

Srinivas Iyengar: At CRED, scale manifests itself from the impulse created due to the IPL campaigns, rewards like CRED Powerplay & CRED Jackpot, along with the TV ads & commentator mentions during the game and month-end payment cycles. 

IPL 2021, one of our biggest campaigns, tested our systems on the three parameters, which comprise a scale, including- Stress (force applied to the system over an extended period), Impulse (rapid shock to the system), & Surprises (traversal of unexpected code paths & traffic patterns).

First, we created a ready reckoner for large-scale event planning mapped against our goals for product and business and technology and laid out traffic projections for the campaign. Then, we created a blueprint document detailing the goals, objectives, key customer flows, and nuances of scale, such as availability, performance, fault tolerance, observability, and building resilience, mapping out external systems and integrations, monitoring systems, dashboards, alerts, and war rooms. This set the tone and structure for tech planning for a large-scale campaign like IPL. We also defined our tenets of scale, which included: availability, performance, observability, fault tolerance, traffic control & load management measures and testing. 

AIM: How do you ensure zero downtime of critical resources?

Srinivas Iyengar: Mapping fault tolerance and creating strategies to strengthen it is key to ensuring zero downtime of critical resources – we had a dedicated cross-functional team focusing on readiness, and chaos testing was done using Toxie proxy to simulate a network and system failure, and simulated database in AWS services failing. In addition, we use Kong as an API gateway. That’s where we do the traffic control. We built a plugin called quality of service plug QoS, or as we loosely call it – graceful degradation plugin. The plugin adds some metadata in our configuration service, which our application teams consume and act upon. 

For instance, it can indicate the level of traffic and intervention needed with the help of colours: Green – smooth, Yellow – some stress, and Red – critical. We built a manual plugin and anticipated some traffic, enabling the plugin at the start of the match. When we enable the plugin on low, the latencies go down because the downstream systems gracefully degrade the pool of data or process only critical information for the customer. It hides all the information that the customer doesn’t care about. This plugin protected us from the spike as we had anticipated and turned out to be massive learning for us within the first leg of IPL. In the second leg of the IPL, we built version two of the QoS plugin, which was fully automated. 

Depending on the incoming traffic numbers, we were able to control the QoS level automatically as opposed to manually operating it. We were able to predict the traffic rates and control the QoS level. The plugin automatically figured out that it had to lower the quality of service, and accordingly, the backend services would enable and disable some functionalities for the customer. Once the QoS level moves back to green, we enable them all. The learning we secured was to anticipate because one can plan for 5x or 10x stress but having planned for 5x stress and experiencing 20x stress can only be catered for if the team anticipates all situations. 

AIM: How do you improve the performance of your data store?

Srinivas Iyengar: For our data stores, while using primarily MySQL, ElasticSearch and Redis, we analysed the access pattern, top query analysis, and resource utilisation (CPU, Memory, Disk/IOPS), along with the data growth. These initiatives helped identify the right capacity (both current & future based on the projected estimates). 

The key how-tos that helped to improve performance were – analysis of slow queries, identifying optimal indexes and improving access patterns by leveraging partitions; for heavy read flows, we leveraged replicas (bulk-heading reads v/s writes) and effective caching strategies to reduce the read path workload; we also created event-sourced materialised views (for some of our core flows), which helped reduce the read amplification and fan-out to multiple downstream services (and their associated data stores). These views led to a reduction in latencies, which helped improve performance. We were also able to reduce fan-out/throughput to the downstream services.

AIM: How do you acquire talent for your team?

Srinivas Iyengar: From a talent perspective, while hiring for our diverse teams, we look for people who share the passion and vision that we have and resonate with what we’re trying to build at CRED. Curiosity, willingness to learn, and being open to feedback are important traits considered when hiring new team members. We encourage a shareholder mindset and don’t micromanage at CRED. Our work culture requires people to map their own tasks, thrive and become successful. Presently, we’re looking to hire team members for our frontend and backend teams across all lines, and there are certain skills required for our frontend team. At CRED, we look for talent density, high performance, and, most importantly, trust and value alignment. While hiring, we gauge criteria like learnability, problem-solving skills, trust (measured through one’s credit score), obsession with good quality work, a need for growth, and a shareholder mindset. 

CRED’s competency framework fosters progressive learning as engineers and data scientists advance their careers with increasing scale, complexity, ambiguity, and interdependency within their streams. Our focus is to enable opportunities for growth and provide team members with the experience they desire. We have soft boundaries in terms of job roles and descriptions – enabling team members to learn and gain experience in various fields. For instance, a frontend engineer can work on the backend in a project and vice versa. 

More Great AIM Stories

Shraddha Goled
I am a technology journalist with AIM. I write stories focused on the AI landscape in India and around the world with a special interest in analysing its long term impact on individuals and societies. Reach out to me at shraddha.goled@analyticsindiamag.com.

Our Upcoming Events

Masterclass, Virtual
How to achieve real-time AI inference on your CPU
7th Jul

Masterclass, Virtual
How to power applications for the data-driven economy
20th Jul

Conference, in-person (Bangalore)
Cypher 2022
21-23rd Sep

Conference, Virtual
Deep Learning DevCon 2022
29th Oct

3 Ways to Join our Community

Discord Server

Stay Connected with a larger ecosystem of data science and ML Professionals

Telegram Channel

Discover special offers, top stories, upcoming events, and more.

Subscribe to our newsletter

Get the latest updates from AIM
MOST POPULAR

What can SEBI learn from casinos?

It is said that casino AI technology comes with superior risk management systems compared to traditional data analytics that regulators are currently using.

Will Tesla Make (it) in India?

Tesla has struggled with optimising their production because Musk has been intent on manufacturing all the car’s parts independent of other suppliers since 2017.

Now Reliance wants to conquer the AI space

Many believe that Reliance is aggressively scouting for AI and NLP companies in the digital space in a bid to create an Indian equivalent of FAANG – Facebook, Apple, Amazon, Netflix, and Google.

[class^="wpforms-"]
[class^="wpforms-"]