Listen to this story
The Wall, Jammy, Mr Dependable and…Indiranagar ka Gunda! Rahul Dravid’s soft personality got a quirky spin after the unforgettable Cred ad. Since then, the company has come up with several innovative advertisements that usually offer a light-hearted break between the nail-biting IPL matches. As the official sponsor of IPL, CRED has developed several interesting initiatives, which have played rich dividends by routing many potential customers to its platform. Operating and managing such a large influx of customer visits on their platform is no mean feat, and CRED has managed to do a good job.
Sign up for your weekly dose of what's up in emerging technology.
AIM: The engineering department at CRED has pulled off some great feats in the past. Tell us about your team.
Srinivas Iyengar: Engineering at CRED encompasses various functions, including frontend, backend and QA across all lines of business. It also includes data, which comprises data platform, data science, and analytics. Frontend functions at CRED include mobile web, Android and iOS, and backend functions comprise development and testing teams across all lines of business.
When it comes to our tech stack, for frontend, we work with Kotlin, Swift and Flutter for Mobile and React-Redux for Web, and we work with Java and GoLang for backend and data platform. This is for both developers and testers. For those interested in details, we rely on both proprietary and open source technologies, some of which are:
- Programming languages – Java, Golang, Python (backend, data platform)
- Data stores – MySQL, DynamoDB, Elasticsearch, Redis
- Message brokers & queues – Kafka, SQS
- Deployments – We use AWS ECS for orchestrating container-based deployments on EC2 & Fargate.
- We heavily leverage other core service offerings from AWS, like S3 (for object storage), Lambda, SNS, Codebuild, CloudWatch, Step functions – workflow orchestration, EMR, etc.
- Our observability stack consists of Datadog – for APM (Application Performance Monitoring), Pagerduty – for alerts, Coralogix – for logging, SignalFx and AWS CloudWatch – for metrics instrumentation.
AIM: What are the mechanisms in place to scale up, especially during events like IPL?
Srinivas Iyengar: At CRED, scale manifests itself from the impulse created due to the IPL campaigns, rewards like CRED Powerplay & CRED Jackpot, along with the TV ads & commentator mentions during the game and month-end payment cycles.
IPL 2021, one of our biggest campaigns, tested our systems on the three parameters, which comprise a scale, including- Stress (force applied to the system over an extended period), Impulse (rapid shock to the system), & Surprises (traversal of unexpected code paths & traffic patterns).
First, we created a ready reckoner for large-scale event planning mapped against our goals for product and business and technology and laid out traffic projections for the campaign. Then, we created a blueprint document detailing the goals, objectives, key customer flows, and nuances of scale, such as availability, performance, fault tolerance, observability, and building resilience, mapping out external systems and integrations, monitoring systems, dashboards, alerts, and war rooms. This set the tone and structure for tech planning for a large-scale campaign like IPL. We also defined our tenets of scale, which included: availability, performance, observability, fault tolerance, traffic control & load management measures and testing.
AIM: How do you ensure zero downtime of critical resources?
Srinivas Iyengar: Mapping fault tolerance and creating strategies to strengthen it is key to ensuring zero downtime of critical resources – we had a dedicated cross-functional team focusing on readiness, and chaos testing was done using Toxie proxy to simulate a network and system failure, and simulated database in AWS services failing. In addition, we use Kong as an API gateway. That’s where we do the traffic control. We built a plugin called quality of service plug QoS, or as we loosely call it – graceful degradation plugin. The plugin adds some metadata in our configuration service, which our application teams consume and act upon.
For instance, it can indicate the level of traffic and intervention needed with the help of colours: Green – smooth, Yellow – some stress, and Red – critical. We built a manual plugin and anticipated some traffic, enabling the plugin at the start of the match. When we enable the plugin on low, the latencies go down because the downstream systems gracefully degrade the pool of data or process only critical information for the customer. It hides all the information that the customer doesn’t care about. This plugin protected us from the spike as we had anticipated and turned out to be massive learning for us within the first leg of IPL. In the second leg of the IPL, we built version two of the QoS plugin, which was fully automated.
Depending on the incoming traffic numbers, we were able to control the QoS level automatically as opposed to manually operating it. We were able to predict the traffic rates and control the QoS level. The plugin automatically figured out that it had to lower the quality of service, and accordingly, the backend services would enable and disable some functionalities for the customer. Once the QoS level moves back to green, we enable them all. The learning we secured was to anticipate because one can plan for 5x or 10x stress but having planned for 5x stress and experiencing 20x stress can only be catered for if the team anticipates all situations.
AIM: How do you improve the performance of your data store?
Srinivas Iyengar: For our data stores, while using primarily MySQL, ElasticSearch and Redis, we analysed the access pattern, top query analysis, and resource utilisation (CPU, Memory, Disk/IOPS), along with the data growth. These initiatives helped identify the right capacity (both current & future based on the projected estimates).
The key how-tos that helped to improve performance were – analysis of slow queries, identifying optimal indexes and improving access patterns by leveraging partitions; for heavy read flows, we leveraged replicas (bulk-heading reads v/s writes) and effective caching strategies to reduce the read path workload; we also created event-sourced materialised views (for some of our core flows), which helped reduce the read amplification and fan-out to multiple downstream services (and their associated data stores). These views led to a reduction in latencies, which helped improve performance. We were also able to reduce fan-out/throughput to the downstream services.
AIM: How do you acquire talent for your team?
Srinivas Iyengar: From a talent perspective, while hiring for our diverse teams, we look for people who share the passion and vision that we have and resonate with what we’re trying to build at CRED. Curiosity, willingness to learn, and being open to feedback are important traits considered when hiring new team members. We encourage a shareholder mindset and don’t micromanage at CRED. Our work culture requires people to map their own tasks, thrive and become successful. Presently, we’re looking to hire team members for our frontend and backend teams across all lines, and there are certain skills required for our frontend team. At CRED, we look for talent density, high performance, and, most importantly, trust and value alignment. While hiring, we gauge criteria like learnability, problem-solving skills, trust (measured through one’s credit score), obsession with good quality work, a need for growth, and a shareholder mindset.
CRED’s competency framework fosters progressive learning as engineers and data scientists advance their careers with increasing scale, complexity, ambiguity, and interdependency within their streams. Our focus is to enable opportunities for growth and provide team members with the experience they desire. We have soft boundaries in terms of job roles and descriptions – enabling team members to learn and gain experience in various fields. For instance, a frontend engineer can work on the backend in a project and vice versa.