As a seasoned Technical Leader with over 17 years of experience, I have consistently delivered innovative and transformative solutions for financial institutions and diverse industries, including Banking, Insurance, Payments, Healthcare, and Retail. My expertise spans Site Reliability Engineering, Non-functional Engineering, Solution Architecting, JVM Tuning, Capacity Planning, Resilience, Observability, and Cloud Adaptation. I specialize in designing and implementing scalable, resilient, and highly available distributed systems using tools such as Datadog, Dynatrace, New Relic APM, and SQL, along with deep expertise in cloud-native architectures (AWS, GCP, Azure) and Kubernetes. In my role as a SRE Architect, I focus on driving enterprise-wide reliability strategies by implementing self-healing and fault-tolerant infrastructures. Leveraging SLOs, SLIs, error budgets, and cutting-edge technologies like AIOps, chaos engineering, and predictive analytics, I ensure system performance and operational excellence. My passion lies in architecting next-generation reliability frameworks that prevent failures before they occur, aligning engineering efforts with business goals.
Additionally, I bring extensive experience managing high-performing teams, including mentoring senior engineers and fostering a culture of collaboration, learning, and proactive reliability. With over seven years of experience in backend engineering, primarily with Java and distributed systems, I have a deep understanding of asynchronous microservices and data consistency. My work with the Kafka ecosystem, including Apache Flink, has equipped me to design and scale complex systems that meet the needs of customer-first environments. As a strategic thought leader at the intersection of SRE, DevOps, and software engineering, I thrive in fast-paced environments, ensuring challenging projects are delivered with precision and impact. I am passionate about inspiring teams, advocating for innovation, and driving the future of scalable, reliable systems.
Client Work Experience
SRE Architect (Performance and Security)| High performane Application Design
Major Retirement System of NY
2018 - Present
IT Archtect - Performance and Security
Major NYC Insurance Giant
2017 - 2018
Non Funcional Testing Manager [Performance and Security] - Simplification Program
Largest UK Bank
2013 - 2017
Performance Engineer
Major Singapore Bank
2010 - 2012
Software Engineer
Persistent System Inc
2008 - 2009
Certifications
AWS Certified Cloud practitioner
Issued by: Amazon Web Services (AWS)
Issued on: March 2021
Salesforce Certified AI Associate
Issued by: Salesforce
Issued on: April 2025
Github Copilot Certified
Issued by: GitHub
Issued on: May 2025
Sun Certified Java Programmer
Issued by: Sun/Oracle
Issued on: February 2007
Publications
- Why Up to 70% of SRE Initiatives Stall Before They Scale—and How to Break the Plateau
- AI-Driven Performance Testing: A New Era for Software Quality
- From Cloud to Cognitive Infrastructure: How AI Is Redefining the Next Frontier of SRE
- Bridging the Gap Between SRE and Security: A Unified Framework for Modern Reliability
- View all Publications
Blog
Connect
Feel free to contact me at aksahthakur34@gmail.com
