Héctor Bautista Flores
SRE Lead @ FICO · Building AI-augmented operations with Claude Code · AWS · Kubernetes · Grafana
📍 Aguascalientes, Mexico · 📩 hbautista [at] gmail.com · hbautista.com · LinkedIn
Summary
Site Reliability Engineer Lead with 10+ years owning production reliability across AWS and Azure environments at enterprise scale. Experienced driving incident management, observability pipelines, and operational automation for highly distributed, multi-tenant financial decisioning platforms.
Skilled in Kubernetes, Grafana, Splunk, and AWS — focused on eliminating toil, enforcing SLA/SLO/SLI frameworks, and evangelizing SRE culture across engineering teams.
Skills
| Domain | Technologies |
|---|---|
| Cloud & Infrastructure | AWS (EC2, EKS, CloudWatch, S3), Azure (PaaS, IaaS, DevOps Pipelines) |
| Containers & Orchestration | Kubernetes, Docker, Amazon EKS, Helm |
| Observability | Grafana, AppDynamics, Splunk, CloudWatch, Zabbix, Kibana |
| CI/CD & Automation | Jenkins, Ansible, Puppet, Git, Azure DevOps, Bash, Python, Go |
| Networking | F5 GTM/LTM (BigIP), Load Balancing, SSL/TLS management |
| ITSM & Collaboration | ServiceNow (ITIL), Jira, Confluence |
| AI-Assisted Operations | Claude Code (Anthropic) — agentic SRE workflows, runbook automation, incident triage |
| Operating Systems | RHEL, Amazon Linux, Windows Server, AIX, Debian/Ubuntu |
Experience
Site Reliability Engineer Lead · FICO · Feb 2023 – Present
Aguascalientes Metropolitan Area · Remote
Leading end-to-end production reliability for enterprise-scale financial decisioning platforms serving millions of credit and risk decisions per day — across AWS, EKS, and private cloud. Responsible for incident management, observability, automation, and SRE culture across a highly distributed, multi-tenant architecture.
3 Key Results:
- Achieved minimal downtime Kubernetes/EKS rolling restarts across all production financial platforms — eliminating deployment-related SLA risk.
- Built full observability coverage across all 4 golden signals (saturation, traffic, latency, error rate) using Grafana, AppDynamics, CloudWatch, and Splunk — closing the blind spots that previously caused SLA breaches, cutting mean time to detect (MTTD) by ~40%.
- Introduced Claude Code (Anthropic) for agentic SRE workflows — automating runbook execution and incident triage, reducing manual toil in production response.
How I do it: Own the full SRE lifecycle: incident management - blameless postmortems - permanent corrective actions (PCAs) - continuous SLI/SLO/SLA tuning. Enforce ITIL processes through ServiceNow for change control and problem management. Evangelize SRE best practices across the engineering org.
Stack:
AWS·EKS·Kubernetes·Docker·Grafana·AppDynamics·Splunk·CloudWatch·Ansible·Puppet·Bash·ServiceNow·Jenkins·Jira·Claude Code
Technology Specialist · Amdocs · Jun 2018 – Mar 2023
Zapopan, Jalisco, Mexico · 4 yrs 10 mos
Led middleware infrastructure operations and cloud migration for multi-client enterprise environments — coordinating cross-functional engineering teams across Mexico, USA, and India. Responsible for platform stability, deployment pipelines, and cloud transformation across multiple concurrent enterprise accounts.
3 Key Results:
- Reduced operational toil by 50% by building Bash automation scripts for recurring service workflows — freeing the team to focus on higher-value engineering work.
- Led end-to-end Azure cloud migration (PaaS/IaaS/SaaS) for on-prem enterprise applications — aligning stakeholders across geographies and business units to deliver on time.
- Achieved zero SSL expiration incidents across all accounts by designing and owning a proactive certificate lifecycle management process across multiple environments.
How I worked: Managed middleware performance and escalated incident resolution for WebLogic, JBoss, WebSphere, and Tomcat environments. Led and coordinated application deployments via Azure DevOps pipelines across non-prod and prod. Mentored team members on CI/CD practices, Git workflows, and Docker/Kubernetes container fundamentals.
Stack:
Azure DevOps·WebLogic·JBoss·WebSphere·Tomcat·Docker·Kubernetes·Bash·Git·ServiceNow·Jira
Software Engineer · Softtek · Sep 2016 – Mar 2018
Aguascalientes, Mexico · 1 yr 7 mos
Middleware engineer for Java enterprise applications supporting global financial services and telecom clients — operating IBM WebSphere environments in high-availability, multi-geography production settings.
3 Key Results:
- Resolved escalated production incidents for global financial and telecom clients by performing root cause analysis and coordinating cross-geography distributed teams — preventing repeat failures.
- Operated IBM PureApplication System across enterprise-scale environments — managing WebSphere Application Server (WAS) and WebSphere Process Server (WPS) for mission-critical Java workloads.
- Delivered middleware design reviews and performance analysis in collaboration with application development teams — bridging the gap between dev and ops before DevOps was the norm.
How I worked: Supported high-stakes middleware environments for clients where downtime was not an option. Coordinated incident resolution across distributed teams spanning multiple geographies and time zones — building the cross-functional collaboration and escalation skills that would define my later SRE career.
Stack:
IBM WebSphere (WAS/WPS)·IBM PureApplication·Java·ServiceNow·Linux
Technical Lead & Backup Administrator · IBM · Sep 2013 – Sep 2016
Guadalajara / El Salto, Jalisco, Mexico · 3 yrs 1 mo
Technical Lead for Middleware infrastructure, coordinating a cross-geography team across Mexico, USA, and India supporting enterprise accounts.
- Led technical best practices and onboarding for new team members
- Administered IBM Tivoli Storage Manager (TSM) and R1Soft backup solutions for on-prem and Softlayer cloud servers
- Coordinated middleware deployments and change control across geographies
Stack:
IBM TSM·R1Soft·Softlayer·Middleware·Linux·AIX
IT Specialist, WW DST Backups · IBM · Sep 2013 – Sep 2016
Guadalajara / El Salto, Jalisco, Mexico
Sysadmin and backup engineer for IBM's worldwide DST operations — part of a global team managing 7,000+ servers hosting hundreds of enterprise projects.
- Administered IBM AIX systems (LPAR/HMC/NIM), Linux (RHEL, SUSE), and Windows Server (2008/2012) environments on-prem and on Softlayer cloud
- Managed IBM System Z (zLinux/SUSE) — disk cloning, DASD administration, LVM filesystem management
- Installed and configured IBM WebSphere (WAS/WPS), InfoSphere, and DB2 databases (v9.5–10.1) across AIX, Linux, and Windows
- Investigated and resolved OS and middleware security vulnerabilities at global account level
- Developed Bash/ksh automation scripts for server administration and operational efficiency
Stack:
AIX·RHEL·SUSE·IBM TSM·WebSphere·DB2·Softlayer·Bash/ksh
Education
Computer Systems Engineer (I.S.C.) Universidad Privada del Sur de México · 2007 – 2010
Higher Technical Degree in Computer Science (T.S.U.I.) Universidad Tecnológica de la Selva · 1999 – 2001
Certifications
| Certification | Issuer | Valid |
|---|---|---|
| AWS Certified Cloud Practitioner | Amazon Web Services · verify | Mar 2026 – Mar 2029 |
| Microsoft Certified: Azure Fundamentals | Microsoft · verify | Nov 2020 |
| Claude Code in Action | Anthropic · verify | May 2026 |
Training & Badges
| Badge | Issuer | Earned |
|---|---|---|
| AWS Cloud Quest: Cloud Practitioner | Amazon Web Services · verify | Jan 2025 |
Languages
- Spanish — Native or bilingual proficiency
- English — Professional working proficiency