Scott O’Brien

Network Engineer / Production Engineer (SRE)
Working on Efficiency through Automation since 2011

Incident Response

  • Ability to manage network and software incidents of various sizes. Involves bringing in the affected teams together on video bridges or incident chats, finding and planning a mitigation path during outages and root causing. Troubleshooting often involves working with multiple parties to bring multiple datasets into timelines to clearly show cause and effect.
  • Help teams come up with recovery and prevention plans to make the infrastructure more resilient.

Network Engineering

  • Experience operating large-scale networks in data center and edge in running both traditional routing protocols (BGP, IS-IS) and home-grown (OpenR) across various network operating systems (FBOSS, Arista EOS, Junos)
  • Experience building and operating IPv6 networks, including IPv4 over IPv6 transports (RFC5549, RFC8950, RFC4884, RFC5837)
  • Ability to use big data (Hive and Presto) to data-mine loss characteristics across topologies and server/NIC generations to find common denominators of issues.

Software Development

  • Experience building software and tooling to monitor and control routers in large-scale environments.
  • Ability to adapt across programming languages and code bases for full stack features. Most comfortable with Python, TypeScript and React. Experience with C++ and Go
  • Collaborate across teams to plan large pipelines for managing network deployment automation.
  • Writes across full stack from backend services, to network control plane features, to frontend React interactive experiences.

Linux

  • Ansible for configuration management, Docker for deploying services and containers.
  • htop and below for historical troubleshooting.
  • cgroup for shared resource control.

I love the challenge of seeing a problem, and coming up with product centric self-serve solutions. I enjoy operations, working with my team to ensure our alarming and processes are scalable. I’ve always strived to automate myself out of a job. I hate unanswered questions. Building and collecting metrics to deep dive into problems has always been a passion of mine.

Facebook (Meta) Production Network Engineer

July 2015 - June 2024 (9 years), Menlo Park, CA, USA

Easily the pinnacle of my career so far. Spent 9 years building and exercising my skills in Network Engineering, Software Development and Data Analytics working across multiple teams over my tenure here to help build and scale Meta’s Network Infrastructure.

Network Infrastructure Engineer

  • Built alarms and remediation to scale the automated management across Facebook’s Datacenter, Backbone and Edge Networks.
  • Built the Drain and audit frameworks to ensure the safe removal and insertion of network devices in all devices roles across the production network.
  • Built data pipelines to show issues with the convergence timing of our MPLS/RSVP network after fiber cut events.
  • Ran incident response for a number of issues across both network and software tools.
  • Deployed and managed multiple Terragraph instances to help operationalize and harden the product.

Network Deployment Integration

  • Partnered with edge and backbone teams, built tooling and workflows to make their turnup and migration processes more consistent and reliable
  • Worked with partner Software Development Teams developing the workflow orchestration systems to help productionize their service. Add monitoring, find bottlenecks, and integrate with wider Facebook tooling to increase the responsiveness and reliability of their service.

Datacenter Network Engineer

  • Built tooling to integrate different teams databases to detect physical cable faults and create the appropriate followup actions.
  • Designed and implemented RFC4884 and RFC5837 on FBOSS network OS to retain IPv4 traceroute functionality across newer V6-only deployments.
  • Built tooling to parse verbose FBOSS switch ASIC state logs, extracting millisecond-granular data on resource usage, convergence timing, and routing micro-loops. This reduced time to triage hard-to-root-cause incidents, aided qualification efforts, and helped identify bottlenecks to inform future network design roadmaps.
  • Helped team take control of on-call burden, build action plan with partner teams to bring alarming down to acceptable levels.
  • Ran multiple Datacenter related incident investigations and mitigations.

Independent Contractor

October 2014 - June 2015 (9 months), Sydney Australia

Work here was contracting for primarily two different companies. Cinenet Systems (now acquired by Superloop) and Rise.ph, a new Philippine ISP starting up.

Cinenet Systems

  • Deploying new services across a passive DWDM network across Sydney and Melbourne.
  • Troubleshooting MPLS (VLL and MPLS) issues with existing customers.

Rise

  • Build and provision backend systems (Chef, Radius & Bind) to build the infrastructure to support initial deployments.
  • Design BGP Communities and policies to influence traffic through network and peers.
  • Manage initial ASN and IP allocations through APNIC.

University of Wollongong Network Engineer

June 2012 - Sep 2014 (2 years 4 months), Wollongong, NSW, Australia

Worked primarily as a Network Engineer and Software Developer to keep the university campus and datacenter networks operating smoothly and help improve processes in the organization through the development of software. Work here involved:

  • Manage, design and implement upgrades to the multi-campus MPLS VPN core
  • Deployment of open source tools such as NetDisco and Rancid with custom scripting to improve change management processes
  • Implement new quota and proxy “Free Internet” deployment with BGP community based shaping rules to network appliances to satisfy business financial needs, along with Traffic Attribution tools for the business to understand usage and costs down to a per-subscriber breakdown.
  • Design network migration and lab changes to migrate to newer Palo Alto based firewalls for allowing inter-vrf routing.

University of Wollongong Academic Tutor

Roles were assisting student with lab tasks, providing class assistance and demonstrations to assist teach course material and help students develop their programming skills. Marked exams and provided feedback. Subjects tutored were:

  • Procedural Programming (March 2009 - June 2010)
  • Interacting Systems (March-June 2010)
  • Systems Administration (March-June 2013)

UTBox Systems Engineer

January 2011 - June 2012 (1 year 6 months), Sydney, NSW, Australia

Upgraded the infrastructure to ensure everything from the network stack, to database and webservers were highly available and managed through configuration management to ensure a failure would not result in a loss of revenue. Day to day operations were also to support clients and help with fixing bugs or developing new features for the product in the codebase.

University of Wollongong

Graduated with Distinction July 2009

Bachelor of Computer Science Majoring in Software Development, Multimedia and Game Development