Scott O’Brien

Network Engineer / Production Engineer (SRE)
Working on Efficiency through Automation since 2011

Incident Response

  • Bringing the affected teams together to find and plan a mitigation path during outages in network and software incidents of various sizes, bringing multiple datasets into timelines to clearly show cause and effect and determine the root cause.
  • Direct teams in coming up with prevention and recovery plans to make the infrastructure more resilient.

Network Engineering

  • Experience operating large-scale networks in data center and edge in running both traditional routing protocols (BGP, IS-IS) and home-grown (OpenR) across various network operating systems (FBOSS, Arista EOS, Junos)
  • Experience building and operating IPv6 networks, including IPv4 over IPv6 transports (RFC5549, RFC8950, RFC4884, RFC5837)
  • Ability to use big data (Hive and Presto) to data-mine loss characteristics across topologies and server/NIC generations to find common denominators of issues.

Software Development

  • Building software and tooling to monitor and control routers in large-scale environments.
  • Ability to adapt across programming languages and code bases for full stack features. Most comfortable with Python, TypeScript and React. Experience with C++ and Go.
  • Collaborate across teams to plan large pipelines for managing network deployment automation.
  • Writes across full stack from backend services, to network control plane features, to frontend React interactive experiences.

Linux

  • Ansible for configuration management, Docker for deploying services and containers.
  • htop and below for historical troubleshooting.
  • cgroup for shared resource control.

I love the challenge of seeing a problem, and coming up with product centric self-serve solutions. I enjoy operations, working with my team to ensure our alarming and processes are scalable. I’ve always strived to automate myself out of a job. I hate unanswered questions. Building and collecting metrics to deep dive into problems has always been a passion of mine.

In my spare time, I’m passionate about paragliding, web development, remote control aircraft, and embedded systems. I contribute to open-source projects including Leaf, a variometer for paragliding, and an IoT chessboard, combining my interests in hardware and software.

Facebook (Meta) Production Network Engineer

July 2015 - June 2024 (9 years), Menlo Park, CA, USA

Easily the pinnacle of my career so far. Spent 9 years building and exercising my skills in Network Engineering, Software Development and Data Analytics working across multiple teams to help build and scale Meta’s Network Infrastructure.

Network Infrastructure Engineer

  • Built alarms and remediation to scale the automated management across Facebook’s Datacenter, Backbone and Edge Networks.
  • Built the Drain and audit frameworks to ensure the safe removal and insertion of network devices in all devices roles across the production network.
  • Built data pipelines to show issues with the convergence timing of our MPLS/RSVP network after fiber cut events.
  • Ran incident response for issues across both network and software tools.
  • Deployed and managed multiple Terragraph instances to operationalize and harden the product.

Network Deployment Integration

  • Partnered with edge and backbone teams, built tooling and workflows to make their turnup and migration processes more consistent and reliable.
  • Worked with partner Software Development Teams developing the workflow orchestration systems to help productionize their service. Add monitoring, find bottlenecks, and integrate with wider Facebook tooling to increase the responsiveness and reliability of their service.

Datacenter Network Engineer

  • Built tooling to integrate different teams databases to detect physical cable faults and create the appropriate followup actions.
  • Designed and implemented RFC4884 and RFC5837 on FBOSS network OS to retain IPv4 traceroute functionality across newer V6-only deployments.
  • Built tooling to parse verbose FBOSS switch ASIC state logs, extracting millisecond-granular data on resource usage, convergence timing, and routing micro-loops. This reduced time to triage hard-to-root-cause incidents, aided qualification efforts, and helped identify bottlenecks to inform future network design roadmaps.
  • Implemented action plan to bring on-call alarming down to acceptable levels.
  • Ran multiple Datacenter related incident investigations and mitigations.

Independent Contractor

October 2014 - June 2015 (9 months), Sydney Australia

Work here was contracting for primarily two different companies. Cinenet Systems (now acquired by Superloop) and Rise.ph, a new Philippine ISP starting up.

Cinenet Systems

  • Deploying new services across a passive DWDM network across Sydney and Melbourne.
  • Troubleshooting MPLS (VLL and MPLS) issues with existing customers.

Rise

  • Build and provision backend systems (Chef, Radius & Bind) to build the infrastructure to support initial deployments.
  • Design BGP Communities and policies to influence traffic through network and peers.
  • Manage initial ASN and IP allocations through APNIC.

University of Wollongong Network Engineer

June 2012 - Sep 2014 (2 years 4 months), Wollongong, NSW, Australia

Worked primarily as a Network Engineer and Software Developer to keep the university campus and datacenter networks operating smoothly.

  • Manage, design and implement upgrades to the multi-campus MPLS VPN core.
  • Deployment of open source tools such as NetDisco and Rancid with custom scripting to improve change management processes.
  • Implement new quota and proxy “Free Internet” deployment with BGP community based shaping rules to network appliances to satisfy business financial needs, along with Traffic Attribution tools for the business to understand usage and costs down to a per-subscriber breakdown.
  • Design network migration and lab changes to migrate to newer Palo Alto based firewalls for allowing inter-vrf routing.

University of Wollongong Academic Tutor

March 2009 - June 2013, Wollongong, NSW, Australia
Roles were assisting students with lab tasks, providing class assistance and demonstrations to assist teach course material and provide to students of procedural programming, interacting systems, and systems administration.

UTBox Systems Engineer

January 2011 - June 2012 (1 year 6 months), Sydney, NSW, Australia

Upgraded the infrastructure to ensure everything, from the network stack to database and webservers, were highly available and managed through configuration management to ensure a failure would not result in a loss of revenue. Supported clients, fixed bugs, and developed new features for the product in the codebase.

University of Wollongong

Graduated with Distinction July 2009

Bachelor of Computer Science - Software Development, Multimedia and Game Development.