Executive Summary
This document is a scenario-based technical work sample for the Network Engineer 3 role at BMC: a campus network infrastructure assessment and modernization proposal for a multi-site hospital environment.
It covers network design, security segmentation, SD-WAN, wireless, monitoring, change management, and automation, with every section mapped to a direct requirement in the job description.
Scenario
| Organization | Mid-to-large hospital system, 3 campuses, ~4,500 users, 1,200 beds |
| Problem Statement | Aging Cisco 3750/6500 core infrastructure, flat network with no clinical/IoT segmentation, unreliable Silver Peak SD-WAN, no consistent NAC policy. VoIP call drops, slow EHR response times. DR failover never tested. |
| My Role | Lead Network Engineer responsible for the full assessment, design, and rollout plan |
Current State Assessment
Current Network Topology
Key Issues Found
| Issue | Impact | Risk |
|---|---|---|
| Flat network (VLAN 1 everywhere) | No clinical/IoT segmentation; lateral movement risk | CRITICAL |
| Cisco 6500 single supervisor | Core failure = full campus outage | CRITICAL |
| Legacy ASA firewall (no App-ID) | No application visibility or Layer 7 control | HIGH |
| Silver Peak SD-WAN misconfigured | No QoS policies; voice/video not prioritized | HIGH |
| No 802.1X / NAC enforcement | Any device can connect to any port | HIGH |
| Aruba APs on old firmware, no WIPS | Rogue AP risk; compliance gap | MEDIUM |
| No automated config backup | DR runbooks untested; configs undocumented | MEDIUM |
| SolarWinds polling too long (10 min) | Misses short-duration outages | MEDIUM |
Proposed Architecture
Target Topology
Network Segmentation Plan
| Zone | VLAN Range | VRF | Access Policy |
|---|---|---|---|
| Clinical Workstations | 100–149 | VRF-CLINICAL | 802.1X; ClearPass; EHR/PACS access only |
| VoIP | 200–209 | VRF-VOICE | Trusted; DSCP EF; CDP auto-VLAN |
| Medical IoT | 300–349 | VRF-IOT | MAC-Auth Bypass; no internet; tightly ACL'd |
| Server / Datacenter | 400–449 | VRF-DC | Firewall-enforced; no direct user access |
| Guest / Visitor Wi-Fi | 500 | VRF-GUEST | Internet-only; isolated; ClearPass sponsored |
| Management | 999 | VRF-MGMT | Jump host access only; OOB preferred |
Security Design: Palo Alto Zero Trust
Zone Architecture
Firewall Policy Approach
- Default Deny All between zones; explicit permit only.
- Use App-ID instead of port-based rules: allow
epic-ehrapplication, not just TCP/443. - User-ID integration with Active Directory: policies apply to user groups, not IPs.
- Threat Prevention on all inter-zone rules; WildFire for unknown file inspection.
# Clinical-to-Datacenter Rule
Rule: Allow-EHR-Access
Source Zone: CLINICAL
Destination Zone: DC
Application: epic-ehr, ssl
Service: application-default
Source User: domain\clinical-staff
Action: Allow
Profile: Threat-Prevention-Strict
Log: Yes (forward to Panorama)
High Availability Config
set deviceconfig high-availability enabled yes
set deviceconfig high-availability group 1 mode active-passive
set deviceconfig high-availability group 1 peer-ip 10.0.99.2
set deviceconfig high-availability group 1 election-option heartbeat-backup enabled yes
set deviceconfig high-availability group 1 state-synchronization enabled yes
SD-WAN: Aruba Silver Peak
What Was Wrong
Silver Peak was running on default settings with no smart routing. VoIP calls were going over the slower broadband link even when the faster MPLS link was available, which meant jitter over 30ms and dropped calls.
Traffic Routing Policies
| Overlay | Applications | Primary Path | Failover | SLA Threshold |
|---|---|---|---|---|
| Realtime-Voice | SIP, RTP, H.323 | MPLS | LTE | Latency <20ms, Jitter <5ms, Loss <0.1% |
| Clinical-Data | Epic EHR, PACS, HL7 | MPLS | Broadband | Latency <50ms, Loss <0.5% |
| General-Business | HTTP/S, Email, DNS | Broadband | MPLS | Best-effort |
| Guest | Internet browsing | Broadband | — | Best-effort; throttled 10Mbps |
Results After the Fix
Wireless: Aruba Wi-Fi 6
RF and AP Design
- Conducted Ekahau site survey for each floor/wing before AP placement.
- Deployed Aruba AP-635 (Wi-Fi 6, tri-radio) in high-density clinical areas; AP-515 in corridors.
- Configured band steering to push capable clients to 5 GHz; airtime fairness enabled.
- Transmit power set to auto with guard rails (7 dBm min, 18 dBm max) to prevent co-channel interference.
SSID Design
| SSID | Band | Security | VLAN | NAC Policy |
|---|---|---|---|---|
| BMC-Clinical | 5 GHz preferred | WPA3-Enterprise / 802.1X | 100 | ClearPass: domain device + user cert |
| BMC-Voice | 5 GHz | WPA2-Enterprise | 200 | ClearPass: MAC-Auth for handsets |
| BMC-IoT | 2.4 / 5 GHz | WPA2-PSK (per-device) | 300 | ClearPass: MAC-Auth Bypass |
| BMC-Guest | 5 GHz | Captive Portal | 500 | ClearPass: sponsored / self-register |
Access Control Policy
Authentication:
1. EAP-TLS (device certificate) → AD computer object check
2. PEAP-MSCHAPv2 fallback (user credentials) → AD group membership check
Authorization Rules:
IF [AD-Group = "Clinical-Staff"] AND [Device-Cert = Valid]
→ VLAN 100, dACL: permit-clinical-apps
IF [AD-Group = "Contractor"] AND [Device-Cert = None]
→ VLAN 500 (Guest), redirect to IT approval portal
IF [MAC = known-IoT-device-list]
→ VLAN 300, dACL: permit-dst-only 10.40.0.0/16
Default:
→ DENY / quarantine VLAN
Monitoring: SolarWinds
What We Monitor
- SolarWinds NPM: All network nodes, SNMP v3 only (v1/v2c disabled for security).
- SolarWinds NTA (NetFlow): NetFlow v9 from all distribution switches; top-talker analysis per VLAN.
- SolarWinds NCM: Automated nightly config backups; compliance checking (no telnet, SSH v2 enforced).
- Airwave / Aruba Central: Wireless health dashboards; client roaming analysis; rogue AP alerting.
Alert Thresholds
| Metric | Warning | Critical | Action |
|---|---|---|---|
| Core switch CPU | 60% | 80% | Page on-call + auto-create ServiceNow incident |
| WAN circuit utilization | 70% | 90% | Capacity review ticket auto-created |
| VoIP VLAN packet loss | 0.1% | 0.5% | Immediate escalation to network on-call |
| AP client count / radio | 25 | 40 | RF team notified for load balancing |
| Firewall session table | 75% | 90% | Palo Alto TAC + change request for scale-out |
| Config drift detected | — | Any | Auto-restore from NCM and create a ServiceNow alert |
Change Management
| Change Title | Core Network Upgrade, Replace Cisco 6500 with Nexus 9508 (Campus A) |
| Change Type | Normal (CAB approval required) |
| Risk Level | HIGH |
| Change Window | Saturday 01:00–05:00 (4-hour window) |
Pre-Change Checklist
- Nexus 9508 staged and tested in lab with production config
- All interface configs migrated and peer-reviewed by second engineer
- SolarWinds NCM config backup of 6500 captured (timestamped)
- Rollback tested in lab: old 6500 reconnected in <15 minutes
- On-call clinician IT liaison notified; Epic team on standby
- Maintenance mode set in SolarWinds (suppress alerts during window)
- Change approved by CAB; Network Lead sign-off confirmed
Implementation Steps
- Confirm no active clinical procedures on affected segment (with charge nurse)
- Enable maintenance mode in SolarWinds NPM
- Gracefully migrate routing: redistribute routes to backup path
- Disconnect 6500; patch fiber to Nexus 9508
- Bring up Nexus 9508; verify BGP/OSPF adjacencies
- Validate: ping all gateway IPs, verify EHR connectivity from test workstation
- Monitor SolarWinds for 30 min; verify no alerts
- Disable maintenance mode; then notify stakeholders once confirmed
Rollback Plan
Root Cause Analysis: VoIP Outage
| Incident | P2: Intermittent VoIP call drops, Campus B, ~200 users |
| Duration | 4 hours |
| Detection | SolarWinds NTA alert: VoIP VLAN packet loss spiked to 3.2% |
Incident Timeline
Corrective Actions
- Immediate: Enabled BPDU Guard on all access ports campus-wide via automated NCM script.
- Short-term: ClearPass policy updated, non-802.1X ports auto-shut after 30 seconds.
- Long-term: Physical security audit of all IDF closets; keycard access required.
- Process: Added "unmanaged switch connected" to NOC Level 1 troubleshooting runbook.
Disaster Recovery, Network DR Runbook
| Scenario | Campus A Core Switch Failure |
| RTO | 30 minutes |
| RPO | Last nightly config backup (SolarWinds NCM) |
1. DETECT
- SolarWinds NPM critical alert: Core-A-NEXUS-01 unreachable
- Confirm via physical check or OOB console (MGMT network)
2. ISOLATE
- Confirm hardware (supervisor) vs software (process crash)
- check: "show system resources" and "show module" via OOB console
3. FAILOVER (hardware failure confirmed)
- Redundant supervisor: auto-failover (<30 sec with NSF/SSO)
- Chassis failure: activate pre-staged spare Nexus 9508 in DR rack
a. Load config from NCM backup (SCP from jump host)
b. Reconnect fiber uplinks in IDF patch panel
c. Verify: "show ip ospf neighbor" / "show bgp summary"
d. Confirm EHR reachability from clinical test workstation
4. VALIDATE
- Ping all 20 critical server IPs (/scripts/validate-core.sh)
- Confirm SolarWinds shows all nodes green
- Call charge nurse on each floor to confirm clinical access
5. COMMUNICATE
- Update ServiceNow P1 incident every 15 minutes
- Notify Network Lead and IT leadership
- Post-incident RCA within 48 hours
Automation & Scripting (Value-Add)
The job description doesn't require scripting, but it's something I do naturally. Automating tedious, error-prone tasks at scale is how I think about infrastructure operations:
Nightly Compliance Script (Python + Netmiko)
from netmiko import ConnectHandler
import re
# Ensure all devices comply: SSH v2 only, no telnet, BPDU Guard on access ports
compliance_checks = [
("transport input ssh", "Telnet disabled"),
("ip ssh version 2", "SSH v2 enforced"),
("spanning-tree portfast bpduguard default", "BPDU Guard global"),
]
devices = [...] # loaded from IPAM / SolarWinds asset list
for device in devices:
conn = ConnectHandler(**device)
config = conn.send_command("show running-config")
for check, description in compliance_checks:
status = "PASS" if re.search(check, config) else "FAIL"
print(f"{device['host']} | {description}: {status}")
conn.disconnect()
Why This Approach Stands Out
Healthcare Context Awareness
Every decision in this document treats network uptime as a patient safety issue, not just an IT metric. HIPAA segmentation and IoT isolation aren't afterthoughts, they're the starting point.
Full Lifecycle, Not Just One Layer
From drawing the architecture to writing the DR runbook to scripting the nightly audit, this sample covers the entire job, not just the parts that look good on a resume.
Root Cause, Not Band-Aids
The VoIP outage RCA goes beyond "we fixed it." It shows why it happened, what was done immediately, and what was changed permanently so it can't happen again.
Changes Done Right
Every change has a pre-flight checklist, a rollback plan good enough to actually use under pressure, and clear communication to clinical stakeholders before anyone touches production.
Monitoring That Helps People
Monitoring built so the Help Desk can answer their own questions without calling the network team. That freed up 30% of the escalation load and improved first-call resolution for clinical staff.
Automate the Boring Stuff
A nightly compliance check that runs itself and raises a ticket if something's wrong. Consistent standards across every device, without anyone remembering to check.
About Me & How I Map to This Role
Praveendhra Rajkumar (he/him)
How My Background Connects
My title is DevOps Engineer, but a lot of what I've spent the last 6 years doing maps directly to this role. I've worked directly with F5 load balancers, run quarterly DR site failovers, built automated config drift detection, and handled on-call production deployments with a 99%+ on-time rate. Below is how each of those experiences connects to something specific in this job.
F5 NGINX Load Balancer Optimization
Refactored F5 NGINX routing logic based on custom request headers, improving traffic distribution efficiency and reducing latency by 10–15% for high-traffic services.
Quarterly Disaster Recovery Switchovers
Coordinated scheduled DR site switchovers quarterly, contributing to 99.9%+ uptime targets. Validated failover procedures, documented runbooks, and ensured business continuity across systems.
Config Drift Detection Framework
Built a tool that automatically compared production and DR configs and flagged anything out of sync. Cut config discrepancies by about 80% and meant no one had to do manual side-by-side reviews anymore.
Quarterly Resilience Testing
Deliberately broke things in a controlled way each quarter to find DR and failover gaps before users ever saw them. Same philosophy behind the DR runbook in Section 9.
Workflow Automation
Automated deployment pipelines and introduced AI-assisted tooling across a team of 20+ engineers, cutting manual effort by ~40% and deployment lead time from days to under a day.
On-Call and Production Ownership
I coordinate weekly production deployments with a 99%+ on-time rate and rotate on-call for critical systems. I know what it's like to get paged at 2am and have to make fast decisions.