Network Process Flows, Praveendhra Rajkumar

How SD-WAN Routes Traffic

Every packet that hits the Silver Peak appliance gets classified by application type, then automatically sent down the best available WAN path, with instant failover if that path degrades. This is what fixed the VoIP call drop problem.

🔗 From the work sample: Aruba Silver Peak SD-WAN · VoIP MOS improved from 2.8 to 4.2

Entry point

Process / Classification

Decision / Policy Check

Primary / Success path

Failover / Degraded path

Destination reached

📡

Incoming WAN Traffic

Branch office → Silver Peak appliance

🔍

Deep Packet Inspection (DPI)

App-ID classification · layer 7 fingerprinting

⚖️

Business Intent Overlay Assignment

Which overlay does this traffic belong to?

🎙 Voice / Realtime

SIP · RTP · H.323

Realtime-Voice Overlay

Latency <20ms · Jitter <5ms · Loss <0.1%

SLA Check on MPLS

Meets latency / loss threshold?

Route via MPLS

FEC + Packet Order Correction

SLA Fail → Failover to LTE

Auto switchover <2s

🏥 Clinical Data

Epic EHR · PACS · HL7

Clinical-Data Overlay

Latency <50ms · Loss <0.5%

SLA Check on MPLS

Meets threshold?

Route via MPLS

Priority queuing for EHR

SLA Fail → Failover Broadband

Auto switchover <2s

💼 General Business

HTTP/S · Email · DNS

General-Business Overlay

Best-effort · Broadband preferred

Route via Broadband

Offloads MPLS capacity

Broadband Healthy

Traffic flows normally

Broadband Fail → MPLS Fallback

👥 Guest / Visitor

Internet Browsing

Guest Overlay

Best-effort · 10 Mbps cap

Route via Broadband

Rate limited · no MPLS access

Internet Delivered

Through captive portal VLAN

No failover, guest-only service

📊

SolarWinds VoIP Quality Manager

Continuous MOS monitoring post-delivery

✅

Application SLA Met · Traffic Delivered

How Devices Get onto the Network

Every device that plugs in or connects to Wi-Fi goes through this flow. ClearPass checks who you are, what device you're using, and puts you in exactly the right network segment, clinical staff, visitor, IoT device, or contractor. Nothing gets in by default.

🔗 From the work sample: Aruba ClearPass NAC · 802.1X · HIPAA Access Control

Entry

Process

Policy decision

Access granted

Access denied / quarantine

AD lookup

🔌

Client Device Connects to Port

Wired or wireless, any campus segment

📡

Port Detects Connection

CDP / LLDP profiling · device fingerprinting begins

🔏

802.1X EAP Request Sent

Does device respond to EAP?

Path A, EAP-TLS

Device presents certificate

Machine cert from internal CA

Cert Valid?

Check against PKI trust store

AD Computer Object Lookup

Is device joined to domain?

AD Group?

Clinical Staff

→ VLAN 100 · dACL: permit-clinical-apps

Path B, PEAP Fallback

No machine cert

PEAP-MSCHAPv2 user credentials

AD User Auth

Username + password validated

AD Group?

Contractor

→ VLAN 500 Guest · redirect to IT portal

Path C, MAC Auth Bypass

No 802.1X response

IoT / headless device

MAC in known IoT list?

ClearPass device database

Known IoT Device

→ VLAN 300 · dACL: healthcare servers only

Unknown Device

→ DENY · Quarantine VLAN · alert fired

📝

ClearPass Logs Access Event

Username · device · VLAN · timestamp · policy matched

🛡️

Device on correct VLAN · Least-privilege enforced

How a Network Change Gets Made

No one just logs in and makes changes to production. Every change has a risk review, a checklist, a rollback plan, and approval gates before anyone touches anything, and a post-review afterward. In a hospital, this process protects patients.

🔗 From the work sample: ServiceNow CHG · CAB approval · Reflects Bright Horizons production deployment discipline

Start

Process step

Decision / gate

Success path

Rollback path

📥

Change Request Submitted

ServiceNow · engineer creates CHG record

🔎

Risk Assessment

Engineer documents: scope, impact, rollback plan, test plan

⚖️

Risk Classification?

Low / Normal / High

Low Risk

Standard Change

Pre-approved template · no CAB needed

Normal / High Risk

CAB Review

Change Advisory Board approval required

CAB Approved?

Rejected → Rework

Update plan, resubmit

✅

Pre-Change Checklist (7 items)

Lab test · config backup (NCM) · rollback verified · stakeholders notified · maintenance mode set

🕐

Change Window Opens

Saturday 01:00–05:00 · clinical systems on standby

⚙️

Step-by-Step Implementation

Follow documented steps · update ServiceNow in real-time

🧪

Validation Tests

Ping gateways · EHR connectivity · SolarWinds all-green?

✅ Validation Pass

Maintenance Mode Off

SolarWinds alerts re-enabled

Stakeholders Notified

Email + ServiceNow update

CHG Record Closed

❌ Validation Fail (within SLA)

Rollback Initiated

Reconnect old hardware · restore config

P1 Incident Created

ServiceNow P1 · notify on-call

CAB Post-Mortem

Within 48 hours · updated risk plan

📊

Post-Change Review (1 week)

SolarWinds trends · any unintended impact?

📁

Change Knowledge Base Updated · Lessons Learned Captured

How an Incident Gets Resolved for Good

From the first alert to the last corrective action. This is the real BPDU storm incident from the work sample , showing how the network team detected it, found the root cause, fixed it immediately, and then made sure it literally cannot happen again.

🔗 From the work sample: RCA section · SolarWinds NTA alert · BPDU Guard remediation

Trigger

Response action

Classification / decision

Recovery / success

Escalation

Documentation / post-incident

🚨

Trigger: Incident Detected

SolarWinds threshold breach · user report · NOC alert

📋

Real Example: BPDU Storm

09:15 · SolarWinds NTA: VLAN 200 packet loss 3.2% ← VoIP call drops begin

🏷️

Severity Classification

P1 (all-hands) / P2 (on-call) / P3-P4 (queue)

P1 / P2, Immediate

On-Call Engineer Paged

ServiceNow alert · 5 min response SLA

Leadership Notified (P1)

Network Lead + IT Director

P3 / P4, Standard

Team Queue

Next business day · standard ticket

🔬

Problem Isolation

SolarWinds · NetFlow top-talker · CLI (show interface / show log) · Wireshark if needed

🔦

Root Cause Identified

09:35, Unmanaged switch from facilities created STP loop. BPDU Guard not enforced on port. Flooded VLAN 200.

🔧

Immediate Fix Applied

09:50, Port shut · BPDU Guard triggered · STP stabilized

✅

Validate: Service Restored?

Packet loss < 0.05% · VoIP calls working · SolarWinds green

Restored ✓

Incident Closed in ServiceNow

10:00 · resolution documented

Not Restored

Escalate / Alternate Fix

Engage TAC · bridge call · DR if needed

📄

RCA Document Written (P1/P2)

5-Why analysis · timeline · contributing factors · impact

🛠️

Corrective Actions (4-tier)

Immediate (same day) · Short-term (1 week) · Long-term (quarter) · Process update

📢

Real Example: Corrective Actions

BPDU Guard enabled globally via NCM script · ClearPass port-auto-shut policy · IDF physical security audit · NOC runbook updated

🛡️

Monitoring Updated · Runbook Updated · Problem Cannot Recur

What Happens When the Core Switch Goes Down

The exact steps from the moment SolarWinds fires a critical alert to the moment clinical systems are confirmed working and the P1 is closed. No guessing, no "we'll figure it out" : a tested runbook ready to execute at 2am.

🔗 From the work sample: DR Runbook section · Zoho quarterly DR switchovers (99.9% uptime) · Bright Horizons resilience testing

Alert / trigger

Runbook action

Diagnosis decision

Recovery path

Communication / documentation

🚨

SolarWinds NPM: CRITICAL

Core-A-NEXUS-01 unreachable · multiple dependent nodes down

📟

On-Call Engineer Paged

ServiceNow P1 incident auto-created · 5-minute response target

👁️

DETECT, Physical + OOB Check

Console cable via OOB MGMT network · check LEDs · "show system resources" · "show version"

🔎

ISOLATE, Failure Type?

Hardware failure vs Software/process crash

Software / Process Crash

Identify Failed Process

"show processes cpu sort" · syslog check

Restart Service / Reload Module

In-service restart if supported

Restore from Backup Image

SolarWinds NCM last-known-good config

Hardware Failure

Redundant Supervisor?

Is NSF/SSO configured?

Yes, Auto-Failover

NSF/SSO Kicks In

<30 sec · zero downtime · log confirms sup switchover

No, Chassis Failure

Stage DR Spare

Pre-racked Nexus 9508 in DR rack

Load NCM Config

SCP from jump host · last nightly backup

Reconnect Fiber Uplinks

IDF patch panel · label map in Visio

🔁

Verify Routing Adjacencies

"show ip ospf neighbor" · "show bgp summary" · all neighbors should re-establish within 5 min

🧪

VALIDATE

/scripts/validate-core.sh, ping all 20 critical server IPs · SolarWinds all-green?

📞

Confirm Clinical Access

Call charge nurses on each floor · "Is Epic working?"

📣

COMMUNICATE, Every 15 min

Update ServiceNow P1 · notify Network Lead + IT Leadership · estimated restore time

✅

Service Restored

RTO met (<30 min target) · P1 closed in ServiceNow

📄

RCA & Post-Incident Review

Within 48 hours · how to prevent recurrence · runbook updated

🏁

Disaster Recovery Complete, Uptime Preserved