Show Menu
Cheatography

Data Center and Cloud Cheat Sheet by

Flow table management

TCAM

 
Compares data against predefined ruleset in one operation
Return action or address with first match
Rules consist of 1, 0, X's
OF Table entries contain Match, Action, Counter, Prio, Timeout

Geometric Repres­ent­ation of Rules

 
Rules can be specified by prefix­/length pairs or operat­or/­number (range)
Rule with d fields -> d-dime­nsional hyper-­rec­tangle
Match condition is finding highest priority hyper-­rec­tangle enclosing P
When rectangles overlap, smallest rectangle "­win­s"

Edge Network Mgmt

 
Mgmt is 80% of IT budget and respon­sible for 62% of otuages
Networks should be truly transp­arent
Challe­nges:
-Large network scalab­ility
-Flexible policies: custom routing, measur­ement and diagnosis, access control
-Commodity switches: small memory, expensive and power hungry, more link speed, storing lots of states, monitoring flows, qos
DC Networks:
-VM migration, load balancing, task schedu­ling, anomaly detect­ion­/is­olation

Difane

 
DIFANE Design goals:
-Scale with network growth
-Improve per-packet perfor­mance: always keep in dataplane
-Minimal switch modifi­cation: no change to dataplane hardware
Difane stages:
-Contr­oller proact­ively generates rules and distri­butes to authority switches
-Contr­oller proact­ively partitions rulespace in wildcards and distri­butes to all switches
-Ingress switches receive unknown flows and they contact Authority switch in correct wildcard space

DIFANE 2

 
-Authority switch forwards packet to correct destin­ation and caches corres­ponding rule in ingress switch for future packets
Caching wildcard rules:
-Contr­oller creates new rules for lower priority rules that overlap with high priority rules (e.g. R1 0-7, 5-6, R3 6-7,0-15, they overlap in 6-7,5-6, so controller creates R3 rule for 6-7,0-3 and 6-7,7-15)
-Rules must be correctly partit­ioned by controller to ensure optimal usage of TCAM, some cuts are better than others

DIFANE 3

 
Network dynamics:
-Policy change at contro­ller: Timeout cache rules, Change authority rules, No change for partitions
-Topology change at switch: No change in cache rules, no change in authority rules, Change in Partition rules
-Host mobility: Timeout cache rules, No change in authority or partition rules

Caching in Buckets

 
Partition rulespace in a grid of buckets
-Larger buckets mean more rules are cached each time
-Smaller buckets means more buckets need to be cached
-Partition until number of associated rules is bounded
-Sweetspot for bucket size is in a region, smaller and larger than this leads to memory overflow
-CAB reduces control network BW, flow setup latency and controller load
-Fully compatible with OF standard, resolves depend­encies wildcard rule caching
 

Cloud Security

 
Typical practices:
-Reinforce applic­ation security, strong network perimeter security
-Access control inside cloud for app/se­rvi­ce/­tenant isolation
-Gauge risk control when using public cloud
Problems:
-Placing new security hardware is not easy
-Security devices are typically shared, miscon­fig­uration in one compro­mises many services, apps and hosts
-Tight work between network and security teams, high cost and low efficiency

Policy Aware Switch

 
-Makes forwarding decisions based on various factors, such as previous hop, input port, source­/dest address

Cloud NaaS

 
Features:
-Virtual network isolation
-Custom addressing
-Service differ­ent­iation
-Flexible middlebox intepr­osition
Cloud contro­ller: provides VM instance manage­ment, self-s­ervice provis­ioning, host virtual switch interc­one­ction
Network contro­ller: provides VM placement directives to cloud contro­ller, generates virtual network between VMs, Configures physical and virtual switches

Hybrid Security Archit­ecture

 
-Tenants everywhere -> Middlebox anywhere
-Flexible traversal: traffi­c-s­pec­ific, middlebox type, arbitrary number and order
-Decouple networking from security, creating appliance layer
App layer: App VMs with security groups
Appliance layer: Traversal path of middle­boxes
Network layer: Only cares about packet delivery
-Forwa­rding: MAC rewrite for L2, IP in IP for L3

HSA Benefits

 
-Scalable and flexible provis­ioning
-Facil­itates virtua­liz­ation, simplifies service develo­pment, testing, deployment and troubl­esh­ooting
-Enables dynamic and hetero­geneous service provis­ioning
-Minimize miscon­fig­uration impacts

SDN Security

 
Bottle­necks: Weak OF Agent CPU, limited message processing capabi­lities, Limited TCAM/SRAM resour­ces­->table overflows
Solution: Leverage NFV to build a softwa­re-­based defense line
NFV in edge clouds:
-Elastic resource allocation
-Network function as a service
-Rapid innovation

SDN Shield

 
-Contr­oller monitors switch packet-in message rate from each switch
-When one switch-s rate approaches satura­tion: counte­rme­asure
-Use a second Attack Mitigation Unit
How to Identify Legitimate Flows:
-Use statis­tical filtering
Condit­ional Legitimate Probab­ility:
-Analyze header field distri­bution
-Compare most recent measur­ement to reference profile
-Build scoreboard to calculate new flow's legitimacy probab­ility
-Threshold to control the rate of passed flows

Packet­Score

 
1: Detect Attack -> monitor key parameters of traffic destined to protected targets, contain by limiting resource consum­ption
2: Differ­entiate attacking packets from legitimate ones in suspicious traffic: compare against baseline and use CLP to compute likelihood of each suspicious packet of being legitimate
3: Discard suspicious packets select­ively comparing CLP with dynamic threshold

Attack types

 
Endpoint: overload a victim or stub network -> Easily isolated by upstream routers, attacking packets have victim IP/subnet
-Monitor traffic rate, flow rate towards each host/stub -> large number of targets monitored
-Use Bloom-­filter to catch targets under attack, use DDoS control server to aggregate and correlate
Infras­tru­cture: Overload some choke-­point (router uplink) -> hard to isolate unless packet traceback infras­tru­cture is in place
-Monitor traffic parameters on links in routers

CLP 1

If packet attributes are indepe­ndent, Joint Probab­ility Mass function can be separated in P(A=a)­*P(­B=b)...
 

IRB

Asymmetric
Symmetric
-Ingress VTEP does L2 and L3 lookup, egress VTEP only L2 lookup
-Both VTEPs perform L2 and L3 lookup
-All VTEP need all VNIs
-Inter­VXLAN traffic is encaps­ulated in L3 VNI, which identifies VRFs
-Ingress VTEP routes from source VNI to dest VNI
-ingress VTEP does not need to know dest VNI
-Not scalable
-Scalable

VXLAN with MPBGP

 
-Improves scalab­ility
-Enables control plane learning of L2 end host and L3 reacha­bility
-Reduced network flooding
-Optimal east-west and north-­south forwarding
-VTEP discovery and authen­tic­ation

MP BGP VXLAN

 
L2 traffic cannot traverse VNI boundaries
L3 traffic from one VRF is mapped to a L3 VNI
L3 traffic from different VRFs cannot traverse L3 VNI boundaries
BGP update sends Host MAC, Host IP, L3 VNI and VTEP
Remote VTEPs take Host MAC and put it in MAC table, and Host IP and put it in VRF (L3 VNI) IP table
Local host inform­ation is learned through conven­tional L2 learning and GARP, or through mgmt plane integr­ation between VTEP and hosts

VXLAN BGP EVPN

 
Asymmetric IRB: different path from source to dest and back, VTEP must be configured with both source and dest VNIs for both l2 and l3
Symmetric IRB: same path to destin­ation and back, ingress VTEP routes from source VNI to L3 VNI and changes inner dest MAC to egress VTEP router MAC
Route Types:
Type 2: MAC adveti­sement -> L2 VNI MAC/MAC-IP -> MAC and ARP resolution
Type 5: IP Prefix Route -> L3 VNI route -> advertise prefix

VXLAN EVPN

 
1 L3 VNI per VRF per VTEP
1 L2 VNI per L2 segment, multiple L2 VNIs per tenant
BGP minimizes network flooding and allows VTEP peer discovery and authen­tic­ation
All VTEPs keep the same IP address for L2 VNIs
Process:
-Host sends out GARP when they come online
-Local VTEP creates local ARP cache and advertises through BGP as Route Type 2
-Remote VTEP puts IP-MAC info into remote ARP cache and suppresses ARP for this IP
-VTEP floods if no match is found in cache

VXLAN BGP-EVPN

 
VXLAN Overlay is an L2 broadcast domain identified by a VNI
VXLAN encap:
-Outer header -> IP source and dest from VTEP endpoints, L2 source from VTEP source, L2 dest from next L3 hop, UDP port dest 4789
Gateway types:
L2-> VLAN to VXLAN bridging
L3-> VXLAN to VXLAN routing

Arista VXLAN 2

 
VTEP: Tunnel endpoint
VXLAN GW: Bridges VXLAN to non-VXLAN enviro­nment (HW or SW)
VNI: Identifies VXLANs
VTI: Terminates a VTEP
VXLAN Segment: L2 overlay network over which VMs commun­icate, only VMs within same VXLAN segment can commun­icate
OVSDB: Allows management of Open vSwitches, create or delete ports, tunnels, and queues

ARISTA VXLAN

 
Challe­nges: Oversu­bsc­rip­tion, Scalab­ility, Cost, Mobility and Latency
Network virtua­liz­ation: Create overlay networks on top of physical network infras­tru­cture
VXLAN 24 bit ID -> 16M networks
-Can cross L3, 50bytes of overhead
-VMs don't see tag
-L2 broadcast is replaced by IP multicast
Benefits:
-VLAN sprawl
-Single fault domains
-Scala­bility beyond 4096 segments
-Non-p­rop­rietary fabric
-IP mobility
-Physical cluster size and locality improves
-Better multit­enancy

VXLAN

 
-Network virtua­liz­ation technology to improve scalab­ility problems in large cloud deploy­ments
-VLAN-like encaps­ula­tion, encaps­ulates L2 frames in UDP packets with port 4789 using a VNI
-Endpoints are called VTEPs, and may be virtual switches, hyperv­isors or NVGREs
-Overlay network is usually a multicast cloud
-NVGRE uses GRE to encaps­ulate L2 frames in L3 packets across L3 networks

VXLAN Flood and Learn

 
VNI is mapped to a multicast group on a VTEP
Broadcast, Unknown Unicast and Multicast traffic is flooded to the multicast group of the VNI
Remote VTEPs of the group learn host MAC, VNI and source VTEP IP from flooded multicast traffic
Unicast packets for the host are sent directly to the source VTEP IP
Encaps­ulated packet:
UDPd: 4789, IPd: remote VTEP/m­ult­icast group, IPs: source VTEP, Md: remote VTEP/m­ult­icast MAC, IPd: Remhost, IPs: Source­host, Md: Remhos­t/B­roa­dcast, Ms: Sourcehost
 

CARPO

 
Considers BW demand variation over time
Elastic Tree might overes­timate demand wasting power (average or peak, real demand is less)
-Use flow correl­ation (90 percentile data) to consol­idate flows with low correl­ation using non-peak rate (low prob of peaking together)
-Minimize total power within a consol­idation period based on traff correl­ation and non-peak data rate
-link rate adaptation for remaining links
Result: lowest power consum­ption and most savings, minor delay and drop degrad­ation

Elastic Tree

 
Power Knobs: vary link speed, disable links, disable switches, move workload
Goal:
-Turn off unneeded link and switch
-Create energy propor­tional DC network
Optimizer:
-Takes topology, routing restri­ctions, power models, traffic matrix
-Produces network subset and flow routes
Models:
-Formal: best quality, any topo, not scalable, input: Traff Matrix
-Greedy: good quality, any topo, scalable, traffic matrix
-Topo-­aware: ok quality, structured topo, best scalab­ility, port counters

Green DC 2

 
Intra DC: dispatch loads to minimal servers and to cooler areas
Inter DC: dispatch loads to DC's with less energy cost or with renewable energy
JEC (Joint inter and intra)
-Considers variation of electr­icity prices and workload distri­bution on the efficiency of cooling systems
Random LB < Electr­icity InterDC < Cooling aware IntraDC < EIR+CIA < JEC

Green DC

 
Minimize energy consumed by servers and cooling
-70-80% of total
-Conso­lidate workload to minimal set of servers and turn off unnece­ssary
-Conso­lidate workload based on locations to maximize efficiency of cooling
Minimize energy consumed by DC network (switches)
-10-20% of total
-Conso­lidate traffic to minimal set of paths and turn off switch­es/­links

Baraat features

 
Keep 3 counters: Total demand, total bytes reserved so far, number of flows in task
Also single aggregate counter for each link to keep track of BW alloca­tions
Features:
-Schedule tasks, not flows
-FIFO-LM algorithm
-No need to know flow size
-New transport protocol
-Modifies switches and hosts
-Does not meet deadlines
-Reduces task completion time for partit­ion-agg workflows compared to Fair share

Traffic Scheduling - Baraat

 
-Flow schedu­lin­g-> ineffi­cient
-Priority scheduling -> does not meet deadlines
Idea: Task-aware scheduling
-Schedule tasks in Smart Priority Classes
-Switch maps flows to classes and handles heavy tasks
-Flows mapped to higher prio class get preference
-Flows with same priority class fair share
-TaskID is used as priority (FIFO)
-Heavy tasks are identified on the fly by byte count, upon exceeding threshold, task and immedi­ately subsequent task are assigned same priority

Traffic Scheduling - pFabric 2

 
minTCP:
-Start at line rate, no RTO estima­tion, reduce window on packet drop, increase same as TCP (ss, CA)
Conclu­sions:
-Simple, yet near-o­ptimal
-Requires new switches and minor host changes (clean­-slate)
-Does not meet deadline requir­ements

Traffic Scheduling - pFabric

 
-Prior­itize packets based on remaining flow size
-pFabric switch: implement scheduling based on priority (send high priority first, drop low priority first)
-pFabric host: send/r­etr­ansmit aggres­sively, use simple flow control (minTCP)
-Very small buffers, 2xLink­Spe­edxRTT
-Worst case: small packets (64B), 51.2ns (64*8/­10Gbps) to find min/max of 600 numbers with binary tree, 10 clock cycles, 1ns with current ASICs

Traffic Schedu­ling: D3

 
Make network aware of flow deadlines
Prioritize based on deadlines
When capacity is greater than desired rates: deadline flows get desired rate + fair share, non-de­adline get only fair-share
When capacity is not enough: greedily satisfy as many flows as possible according to request rates in order of arrival
-Need to modify hosts and switches, not backward compat­ible, no increm­ental deployment
-Not friendly with legacy transport protocols, running in parallel degrades perfor­mance

TCP Losses

 
Block loss: lose a whole window of packets
Double loss: lose a retran­smitted packet, protocol can't tell
-Solution: timestamp
Tail loss: one of the last packets of the stream is lost, not enough DUP ACK to trigger retran­smi­ssion
-Solution: send dummy data (e.g. reiterated FIN)
PLATO: Send heartbeats interl­eaved to avoid RTO, to infer loss by 3 DUP ACK, heartbeat is rarely dropped

DCTCP

 
A single flow needs C*RTT buffers for 100% TP
For large N flows C*RTT/­sqrt(N) is enough
-Idea: React to ECN marks, every ECN mark cuts down window by 5% (TCP cuts by half regardless of number of marks)
-At switch mark packets when queue length > K
-At sender keep F=#mar­kAC­K/t­ota­lACK, a=(1-g­)*a+gF
-cwnd=­(1-­a/2­)cwnd
Benefit: keep queue length short and TP high
Tradeoff: Conver­gence time is greater for new flows

New Reno

 
Remember last segment sent before Fast Retransmit
-Deal with partial ACK (new ACK does not cover last remembered segment, i.e. more packets lost before entering FR)
-Retra­nsmit new lost packet too and remain in Fast Recovery, exit when ACK that covers last segment sent before FR is received)
-sshth­res­h=m­ax(­fli­ght­siz­e/2­,2*mss)
-cwnd=­ssh­thr­esh­+3*mss
-each new dupack cwnd=c­wnd+mss
-when partial ack received cwnd=c­wnd­-(c­urr­ACK­-pr­evA­CK)­*ms­s+mss

TCP NewReno

TCP Reno

TCP

 
Slowstart: Start with cwnd =1, each ACK cwnd <- cwnd + 1, each RTT cwnd <- 2xcwnd (expon­ential)
CA: enter when cwnd >= ssthresh, each ACK cwnd<-­cwn­d+1­/cwnd
-Each RTT: cwnd <- cwnd + 1
Fast Retran­smit: flightsize = min(aw­nd,­cwnd), sshthresh = max(fl­igh­tsi­ze/2,2)
-Enter slowstart cwnd=1

TCP Tahoe and Reno

Tahoe
Reno
-3 DUP ACKS -> Fast Retran­smit, set ssthresh to cwnd/2, reduce cwnd to 1 MSS, reset to slow start
-3 DUP ACKS -> Fast Retransmit and skip slow start, set cwnd to cwnd/2, enter fast recovery
-ACK time out (RTO) -> Slow start, cwnd -> 1MSS
-ACK time out (RTO) -> Slow start, cwnd -> 1MSS
 
Fast recovery: wait for ACK for entire window before returning to CA, if no ACK enter slowstart

DC Transport Requir­ements

 
-High burst tolerance
-Low latency
-High throughput
Tradit­ional TCP:
-Window flow control: lost packets detected by missing ACKs
-W=BW x RTT -> awnd (recei­ver), cwnd (network), W = min(aw­nd,­cwnd)
Algorithms to calculate cwnd: Tahoe, Reno, NewReno, DCTCP

TCP in the DC

 
Not good for DC
-Adds latency
-Wastes buffer space
-Performs bad with shallo­w-b­uffer switches
DC Workloads:
-Parti­tio­n/A­ggr­egate (Delay sensitive, bursty)
-Short messages (delay­-se­nsi­tive)
-Large flows (throu­ghput sensitive)
Incast: Synchr­onized congestion from partit­ion­-ag­gregate workloads
-Seemingly underu­tilized links become overut­ilized in short burst causing unseen drops
 

Comments

No comments yet. Add yours below!

Add a Comment

Your Comment

Please enter your name.

    Please enter your email address

      Please enter your Comment.