πŸŽ“ VIT Vellore Β· MTech CSE Β· One-Day Revision Bible

OS & Linux
Interview Mastery

By Shivam Kachhadiya

Real interview questions from FAANG, Goldman Sachs, Citadel, Razorpay & HFT firms β€” curated from Glassdoor, GFG, LeetCode Discuss & verified interview reports. Covers everything from basics to advanced.

Google Β· Meta Β· Amazon Β· Microsoft Goldman Sachs Β· JP Morgan Β· Morgan Stanley Citadel Β· Jane Street Β· Razorpay Β· Stripe
REVISION PROGRESS
0 / 0 done
Difficulty
Company
01

OS Fundamentals

OS CORE
What is an OS? What are its 5 core functions?
All Companies
β˜…β˜…β˜…β–Ό

An OS is system software acting as interface between hardware and user programs. It manages hardware and provides services to applications.

Process MgmtCreate, schedule, pause, terminate processes. Allocate CPU time fairly.
Memory MgmtAllocate/free RAM, virtual memory, paging, swapping, isolation.
File SystemOrganize storage as files/dirs. Inodes, permissions, access control.
I/O MgmtAbstract hardware via device drivers. Buffering, caching, spooling.
SecurityAuthentication, authorization, system call gating, process isolation.
πŸ’‘ How to Answer at FAANG
Don't just list β€” explain the why. "The OS abstracts hardware so devs don't care if disk is SSD or HDD. It enforces isolation so a buggy process can't corrupt another process's memory."
What is a Kernel? Difference between Monolithic, Microkernel, and Hybrid kernel?
AmazonMicrosoft
β˜…β˜…β˜…β–Ό

The kernel is the core of the OS β€” runs in privileged mode (Ring 0), manages hardware directly, handles syscalls.

TypeWhat runs in kernel?SpeedStabilityExample
MonolithicEverything: FS, drivers, networking, schedulerFastest (no IPC)One bug = crashLinux, Unix
MicrokernelOnly: IPC, basic scheduling, memorySlower (IPC overhead)Very stableMach, MINIX, QNX
HybridCore in kernel, some drivers outsideGood balanceGoodWindows NT, macOS XNU

Linux = Monolithic but Modular: Load/unload drivers at runtime via insmod / rmmod / lsmod without rebooting β€” combines monolithic performance with microkernel flexibility.

Amazon Follow-up
"Why is Linux monolithic if microkernel is theoretically safer?" β†’ IPC overhead in microkernel makes it impractical for general purpose OSes. Every driver call = context switch + message passing = latency. Linux compensates with good code review and module isolation.
Kernel Space vs User Space β€” what is a System Call and how does it work internally?
GoogleAmazonGoldman
β˜…β˜…β˜…β–Ό

User Space (Ring 3): Apps run here. Cannot touch hardware directly. One crash = only that process dies.

Kernel Space (Ring 0): Full hardware access. OS runs here. A bug here crashes the whole system.

System Call β€” internal flow on x86_64:

# e.g. calling read(fd, buf, 100) in C
1. glibc wrapper loads syscall number into RAX register (read = 0)
2. Arguments: RDI=fd, RSI=buf, RDX=count
3. Executes SYSCALL instruction β†’ CPU hardware switches Ring 3 β†’ Ring 0
4. CPU saves user registers to kernel stack, jumps to syscall_handler
5. Kernel validates args, performs the actual work (e.g. reads from disk)
6. Return value placed in RAX
7. SYSRET instruction β†’ Ring 0 β†’ Ring 3, app continues
# Typical cost: 100-500 ns per syscall
⚑ Fintech Angle β€” Goldman / Citadel ask this
Each syscall costs ~100–500ns. HFT systems use kernel bypass (DPDK), memory-mapped I/O, and busy-polling to eliminate syscalls entirely. Even saving 10 syscalls per trade = microseconds of edge.
What is a Context Switch? True cost, what triggers it, how to measure it?
GoogleGoldmanCitadel
β˜…β˜…β˜…β–Ό

A context switch = CPU saves state of running process A into its PCB (Process Control Block), loads state of process B, runs B.

PCB stores: CPU registers (PC, SP, flags, GPRs), page table pointer (CR3), file descriptor table, signal masks, scheduling state.

Triggers: time quantum expiry (CFS preemption), I/O block (read/write syscall), higher priority task arrives (SCHED_FIFO), voluntary yield (sleep, mutex lock).

# True cost β€” direct + indirect
Save/restore registers:       ~200-500 ns
TLB flush (if no PCID):      ~500 ns - 5 Β΅s
CPU cache invalidation:       ~1-100 Β΅s (process's working set is now cold!)
Branch predictor flush:       misprediction spike after switch
# Total observable cost: 1 Β΅s to 100 Β΅s depending on cache pollution

# Measure context switches
vmstat 1             # 'cs' column = context switches/sec
pidstat -w 1         # per-process voluntary/involuntary switches
perf stat ./app      # 'context-switches' hardware counter
HFT Fix
Use SCHED_FIFO + CPU isolation (isolcpus=2,3 kernel param) + taskset -c 2 ./app. Trading threads never get preempted β€” zero involuntary context switches.
02

Process, States & Lifecycle

OS CORE
What are the 5 states of a process? Draw the state transition diagram.
All Companies
β˜…β˜…β˜…β–Ό
                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚                  Process States                         β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

   New ──admitted──► Ready ◄──────────── I/O complete / event done
                      β”‚  β–²                        β”‚
               CPU  β”‚  β”‚ preempted         Waiting (Blocked)
             dispatchedβ”‚  β”‚                        β–²
                      β–Ό  β”‚                        β”‚
                   Running ──────I/O request β”€β”€β”€β”€β”€β”˜
                      β”‚
                   exit()
                      β”‚
                      β–Ό
                  Terminated
NewProcess created, PCB allocated. Not yet in ready queue.
ReadyLoaded in RAM, waiting for CPU. Scheduler picks from here.
RunningCurrently executing on CPU. Only 1 per core at a time.
WaitingBlocked on I/O (disk, network) or event. CPU goes to other process.
TerminatedExecution complete. PCB stays until parent calls wait() (zombie if not).
Linux: check process state
ps aux β†’ STAT column: R=Running/Runnable, S=Sleeping (interruptible), D=Uninterruptible sleep (I/O wait), Z=Zombie, T=Stopped. D state processes are stuck waiting for disk β€” if you have many, your disk is the bottleneck.
Process vs Thread β€” ALL differences. What's shared, what's private?
GoogleAmazonMicrosoft
β˜…β˜…β˜…β–Ό
AspectProcessThread
Address SpaceSeparate β€” own virtual memory mapShared β€” same virtual address space
HeapPrivate heapShared heap (race conditions possible!)
StackOwn stackEach thread has its own stack
File DescriptorsOwn table (after fork)Shared β€” close in one affects all!
Code/Data segmentsOwn copy (CoW after fork)Shared β€” changes visible immediately
Creation costHeavy β€” fork() copies page tablesLight β€” clone() with CLONE_VM flag
Context switchExpensive β€” TLB flush, new page tablesCheaper β€” same address space, less TLB pressure
Crash impactOther processes surviveOne crash = entire process dies
CommunicationIPC required (pipes, sockets, shm)Direct via shared memory (fast, needs sync)
Linux kernel implfork() β†’ new task_structclone(CLONE_VM|CLONE_FILES|...) β†’ task_struct shared
Key Insight
In Linux, threads and processes are the same kernel structure (task_struct). The difference is only which resources are shared β€” controlled by flags in clone(). pthread_create() calls clone() with sharing flags internally.
Explain fork(), exec(), wait(). What is Copy-on-Write? Zombie vs Orphan process?
AmazonGoogleMicrosoft
β˜…β˜…β˜…β–Ό

fork(): Creates exact copy of calling process. Returns 0 to child, child's PID to parent. Uses Copy-on-Write (CoW).

Copy-on-Write: After fork(), parent and child share the same physical pages marked read-only. When either writes β†’ page fault β†’ kernel copies just that page privately. Only pages actually written get duplicated. Makes fork() fast even for large processes.

exec(): Replaces current process image with a new program. Code, data, heap, stack all replaced. FDs stay open (unless FD_CLOEXEC set).

wait(): Parent blocks until child exits and collects exit status. Essential to prevent zombies.

# Classic shell fork-exec-wait pattern
pid = fork()
if pid == 0:            # CHILD
    execv("/bin/ls", args)  # replace child with ls
else:                   # PARENT
    wait(&status)           # collect exit status, prevent zombie
Zombie 🧟Child exited but parent hasn't called wait(). PCB stays in /proc. Fix: parent must call wait() or handle SIGCHLD. Shows as Z in ps.
Orphan πŸ‘ΆParent exits while child still running. Linux re-parents orphan to init/systemd (PID 1) which periodically calls wait().
# Find zombies
ps aux | grep ' Z '
ps -eo pid,ppid,stat,cmd | awk '$3~/Z/{print}'
Linux Signals β€” SIGKILL vs SIGTERM vs SIGHUP. How to gracefully shutdown a service?
GoogleAmazonGoldman
β˜…β˜…β˜…β–Ό
SignalNumCatchable?Default ActionUse case
SIGTERM15βœ… YesTerminateGraceful shutdown β€” app cleans up, closes DB, flushes buffers
SIGKILL9❌ NeverInstant killForce kill. Data loss possible. Last resort only.
SIGHUP1βœ… YesTerminateDaemons reload config without restart: kill -HUP nginx_pid
SIGINT2βœ… YesTerminateCtrl+C in terminal
SIGSEGV11βœ… YesCore dumpInvalid memory access (null pointer dereference, buffer overflow)
SIGCHLD17βœ… YesIgnoreChild state changed β€” parent should call wait() here
SIGSTOP19❌ NeverPauseSuspend process. Ctrl+Z sends SIGTSTP (catchable version)
SIGUSR1/210,12βœ… YesTerminateUser-defined. Apps use for custom triggers (log rotation, stats dump)
# Graceful shutdown pattern (used in Kubernetes pod termination)
kill -15 <pid>             # Step 1: SIGTERM β€” request graceful stop
sleep 30                   # Step 2: Give it 30 seconds to finish
kill -9 <pid> 2>/dev/null  # Step 3: Force if still running

# In shell scripts β€” trap signals
trap 'echo Cleaning up; rm -f /tmp/lock; exit 0' SIGTERM SIGINT
03

Threads, Synchronization & Race Conditions

OS CORE
Mutex vs Semaphore vs Spinlock vs Monitor β€” explain each and when to use which.
GoogleAmazonGoldman
β˜…β˜…β˜…β–Ό

Race Condition: Two threads read-modify-write shared data concurrently. counter++ is 3 ops: LOAD, ADD, STORE β€” not atomic. Result is unpredictable.

MutexBinary lock. Only 1 thread holds it. Others block (sleep). OS reschedules. No wasted CPU. Use for: most shared data protection.
SemaphoreCounting lock. Value N = N threads can hold simultaneously. Counting semaphore (N>1): limit concurrent resource access (e.g. max 10 DB connections).
SpinlockBusy-waits in a loop. No sleep. Uses 100% CPU while waiting. Only use for: extremely short critical sections (<100ns) where sleep overhead exceeds wait time. HFT loves it.
RW LockMultiple readers OR one writer. Perfect for read-heavy workloads (order book reads: many; price updates: rare).
MonitorHigh-level: mutex + condition variable bundled. Java synchronized block = monitor. Cleaner API than raw pthreads.
# Mutex (C pthreads)
pthread_mutex_lock(&m);
shared_counter++;         # critical section
pthread_mutex_unlock(&m);

# Semaphore β€” binary (works like mutex) or counting
sem_init(&s, 0, 5);       # allows 5 concurrent holders
sem_wait(&s);             # acquire (decrement, block if 0)
sem_post(&s);             # release (increment)

# Atomic (lock-free, no OS involvement)
std::atomic<int> cnt{0};
cnt.fetch_add(1, std::memory_order_seq_cst);  # thread-safe increment
Mutex vs Binary Semaphore β€” Key Difference
A mutex has ownership β€” only the thread that locked it can unlock it. A semaphore has no owner β€” any thread can signal it. This matters for priority inversion protocols and debugging.
What is lock-free programming? Explain CAS (Compare-And-Swap) and the ABA Problem.
CitadelJane StGoldman
β˜…β˜…β˜…β–Ό

Lock-free: Thread-safe without mutexes. Uses hardware atomic instructions. No blocking = no deadlock, no priority inversion, better cache performance.

CAS β€” Compare-And-Swap: Single hardware instruction: "If *ptr == expected, atomically set *ptr = new_val, return true. Otherwise return false." Foundation of ALL lock-free algorithms.

// C++ lock-free counter using CAS loop
std::atomic<int> counter{0};
int old, next;
do {
    old = counter.load(memory_order_relaxed);
    next = old + 1;
} while (!counter.compare_exchange_weak(old, next));
// If counter changed between load() and CAS β†’ retry

// Java equivalent
AtomicInteger cnt = new AtomicInteger(0);
cnt.incrementAndGet();  // internally uses CAS loop

ABA Problem: Thread 1 reads value A. Thread 2 changes A→B→A. Thread 1 does CAS, sees A, thinks nothing changed — but state actually changed! Classic in lock-free linked lists: node removed and re-inserted.

Fix: Pair value with a version counter. CAS on (value, version) together. Version always increments β†’ impossible to have ABA. Java: AtomicStampedReference. C++: tagged pointer.

Memory Ordering β€” Fintech Deep Dive
relaxed: no ordering, fastest. acquire/release: one-way barriers for producer-consumer. seq_cst: full fence, safest, default. HFT uses relaxed where correct, seq_cst at key synchronization points.
Classic Synchronization Problems β€” Producer-Consumer, Dining Philosophers, Reader-Writer.
AmazonMicrosoftRazorpay
β˜…β˜…β˜…β–Ό

1. Producer-Consumer (Bounded Buffer): Producer adds items to a fixed-size buffer. Consumer removes. Need to: block producer when full, block consumer when empty, mutual exclusion on buffer access.

# Solution: 3 semaphores
mutex = Semaphore(1)     # mutual exclusion on buffer
empty = Semaphore(N)     # count of empty slots (start=N)
full  = Semaphore(0)     # count of filled slots (start=0)

Producer: wait(empty) β†’ wait(mutex) β†’ add item β†’ signal(mutex) β†’ signal(full)
Consumer: wait(full)  β†’ wait(mutex) β†’ remove   β†’ signal(mutex) β†’ signal(empty)

2. Dining Philosophers: 5 philosophers, 5 forks. Each needs 2 forks to eat. Risk: all pick up left fork simultaneously β†’ deadlock.

Fix options: Allow at most 4 philosophers to sit simultaneously. OR odd philosopher picks left first, even picks right first (breaks circular wait). OR use a monitor/arbitrator.

3. Reader-Writer Problem: Multiple readers can read simultaneously. Only one writer at a time (exclusive), no readers while writing.

# Solution: RW Lock (pthread_rwlock_t)
pthread_rwlock_rdlock(&rw);   # acquire read lock (multiple allowed)
pthread_rwlock_wrlock(&rw);   # acquire write lock (exclusive)
pthread_rwlock_unlock(&rw);
04

CPU Scheduling Algorithms

OS CORE
FCFS, SJF, Round Robin, Priority Scheduling β€” explain ALL with worked examples and tradeoffs.
AmazonMicrosoftRazorpay
β˜…β˜…β˜…β–Ό

Key Metrics: Waiting Time = time in ready queue. Turnaround Time = total time from arrival to completion. Throughput = processes/sec. Response Time = first CPU allocation.

# Example: 3 processes P1(burst=10ms), P2(burst=5ms), P3(burst=8ms) arriving t=0

# 1. FCFS (First Come First Served) β€” non-preemptive
Order: P1 β†’ P2 β†’ P3
Timeline: |---P1 10ms---|--P2 5ms--|----P3 8ms----|
Waiting:  P1=0, P2=10, P3=15  β†’  Avg WT = 8.33ms
Problem:  Convoy effect β€” short jobs wait behind long ones!

# 2. SJF (Shortest Job First) β€” non-preemptive
Order: P2(5ms) β†’ P3(8ms) β†’ P1(10ms)
Timeline: |-P2 5ms--|---P3 8ms---|-----P1 10ms-----|
Waiting:  P2=0, P3=5, P1=13  β†’  Avg WT = 6ms
Optimal!  Minimum average waiting time. But: needs future knowledge of burst time.

# 3. SRTF (Shortest Remaining Time First) β€” preemptive SJF
Preempts running process if new arrival has shorter remaining time.
Better response time but more context switches and starvation possible.

# 4. Round Robin (RR) β€” preemptive, time quantum q
q = 4ms. Cycle: P1(4) β†’ P2(4) β†’ P3(4) β†’ P1(4) β†’ P2(1) β†’ P3(4) β†’ P1(2)
Timeline: |P1 4|P2 4|P3 4|P1 4|P2 1|P3 4|P1 2|
Good for: time-sharing, fairness. q too small = too many context switches.
q too large = degenerates to FCFS. Typically q = 10-100ms.

# 5. Priority Scheduling
Each process has a priority. Highest priority runs first.
Problem: Starvation β€” low priority processes may never run.
Fix: Aging β€” gradually increase priority of waiting processes.
AlgorithmPreemptive?Starvation?Best ForReal OS
FCFSNoNoBatch systemsMainframes
SJFNoYesTheoretical optimalBenchmark
SRTFYesYesMinimizing wait timeSome batch
Round RobinYesNoTime-sharing, interactiveLinux CFS basis
PriorityBothYes (fix: aging)Real-time systemsLinux SCHED_FIFO
Multilevel QueueYesYesMixed workloadsWindows, Linux
How does Linux CFS (Completely Fair Scheduler) work? vruntime, Red-Black Tree, nice values.
GoogleGoldmanCitadel
β˜…β˜…β˜…β–Ό

CFS (since Linux 2.6.23) β€” core idea: give every process a fair share of CPU proportional to its weight. No fixed time quanta.

vruntime: Each task tracks virtual runtime β€” how much CPU time it has received, weighted by priority. Lower nice value = higher weight = vruntime grows slower = more CPU time.

Red-Black Tree: All runnable tasks in an RB-tree sorted by vruntime. O(log n) insert/delete. Leftmost node = smallest vruntime = runs next. O(1) to get next task.

# nice values: -20 (highest priority) to +19 (lowest)
nice -n -20 ./trading-engine   # max priority, gets ~40% more CPU than default
nice -n 19 ./batch-report      # lowest priority, runs only when CPU idle
renice -5 -p <pid>            # change priority of running process
Scheduler ClassTypeUse CasePriority
SCHED_FIFOReal-time, no preemption by same levelHFT trading threads, audio drivers1-99 (RT)
SCHED_RRReal-time with time quantumRT with fairness among same-prio tasks1-99 (RT)
SCHED_OTHERNormal CFSRegular applicationsnice -20 to 19
SCHED_DEADLINEEDF with hard deadlinesPeriodic RT tasks with guaranteed latencyAbove RT
SCHED_IDLELowestBackground housekeeping tasksBelow all
# Set real-time scheduling (root required)
chrt -f 99 ./hft-engine        # SCHED_FIFO priority 99 (max)
chrt -p <pid>                  # check current scheduling policy
05

Deadlock β€” Prevention, Avoidance, Detection

OS CORE
What is Deadlock? 4 Coffman conditions. Prevention vs Avoidance vs Detection strategies.
AmazonGoogleGoldman
β˜…β˜…β˜…β–Ό

Deadlock: Process A holds Resource 1, waits for Resource 2. Process B holds Resource 2, waits for Resource 1. Both wait forever.

4 Coffman Conditions β€” ALL must hold simultaneously:

1. Mutual Excl.Resource can only be held by one process at a time
2. Hold & WaitProcess holds β‰₯1 resource AND waits for more
3. No PreemptionResources can't be forcibly taken; only voluntarily released
4. Circular WaitCircular chain: P1 waits for P2, P2 waits for P3, P3 waits for P1

Breaking each condition (Prevention):

Break #4Lock Ordering: Always acquire locks in globally consistent order. Most practical. If every thread acquires lock A before lock B β†’ circular wait impossible.
Break #2Request all resources at once before starting (breaks Hold & Wait). Or release all before requesting more.
Try-Lockpthread_mutex_trylock() β€” if can't acquire, back off and retry. Or use timeout: if not acquired in N ms, release and retry.
Lock-FreeUse atomic CAS operations. No locks = impossible to deadlock.
# Detect deadlocks in production Linux
pstack <pid>                              # all thread stacks
gdb -p <pid> -ex "thread apply all bt"   # backtraces
cat /proc/<pid>/wchan                    # which kernel function thread waits on
jstack <java_pid> | grep -A20 BLOCKED   # Java thread dump
valgrind --tool=helgrind ./app           # detect + races
Banker's Algorithm β€” what is it, how does it work, and why is it called "Banker's"?
AmazonMicrosoftRazorpay
β˜…β˜…β˜…β–Ό

Banker's Algorithm is a deadlock avoidance algorithm. Named after bank lending: a bank never lends money that would prevent it from satisfying all customers' needs.

Before granting a resource request, OS simulates the allocation and checks if the resulting state is "safe" β€” i.e., there exists a sequence in which all processes can complete using available resources.

# Example: 3 processes, 1 resource type (12 units total)
         Max   Allocated   Need    Available
P1:       10        5        5  ─────────────
P2:        4        2        2      12-5-2-3=2 units free
P3:        9        3        6

# Is this state safe? Find a safe sequence:
# Available=2. P2 needs 2 β†’ give it. P2 finishes β†’ releases 2. Available=4
# P1 needs 5 β†’ 4 not enough. P3 needs 6 β†’ not enough.
# Wait β†’ P2 done β†’ available=4. P1 needs 5 β†’ not enough.
# Try P3? needs 6 β†’ not enough. DEADLOCK! State is UNSAFE β†’ deny P1's request.

# Safe sequence example: P2 β†’ P1 β†’ P3 (if resources allow)
Safe StateThere EXISTS a sequence where all processes can finish. No deadlock possible.
Unsafe StateNo such sequence exists. Deadlock POSSIBLE (not guaranteed, but risky).
Practical Limitation
Banker's algorithm requires knowing maximum resource needs in advance β€” impossible in general-purpose OS. Used in real-time systems (medical devices, aerospace) where processes declare resource needs upfront. General OSes use simpler prevention (lock ordering) instead.
06

Memory Management, Paging & Virtual Memory

OS CORE
Internal vs External Fragmentation β€” what is it, which causes it, how to fix?
AmazonMicrosoftRazorpay
β˜…β˜…β˜…β–Ό
Internal Frag.Allocated block is LARGER than requested. Wasted space INSIDE the block. Example: process needs 18KB, OS allocates a 20KB frame β†’ 2KB wasted inside. Caused by: Paging, fixed-size allocation.
External Frag.Total free memory is sufficient but it's in NON-CONTIGUOUS small chunks. No single chunk large enough. Example: 50KB free but split as 10+15+25KB, process needs 40KB β†’ fails. Caused by: Variable-size allocation, Segmentation.
# Fixes
Internal: Use smaller page size (less waste per frame), or slab allocator (kernel uses this)
External: Compaction (move processes together β€” expensive!), Paging (avoids it entirely),
          Best-fit / First-fit allocation strategies

Paging vs Segmentation:

AspectPagingSegmentation
DivisionFixed-size pages (4KB typical)Variable-size logical segments (code, data, stack)
FragmentationInternal (last page may be partially used)External (segments of varying sizes)
User visibilityTransparent to user/programmerVisible (programmer manages segments)
Used byModern OSes (Linux, Windows)Old systems, Intel x86 in protected mode
Virtual Memory and Paging β€” address translation, page table, TLB, page fault.
GoogleAmazonGoldman
β˜…β˜…β˜…β–Ό

Virtual memory gives each process an illusion of its own large, private address space. The MMU (Memory Management Unit) translates virtual β†’ physical addresses using page tables.

# x86_64: 4-level page table (48-bit virtual address)
VA: [PML4 9 bits][PDPT 9 bits][PD 9 bits][PT 9 bits][Offset 12 bits]
     ↓              ↓            ↓           ↓
  Page Dir 4 β†’  Page Dir 3 β†’  Page Dir β†’ Page Table β†’ Physical Frame + Offset

# TLB = Translation Lookaside Buffer
TLB hit:  1-2 cycles (virtual β†’ physical from cache)
TLB miss: 4+ memory accesses for full page table walk = ~100 cycles
# Context switch may flush TLB! (unless CPU has PCID extension)

# Huge Pages = 2MB or 1GB pages
# Fewer TLB entries needed β†’ fewer misses β†’ better perf for large workloads
echo 1024 > /proc/sys/vm/nr_hugepages   # allocate huge pages
cat /proc/meminfo | grep Huge           # check usage

Page Fault: CPU accesses virtual address not currently mapped to physical frame β†’ hardware triggers page fault exception β†’ OS fault handler runs:

Minor FaultPage in memory but not mapped (e.g. shared lib first use). No disk I/O. Fast.
Major FaultPage must be loaded from disk (swap). Slow: ~1-100ms. Kills latency.
Invalid FaultAccess to unmapped/forbidden address β†’ SIGSEGV (segfault).
CoW FaultWrite to Copy-on-Write shared page β†’ kernel copies page privately.
What is Demand Paging? What is Thrashing and how do you fix it?
AmazonMicrosoftRazorpay
β˜…β˜…β˜…β–Ό

Demand Paging: Pages are loaded into RAM only when accessed (on demand), not all at program start. Pages not yet accessed = not in memory. When accessed β†’ page fault β†’ load from disk. Saves RAM, allows running programs larger than physical memory.

Thrashing: System spends MORE time swapping pages than executing processes. Happens when sum of all working sets exceeds physical RAM. CPU utilization collapses to near 0% while disk I/O is 100%.

# Detect thrashing
vmstat 1          # 'si' (swap-in) and 'so' (swap-out) > 0 = swapping
free -h           # available memory near 0
cat /proc/vmstat | grep pswp  # pswpin/pswpout = swap activity
sar -B 1          # page fault rate
Fix 1Reduce multiprogramming degree β€” kill some processes or lower limits
Fix 2Add more physical RAM
Fix 3Working Set Model β€” track which pages a process actually uses, ensure they're kept in RAM
Fix 4vm.swappiness=1 (Linux) β€” discourage swapping for fintech workloads
07

Page Replacement Algorithms

OS CORE
FIFO, LRU, Optimal page replacement β€” explain with worked examples showing page faults.
AmazonMicrosoftRazorpay
β˜…β˜…β˜…β–Ό

Reference String: 7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2    Frames = 3

# 1. FIFO β€” evict the page that was loaded EARLIEST
Ref: 7  0  1  2  0  3  0  4  2  3  0  3  2
F1:  7  7  7  2  2  2  2  4  4  4  0  0  0
F2:  -  0  0  0  0  3  3  3  2  2  2  2  2
F3:  -  -  1  1  1  1  0  0  0  3  3  3  3
PF:  βœ—  βœ—  βœ—  βœ—  -  βœ—  βœ—  βœ—  βœ—  βœ—  βœ—  -  -
Total Page Faults = 9

# 2. LRU (Least Recently Used) β€” evict page unused for longest time
Ref: 7  0  1  2  0  3  0  4  2  3  0  3  2
F1:  7  7  7  2  2  2  2  4  4  4  0  0  0
F2:  -  0  0  0  0  0  0  0  2  2  2  2  2
F3:  -  -  1  1  1  3  3  3  3  3  3  3  3
PF:  βœ—  βœ—  βœ—  βœ—  -  βœ—  -  βœ—  βœ—  βœ—  βœ—  -  -
Total Page Faults = 8  (better than FIFO!)

# 3. Optimal β€” evict page not needed for longest time IN THE FUTURE
Ref: 7  0  1  2  0  3  0  4  2  3  0  3  2
F1:  7  7  7  2  2  2  2  2  2  2  2  2  2
F2:  -  0  0  0  0  0  0  4  4  4  0  0  0
F3:  -  -  1  1  1  3  3  3  3  3  3  3  3
PF:  βœ—  βœ—  βœ—  βœ—  -  βœ—  -  βœ—  -  -  βœ—  -  -
Total Page Faults = 6  (MINIMUM possible β€” theoretical ideal!)
FIFOSimple queue. Easy. But ignores usage pattern. Suffers Belady's Anomaly. OpenVMS uses it.
LRUGood approximation of Optimal. Expensive to implement perfectly (timestamps). Linux uses Clock algorithm as approximation. Best practical choice.
Optimal (OPT)Minimum page faults. Needs future knowledge β€” impossible in practice! Used only as theoretical benchmark.
Linux's Actual Algorithm
Linux uses a modified Clock (Second Chance) algorithm β€” approximates LRU using a reference bit in the PTE. Each page gets a "second chance" before eviction. Combined with multiple LRU lists (active_list, inactive_list) managed by kswapd.
What is Belady's Anomaly? Which algorithms suffer from it? Prove with an example.
AmazonMicrosoftGoogle
β˜…β˜…β˜…β–Ό

Belady's Anomaly: In some page replacement algorithms, increasing the number of page frames can actually INCREASE page faults. Counter-intuitive β€” more RAM = worse performance!

Only FIFO suffers from it. LRU and Optimal do NOT (they are "stack algorithms" β€” set of pages with N frames is always a subset of N+1 frames).

# Reference String: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5

# With 3 frames (FIFO) β€” 9 page faults:
Ref: 1  2  3  4  1  2  5  1  2  3  4  5
F1:  1  1  1  4  4  4  5  5  5  5  4  4
F2:  -  2  2  2  1  1  1  1  1  1  1  5
F3:  -  -  3  3  3  2  2  2  2  3  3  3
PF:  βœ—  βœ—  βœ—  βœ—  βœ—  βœ—  βœ—  -  -  βœ—  βœ—  βœ—  β†’ 9 faults

# With 4 frames (FIFO) β€” 10 page faults! (MORE with more RAM!)
Ref: 1  2  3  4  1  2  5  1  2  3  4  5
F1:  1  1  1  1  1  1  5  5  5  5  4  4
F2:  -  2  2  2  2  2  2  1  1  1  1  5
F3:  -  -  3  3  3  3  3  3  2  2  2  2
F4:  -  -  -  4  4  4  4  4  4  3  3  3
PF:  βœ—  βœ—  βœ—  βœ—  -  -  βœ—  βœ—  βœ—  βœ—  βœ—  βœ—  β†’ 10 faults 😱
Why it Happens
FIFO replaces the OLDEST page, not the LEAST USEFUL. Adding a frame changes the eviction order in a way that happens to cause more misses. Stack algorithms (LRU, OPT) always benefit from more frames because adding a frame only adds a page to the working set, never removes one.
08

File System, IPC & Storage

LINUX
What is an inode? Hard link vs Soft link β€” difference with examples.
AmazonGoogleMicrosoft
β˜…β˜…β˜…β–Ό

An inode (index node) stores ALL metadata about a file β€” except its name and data. Every file has exactly one inode number.

Inode storesFile type, permissions (rwx), UID/GID, size, timestamps (atime/mtime/ctime), hard link count, pointers to data blocks
NOT in inodeFilename (in directory entry/dentry) and actual file data (in data blocks)
# Hard Link β€” another directory entry pointing to SAME inode
ln file.txt hardlink.txt
ls -i file.txt hardlink.txt   # both show same inode number!
# File exists until ALL hard links deleted (link count = 0)
# Cannot cross filesystems. Cannot link directories.

# Soft/Symbolic Link β€” a file containing a PATH string
ln -s /opt/java/bin/java /usr/bin/java  # symlink
ls -la /usr/bin/java   # shows: /usr/bin/java -> /opt/java/bin/java
# Different inode. CAN cross filesystems. BREAKS if target deleted.

stat file.txt          # show all inode info
ls -i file.txt         # show inode number
df -i /var/log         # inode usage (can be full even with disk space!)
Classic Trap Question
"My disk shows free space but I can't create files!" β†’ Inode table is full. Check: df -i. Happens on mail/log servers with millions of tiny files. Each file consumes one inode regardless of file size.
Compare ALL IPC mechanisms β€” when to use pipes vs shared memory vs sockets?
AmazonGoogleGoldman
β˜…β˜…β˜…β–Ό
MechanismSpeedScopePersistenceWhen to Use
Unnamed PipeFastParent-child onlyNoShell pipelines: ls | grep. Unidirectional, byte stream.
Named Pipe (FIFO)FastSame machineUntil deletedUnrelated processes on same host. File in /tmp.
POSIX Msg QueueMediumSame machineUntil removedTyped, priority messages. Structured communication.
Shared MemoryFastestSame machineUntil unmappedHigh-throughput: market feeds, game engines. Needs separate lock!
Unix Domain SocketVery FastSame machineNoDocker ↔ host, nginx ↔ php-fpm. Full duplex, permissions.
TCP SocketMediumAny hostNoDistributed systems, microservices. Reliable, ordered.
mmap'd fileFastestSame machineFile lifetimeZero-copy DB files (LMDB), persistent shared state, HFT ring buffers.
LMAX Disruptor β€” HFT IPC Gold Standard
Lock-free ring buffer in shared memory. Sequence numbers + CAS for coordination. 25M+ events/sec sub-microsecond latency. No locks, no queues, no GC pressure. Used at Goldman, LMAX Exchange, many fintech firms.
Blocking I/O vs Non-blocking vs Async I/O β€” explain epoll and why it's O(1).
GoogleAmazonGoldman
β˜…β˜…β˜…β–Ό
ModelThread blocks?HowUsed in
Blocking I/OYesSleeps until data ready. Simple but wastes threads.Traditional servers
Non-blockingNoReturns EAGAIN immediately if not ready. App polls.With epoll/select
select/pollYes (in call)Check N FDs, return when any ready. O(n) scan.Old servers (<1000 connections)
epollYes (in call)Kernel callback on ready FDs. O(1). Scales to 10M connections.Nginx, Redis, Node.js
Async (io_uring)NeverSubmit + completion ring buffers in shared memory. Zero syscalls.Modern high-perf servers
# Why epoll is O(1):
epoll_create()         # Creates: RB-tree (all monitored FDs) + Ready list
epoll_ctl(ADD, fd)     # Adds FD to RB-tree: O(log n). Registers callback in kernel.
# When NIC gets data β†’ IRQ β†’ kernel callback fires β†’ FD added to ready list: O(1)
n = epoll_wait(events) # Returns ONLY ready FDs: O(k) where k = ready count
# Total: O(1) regardless of total monitored connections!
# select: O(n) scan all FDs each call. 100k connections = 100k checks every call.
09

Linux Commands β€” Basic to Intermediate

COMMANDS
🟒 BASIC β€” Navigation, file creation, viewing: ls, cd, pwd, mkdir, touch, cat, head, tail
All Companies
β˜…β˜…β˜…β–Ό
# Navigation
pwd                      # print current directory
cd /var/log              # go to absolute path
cd ..                    # one level up
cd -                     # toggle to previous directory
cd ~                     # go home

# Listing
ls -la                   # long format + hidden files (most used combo)
ls -ltr                  # sort by time, oldest first β†’ great for logs!
ls -lhS                  # human-readable sizes, largest first

# Create files and directories
touch file.txt           # create empty file (or update timestamp)
mkdir -p a/b/c/d         # create nested dirs in one shot
mkdir -p /opt/{logs,config,data}  # create multiple subdirs with brace expansion

# View files
cat file.txt             # print entire file
cat -n file.txt          # print with line numbers
head -50 file.txt        # first 50 lines
tail -100 app.log        # last 100 lines
tail -f /var/log/syslog  # FOLLOW live β€” most used for monitoring!
tail -f app.log | grep ERROR  # follow + filter in real time
less app.log             # navigate: /pattern=search n=next G=end q=quit
🟒 BASIC β€” Copy, move, delete: cp, mv, rm. Pipes and I/O redirection.
All Companies
β˜…β˜…β˜…β–Ό
# Copy / Move / Delete
cp -r src/ dest/         # recursive copy (required for directories)
cp -p file backup        # preserve timestamps + permissions
mv old.txt new.txt       # rename file
mv file.txt /tmp/        # move to directory
rm -rf dirname/          # force-delete directory tree (NO prompt! Be careful!)
rm -i file.txt           # interactive β€” asks before delete

# Pipes β€” send stdout of cmd1 as stdin to cmd2
cat app.log | grep ERROR | wc -l     # count error lines
ps aux | grep java | grep -v grep    # find java processes
ls -l | sort -k5 -rn | head -10      # top 10 largest files

# I/O Redirection
cmd > file.txt           # stdout to file (OVERWRITE)
cmd >> file.txt          # stdout to file (APPEND)
cmd 2> err.log           # stderr to file
cmd > out.log 2>&1       # both stdout AND stderr to file
cmd > /dev/null 2>&1     # run silently β€” discard everything
./script.sh | tee output.log  # see AND save output simultaneously
β˜…β˜…β˜… Most Asked Linux Interview Q
"What does command > /dev/null 2>&1 mean?" β€” /dev/null discards everything written to it. > redirects stdout there. 2>&1 redirects fd 2 (stderr) to wherever fd 1 (stdout) is pointing β€” also /dev/null. Result: completely silent. Standard in cron jobs.
🟑 INTERMEDIATE β€” grep, find, awk, sed β€” text processing with examples.
AmazonRazorpay
β˜…β˜…β˜…β–Ό
# grep β€” search inside files
grep -r "ERROR" /var/log/       # recursive search
grep -in "error" app.log        # case-insensitive + line numbers
grep -v "DEBUG" app.log         # invert β€” show lines NOT matching
grep -E "ERROR|WARN|FATAL"      # OR pattern (extended regex)
grep -A3 -B3 "exception"        # 3 lines before and after match
grep -c "timeout" app.log       # count matches only

# find β€” locate files by any attribute
find . -name "*.log" -mtime -7          # .log files modified in last 7 days
find . -type f -size +100M              # regular files > 100MB
find /tmp -mtime +30 -delete            # delete files older than 30 days
find . -name "*.java" -exec grep -l "TODO" {} \;  # grep inside found files

# awk β€” column/field processing
awk '{print $1, $3}' file               # print column 1 and 3
awk -F: '{print $1}' /etc/passwd        # colon-separated β†’ print usernames
ps aux | awk '$3 > 50 {print $1,$3}'   # processes >50% CPU
awk '{sum+=$2} END{print "Total:",sum}' file

# sed β€” stream find & replace
sed 's/old/new/g' file                  # replace all occurrences
sed -i 's/localhost/prod-db/g' cfg.yml  # in-place edit (modifies file!)
sed -n '5,15p' file                     # print only lines 5-15
sed '/^#/d' config.conf                 # delete comment lines
🟑 INTERMEDIATE β€” Linux permissions: chmod, chown, SUID, SGID, sticky bit.
AmazonRazorpay
β˜…β˜…β–Ό
# Permission format: -rwxr-xr-x
#  Type: - file  d dir  l symlink
#  User|Group|Other  β†’  r=4 w=2 x=1
#  755 = rwxr-xr-x | 644 = rw-r--r-- | 600 = rw-------

chmod 755 script.sh         # owner=rwx, group+others=rx
chmod 644 config.yml        # owner=rw, others=read only
chmod +x deploy.sh          # add execute bit for everyone
chmod o-w file.txt          # remove write for others
chown appuser:appgroup file  # change owner and group
chown -R www-data:www-data /var/www/  # recursive

# Special bits
chmod 4755 /usr/bin/sudo    # SUID: executable runs as FILE OWNER (root)
                             # e.g. passwd is SUID root β†’ any user can change pw
chmod 2755 /shared/         # SGID dir: new files inherit group, not creator's
chmod 1777 /tmp             # Sticky: only FILE OWNER can delete (not dir owner)

# Find SUID files (security audit!)
find / -perm -4000 -type f 2>/dev/null   # all SUID binaries
SCENARIO: "99% CPU on a Java production server β€” walk me through your diagnosis."
AmazonRazorpayGoldman
β˜…β˜…β˜…β–Ό
# Step 1: Which process?
top -c               # press P to sort by CPU. Note PID.
ps aux --sort=-%cpu | head -5

# Step 2: Which THREAD? (critical for Java β€” GC threads, hot path threads)
top -H -p <pid>     # H = show threads. Note TID of hottest thread.

# Step 3: User CPU vs System CPU?
# 'us' high β†’ hot code in app (algorithm, hot loop)
# 'sy' high β†’ too many syscalls or kernel work
# 'wa' high β†’ I/O bottleneck (disk, network)

# Step 4: Java thread dump β€” match TID to thread
printf "%x\n" <TID>                     # convert TID to hex: e.g. 1234 β†’ 0x4d2
jstack <pid> | grep -A 20 "0x4d2"     # find thread in dump, see stack trace

# Step 5: Profile to find hot functions
perf top -p <pid>                       # live function-level CPU profile
perf record -F 99 -p <pid> -- sleep 30
perf report                              # interactive flamegraph-like report

# Step 6: Check GC pressure
jstat -gcutil <pid> 1000               # GC stats every 1 second
# If FGC (full GC) is frequent β†’ heap too small, memory leak
Common Root Causes in Fintech
JSON serialization on hot path Β· HashMap resize under load Β· GC thrashing (heap >90%) Β· DB connection pool timeout loop Β· Thread pool exhaustion causing spin-wait Β· Infinite retry loop on failed external call
10

Linux Internals β€” Boot, /proc, Containers

LINUX
Linux boot process end-to-end: BIOS β†’ GRUB β†’ Kernel β†’ systemd β†’ your app.
AmazonGoogleMicrosoft
β˜…β˜…β–Ό
1. BIOS / UEFI
   └─ POST: Power-On Self Test, detect CPU, RAM, PCI devices
   └─ UEFI: finds EFI System Partition β†’ loads bootloader

2. GRUB2 Bootloader
   └─ Reads /boot/grub/grub.cfg
   └─ Loads kernel image (/boot/vmlinuz) + initrd into RAM
   └─ Passes kernel cmdline: root=/dev/sda1 ro quiet isolcpus=2,3

3. Kernel Initialization
   └─ Decompresses itself (vmlinuz = zlib-compressed ELF)
   └─ Sets up CPU (GDT, IDT), enables paging/MMU, detects hardware
   └─ Initializes memory manager, scheduler, device subsystems
   └─ Mounts initramfs (initial RAM filesystem) as temporary /

4. initramfs
   └─ Minimal environment: busybox, storage drivers (NVMe, LVM, RAID)
   └─ Mounts real root filesystem, decrypts if LUKS
   └─ Pivots root, exec's /sbin/init

5. systemd (PID 1)
   └─ Reads unit files in /lib/systemd/system/
   └─ Resolves dependency graph, starts services in PARALLEL
   └─ Mounts /proc /sys /dev /run
   └─ Reaches target: multi-user.target β†’ networking, SSH, your app starts

6. Your service (via systemd unit file)
   └─ Applies cgroup limits (CPU, memory)
   └─ Sets environment, working directory, user
   └─ App is ready to serve traffic
# Debug boot
systemd-analyze          # total boot time
systemd-analyze blame    # slowest services
dmesg | grep -E "error|fail" # kernel boot errors
Linux namespaces and cgroups β€” how do they form Docker/containers?
AmazonGoogle
β˜…β˜…β˜…β–Ό

Namespaces control what a process can SEE. Each namespace type isolates a different resource:

PID nsContainer has its own PID tree starting at 1. Its PID 1 maps to some PID like 8432 on the host.
NET nsOwn network interfaces, IP addresses, routing tables, iptables rules. Container eth0 β‰  host eth0.
MNT nsOwn mount table. Container sees its own / root filesystem.
UTS nsOwn hostname and domain name.
USER nsMaps container UID 0 (root) to unprivileged host UID 100000. Rootless containers.
IPC nsOwn POSIX message queues, semaphores β€” containers can't interfere with each other's IPC.

cgroups (Control Groups) control what a process can USE:

cpuCPU time quota: max 50% of 1 core. cpu.shares for relative weight.
memoryHard RAM limit. OOM behavior when exceeded.
blkioDisk I/O rate limiting (reads/writes per second)
net_clsTag packets for QoS / traffic shaping

Container = namespace isolation + cgroup resource limits + OverlayFS (union filesystem for layers). Docker, Kubernetes just automate creating these Linux primitives.

11

Performance Tuning & Debugging

LINUX
SCENARIO: "Server is slow β€” walk through your systematic diagnosis."
AmazonRazorpayGoldman
β˜…β˜…β˜…β–Ό

Use the USE Method β€” Utilization, Saturation, Errors for every resource:

# 1. CPU
top / htop                     # CPU% per process (P=sort by CPU)
vmstat 1                       # r=runqueue (>nCPUs=saturated) b=blocked wa=IO-wait
# us%=app, sy%=kernel, wa%=IO-wait, id%=idle

# 2. Memory
free -h                        # is 'available' near zero?
vmstat 1 | awk '{print $7,$8}' # si/so > 0 = SWAPPING β†’ major problem!

# 3. Disk I/O
iostat -x 1                    # %util > 80% = disk saturated
iotop                          # which process is hammering disk
# await (ms) = average I/O latency. >10ms for SSD = bad.

# 4. Network
ss -s                          # socket summary stats
netstat -s | grep -E "retransmit|failed" # TCP errors
sar -n DEV 1                   # bandwidth per interface

# 5. Recent errors
dmesg | tail -50               # kernel messages
journalctl -p err --since "1 hour ago"
tail -f /var/log/syslog
Process monitoring: ps, top, lsof, strace β€” key flags and what they tell you.
AmazonRazorpay
β˜…β˜…β–Ό
# ps β€” process snapshot
ps aux                        # all processes, all users
ps aux --sort=-%mem | head    # top memory consumers
ps -eo pid,ppid,cmd,%cpu,%mem # custom columns
# VSZ = virtual memory size (includes not-yet-used areas)
# RSS = Resident Set Size = actual physical RAM used (this is what matters!)

# lsof β€” everything is a file in Linux!
lsof -p <pid>                # all FDs opened by process
lsof -i :8080                # what's using port 8080
lsof +D /var/log             # who has files open in directory
lsof | grep deleted          # deleted files still held open β†’ disk leak!

# strace β€” most powerful: trace every syscall
strace ./app                 # trace all syscalls of new process
strace -p <pid>             # attach to running process (non-invasive)
strace -e trace=openat,read,write,close ./app  # filter syscalls
strace -c ./app              # count + time per syscall (profiling)
strace -T ./app              # show time spent in each syscall
# Use for: "permission denied", hung process, mystery slow I/O
OOM Killer β€” how does Linux decide which process to kill? How to protect a service?
AmazonGoogleGoldman
β˜…β˜…β–Ό

When physical RAM + swap is exhausted, the Linux OOM Killer selects and kills a process to free memory. Goal: maximize freed memory, minimize damage.

oom_score: 0–1000. Higher = more likely killed. Based on: RSS usage (memory hog = more likely), runtime (new processes killed first), nice value.

# View and adjust OOM scores
cat /proc/<pid>/oom_score        # current score 0-1000
cat /proc/<pid>/oom_score_adj    # adjustment -1000 to +1000

echo -1000 > /proc/<pid>/oom_score_adj  # NEVER kill this process
echo 1000  > /proc/<pid>/oom_score_adj  # kill this first

# In systemd unit file β€” persists across restarts:
OOMScoreAdjust=-500

# Detect OOM kills
dmesg | grep -i "oom killer\|killed process"
journalctl -k --since "1 hour ago" | grep -i oom
Fintech: Never OOM Kill the Trading Engine
Set OOMScoreAdjust=-1000 in systemd unit for any critical financial process. Set cgroup memory limits on non-critical services so OOM killer targets those first. Use vm.overcommit_memory=2 in fintech to prevent overcommit entirely.
12

Fintech / HFT β€” Goldman, Citadel, Jane Street Level

FINTECH
How would you tune a Linux server for sub-100Β΅s trade execution latency?
CitadelGoldmanJane St
β˜…β˜…β˜…β–Ό
# 1. CPU Isolation β€” keep trading cores free from OS interrupts
# /etc/default/grub: GRUB_CMDLINE_LINUX=
isolcpus=2,3,4,5 nohz_full=2,3,4,5 rcu_nocbs=2,3,4,5
# isolcpus: scheduler won't place any task here unless explicitly asked
# nohz_full: disable scheduler tick (100Hz jitter eliminated)
# rcu_nocbs: no RCU callbacks on these cores

# 2. CPU Frequency β€” lock to max, disable C-states
echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# C-state exit latency: 1-100Β΅s! Keep CPUs in C0 always.

# 3. NUMA-aware placement
numactl --cpunodebind=0 --membind=0 ./trading-engine
# Remote NUMA access: 2-3x slower. Must bind to same socket as NIC.

# 4. Real-time scheduling
chrt -f 99 taskset -c 2 ./order-matching-engine
echo -1000 > /proc/$(pgrep engine)/oom_score_adj

# 5. Memory β€” eliminate page faults entirely
sysctl -w vm.swappiness=0           # no swapping
# In application code:
mlockall(MCL_CURRENT | MCL_FUTURE); # lock ALL pages in RAM forever
# Pre-fault all memory at startup to eliminate runtime faults

# 6. Huge Pages β€” fewer TLB misses
echo 2048 > /proc/sys/vm/nr_hugepages  # 2048 Γ— 2MB = 4GB huge pages

# 7. Network tuning
sysctl -w net.ipv4.tcp_nodelay=1        # disable Nagle's algo
sysctl -w net.core.busy_poll=50         # busy poll for 50Β΅s
# Pin NIC IRQs to non-trading cores
echo 1 > /proc/irq/<eth_irq>/smp_affinity
Impact Priority (biggest first)
1. CPU isolation + SCHED_FIFO (~70% of benefit) Β· 2. mlockall + huge pages (TLB misses) Β· 3. NUMA binding Β· 4. TCP tuning Β· 5. C-state + frequency governor
DPDK and kernel bypass β€” what's wrong with standard Linux networking for HFT?
CitadelGoldman
β˜…β˜…β–Ό
# Standard Linux path β€” every step adds latency:
NIC β†’ IRQ (~1Β΅s) β†’ kernel driver β†’ NAPI softirq
    β†’ kernel TCP/IP stack (~10-30Β΅s) β†’ socket buffer
    β†’ epoll wakeup β†’ syscall (~1Β΅s) β†’ copy to userspace (~1Β΅s)
    β†’ app processes
Total: 30–100Β΅s minimum β€” unacceptable for Β΅s trading

# DPDK (Data Plane Development Kit) β€” kernel bypass:
# Moves NIC driver entirely to user-space
# App busy-polls NIC ring buffer directly β€” NO interrupts
# NIC DMA directly to huge-page user memory β€” zero copies
# Zero syscalls: reads from memory-mapped ring buffer
Result: ~1-5Β΅s end-to-end latency

# RDMA (Remote Direct Memory Access):
# NIC transfers data between machines WITHOUT CPU involvement
# Bypasses kernel on BOTH sender and receiver
Result: <1Β΅s remote memory access
Used for: inter-datacenter order routing, market data distribution
io_uring β€” Middle Ground (Linux 5.1+)
Shared ring buffers between kernel and userspace. Submit I/O requests + poll completions without syscalls. Not as fast as DPDK but far better than traditional sockets. Used in RocksDB, NGINX. Good for latency-sensitive without full kernel bypass complexity.
What happens step-by-step when you type `curl https://api.razorpay.com/v1/payments`?
RazorpayGoldman
β˜…β˜…β˜…β–Ό
# 1. Shell parsing
Shell tokenizes β†’ checks aliases/builtins β†’ PATH lookup: /usr/bin/curl
fork() child β†’ exec("/usr/bin/curl", args) β†’ ELF loaded by kernel

# 2. DNS Resolution
getaddrinfo("api.razorpay.com") β†’ checks /etc/hosts
β†’ queries nameserver from /etc/resolv.conf (UDP port 53)
β†’ Returns IP: 104.18.x.x

# 3. TCP 3-way handshake
socket(AF_INET, SOCK_STREAM) β†’ allocates fd
connect() β†’ SYN β†’ SYN-ACK β†’ ACK
OS assigns ephemeral port (32768-60999 range)

# 4. TLS Handshake (HTTPS)
ClientHello (cipher suites, TLS 1.3)
β†’ ServerHello + Certificate
β†’ Verify cert against /etc/ssl/certs/ CA bundle
β†’ ECDHE key exchange β†’ session keys derived
β†’ Encrypted channel established (~1-2 RTTs for TLS 1.3)

# 5. HTTP Request
"GET /v1/payments HTTP/1.1\r\nHost: api.razorpay.com\r\n..."
write() syscall β†’ kernel TCP buffer β†’ NIC sends

# 6. Response + cleanup
read() β†’ TLS decrypt β†’ HTTP parse β†’ stdout
close() β†’ TCP FIN β†’ connection in TIME_WAIT for ~60s

πŸ“‹ All Questions

1 / 0