OS Fundamentals
OS COREAn OS is system software acting as interface between hardware and user programs. It manages hardware and provides services to applications.
The kernel is the core of the OS β runs in privileged mode (Ring 0), manages hardware directly, handles syscalls.
| Type | What runs in kernel? | Speed | Stability | Example |
|---|---|---|---|---|
| Monolithic | Everything: FS, drivers, networking, scheduler | Fastest (no IPC) | One bug = crash | Linux, Unix |
| Microkernel | Only: IPC, basic scheduling, memory | Slower (IPC overhead) | Very stable | Mach, MINIX, QNX |
| Hybrid | Core in kernel, some drivers outside | Good balance | Good | Windows NT, macOS XNU |
Linux = Monolithic but Modular: Load/unload drivers at runtime via insmod / rmmod / lsmod without rebooting β combines monolithic performance with microkernel flexibility.
User Space (Ring 3): Apps run here. Cannot touch hardware directly. One crash = only that process dies.
Kernel Space (Ring 0): Full hardware access. OS runs here. A bug here crashes the whole system.
System Call β internal flow on x86_64:
# e.g. calling read(fd, buf, 100) in C 1. glibc wrapper loads syscall number into RAX register (read = 0) 2. Arguments: RDI=fd, RSI=buf, RDX=count 3. Executes SYSCALL instruction β CPU hardware switches Ring 3 β Ring 0 4. CPU saves user registers to kernel stack, jumps to syscall_handler 5. Kernel validates args, performs the actual work (e.g. reads from disk) 6. Return value placed in RAX 7. SYSRET instruction β Ring 0 β Ring 3, app continues # Typical cost: 100-500 ns per syscall
A context switch = CPU saves state of running process A into its PCB (Process Control Block), loads state of process B, runs B.
PCB stores: CPU registers (PC, SP, flags, GPRs), page table pointer (CR3), file descriptor table, signal masks, scheduling state.
Triggers: time quantum expiry (CFS preemption), I/O block (read/write syscall), higher priority task arrives (SCHED_FIFO), voluntary yield (sleep, mutex lock).
# True cost β direct + indirect Save/restore registers: ~200-500 ns TLB flush (if no PCID): ~500 ns - 5 Β΅s CPU cache invalidation: ~1-100 Β΅s (process's working set is now cold!) Branch predictor flush: misprediction spike after switch # Total observable cost: 1 Β΅s to 100 Β΅s depending on cache pollution # Measure context switches vmstat 1 # 'cs' column = context switches/sec pidstat -w 1 # per-process voluntary/involuntary switches perf stat ./app # 'context-switches' hardware counter
SCHED_FIFO + CPU isolation (isolcpus=2,3 kernel param) + taskset -c 2 ./app. Trading threads never get preempted β zero involuntary context switches.Process, States & Lifecycle
OS COREβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Process States β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ New ββadmittedβββΊ Ready βββββββββββββ I/O complete / event done β β² β CPU β β preempted Waiting (Blocked) dispatchedβ β β² βΌ β β Running ββββββI/O request ββββββ β exit() β βΌ Terminated
ps aux β STAT column: R=Running/Runnable, S=Sleeping (interruptible), D=Uninterruptible sleep (I/O wait), Z=Zombie, T=Stopped. D state processes are stuck waiting for disk β if you have many, your disk is the bottleneck.| Aspect | Process | Thread |
|---|---|---|
| Address Space | Separate β own virtual memory map | Shared β same virtual address space |
| Heap | Private heap | Shared heap (race conditions possible!) |
| Stack | Own stack | Each thread has its own stack |
| File Descriptors | Own table (after fork) | Shared β close in one affects all! |
| Code/Data segments | Own copy (CoW after fork) | Shared β changes visible immediately |
| Creation cost | Heavy β fork() copies page tables | Light β clone() with CLONE_VM flag |
| Context switch | Expensive β TLB flush, new page tables | Cheaper β same address space, less TLB pressure |
| Crash impact | Other processes survive | One crash = entire process dies |
| Communication | IPC required (pipes, sockets, shm) | Direct via shared memory (fast, needs sync) |
| Linux kernel impl | fork() β new task_struct | clone(CLONE_VM|CLONE_FILES|...) β task_struct shared |
clone(). pthread_create() calls clone() with sharing flags internally.fork(): Creates exact copy of calling process. Returns 0 to child, child's PID to parent. Uses Copy-on-Write (CoW).
Copy-on-Write: After fork(), parent and child share the same physical pages marked read-only. When either writes β page fault β kernel copies just that page privately. Only pages actually written get duplicated. Makes fork() fast even for large processes.
exec(): Replaces current process image with a new program. Code, data, heap, stack all replaced. FDs stay open (unless FD_CLOEXEC set).
wait(): Parent blocks until child exits and collects exit status. Essential to prevent zombies.
# Classic shell fork-exec-wait pattern pid = fork() if pid == 0: # CHILD execv("/bin/ls", args) # replace child with ls else: # PARENT wait(&status) # collect exit status, prevent zombie
# Find zombies
ps aux | grep ' Z '
ps -eo pid,ppid,stat,cmd | awk '$3~/Z/{print}'
| Signal | Num | Catchable? | Default Action | Use case |
|---|---|---|---|---|
| SIGTERM | 15 | β Yes | Terminate | Graceful shutdown β app cleans up, closes DB, flushes buffers |
| SIGKILL | 9 | β Never | Instant kill | Force kill. Data loss possible. Last resort only. |
| SIGHUP | 1 | β Yes | Terminate | Daemons reload config without restart: kill -HUP nginx_pid |
| SIGINT | 2 | β Yes | Terminate | Ctrl+C in terminal |
| SIGSEGV | 11 | β Yes | Core dump | Invalid memory access (null pointer dereference, buffer overflow) |
| SIGCHLD | 17 | β Yes | Ignore | Child state changed β parent should call wait() here |
| SIGSTOP | 19 | β Never | Pause | Suspend process. Ctrl+Z sends SIGTSTP (catchable version) |
| SIGUSR1/2 | 10,12 | β Yes | Terminate | User-defined. Apps use for custom triggers (log rotation, stats dump) |
# Graceful shutdown pattern (used in Kubernetes pod termination) kill -15 <pid> # Step 1: SIGTERM β request graceful stop sleep 30 # Step 2: Give it 30 seconds to finish kill -9 <pid> 2>/dev/null # Step 3: Force if still running # In shell scripts β trap signals trap 'echo Cleaning up; rm -f /tmp/lock; exit 0' SIGTERM SIGINT
Threads, Synchronization & Race Conditions
OS CORERace Condition: Two threads read-modify-write shared data concurrently. counter++ is 3 ops: LOAD, ADD, STORE β not atomic. Result is unpredictable.
synchronized block = monitor. Cleaner API than raw pthreads.# Mutex (C pthreads) pthread_mutex_lock(&m); shared_counter++; # critical section pthread_mutex_unlock(&m); # Semaphore β binary (works like mutex) or counting sem_init(&s, 0, 5); # allows 5 concurrent holders sem_wait(&s); # acquire (decrement, block if 0) sem_post(&s); # release (increment) # Atomic (lock-free, no OS involvement) std::atomic<int> cnt{0}; cnt.fetch_add(1, std::memory_order_seq_cst); # thread-safe increment
Lock-free: Thread-safe without mutexes. Uses hardware atomic instructions. No blocking = no deadlock, no priority inversion, better cache performance.
CAS β Compare-And-Swap: Single hardware instruction: "If *ptr == expected, atomically set *ptr = new_val, return true. Otherwise return false." Foundation of ALL lock-free algorithms.
// C++ lock-free counter using CAS loop std::atomic<int> counter{0}; int old, next; do { old = counter.load(memory_order_relaxed); next = old + 1; } while (!counter.compare_exchange_weak(old, next)); // If counter changed between load() and CAS β retry // Java equivalent AtomicInteger cnt = new AtomicInteger(0); cnt.incrementAndGet(); // internally uses CAS loop
ABA Problem: Thread 1 reads value A. Thread 2 changes AβBβA. Thread 1 does CAS, sees A, thinks nothing changed β but state actually changed! Classic in lock-free linked lists: node removed and re-inserted.
Fix: Pair value with a version counter. CAS on (value, version) together. Version always increments β impossible to have ABA. Java: AtomicStampedReference. C++: tagged pointer.
relaxed: no ordering, fastest. acquire/release: one-way barriers for producer-consumer. seq_cst: full fence, safest, default. HFT uses relaxed where correct, seq_cst at key synchronization points.1. Producer-Consumer (Bounded Buffer): Producer adds items to a fixed-size buffer. Consumer removes. Need to: block producer when full, block consumer when empty, mutual exclusion on buffer access.
# Solution: 3 semaphores mutex = Semaphore(1) # mutual exclusion on buffer empty = Semaphore(N) # count of empty slots (start=N) full = Semaphore(0) # count of filled slots (start=0) Producer: wait(empty) β wait(mutex) β add item β signal(mutex) β signal(full) Consumer: wait(full) β wait(mutex) β remove β signal(mutex) β signal(empty)
2. Dining Philosophers: 5 philosophers, 5 forks. Each needs 2 forks to eat. Risk: all pick up left fork simultaneously β deadlock.
Fix options: Allow at most 4 philosophers to sit simultaneously. OR odd philosopher picks left first, even picks right first (breaks circular wait). OR use a monitor/arbitrator.
3. Reader-Writer Problem: Multiple readers can read simultaneously. Only one writer at a time (exclusive), no readers while writing.
# Solution: RW Lock (pthread_rwlock_t) pthread_rwlock_rdlock(&rw); # acquire read lock (multiple allowed) pthread_rwlock_wrlock(&rw); # acquire write lock (exclusive) pthread_rwlock_unlock(&rw);
CPU Scheduling Algorithms
OS COREKey Metrics: Waiting Time = time in ready queue. Turnaround Time = total time from arrival to completion. Throughput = processes/sec. Response Time = first CPU allocation.
# Example: 3 processes P1(burst=10ms), P2(burst=5ms), P3(burst=8ms) arriving t=0 # 1. FCFS (First Come First Served) β non-preemptive Order: P1 β P2 β P3 Timeline: |---P1 10ms---|--P2 5ms--|----P3 8ms----| Waiting: P1=0, P2=10, P3=15 β Avg WT = 8.33ms Problem: Convoy effect β short jobs wait behind long ones! # 2. SJF (Shortest Job First) β non-preemptive Order: P2(5ms) β P3(8ms) β P1(10ms) Timeline: |-P2 5ms--|---P3 8ms---|-----P1 10ms-----| Waiting: P2=0, P3=5, P1=13 β Avg WT = 6ms Optimal! Minimum average waiting time. But: needs future knowledge of burst time. # 3. SRTF (Shortest Remaining Time First) β preemptive SJF Preempts running process if new arrival has shorter remaining time. Better response time but more context switches and starvation possible. # 4. Round Robin (RR) β preemptive, time quantum q q = 4ms. Cycle: P1(4) β P2(4) β P3(4) β P1(4) β P2(1) β P3(4) β P1(2) Timeline: |P1 4|P2 4|P3 4|P1 4|P2 1|P3 4|P1 2| Good for: time-sharing, fairness. q too small = too many context switches. q too large = degenerates to FCFS. Typically q = 10-100ms. # 5. Priority Scheduling Each process has a priority. Highest priority runs first. Problem: Starvation β low priority processes may never run. Fix: Aging β gradually increase priority of waiting processes.
| Algorithm | Preemptive? | Starvation? | Best For | Real OS |
|---|---|---|---|---|
| FCFS | No | No | Batch systems | Mainframes |
| SJF | No | Yes | Theoretical optimal | Benchmark |
| SRTF | Yes | Yes | Minimizing wait time | Some batch |
| Round Robin | Yes | No | Time-sharing, interactive | Linux CFS basis |
| Priority | Both | Yes (fix: aging) | Real-time systems | Linux SCHED_FIFO |
| Multilevel Queue | Yes | Yes | Mixed workloads | Windows, Linux |
CFS (since Linux 2.6.23) β core idea: give every process a fair share of CPU proportional to its weight. No fixed time quanta.
vruntime: Each task tracks virtual runtime β how much CPU time it has received, weighted by priority. Lower nice value = higher weight = vruntime grows slower = more CPU time.
Red-Black Tree: All runnable tasks in an RB-tree sorted by vruntime. O(log n) insert/delete. Leftmost node = smallest vruntime = runs next. O(1) to get next task.
# nice values: -20 (highest priority) to +19 (lowest) nice -n -20 ./trading-engine # max priority, gets ~40% more CPU than default nice -n 19 ./batch-report # lowest priority, runs only when CPU idle renice -5 -p <pid> # change priority of running process
| Scheduler Class | Type | Use Case | Priority |
|---|---|---|---|
| SCHED_FIFO | Real-time, no preemption by same level | HFT trading threads, audio drivers | 1-99 (RT) |
| SCHED_RR | Real-time with time quantum | RT with fairness among same-prio tasks | 1-99 (RT) |
| SCHED_OTHER | Normal CFS | Regular applications | nice -20 to 19 |
| SCHED_DEADLINE | EDF with hard deadlines | Periodic RT tasks with guaranteed latency | Above RT |
| SCHED_IDLE | Lowest | Background housekeeping tasks | Below all |
# Set real-time scheduling (root required) chrt -f 99 ./hft-engine # SCHED_FIFO priority 99 (max) chrt -p <pid> # check current scheduling policy
Deadlock β Prevention, Avoidance, Detection
OS COREDeadlock: Process A holds Resource 1, waits for Resource 2. Process B holds Resource 2, waits for Resource 1. Both wait forever.
4 Coffman Conditions β ALL must hold simultaneously:
Breaking each condition (Prevention):
pthread_mutex_trylock() β if can't acquire, back off and retry. Or use timeout: if not acquired in N ms, release and retry.# Detect deadlocks in production Linux pstack <pid> # all thread stacks gdb -p <pid> -ex "thread apply all bt" # backtraces cat /proc/<pid>/wchan # which kernel function thread waits on jstack <java_pid> | grep -A20 BLOCKED # Java thread dump valgrind --tool=helgrind ./app # detect + races
Banker's Algorithm is a deadlock avoidance algorithm. Named after bank lending: a bank never lends money that would prevent it from satisfying all customers' needs.
Before granting a resource request, OS simulates the allocation and checks if the resulting state is "safe" β i.e., there exists a sequence in which all processes can complete using available resources.
# Example: 3 processes, 1 resource type (12 units total) Max Allocated Need Available P1: 10 5 5 βββββββββββββ P2: 4 2 2 12-5-2-3=2 units free P3: 9 3 6 # Is this state safe? Find a safe sequence: # Available=2. P2 needs 2 β give it. P2 finishes β releases 2. Available=4 # P1 needs 5 β 4 not enough. P3 needs 6 β not enough. # Wait β P2 done β available=4. P1 needs 5 β not enough. # Try P3? needs 6 β not enough. DEADLOCK! State is UNSAFE β deny P1's request. # Safe sequence example: P2 β P1 β P3 (if resources allow)
Memory Management, Paging & Virtual Memory
OS CORE# Fixes
Internal: Use smaller page size (less waste per frame), or slab allocator (kernel uses this)
External: Compaction (move processes together β expensive!), Paging (avoids it entirely),
Best-fit / First-fit allocation strategies
Paging vs Segmentation:
| Aspect | Paging | Segmentation |
|---|---|---|
| Division | Fixed-size pages (4KB typical) | Variable-size logical segments (code, data, stack) |
| Fragmentation | Internal (last page may be partially used) | External (segments of varying sizes) |
| User visibility | Transparent to user/programmer | Visible (programmer manages segments) |
| Used by | Modern OSes (Linux, Windows) | Old systems, Intel x86 in protected mode |
Virtual memory gives each process an illusion of its own large, private address space. The MMU (Memory Management Unit) translates virtual β physical addresses using page tables.
# x86_64: 4-level page table (48-bit virtual address) VA: [PML4 9 bits][PDPT 9 bits][PD 9 bits][PT 9 bits][Offset 12 bits] β β β β Page Dir 4 β Page Dir 3 β Page Dir β Page Table β Physical Frame + Offset # TLB = Translation Lookaside Buffer TLB hit: 1-2 cycles (virtual β physical from cache) TLB miss: 4+ memory accesses for full page table walk = ~100 cycles # Context switch may flush TLB! (unless CPU has PCID extension) # Huge Pages = 2MB or 1GB pages # Fewer TLB entries needed β fewer misses β better perf for large workloads echo 1024 > /proc/sys/vm/nr_hugepages # allocate huge pages cat /proc/meminfo | grep Huge # check usage
Page Fault: CPU accesses virtual address not currently mapped to physical frame β hardware triggers page fault exception β OS fault handler runs:
Demand Paging: Pages are loaded into RAM only when accessed (on demand), not all at program start. Pages not yet accessed = not in memory. When accessed β page fault β load from disk. Saves RAM, allows running programs larger than physical memory.
Thrashing: System spends MORE time swapping pages than executing processes. Happens when sum of all working sets exceeds physical RAM. CPU utilization collapses to near 0% while disk I/O is 100%.
# Detect thrashing vmstat 1 # 'si' (swap-in) and 'so' (swap-out) > 0 = swapping free -h # available memory near 0 cat /proc/vmstat | grep pswp # pswpin/pswpout = swap activity sar -B 1 # page fault rate
vm.swappiness=1 (Linux) β discourage swapping for fintech workloadsPage Replacement Algorithms
OS COREReference String: 7, 0, 1, 2, 0, 3, 0, 4, 2, 3, 0, 3, 2 Frames = 3
# 1. FIFO β evict the page that was loaded EARLIEST Ref: 7 0 1 2 0 3 0 4 2 3 0 3 2 F1: 7 7 7 2 2 2 2 4 4 4 0 0 0 F2: - 0 0 0 0 3 3 3 2 2 2 2 2 F3: - - 1 1 1 1 0 0 0 3 3 3 3 PF: β β β β - β β β β β β - - Total Page Faults = 9 # 2. LRU (Least Recently Used) β evict page unused for longest time Ref: 7 0 1 2 0 3 0 4 2 3 0 3 2 F1: 7 7 7 2 2 2 2 4 4 4 0 0 0 F2: - 0 0 0 0 0 0 0 2 2 2 2 2 F3: - - 1 1 1 3 3 3 3 3 3 3 3 PF: β β β β - β - β β β β - - Total Page Faults = 8 (better than FIFO!) # 3. Optimal β evict page not needed for longest time IN THE FUTURE Ref: 7 0 1 2 0 3 0 4 2 3 0 3 2 F1: 7 7 7 2 2 2 2 2 2 2 2 2 2 F2: - 0 0 0 0 0 0 4 4 4 0 0 0 F3: - - 1 1 1 3 3 3 3 3 3 3 3 PF: β β β β - β - β - - β - - Total Page Faults = 6 (MINIMUM possible β theoretical ideal!)
Belady's Anomaly: In some page replacement algorithms, increasing the number of page frames can actually INCREASE page faults. Counter-intuitive β more RAM = worse performance!
Only FIFO suffers from it. LRU and Optimal do NOT (they are "stack algorithms" β set of pages with N frames is always a subset of N+1 frames).
# Reference String: 1, 2, 3, 4, 1, 2, 5, 1, 2, 3, 4, 5 # With 3 frames (FIFO) β 9 page faults: Ref: 1 2 3 4 1 2 5 1 2 3 4 5 F1: 1 1 1 4 4 4 5 5 5 5 4 4 F2: - 2 2 2 1 1 1 1 1 1 1 5 F3: - - 3 3 3 2 2 2 2 3 3 3 PF: β β β β β β β - - β β β β 9 faults # With 4 frames (FIFO) β 10 page faults! (MORE with more RAM!) Ref: 1 2 3 4 1 2 5 1 2 3 4 5 F1: 1 1 1 1 1 1 5 5 5 5 4 4 F2: - 2 2 2 2 2 2 1 1 1 1 5 F3: - - 3 3 3 3 3 3 2 2 2 2 F4: - - - 4 4 4 4 4 4 3 3 3 PF: β β β β - - β β β β β β β 10 faults π±
File System, IPC & Storage
LINUXAn inode (index node) stores ALL metadata about a file β except its name and data. Every file has exactly one inode number.
# Hard Link β another directory entry pointing to SAME inode ln file.txt hardlink.txt ls -i file.txt hardlink.txt # both show same inode number! # File exists until ALL hard links deleted (link count = 0) # Cannot cross filesystems. Cannot link directories. # Soft/Symbolic Link β a file containing a PATH string ln -s /opt/java/bin/java /usr/bin/java # symlink ls -la /usr/bin/java # shows: /usr/bin/java -> /opt/java/bin/java # Different inode. CAN cross filesystems. BREAKS if target deleted. stat file.txt # show all inode info ls -i file.txt # show inode number df -i /var/log # inode usage (can be full even with disk space!)
df -i. Happens on mail/log servers with millions of tiny files. Each file consumes one inode regardless of file size.| Mechanism | Speed | Scope | Persistence | When to Use |
|---|---|---|---|---|
| Unnamed Pipe | Fast | Parent-child only | No | Shell pipelines: ls | grep. Unidirectional, byte stream. |
| Named Pipe (FIFO) | Fast | Same machine | Until deleted | Unrelated processes on same host. File in /tmp. |
| POSIX Msg Queue | Medium | Same machine | Until removed | Typed, priority messages. Structured communication. |
| Shared Memory | Fastest | Same machine | Until unmapped | High-throughput: market feeds, game engines. Needs separate lock! |
| Unix Domain Socket | Very Fast | Same machine | No | Docker β host, nginx β php-fpm. Full duplex, permissions. |
| TCP Socket | Medium | Any host | No | Distributed systems, microservices. Reliable, ordered. |
| mmap'd file | Fastest | Same machine | File lifetime | Zero-copy DB files (LMDB), persistent shared state, HFT ring buffers. |
| Model | Thread blocks? | How | Used in |
|---|---|---|---|
| Blocking I/O | Yes | Sleeps until data ready. Simple but wastes threads. | Traditional servers |
| Non-blocking | No | Returns EAGAIN immediately if not ready. App polls. | With epoll/select |
| select/poll | Yes (in call) | Check N FDs, return when any ready. O(n) scan. | Old servers (<1000 connections) |
| epoll | Yes (in call) | Kernel callback on ready FDs. O(1). Scales to 10M connections. | Nginx, Redis, Node.js |
| Async (io_uring) | Never | Submit + completion ring buffers in shared memory. Zero syscalls. | Modern high-perf servers |
# Why epoll is O(1): epoll_create() # Creates: RB-tree (all monitored FDs) + Ready list epoll_ctl(ADD, fd) # Adds FD to RB-tree: O(log n). Registers callback in kernel. # When NIC gets data β IRQ β kernel callback fires β FD added to ready list: O(1) n = epoll_wait(events) # Returns ONLY ready FDs: O(k) where k = ready count # Total: O(1) regardless of total monitored connections! # select: O(n) scan all FDs each call. 100k connections = 100k checks every call.
Linux Commands β Basic to Intermediate
COMMANDS# Navigation pwd # print current directory cd /var/log # go to absolute path cd .. # one level up cd - # toggle to previous directory cd ~ # go home # Listing ls -la # long format + hidden files (most used combo) ls -ltr # sort by time, oldest first β great for logs! ls -lhS # human-readable sizes, largest first # Create files and directories touch file.txt # create empty file (or update timestamp) mkdir -p a/b/c/d # create nested dirs in one shot mkdir -p /opt/{logs,config,data} # create multiple subdirs with brace expansion # View files cat file.txt # print entire file cat -n file.txt # print with line numbers head -50 file.txt # first 50 lines tail -100 app.log # last 100 lines tail -f /var/log/syslog # FOLLOW live β most used for monitoring! tail -f app.log | grep ERROR # follow + filter in real time less app.log # navigate: /pattern=search n=next G=end q=quit
# Copy / Move / Delete cp -r src/ dest/ # recursive copy (required for directories) cp -p file backup # preserve timestamps + permissions mv old.txt new.txt # rename file mv file.txt /tmp/ # move to directory rm -rf dirname/ # force-delete directory tree (NO prompt! Be careful!) rm -i file.txt # interactive β asks before delete # Pipes β send stdout of cmd1 as stdin to cmd2 cat app.log | grep ERROR | wc -l # count error lines ps aux | grep java | grep -v grep # find java processes ls -l | sort -k5 -rn | head -10 # top 10 largest files # I/O Redirection cmd > file.txt # stdout to file (OVERWRITE) cmd >> file.txt # stdout to file (APPEND) cmd 2> err.log # stderr to file cmd > out.log 2>&1 # both stdout AND stderr to file cmd > /dev/null 2>&1 # run silently β discard everything ./script.sh | tee output.log # see AND save output simultaneously
command > /dev/null 2>&1 mean?" β /dev/null discards everything written to it. > redirects stdout there. 2>&1 redirects fd 2 (stderr) to wherever fd 1 (stdout) is pointing β also /dev/null. Result: completely silent. Standard in cron jobs.# grep β search inside files grep -r "ERROR" /var/log/ # recursive search grep -in "error" app.log # case-insensitive + line numbers grep -v "DEBUG" app.log # invert β show lines NOT matching grep -E "ERROR|WARN|FATAL" # OR pattern (extended regex) grep -A3 -B3 "exception" # 3 lines before and after match grep -c "timeout" app.log # count matches only # find β locate files by any attribute find . -name "*.log" -mtime -7 # .log files modified in last 7 days find . -type f -size +100M # regular files > 100MB find /tmp -mtime +30 -delete # delete files older than 30 days find . -name "*.java" -exec grep -l "TODO" {} \; # grep inside found files # awk β column/field processing awk '{print $1, $3}' file # print column 1 and 3 awk -F: '{print $1}' /etc/passwd # colon-separated β print usernames ps aux | awk '$3 > 50 {print $1,$3}' # processes >50% CPU awk '{sum+=$2} END{print "Total:",sum}' file # sed β stream find & replace sed 's/old/new/g' file # replace all occurrences sed -i 's/localhost/prod-db/g' cfg.yml # in-place edit (modifies file!) sed -n '5,15p' file # print only lines 5-15 sed '/^#/d' config.conf # delete comment lines
# Permission format: -rwxr-xr-x # Type: - file d dir l symlink # User|Group|Other β r=4 w=2 x=1 # 755 = rwxr-xr-x | 644 = rw-r--r-- | 600 = rw------- chmod 755 script.sh # owner=rwx, group+others=rx chmod 644 config.yml # owner=rw, others=read only chmod +x deploy.sh # add execute bit for everyone chmod o-w file.txt # remove write for others chown appuser:appgroup file # change owner and group chown -R www-data:www-data /var/www/ # recursive # Special bits chmod 4755 /usr/bin/sudo # SUID: executable runs as FILE OWNER (root) # e.g. passwd is SUID root β any user can change pw chmod 2755 /shared/ # SGID dir: new files inherit group, not creator's chmod 1777 /tmp # Sticky: only FILE OWNER can delete (not dir owner) # Find SUID files (security audit!) find / -perm -4000 -type f 2>/dev/null # all SUID binaries
# Step 1: Which process? top -c # press P to sort by CPU. Note PID. ps aux --sort=-%cpu | head -5 # Step 2: Which THREAD? (critical for Java β GC threads, hot path threads) top -H -p <pid> # H = show threads. Note TID of hottest thread. # Step 3: User CPU vs System CPU? # 'us' high β hot code in app (algorithm, hot loop) # 'sy' high β too many syscalls or kernel work # 'wa' high β I/O bottleneck (disk, network) # Step 4: Java thread dump β match TID to thread printf "%x\n" <TID> # convert TID to hex: e.g. 1234 β 0x4d2 jstack <pid> | grep -A 20 "0x4d2" # find thread in dump, see stack trace # Step 5: Profile to find hot functions perf top -p <pid> # live function-level CPU profile perf record -F 99 -p <pid> -- sleep 30 perf report # interactive flamegraph-like report # Step 6: Check GC pressure jstat -gcutil <pid> 1000 # GC stats every 1 second # If FGC (full GC) is frequent β heap too small, memory leak
Linux Internals β Boot, /proc, Containers
LINUX1. BIOS / UEFI ββ POST: Power-On Self Test, detect CPU, RAM, PCI devices ββ UEFI: finds EFI System Partition β loads bootloader 2. GRUB2 Bootloader ββ Reads /boot/grub/grub.cfg ββ Loads kernel image (/boot/vmlinuz) + initrd into RAM ββ Passes kernel cmdline: root=/dev/sda1 ro quiet isolcpus=2,3 3. Kernel Initialization ββ Decompresses itself (vmlinuz = zlib-compressed ELF) ββ Sets up CPU (GDT, IDT), enables paging/MMU, detects hardware ββ Initializes memory manager, scheduler, device subsystems ββ Mounts initramfs (initial RAM filesystem) as temporary / 4. initramfs ββ Minimal environment: busybox, storage drivers (NVMe, LVM, RAID) ββ Mounts real root filesystem, decrypts if LUKS ββ Pivots root, exec's /sbin/init 5. systemd (PID 1) ββ Reads unit files in /lib/systemd/system/ ββ Resolves dependency graph, starts services in PARALLEL ββ Mounts /proc /sys /dev /run ββ Reaches target: multi-user.target β networking, SSH, your app starts 6. Your service (via systemd unit file) ββ Applies cgroup limits (CPU, memory) ββ Sets environment, working directory, user ββ App is ready to serve traffic
# Debug boot systemd-analyze # total boot time systemd-analyze blame # slowest services dmesg | grep -E "error|fail" # kernel boot errors
Namespaces control what a process can SEE. Each namespace type isolates a different resource:
cgroups (Control Groups) control what a process can USE:
Container = namespace isolation + cgroup resource limits + OverlayFS (union filesystem for layers). Docker, Kubernetes just automate creating these Linux primitives.
Performance Tuning & Debugging
LINUXUse the USE Method β Utilization, Saturation, Errors for every resource:
# 1. CPU top / htop # CPU% per process (P=sort by CPU) vmstat 1 # r=runqueue (>nCPUs=saturated) b=blocked wa=IO-wait # us%=app, sy%=kernel, wa%=IO-wait, id%=idle # 2. Memory free -h # is 'available' near zero? vmstat 1 | awk '{print $7,$8}' # si/so > 0 = SWAPPING β major problem! # 3. Disk I/O iostat -x 1 # %util > 80% = disk saturated iotop # which process is hammering disk # await (ms) = average I/O latency. >10ms for SSD = bad. # 4. Network ss -s # socket summary stats netstat -s | grep -E "retransmit|failed" # TCP errors sar -n DEV 1 # bandwidth per interface # 5. Recent errors dmesg | tail -50 # kernel messages journalctl -p err --since "1 hour ago" tail -f /var/log/syslog
# ps β process snapshot ps aux # all processes, all users ps aux --sort=-%mem | head # top memory consumers ps -eo pid,ppid,cmd,%cpu,%mem # custom columns # VSZ = virtual memory size (includes not-yet-used areas) # RSS = Resident Set Size = actual physical RAM used (this is what matters!) # lsof β everything is a file in Linux! lsof -p <pid> # all FDs opened by process lsof -i :8080 # what's using port 8080 lsof +D /var/log # who has files open in directory lsof | grep deleted # deleted files still held open β disk leak! # strace β most powerful: trace every syscall strace ./app # trace all syscalls of new process strace -p <pid> # attach to running process (non-invasive) strace -e trace=openat,read,write,close ./app # filter syscalls strace -c ./app # count + time per syscall (profiling) strace -T ./app # show time spent in each syscall # Use for: "permission denied", hung process, mystery slow I/O
When physical RAM + swap is exhausted, the Linux OOM Killer selects and kills a process to free memory. Goal: maximize freed memory, minimize damage.
oom_score: 0β1000. Higher = more likely killed. Based on: RSS usage (memory hog = more likely), runtime (new processes killed first), nice value.
# View and adjust OOM scores cat /proc/<pid>/oom_score # current score 0-1000 cat /proc/<pid>/oom_score_adj # adjustment -1000 to +1000 echo -1000 > /proc/<pid>/oom_score_adj # NEVER kill this process echo 1000 > /proc/<pid>/oom_score_adj # kill this first # In systemd unit file β persists across restarts: OOMScoreAdjust=-500 # Detect OOM kills dmesg | grep -i "oom killer\|killed process" journalctl -k --since "1 hour ago" | grep -i oom
OOMScoreAdjust=-1000 in systemd unit for any critical financial process. Set cgroup memory limits on non-critical services so OOM killer targets those first. Use vm.overcommit_memory=2 in fintech to prevent overcommit entirely.Fintech / HFT β Goldman, Citadel, Jane Street Level
FINTECH# 1. CPU Isolation β keep trading cores free from OS interrupts # /etc/default/grub: GRUB_CMDLINE_LINUX= isolcpus=2,3,4,5 nohz_full=2,3,4,5 rcu_nocbs=2,3,4,5 # isolcpus: scheduler won't place any task here unless explicitly asked # nohz_full: disable scheduler tick (100Hz jitter eliminated) # rcu_nocbs: no RCU callbacks on these cores # 2. CPU Frequency β lock to max, disable C-states echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor # C-state exit latency: 1-100Β΅s! Keep CPUs in C0 always. # 3. NUMA-aware placement numactl --cpunodebind=0 --membind=0 ./trading-engine # Remote NUMA access: 2-3x slower. Must bind to same socket as NIC. # 4. Real-time scheduling chrt -f 99 taskset -c 2 ./order-matching-engine echo -1000 > /proc/$(pgrep engine)/oom_score_adj # 5. Memory β eliminate page faults entirely sysctl -w vm.swappiness=0 # no swapping # In application code: mlockall(MCL_CURRENT | MCL_FUTURE); # lock ALL pages in RAM forever # Pre-fault all memory at startup to eliminate runtime faults # 6. Huge Pages β fewer TLB misses echo 2048 > /proc/sys/vm/nr_hugepages # 2048 Γ 2MB = 4GB huge pages # 7. Network tuning sysctl -w net.ipv4.tcp_nodelay=1 # disable Nagle's algo sysctl -w net.core.busy_poll=50 # busy poll for 50Β΅s # Pin NIC IRQs to non-trading cores echo 1 > /proc/irq/<eth_irq>/smp_affinity
# Standard Linux path β every step adds latency: NIC β IRQ (~1Β΅s) β kernel driver β NAPI softirq β kernel TCP/IP stack (~10-30Β΅s) β socket buffer β epoll wakeup β syscall (~1Β΅s) β copy to userspace (~1Β΅s) β app processes Total: 30β100Β΅s minimum β unacceptable for Β΅s trading # DPDK (Data Plane Development Kit) β kernel bypass: # Moves NIC driver entirely to user-space # App busy-polls NIC ring buffer directly β NO interrupts # NIC DMA directly to huge-page user memory β zero copies # Zero syscalls: reads from memory-mapped ring buffer Result: ~1-5Β΅s end-to-end latency # RDMA (Remote Direct Memory Access): # NIC transfers data between machines WITHOUT CPU involvement # Bypasses kernel on BOTH sender and receiver Result: <1Β΅s remote memory access Used for: inter-datacenter order routing, market data distribution
# 1. Shell parsing Shell tokenizes β checks aliases/builtins β PATH lookup: /usr/bin/curl fork() child β exec("/usr/bin/curl", args) β ELF loaded by kernel # 2. DNS Resolution getaddrinfo("api.razorpay.com") β checks /etc/hosts β queries nameserver from /etc/resolv.conf (UDP port 53) β Returns IP: 104.18.x.x # 3. TCP 3-way handshake socket(AF_INET, SOCK_STREAM) β allocates fd connect() β SYN β SYN-ACK β ACK OS assigns ephemeral port (32768-60999 range) # 4. TLS Handshake (HTTPS) ClientHello (cipher suites, TLS 1.3) β ServerHello + Certificate β Verify cert against /etc/ssl/certs/ CA bundle β ECDHE key exchange β session keys derived β Encrypted channel established (~1-2 RTTs for TLS 1.3) # 5. HTTP Request "GET /v1/payments HTTP/1.1\r\nHost: api.razorpay.com\r\n..." write() syscall β kernel TCP buffer β NIC sends # 6. Response + cleanup read() β TLS decrypt β HTTP parse β stdout close() β TCP FIN β connection in TIME_WAIT for ~60s