The Data Science Starter Kit

The Data Science Interview Cheat Sheet: Math & Code

Essential Statistics & Python Templates

By Tanmoy Chowdhury

How to use this guide: Data Science interviews often bounce between heavy statistics and rapid-fire coding. This cheat sheet bridges the gap. Use Part 1 to refresh your statistical definitions before a screening call, and use Part 2 to memorize the code patterns that solve 80% of algorithmic interview questions.

Part 1: Statistics for Data Science

1. Core Terminology

  • Population vs. Sample: A population is the entire group you want to draw conclusions about. [cite_start]A sample is the specific group that you will collect data from.
  • Parameter vs. Statistic: A parameter is a numerical value that describes a population (e.g., Population Mean μ). A statistic is a numerical value that describes a sample (e.g., Sample Mean x̄).

2. Hypothesis Testing

The Concept: A hypothesis test evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data.

  • Null Hypothesis (H0): Typically states that there is no effect or no difference between experimental treatments.
  • Alternative Hypothesis (Ha): A statement that is mutually exclusive with the null hypothesis (the opposite of H0).

The Confusion Matrix of Errors:

Null Hypothesis (H0) is True Null Hypothesis (H0) is False
Reject H0 Type I Error (α)
(False Positive)
Correct Outcome
(Power: 1 - β)
Fail to Reject H0 Correct Outcome
(Confidence: 1 - α)
Type II Error (β)
(False Negative)
  • Type I Error (α): Rejecting a true null hypothesis (False Positive).
  • Type II Error (β): Failing to reject a false null hypothesis (False Negative).
  • Significance Level (α): The probability of making a Type I error. Commonly set at 0.05 (5%).
  • P-value: The probability of obtaining a result equal to or "more extreme" than what was observed, assuming the null hypothesis is true.

3. Common Statistical Tests

  • Z-Test: Used to check if the means of two large samples (n > 30) are different when the population variance is known.
  • T-Test: Used to check if the means of two data sets are different when the population variance is not known or sample size is small (< 30).
  • ANOVA: Analysis of Variance. Used to compare means across three or more groups.
  • Chi-Square Test: Used to determine if there is a significant association between two categorical variables.

Part 2: Python Code Snippets for Data Science

1. List Operations & Slicing

Python lists are dynamic arrays. Slicing is a powerful feature for data manipulation.

A = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# Basic Slicing A[start:stop:step]
print(A[2:5])    # [2, 3, 4]
print(A[:3])     # [0, 1, 2] (First 3 items)
print(A[-3:])    # [7, 8, 9] (Last 3 items)
print(A[::-1])   # [9, 8, 7, 6, 5, 4, 3, 2, 1, 0] (Reverse list)

# List Comprehension
# Syntax: [expression for item in list if condition]
squared_evens = [x**2 for x in range(10) if x % 2 == 0]
# Result: [0, 4, 16, 36, 64]

2. Essential Bit Manipulation Tricks

Bitwise operations are incredibly fast (O(1)) and often asked in technical interviews.

x = 10  # Binary: 1010

# Check if nth bit is set
is_set = x & (1 << n) != 0 

# Set the nth bit
x_new = x | (1 << n)

# Clear the nth bit
x_new = x & ~(1 << n)

# Toggle the nth bit
x_new = x ^ (1 << n)

# Check if number is power of 2
# (x & (x-1)) clears the lowest set bit. If result is 0, it was a power of 2.
is_power_2 = (x & (x - 1)) == 0

3. Common Algorithm Templates

Memorize these patterns for coding interviews.

Two Pointers (Opposite Ends)

Used for: Sorted arrays, finding pairs, checking palindromes.

def two_pointers(arr):
    left = 0
    right = len(arr) - 1
    ans = 0
    while left < right:
        # Logic here (e.g., check sum of arr[left] + arr[right])
        if CONDITION:
            left += 1
        else:
            right -= 1
    return ans

Sliding Window

Used for: Subarray problems, finding longest/shortest substrings.

def sliding_window(arr):
    left = 0
    curr = 0
    ans = 0
    for right in range(len(arr)):
        # Logic to add arr[right] to current window
        
        while WINDOW_CONDITION_BROKEN:
            # Logic to remove arr[left] from current window
            left += 1
            
        ans = max(ans, right - left + 1) # Update answer
    return ans

4. Critical Helper Functions

Don't reinvent the wheel; use Python's built-in power.

import heapq
import collections

# Heap Operations (Min-Heap by default)
heap = []
heapq.heappush(heap, 10)
smallest = heapq.heappop(heap) # Returns smallest element
# For Max-Heap, push negative values: heapq.heappush(heap, -10)

# Counting elements efficiently
counts = collections.Counter(['a', 'b', 'c', 'a', 'b', 'b'])
print(counts.most_common(1)) # [('b', 3)]

# Default Dictionary (Handling missing keys automatically)
adj_list = collections.defaultdict(list)
adj_list['node1'].append('node2') # No KeyError even if 'node1' doesn't exist yet

Comments