The Data Science Starter Kit
The Data Science Interview Cheat Sheet: Math & Code
Essential Statistics & Python Templates
By Tanmoy Chowdhury
How to use this guide: Data Science interviews often bounce between heavy statistics and rapid-fire coding. This cheat sheet bridges the gap. Use Part 1 to refresh your statistical definitions before a screening call, and use Part 2 to memorize the code patterns that solve 80% of algorithmic interview questions.
Part 1: Statistics for Data Science
1. Core Terminology
- Population vs. Sample: A population is the entire group you want to draw conclusions about. [cite_start]A sample is the specific group that you will collect data from.
- Parameter vs. Statistic: A parameter is a numerical value that describes a population (e.g., Population Mean μ). A statistic is a numerical value that describes a sample (e.g., Sample Mean x̄).
2. Hypothesis Testing
The Concept: A hypothesis test evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data.
- Null Hypothesis (H0): Typically states that there is no effect or no difference between experimental treatments.
- Alternative Hypothesis (Ha): A statement that is mutually exclusive with the null hypothesis (the opposite of H0).
The Confusion Matrix of Errors:
| Null Hypothesis (H0) is True | Null Hypothesis (H0) is False | |
|---|---|---|
| Reject H0 | Type I Error (α) (False Positive) |
Correct Outcome (Power: 1 - β) |
| Fail to Reject H0 | Correct Outcome (Confidence: 1 - α) |
Type II Error (β) (False Negative) |
- Type I Error (α): Rejecting a true null hypothesis (False Positive).
- Type II Error (β): Failing to reject a false null hypothesis (False Negative).
- Significance Level (α): The probability of making a Type I error. Commonly set at 0.05 (5%).
- P-value: The probability of obtaining a result equal to or "more extreme" than what was observed, assuming the null hypothesis is true.
3. Common Statistical Tests
- Z-Test: Used to check if the means of two large samples (n > 30) are different when the population variance is known.
- T-Test: Used to check if the means of two data sets are different when the population variance is not known or sample size is small (< 30).
- ANOVA: Analysis of Variance. Used to compare means across three or more groups.
- Chi-Square Test: Used to determine if there is a significant association between two categorical variables.
Part 2: Python Code Snippets for Data Science
1. List Operations & Slicing
Python lists are dynamic arrays. Slicing is a powerful feature for data manipulation.
A = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# Basic Slicing A[start:stop:step]
print(A[2:5]) # [2, 3, 4]
print(A[:3]) # [0, 1, 2] (First 3 items)
print(A[-3:]) # [7, 8, 9] (Last 3 items)
print(A[::-1]) # [9, 8, 7, 6, 5, 4, 3, 2, 1, 0] (Reverse list)
# List Comprehension
# Syntax: [expression for item in list if condition]
squared_evens = [x**2 for x in range(10) if x % 2 == 0]
# Result: [0, 4, 16, 36, 64]
2. Essential Bit Manipulation Tricks
Bitwise operations are incredibly fast (O(1)) and often asked in technical interviews.
x = 10 # Binary: 1010
# Check if nth bit is set
is_set = x & (1 << n) != 0
# Set the nth bit
x_new = x | (1 << n)
# Clear the nth bit
x_new = x & ~(1 << n)
# Toggle the nth bit
x_new = x ^ (1 << n)
# Check if number is power of 2
# (x & (x-1)) clears the lowest set bit. If result is 0, it was a power of 2.
is_power_2 = (x & (x - 1)) == 0
3. Common Algorithm Templates
Memorize these patterns for coding interviews.
Two Pointers (Opposite Ends)
Used for: Sorted arrays, finding pairs, checking palindromes.
def two_pointers(arr):
left = 0
right = len(arr) - 1
ans = 0
while left < right:
# Logic here (e.g., check sum of arr[left] + arr[right])
if CONDITION:
left += 1
else:
right -= 1
return ans
Sliding Window
Used for: Subarray problems, finding longest/shortest substrings.
def sliding_window(arr):
left = 0
curr = 0
ans = 0
for right in range(len(arr)):
# Logic to add arr[right] to current window
while WINDOW_CONDITION_BROKEN:
# Logic to remove arr[left] from current window
left += 1
ans = max(ans, right - left + 1) # Update answer
return ans
4. Critical Helper Functions
Don't reinvent the wheel; use Python's built-in power.
import heapq
import collections
# Heap Operations (Min-Heap by default)
heap = []
heapq.heappush(heap, 10)
smallest = heapq.heappop(heap) # Returns smallest element
# For Max-Heap, push negative values: heapq.heappush(heap, -10)
# Counting elements efficiently
counts = collections.Counter(['a', 'b', 'c', 'a', 'b', 'b'])
print(counts.most_common(1)) # [('b', 3)]
# Default Dictionary (Handling missing keys automatically)
adj_list = collections.defaultdict(list)
adj_list['node1'].append('node2') # No KeyError even if 'node1' doesn't exist yet
Comments
Post a Comment