Learning Data Science: From Practical Foundations to Theoretical Insights
Authored by Sam Lau, Joseph Gonzalez, and Deborah Nolan. Mapped to DATA 100 assignment problems by our tutors.
UC BERKELEY
Project walkthroughs in Python with Pandas, NumPy, scikit-learn, and matplotlib, with Otter-Grader pass guarantees on every homework from EDA through classification. The number-one DATA 100 grading deduction is on Project A1 housing-price regression, where students apply log-transforms to a target that already contains 0-valued entries and silently introduce inf to the design matrix, the exact failure mode our tutors annotate inline. Verified CS and statistics graduates from Berkeley and Stanford, starting at $20 per homework, 12-hour average turnaround.
Course Overview
DATA 100 teaches the principles and techniques of data science across 14 weeks under Joseph Gonzalez, Narges Norouzi, and Lisa Yan (recent semesters), with co-development between EECS and the Division of Computing, Data Science, and Society. The course covers 6 modules: (1) Pandas and data wrangling, (2) exploratory data analysis (EDA) and visualization, (3) sampling and experimentation, (4) modeling and least-squares regression, (5) gradient descent and feature engineering, (6) classification with logistic regression, decision trees, and cross-validation. Languages and libraries: Python 3.11 with Pandas 2.x, NumPy 1.26, scikit-learn 1.4, matplotlib, seaborn, plotly, and statsmodels in select labs.
The course assesses through 12 weekly Jupyter notebook homework assignments graded by Otter-Grader (an open-source autograder originally written for DATA 8 and extended for DATA 100), 2 projects (Project A1 housing-price regression, Project A2 spam-classification with logistic regression), a midterm at week 8, and a final at week 14. Lectures Monday and Wednesday at 5 PM in Wheeler Hall (or Pimentel for larger semesters), discussion Friday at varied times. Grading: 30 percent homework, 30 percent projects (15-15), 15 percent midterm, 20 percent final, 5 percent discussion attendance.
The course is the second course in the Data Science major after DATA 8 (Foundations of Data Science) and is a prerequisite for DATA 101 (Data Engineering) and CS 189 (Machine Learning).
Course Reading
Authored by Sam Lau, Joseph Gonzalez, and Deborah Nolan. Mapped to DATA 100 assignment problems by our tutors.
Authored by Ani Adhikari, John DeNero, and David Wagner. Mapped to DATA 100 assignment problems by our tutors.
Authored by Wes McKinney. Mapped to DATA 100 assignment problems by our tutors.
Assignments
Verifies the Python 3.11 toolchain, datahub.berkeley.edu Jupyter environment, and Pandas plus matplotlib baseline skills. Tasks: index a Pandas DataFrame by label (loc) and position (iloc), compute groupby-aggregate-rank pipelines on a babynames-style dataset, produce a 2-axis matplotlib subplot. Otter-Grader autograder runs 8 cells with 3 hidden tests per cell, completing in under 30 seconds on the datahub kernel.
Exploratory analysis of the 1880 through 2020 US Social Security Administration babynames dataset (12 million rows). Tasks: compute most-popular name per decade with idxmax, plot name popularity over time, detect anomalies in name-frequency time series. The hidden tests verify vectorized Pandas operations (df.groupby().agg() not df.apply(lambda row)) by timing each cell at under 5 seconds on the 12-million-row dataset.
Generate 7 plots covering bar charts, histograms, scatter plots with overlay regressions, heatmaps via sns.heatmap, and a faceted plot via sns.relplot or pandas.plot. Otter tests assert axes labels, title strings, and series colors via matplotlib.pyplot.gca() inspection. The visualization principles graded match the Tufte and Cleveland guidelines covered in lecture: data-to-ink ratio, perceptual ordering, and lie factor below 1.05.
Fit a 1-variable and 2-variable least-squares linear regression by both closed-form (theta = (X^T X)^-1 X^T y) and gradient descent. Tasks: derive the gradient of the squared-error loss, implement batch gradient descent in NumPy with vectorized matrix operations, and benchmark against sklearn.linear_model.LinearRegression. Expected runtime on the staff datahub: under 10 seconds for 100,000 gradient-descent iterations on a 5-feature design matrix.
Engineer polynomial and interaction features from a continuous design matrix, then evaluate via 5-fold cross-validation with sklearn.model_selection.KFold. Tasks: implement a one-hot encoder by hand using pd.get_dummies, scale features to zero-mean unit-variance with sklearn.preprocessing.StandardScaler, and plot training versus validation RMSE across polynomial degree 1 to 10 to identify the bias-variance tradeoff inflection.
Fit a binary logistic regression on the breast-cancer dataset (sklearn.datasets.load_breast_cancer, 569 samples, 30 features). Tasks: derive the cross-entropy loss gradient, implement batch gradient descent for logistic regression in NumPy, and benchmark against sklearn.linear_model.LogisticRegression with the lbfgs solver. Evaluate via accuracy, precision, recall, and ROC AUC with sklearn.metrics.roc_auc_score. Plot the ROC curve via matplotlib.
A 2-week project predicting Ames, Iowa housing sale prices (2010 through 2014) from 81 features. Required pipeline: data cleaning with Pandas (handling 1,460 missing values across 19 columns), feature engineering (log-transform SalePrice and 1stFlrSF, one-hot-encode Neighborhood and HouseStyle, derive TotalSF from 1stFlrSF + 2ndFlrSF + TotalBsmtSF), and fit Ridge regression with sklearn.linear_model.Ridge. The leaderboard ranks submissions by held-out test RMSE, with a 25,000-dollar threshold for full credit.
A 2-week project classifying email as spam vs ham on the Enron public corpus (5,172 emails after cleaning). Required pipeline: text preprocessing (lowercase, remove punctuation, strip HTML tags via BeautifulSoup, tokenize on whitespace), feature engineering (bag-of-words via sklearn.feature_extraction.text.CountVectorizer with min_df=5, then TF-IDF transform), and fit Logistic Regression with L2 regularization. The leaderboard ranks by accuracy on the held-out test set with a 0.92 threshold for full credit.
Common Pitfalls
Students new to Pandas reach for `df.groupby("year").apply(lambda group: group.sort_values("count", ascending=False).head(10))` which runs at 60 seconds on the 12-million-row dataset and TLEs the Otter test budget of 30 seconds. The fix: vectorized `df.sort_values(["year", "count"], ascending=[True, False]).groupby("year").head(10)` runs in 2 seconds. The principle: groupby-apply with a Python lambda invokes the lambda once per group with full Pandas overhead; vectorized sort-then-groupby uses C-level operations throughout.
A learning rate above 0.01 on a non-standardized design matrix causes gradient descent to diverge (theta grows by 10x per iteration and overflows to inf within 20 steps). Students see RMSE = inf in the autograder output and fail the hidden tests. The fix: standardize the design matrix to zero-mean unit-variance with StandardScaler before gradient descent, then a learning rate of 0.01 to 0.1 converges in 100 to 500 iterations for the staff test inputs.
Project A1 recommends log-transforming SalePrice to stabilize the residual variance. Students who apply `np.log(df["SalePrice"])` without checking for zeros silently introduce -inf to the target vector for any row with SalePrice = 0 (rare in this dataset but present in the test fold for some semester variants). Ridge regression then converges to theta with inf entries and the leaderboard submission fails with RMSE = nan. The fix: `np.log1p(df["SalePrice"])` (which computes log(1 + x) and is safe for x = 0) plus `np.expm1(predictions)` to invert at inference.
BeautifulSoup with the default html.parser silently treats `<body>` text as a top-level NavigableString in some Python 3.11 patch versions, dropping body content from `soup.get_text()`. Students see a 0.85 test-set accuracy that suddenly drops to 0.72 after rerunning the notebook with a different Python patch. The fix: use `BeautifulSoup(text, "html.parser").get_text(separator=" ")` plus a fallback for empty results that re-parses with lxml.
DATA 100 standardizes on random_state=42 across all sklearn calls (train_test_split, KFold, RandomForestRegressor) for grading reproducibility. Students who use random_state=None or a different seed see different cross-validation splits and different held-out scores than the autograder expects, causing hidden tests to fail with "expected 0.876 +/- 0.005 but got 0.891". The fix: explicit `random_state=42` on every randomization call.
A naive for-loop computing X.dot(theta) for a 100,000-row 5-column design matrix runs in 800 ms in Python; the equivalent `X @ theta` (BLAS-backed) runs in 0.5 ms (1600x speedup). DATA 100 hidden tests budget the vectorized version. Students who write `predictions = [sum(X[i][j] * theta[j] for j in range(5)) for i in range(100000)]` TLE at the 5-second budget. The autograder includes a benchmark cell that flags any cell over the budget.
Code like `df[df["price"] > 100]["sale_status"] = "active"` triggers a SettingWithCopyWarning and silently fails to modify the original DataFrame because the boolean indexer returns a view in some Pandas versions and a copy in others. The fix: `df.loc[df["price"] > 100, "sale_status"] = "active"` uses .loc which always modifies the original frame, or `df = df.copy()` before chained assignment.
Otter-Grader runs a subset of tests publicly when students click "Run Otter Tests" in the notebook, but the full hidden test suite runs on Gradescope after submission. Students who pass all visible tests sometimes see 60 percent of hidden tests fail because the hidden tests cover edge cases (empty DataFrames, single-row inputs, all-NaN columns) that the public tests skip. The fix: write defensive code (check len(df) > 0 before groupby, handle pd.isna() in numerical operations) and test with adversarial inputs locally before submitting.
Sample Work
Every DATA 100 deliverable ships with annotated code, an autograder transcript, and a line-by-line walkthrough. Browse anonymized samples to see what a delivered pset looks like before you submit.
Sample-work archive includes code, comments, autograder output, and the design-decision notes our tutors leave for each pset.
Browse sample workRelated Coverage
Annotated Jupyter notebooks and pytest-passing scripts for ML, pandas, and algorithm assignments, with PEP 8 formatting and type hints throughout. The most common failure in Data Science labs (Berkeley DATA 100, U of T STA130, Edinburgh INFR11125, NUS DSA1101, IIT Bombay DS203) and Intro Programming psets (Berkeley CS61A, U of T CSC108, Manchester COMP16321, Sydney COMP1531, NUS CS1010E) is silent NumPy broadcasting that produces the wrong output shape without raising, the exact failure mode our tutors catch with assert statements inline. Verified CS graduates from Georgia Tech, BITS Pilani, U of Toronto, Manchester, NUS, and IIT, starting at $20 per task, 12-hour average turnaround.
Regression, classification, clustering, neural networks, gradient descent, and evaluation pipelines with annotated Jupyter notebooks. The hardest CS229 final-project grading deduction is data leakage from incorrect cross-validation splits, the failure mode our tutors catch with stratified k-fold and explicit train-test isolation. Verified CS graduates from Georgia Tech, Purdue, and BITS Pilani with PyTorch and TensorFlow depth, starting at $20 per task, 12-hour average turnaround.
Relational schema design, SQL queries through window functions, normalization to BCNF, index tuning, and transaction isolation analysis. The hardest CMU 15-445 query optimization step is reading the PostgreSQL EXPLAIN ANALYZE output and identifying a missing index that drops a sequential scan to an index scan, the move our tutors annotate inline. Verified CS graduates with PostgreSQL, MySQL, MongoDB depth, starting at $20 per task, 12-hour average turnaround.
DataFrame operations, groupby, merge, time series, and the SettingWithCopyWarning explained for university data science coursework. The top failure mode in data wrangling assignments is chained indexing that triggers SettingWithCopyWarning then silently fails to mutate, the bug our tutors patch with explicit .loc assignment. Verified CS graduates from Georgia Tech, Purdue, and BITS Pilani, starting at $20 per task, 12-hour average turnaround.
FAQ
Reviewed By
Submit your DATA 100 assignment and get a verified CS tutor on it within 12 hours. Every delivery passes the autograder, ships with line-by-line comments, and includes a design-decision walkthrough so you can defend the work in office hours.
Submit DATA 100 Assignment