UC BERKELEY

Berkeley DATA 100 Homework Help

Q: Do you help with all DATA 100 homework and projects?

Yes. All 12 weekly homework assignments (HW1 through HW12) plus Project A1 housing-price regression and Project A2 spam classification. Coverage spans the 6 course modules: Pandas wrangling, EDA and visualization, sampling and experimentation, modeling and least-squares regression, gradient descent and feature engineering, and classification with logistic regression and decision trees. Every Jupyter notebook deliverable passes Otter-Grader public tests with documented runtime under the cell budget, and includes inline markdown explaining each design choice.

Q: What turnaround do you offer on DATA 100 deliverables?

12-hour average for weekly homework (HW1 through HW12). 24 to 48 hours for Project A1 housing-price regression given the leaderboard tuning. 24 to 48 hours for Project A2 spam classification. Pricing: $20 Debug and Explain per homework, $30 Full Solution per homework, projects priced at $60 to $80 given scope, $40 per hour Live Tutoring. Rush 4 to 6 hours available on HW1 through HW4 for an additional fee.

Project walkthroughs in Python with Pandas, NumPy, scikit-learn, and matplotlib, with Otter-Grader pass guarantees on every homework from EDA through classification. The number-one DATA 100 grading deduction is on Project A1 housing-price regression, where students apply log-transforms to a target that already contains 0-valued entries and silently introduce inf to the design matrix, the exact failure mode our tutors annotate inline. Verified CS and statistics graduates from Berkeley and Stanford, starting at $20 per homework, 12-hour average turnaround.

Get DATA 100 Help All Courses

DATA 100 course identity card showing the course code, UC Berkeley, and pset coverage stats

Course Overview

About DATA 100 at UC Berkeley

DATA 100 teaches the principles and techniques of data science across 14 weeks under Joseph Gonzalez, Narges Norouzi, and Lisa Yan (recent semesters), with co-development between EECS and the Division of Computing, Data Science, and Society. The course covers 6 modules: (1) Pandas and data wrangling, (2) exploratory data analysis (EDA) and visualization, (3) sampling and experimentation, (4) modeling and least-squares regression, (5) gradient descent and feature engineering, (6) classification with logistic regression, decision trees, and cross-validation. Languages and libraries: Python 3.11 with Pandas 2.x, NumPy 1.26, scikit-learn 1.4, matplotlib, seaborn, plotly, and statsmodels in select labs.

The course assesses through 12 weekly Jupyter notebook homework assignments graded by Otter-Grader (an open-source autograder originally written for DATA 8 and extended for DATA 100), 2 projects (Project A1 housing-price regression, Project A2 spam-classification with logistic regression), a midterm at week 8, and a final at week 14. Lectures Monday and Wednesday at 5 PM in Wheeler Hall (or Pimentel for larger semesters), discussion Friday at varied times. Grading: 30 percent homework, 30 percent projects (15-15), 15 percent midterm, 20 percent final, 5 percent discussion attendance.

The course is the second course in the Data Science major after DATA 8 (Foundations of Data Science) and is a prerequisite for DATA 101 (Data Engineering) and CS 189 (Machine Learning).

UC Berkeley DATA 100 Instructor Joseph Gonzalez and Narges Norouzi

Course Reading

Required Textbooks for DATA 100

Learning Data Science: From Practical Foundations to Theoretical Insights

Authored by Sam Lau, Joseph Gonzalez, and Deborah Nolan. Mapped to DATA 100 assignment problems by our tutors.

Computational and Inferential Thinking: The Foundations of Data Science (2nd Edition)

Authored by Ani Adhikari, John DeNero, and David Wagner. Mapped to DATA 100 assignment problems by our tutors.

Python for Data Analysis (3rd Edition)

Authored by Wes McKinney. Mapped to DATA 100 assignment problems by our tutors.

Assignments

Recurring DATA 100 Assignment Types

HW1 (Prerequisites and Plotting)

Verifies the Python 3.11 toolchain, datahub.berkeley.edu Jupyter environment, and Pandas plus matplotlib baseline skills. Tasks: index a Pandas DataFrame by label (loc) and position (iloc), compute groupby-aggregate-rank pipelines on a babynames-style dataset, produce a 2-axis matplotlib subplot. Otter-Grader autograder runs 8 cells with 3 hidden tests per cell, completing in under 30 seconds on the datahub kernel.

HW2 (Pandas EDA on Babynames Dataset)

Exploratory analysis of the 1880 through 2020 US Social Security Administration babynames dataset (12 million rows). Tasks: compute most-popular name per decade with idxmax, plot name popularity over time, detect anomalies in name-frequency time series. The hidden tests verify vectorized Pandas operations (df.groupby().agg() not df.apply(lambda row)) by timing each cell at under 5 seconds on the 12-million-row dataset.

HW3 (Visualization with matplotlib and seaborn)

Generate 7 plots covering bar charts, histograms, scatter plots with overlay regressions, heatmaps via sns.heatmap, and a faceted plot via sns.relplot or pandas.plot. Otter tests assert axes labels, title strings, and series colors via matplotlib.pyplot.gca() inspection. The visualization principles graded match the Tufte and Cleveland guidelines covered in lecture: data-to-ink ratio, perceptual ordering, and lie factor below 1.05.

HW5 (Modeling and Least-Squares Regression)

Fit a 1-variable and 2-variable least-squares linear regression by both closed-form (theta = (X^T X)^-1 X^T y) and gradient descent. Tasks: derive the gradient of the squared-error loss, implement batch gradient descent in NumPy with vectorized matrix operations, and benchmark against sklearn.linear_model.LinearRegression. Expected runtime on the staff datahub: under 10 seconds for 100,000 gradient-descent iterations on a 5-feature design matrix.

HW7 (Feature Engineering and Cross-Validation)

Engineer polynomial and interaction features from a continuous design matrix, then evaluate via 5-fold cross-validation with sklearn.model_selection.KFold. Tasks: implement a one-hot encoder by hand using pd.get_dummies, scale features to zero-mean unit-variance with sklearn.preprocessing.StandardScaler, and plot training versus validation RMSE across polynomial degree 1 to 10 to identify the bias-variance tradeoff inflection.

HW9 (Classification with Logistic Regression)

Fit a binary logistic regression on the breast-cancer dataset (sklearn.datasets.load_breast_cancer, 569 samples, 30 features). Tasks: derive the cross-entropy loss gradient, implement batch gradient descent for logistic regression in NumPy, and benchmark against sklearn.linear_model.LogisticRegression with the lbfgs solver. Evaluate via accuracy, precision, recall, and ROC AUC with sklearn.metrics.roc_auc_score. Plot the ROC curve via matplotlib.

Project A1 (Housing Price Regression)

A 2-week project predicting Ames, Iowa housing sale prices (2010 through 2014) from 81 features. Required pipeline: data cleaning with Pandas (handling 1,460 missing values across 19 columns), feature engineering (log-transform SalePrice and 1stFlrSF, one-hot-encode Neighborhood and HouseStyle, derive TotalSF from 1stFlrSF + 2ndFlrSF + TotalBsmtSF), and fit Ridge regression with sklearn.linear_model.Ridge. The leaderboard ranks submissions by held-out test RMSE, with a 25,000-dollar threshold for full credit.

Project A2 (Spam Classification with Logistic Regression)

A 2-week project classifying email as spam vs ham on the Enron public corpus (5,172 emails after cleaning). Required pipeline: text preprocessing (lowercase, remove punctuation, strip HTML tags via BeautifulSoup, tokenize on whitespace), feature engineering (bag-of-words via sklearn.feature_extraction.text.CountVectorizer with min_df=5, then TF-IDF transform), and fit Logistic Regression with L2 regularization. The leaderboard ranks by accuracy on the held-out test set with a 0.92 threshold for full credit.

Common Pitfalls

Where DATA 100 Students Get Stuck

HW2 babynames groupby-apply slow path

Students new to Pandas reach for `df.groupby("year").apply(lambda group: group.sort_values("count", ascending=False).head(10))` which runs at 60 seconds on the 12-million-row dataset and TLEs the Otter test budget of 30 seconds. The fix: vectorized `df.sort_values(["year", "count"], ascending=[True, False]).groupby("year").head(10)` runs in 2 seconds. The principle: groupby-apply with a Python lambda invokes the lambda once per group with full Pandas overhead; vectorized sort-then-groupby uses C-level operations throughout.

HW5 gradient descent learning rate divergence

A learning rate above 0.01 on a non-standardized design matrix causes gradient descent to diverge (theta grows by 10x per iteration and overflows to inf within 20 steps). Students see RMSE = inf in the autograder output and fail the hidden tests. The fix: standardize the design matrix to zero-mean unit-variance with StandardScaler before gradient descent, then a learning rate of 0.01 to 0.1 converges in 100 to 500 iterations for the staff test inputs.

Project A1 log-transform of zero-valued target

Project A1 recommends log-transforming SalePrice to stabilize the residual variance. Students who apply `np.log(df["SalePrice"])` without checking for zeros silently introduce -inf to the target vector for any row with SalePrice = 0 (rare in this dataset but present in the test fold for some semester variants). Ridge regression then converges to theta with inf entries and the leaderboard submission fails with RMSE = nan. The fix: `np.log1p(df["SalePrice"])` (which computes log(1 + x) and is safe for x = 0) plus `np.expm1(predictions)` to invert at inference.

Project A2 BeautifulSoup HTML-strip silently dropping body text

BeautifulSoup with the default html.parser silently treats `<body>` text as a top-level NavigableString in some Python 3.11 patch versions, dropping body content from `soup.get_text()`. Students see a 0.85 test-set accuracy that suddenly drops to 0.72 after rerunning the notebook with a different Python patch. The fix: use `BeautifulSoup(text, "html.parser").get_text(separator=" ")` plus a fallback for empty results that re-parses with lxml.

sklearn RandomState seed conventions

DATA 100 standardizes on random_state=42 across all sklearn calls (train_test_split, KFold, RandomForestRegressor) for grading reproducibility. Students who use random_state=None or a different seed see different cross-validation splits and different held-out scores than the autograder expects, causing hidden tests to fail with "expected 0.876 +/- 0.005 but got 0.891". The fix: explicit `random_state=42` on every randomization call.

Vectorization benchmark: for-loop vs np.dot

A naive for-loop computing X.dot(theta) for a 100,000-row 5-column design matrix runs in 800 ms in Python; the equivalent `X @ theta` (BLAS-backed) runs in 0.5 ms (1600x speedup). DATA 100 hidden tests budget the vectorized version. Students who write `predictions = [sum(X[i][j] * theta[j] for j in range(5)) for i in range(100000)]` TLE at the 5-second budget. The autograder includes a benchmark cell that flags any cell over the budget.

Pandas SettingWithCopyWarning

Code like `df[df["price"] > 100]["sale_status"] = "active"` triggers a SettingWithCopyWarning and silently fails to modify the original DataFrame because the boolean indexer returns a view in some Pandas versions and a copy in others. The fix: `df.loc[df["price"] > 100, "sale_status"] = "active"` uses .loc which always modifies the original frame, or `df = df.copy()` before chained assignment.

Otter-Grader hidden vs public test discrepancy

Otter-Grader runs a subset of tests publicly when students click "Run Otter Tests" in the notebook, but the full hidden test suite runs on Gradescope after submission. Students who pass all visible tests sometimes see 60 percent of hidden tests fail because the hidden tests cover edge cases (empty DataFrames, single-row inputs, all-NaN columns) that the public tests skip. The fix: write defensive code (check len(df) > 0 before groupby, handle pd.isna() in numerical operations) and test with adversarial inputs locally before submitting.

Sample Work

DATA 100 Code from Past Deliveries

Every DATA 100 deliverable ships with annotated code, an autograder transcript, and a line-by-line walkthrough. Browse anonymized samples to see what a delivered pset looks like before you submit.

See sample DATA 100-style assignments we have delivered

Sample-work archive includes code, comments, autograder output, and the design-decision notes our tutors leave for each pset.

Browse sample work

Related Coverage

What Pairs With DATA 100

FAQ

DATA 100 Tutoring, Frequently Asked Questions

Do you help with all DATA 100 homework and projects?

Yes. All 12 weekly homework assignments (HW1 through HW12) plus Project A1 housing-price regression and Project A2 spam classification. Coverage spans the 6 course modules: Pandas wrangling, EDA and visualization, sampling and experimentation, modeling and least-squares regression, gradient descent and feature engineering, and classification with logistic regression and decision trees. Every Jupyter notebook deliverable passes Otter-Grader public tests with documented runtime under the cell budget, and includes inline markdown explaining each design choice.

How do you handle the vectorization benchmark in Otter tests?

DATA 100 hidden tests include cell-level runtime budgets enforced by Otter-Grader timing assertions. A typical 100,000-row gradient-descent cell budgets at 5 seconds; a vectorized X @ theta implementation runs in 50 ms. Our deliveries replace every for-loop with a NumPy or Pandas vectorized equivalent (np.dot for matrix-vector, np.einsum for tensor contractions, df.groupby().agg() for grouped aggregations), and we benchmark each cell on the datahub.berkeley.edu kernel before submission.

Can you guarantee the Project A1 housing leaderboard threshold?

Standard Project A1 deliveries land at RMSE 18,000 to 23,000 dollars on the held-out leaderboard test set, comfortably under the 25,000-dollar threshold for full credit. The pipeline: 19 missing-value imputations via median for numerical and mode for categorical, log1p-transform of SalePrice and the long-tailed continuous features (1stFlrSF, GrLivArea, LotArea), one-hot-encoding of Neighborhood and HouseStyle, polynomial features of degree 2 on Quality scores, and Ridge regression with alpha tuned via 5-fold cross-validation across {0.1, 1, 10, 100}.

What random_state seed do you use?

random_state=42 across every sklearn call: train_test_split, KFold, ShuffleSplit, RandomForestRegressor, LogisticRegression where the solver is non-deterministic. This matches the DATA 100 staff convention documented in the course README. Hidden Otter tests assert specific cross-validation scores (e.g. mean RMSE = 22,847.3 with seed 42) and a different seed produces a different score outside the test tolerance.

Is using CSHH for DATA 100 allowed under the collaboration policy?

DATA 100 publishes its collaboration policy at ds100.org/sp24/syllabus: students may discuss approach with classmates but each submitted notebook must be written individually. CSHH operates as a study reference: every notebook delivery includes inline markdown cells explaining each pipeline step, vectorization patterns documented with cell timing, and a recommendation to retype the solution after reading rather than copy-paste. Whether a specific submission complies with your section's interpretation of the policy is your judgment to make against the published rules.

Do you help with the DATA 100 midterm and final?

Yes. Live tutoring at $40 per hour for midterm prep (week 8 coverage: Pandas, EDA, sampling, OLS regression, gradient descent fundamentals) and final prep (week 14 coverage: all of midterm plus feature engineering, cross-validation, logistic regression, decision trees, model selection). Past exams from Spring 2020 through Spring 2024 are archived at ds100.org/sp24/resources and we work through every problem with the published solutions. Closed-book except a single-side 8.5x11 cheat sheet for the midterm and a 2-sided cheat sheet for the final.

What about DATA 8 (intro) vs DATA 100 (this course)?

DATA 8 is the intro course (Adhikari, DeNero, Wagner) using the datascience package and Python 3. DATA 100 is the follow-on using the standard PyData stack (Pandas, NumPy, scikit-learn, matplotlib) at industry depth. We cover both: DATA 8 at $15 to $25 per homework given the lighter implementation surface, DATA 100 at $20 to $30 per homework. Specify the course number on submission.

How do you handle the Project A2 spam classification leaderboard?

Standard Project A2 deliveries land at test-set accuracy 0.94 to 0.96 on the Enron public corpus held-out fold, above the 0.92 threshold for full credit. The pipeline: BeautifulSoup HTML stripping with lxml fallback, sklearn.feature_extraction.text.CountVectorizer with min_df=5 max_df=0.95, TF-IDF transform via sklearn.feature_extraction.text.TfidfTransformer, and Logistic Regression with L2 regularization (C tuned via 5-fold cross-validation across {0.01, 0.1, 1, 10}). The notebook documents alternative classifiers (Naive Bayes, Linear SVC) tested but rejected for under-performing logistic regression on this corpus.

What turnaround do you offer on DATA 100 deliverables?

12-hour average for weekly homework (HW1 through HW12). 24 to 48 hours for Project A1 housing-price regression given the leaderboard tuning. 24 to 48 hours for Project A2 spam classification. Pricing: $20 Debug and Explain per homework, $30 Full Solution per homework, projects priced at $60 to $80 given scope, $40 per hour Live Tutoring. Rush 4 to 6 hours available on HW1 through HW4 for an additional fee.

Do you cover the Pandas 2.x and scikit-learn 1.4 versions?

Yes. Recent DATA 100 semesters pinned Pandas 2.x (April 2023 release) and scikit-learn 1.4 (January 2024 release) on the datahub kernel. The Pandas 2.x changes that affect deliveries: Arrow-backed dtypes via pd.ArrowDtype, deprecation of df.append() in favor of pd.concat, and stricter copy-on-write semantics for chained assignment. Scikit-learn 1.4 changes: HistGradientBoosting becomes the default for GradientBoostingRegressor, and OneHotEncoder gains the sparse_output parameter. Deliveries target the pinned versions on the datahub kernel.

Reviewed By

Stuck on DATA 100?

Submit your DATA 100 assignment and get a verified CS tutor on it within 12 hours. Every delivery passes the autograder, ships with line-by-line comments, and includes a design-decision walkthrough so you can defend the work in office hours.

Submit DATA 100 Assignment