The Titanic dataset is one of the most widely studied beginner machine learning challenges. The task is to predict whether a passenger survived the disaster based on attributes like class, sex, age, family size, fare, and port of embarkation. This summary explains how the dataset is structured, what preprocessing is done, what model is used, and how to interpret the results.
Feature | Meaning | Used? |
---|---|---|
Survived | Target variable (1=survived, 0=not) | Target |
Pclass | Ticket class (1=First, 3=Third) | Yes |
Sex | Male or Female | Yes |
Age | Age in years | Yes |
SibSp | # of siblings/spouses aboard | Yes |
Parch | # of parents/children aboard | Yes |
Fare | Ticket fare | Yes |
Embarked | Port of embarkation (S, C, Q) | Yes |
Name, Cabin, Ticket, PassengerId | Identifiers/non-predictive | No |
Step | Action | Purpose |
---|---|---|
Drop unused fields | Name, Cabin, Ticket, PassengerId | Not useful for prediction |
Handle missing | Drop or impute rows with missing values | Clean data for training |
Encode categories | Sex → 0/1; Embarked → numeric codes | Convert to numeric |
Scale features | Standardize values (optional) | Stabilizes training |
This project uses a Logistic Regression model. Logistic regression is a simple but powerful classification algorithm. It works by fitting a linear combination of the input features and applying the sigmoid function to output a probability between 0 and 1.
The training process adjusts the weights so that the predicted probabilities best match the actual survival outcomes. Logistic regression is chosen here because:
More advanced models like Random Forests or Neural Networks can also be used, but logistic regression remains popular for explaining results to new learners because of its clarity.
Feature | Coefficient | Interpretation |
---|---|---|
Pclass | -1.10 | Lower classes decrease survival odds |
Sex = female | +2.70 | Being female strongly increases odds |
Age | -0.03 per year | Older age slightly reduces odds |
SibSp | -0.35 per person | Large groups reduce odds |
Parch | -0.10 per person | More dependents reduce odds |
Fare | +0.015 per unit | Higher fare slightly increases odds |
Embarked=C | +0.25 | Cherbourg passengers had higher odds |
Coefficients above are illustrative; actual numbers depend on the trained dataset split.
Factor | Effect | Why |
---|---|---|
Sex | Women had much higher survival | “Women and children first” policy |
Passenger Class | 1st > 2nd > 3rd | Location/access to lifeboats |
Age | Younger better | Children prioritized |
Fare | Higher fare helped | Proxy for cabin location/status |
Family Size | Large families lower odds | Harder to evacuate together |