Titanic Survival Research

Overview

The Titanic dataset is one of the most widely studied beginner machine learning challenges. The task is to predict whether a passenger survived the disaster based on attributes like class, sex, age, family size, fare, and port of embarkation. This summary explains how the dataset is structured, what preprocessing is done, what model is used, and how to interpret the results.

Dataset Features

FeatureMeaningUsed?
SurvivedTarget variable (1=survived, 0=not)Target
PclassTicket class (1=First, 3=Third)Yes
SexMale or FemaleYes
AgeAge in yearsYes
SibSp# of siblings/spouses aboardYes
Parch# of parents/children aboardYes
FareTicket fareYes
EmbarkedPort of embarkation (S, C, Q)Yes
Name, Cabin, Ticket, PassengerIdIdentifiers/non-predictiveNo

Preprocessing Steps

StepActionPurpose
Drop unused fieldsName, Cabin, Ticket, PassengerIdNot useful for prediction
Handle missingDrop or impute rows with missing valuesClean data for training
Encode categoriesSex → 0/1; Embarked → numeric codesConvert to numeric
Scale featuresStandardize values (optional)Stabilizes training

Model Used

This project uses a Logistic Regression model. Logistic regression is a simple but powerful classification algorithm. It works by fitting a linear combination of the input features and applying the sigmoid function to output a probability between 0 and 1.

The training process adjusts the weights so that the predicted probabilities best match the actual survival outcomes. Logistic regression is chosen here because:

More advanced models like Random Forests or Neural Networks can also be used, but logistic regression remains popular for explaining results to new learners because of its clarity.

Example Coefficients (Illustrative)

FeatureCoefficientInterpretation
Pclass-1.10Lower classes decrease survival odds
Sex = female+2.70Being female strongly increases odds
Age-0.03 per yearOlder age slightly reduces odds
SibSp-0.35 per personLarge groups reduce odds
Parch-0.10 per personMore dependents reduce odds
Fare+0.015 per unitHigher fare slightly increases odds
Embarked=C+0.25Cherbourg passengers had higher odds

Coefficients above are illustrative; actual numbers depend on the trained dataset split.

Key Insights

FactorEffectWhy
SexWomen had much higher survival“Women and children first” policy
Passenger Class1st > 2nd > 3rdLocation/access to lifeboats
AgeYounger betterChildren prioritized
FareHigher fare helpedProxy for cabin location/status
Family SizeLarge families lower oddsHarder to evacuate together