Description

The presented pipeline solves a classical binary classification problem - Titanic passengers survival prediction. This pipeline demonstrates the applying data processing and machine learning scenario to assess which passengers are more likely to survive the Titanic shipwreck.

Problem Statement

Based on the information about the Titanic's passengers, predict if a particular passenger survived the Titanic shipwreck or not.

Dataset

Titanic

Modeling scenario

General Schema of the Binary Classification: Titanic can be depicted as a sequence:

  1. Prepare the initial dataset with target variable (survived) and potentially useful explanatory variables
  2. Extract features that may be relevant for model training
  3. Split initial data into train\test sets
  4. Train the prediction models using the train set
  5. Evaluate built models on the test set and select the best one for further use

Datrics Pipeline

Pipeline Shema

The full pipeline is presented in the following way:

Pipeline Scenario

Overall, the pipeline can be split into the following groups: dataset preprocessing, feature engineering, data splitting, model training and model testing. **Let us consider every group in detail below.

Dataset preprocessing

BRICKS:

Missing Values Treatment

Filter Columns


Firstly, we upload the data from Storage → Samples → titanic.csv and verify the number of missing values in the given dataset.

Here we propose to fill the empty values of age and fare columns with the median value calculated on input sampling while deleting the cabin column due to a very high index of missing values. Also, we apply autosuggestion for embarked column and, thus, fill its missing values with the S category.