The presented pipeline solves a classical binary classification problem - Titanic passengers survival prediction. This pipeline demonstrates the applying data processing and machine learning scenario to assess which passengers are more likely to survive the Titanic shipwreck.
Based on the information about the Titanic's passengers, predict if a particular passenger survived the Titanic shipwreck or not.
General Schema of the Binary Classification: Titanic can be depicted as a sequence:
The full pipeline is presented in the following way:
Overall, the pipeline can be split into the following groups: dataset preprocessing, feature engineering, data splitting, model training and model testing. **Let us consider every group in detail below.
BRICKS:
Firstly, we upload the data from Storage → Samples → titanic.csv and verify the number of missing values in the given dataset.
Here we propose to fill the empty values of age and fare columns with the median value calculated on input sampling while deleting the cabin column due to a very high index of missing values. Also, we apply autosuggestion for embarked column and, thus, fill its missing values with the S category.