Survival predicetion of Titanic


1.Business Understanding

2.Data Understanding

  • Data collecting
  • Data importing
  • View data set information

3.Data Preparation

  • Dta preprocessing
  • Feature Engineering
  • Feature selection


  • Build training and test data sets
  • Choose a machine learning algorithm
  • Training model



Submit the results to kaggle

Report writing

1.Raise a question

what kind of people are more likely to survive on the Tiatanic

2.Data understanding

2.1.Data preparation

Download the data from kaggle Titanic project page:

2.2 Data importing

#Ignore warning prompts
import warnings
# import data processing package
import numpy as np
import pandas as pd
# train data set
train = pd.read_csv('E:\\liaoyuanhao\\train.csv')
# test data set
test = pd.read_csv('E:\\liaoyuanhao\\test.csv')
# keep in mind that there are 891 pieces of data in the training dataset .
#it is convenient to split the test data set from it later for submitting kaggle resulits
训练数据集: (891, 12) 测试数据集: (418, 11)
# merge data sets to facilitate simultaneous cleaning of two data sets
full = train.append(test,ignore_index = True)
合并后的数据集: (1309, 12)

2.3 View data set information

Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket
0 22.0 NaN S 7.2500 Braund, Mr. Owen Harris 0 1 3 male 1 0.0 A/5 21171
1 38.0 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 1 female 1 1.0 PC 17599
2 26.0 NaN S 7.9250 Heikkinen, Miss. Laina 0 3 3 female 0 1.0 STON/O2. 3101282
3 35.0 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 1 female 1 1.0 113803
4 35.0 NaN S 8.0500 Allen, Mr. William Henry 0 5 3 male 0 0.0 373450
'describe'can only view descriptive statistics for data types,not shown for other types of data
Age Fare Parch PassengerId Pclass SibSp Survived
count 1046.000000 1308.000000 1309.000000 1309.000000 1309.000000 1309.000000 891.000000
mean 29.881138 33.295479 0.385027 655.000000 2.294882 0.498854 0.383838
std 14.413493 51.758668 0.865560 378.020061 0.837836 1.041658 0.486592
min 0.170000 0.000000 0.000000 1.000000 1.000000 0.000000 0.000000
25% 21.000000 7.895800 0.000000 328.000000 2.000000 0.000000 0.000000
50% 28.000000 14.454200 0.000000 655.000000 3.000000 0.000000 0.000000
75% 39.000000 31.275000 0.000000 982.000000 3.000000 1.000000 1.000000
max 80.000000 512.329200 9.000000 1309.000000 3.000000 8.000000 1.000000
# seeing the data type for each colume and total number of data
We found that the data has a total of 1309 rows
Data type column : Age 、Cabin number missing data:
(1)The total number of data in Age is 1046,missing 1309-1046=263,missing rate263/1309=20%
(2)The total number of data in Fare is 1308,1 data is missing
String colmun:
(1)The total number of data in Embarked is 1307,only 2 pieces of data are missing,which is less.
(2)The total number of data in Cabin number is 295,missing 1309-295=1014,missing rate=1014/1309=77.5%,the 
missing is relatively large
This indicates the direction for our next data cleaning,only by knowing which data is missing,we can deal with
it in targeted manner
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1046 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

3.Data preparation

3.1 Data preprocessing

Missing value processing

In order to train models,many machine learning algorithms require that the features

passed in cannot have null values

1.if it’s a numeric type,replace with average

2.if it’s categorical data, replace with the most common category

3.Use the model to predict missing values, for example: K-NN

For data types,the easiest way to deal with missing values is to fill the missing values with the mean
# Age
# Fare
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1046 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1309 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
# Check if data processing is normal
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket
0 22.0 NaN S 7.2500 Braund, Mr. Owen Harris 0 1 3 male 1 0.0 A/5 21171
1 38.0 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 1 female 1 1.0 PC 17599
2 26.0 NaN S 7.9250 Heikkinen, Miss. Laina 0 3 3 female 0 1.0 STON/O2. 3101282
3 35.0 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 1 female 1 1.0 113803
4 35.0 NaN S 8.0500 Allen, Mr. William Henry 0 5 3 male 0 0.0 373450
Total number of data is 1309
String column:
(1)The total number of data in Embarked is 1307,and only 2 pieces of data are missing,which is relatively small.
(2)The total number of data in Cabin is 295,missing 1309-295=1014,missing rate=1014/1309=77.5%,the missing is
relatively large
# Embarked:See what the data looks like
Departure location: S = Southampton,UK
Way location 1: C = Cherbourg,France
Way location 2: Q = Queenstown,Ireland
0    S
1    C
2    S
3    S
4    S
Name: Embarked, dtype: object
Categorical variable Embarked,look at the most common categories and fill them with
S    914
C    270
Q    123
Name: Embarked, dtype: int64
From the results,the S category is the most common.We fill in missing values with 
the most frequently occurring values: S = Southampton
# Cabin: See what the data looks like
0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object
There are many missing data. The missing value of the cabin number is filled with U,
indicating that it's unknown.
full['Cabin'] = full['Cabin'].fillna('U')
# Cheack if data processing is normal
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket
0 22.0 U S 7.2500 Braund, Mr. Owen Harris 0 1 3 male 1 0.0 A/5 21171
1 38.0 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 1 female 1 1.0 PC 17599
2 26.0 U S 7.9250 Heikkinen, Miss. Laina 0 3 3 female 0 1.0 STON/O2. 3101282
3 35.0 C123 S 53.1000 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 4 1 female 1 1.0 113803
4 35.0 U S 8.0500 Allen, Mr. William Henry 0 5 3 male 0 0.0 373450
Look at the processing of the final missing value,remember the generation(Survived),
This column is our labels,which is used for machine learning prediction,
and there is no need to process this column 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            13
