Directory:
1.Business Understanding
2.Data Understanding
3.Data Preparation
4.Modeling
5.Evaluation
6.Deployment
Submit the results to kaggle
Report writing
what kind of people are more likely to survive on the Tiatanic
Download the data from kaggle Titanic project page:https://www.kaggle.com/c/titanic
#Ignore warning prompts
import warnings
warnings.filterwarnings('ignore')
# import data processing package
import numpy as np
import pandas as pd
# train data set
train = pd.read_csv('E:\\liaoyuanhao\\train.csv')
# test data set
test = pd.read_csv('E:\\liaoyuanhao\\test.csv')
# keep in mind that there are 891 pieces of data in the training dataset .
#it is convenient to split the test data set from it later for submitting kaggle resulits
print('训练数据集:',train.shape,'测试数据集:',test.shape)
训练数据集: (891, 12) 测试数据集: (418, 11)
# merge data sets to facilitate simultaneous cleaning of two data sets
full = train.append(test,ignore_index = True)
print('合并后的数据集:',full.shape)
合并后的数据集: (1309, 12)
full.head()
Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | NaN | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 1 | 3 | male | 1 | 0.0 | A/5 21171 |
1 | 38.0 | C85 | C | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 2 | 1 | female | 1 | 1.0 | PC 17599 |
2 | 26.0 | NaN | S | 7.9250 | Heikkinen, Miss. Laina | 0 | 3 | 3 | female | 0 | 1.0 | STON/O2. 3101282 |
3 | 35.0 | C123 | S | 53.1000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 4 | 1 | female | 1 | 1.0 | 113803 |
4 | 35.0 | NaN | S | 8.0500 | Allen, Mr. William Henry | 0 | 5 | 3 | male | 0 | 0.0 | 373450 |
'''
'describe'can only view descriptive statistics for data types,not shown for other types of data
'''
full.describe()
Age | Fare | Parch | PassengerId | Pclass | SibSp | Survived | |
---|---|---|---|---|---|---|---|
count | 1046.000000 | 1308.000000 | 1309.000000 | 1309.000000 | 1309.000000 | 1309.000000 | 891.000000 |
mean | 29.881138 | 33.295479 | 0.385027 | 655.000000 | 2.294882 | 0.498854 | 0.383838 |
std | 14.413493 | 51.758668 | 0.865560 | 378.020061 | 0.837836 | 1.041658 | 0.486592 |
min | 0.170000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 |
25% | 21.000000 | 7.895800 | 0.000000 | 328.000000 | 2.000000 | 0.000000 | 0.000000 |
50% | 28.000000 | 14.454200 | 0.000000 | 655.000000 | 3.000000 | 0.000000 | 0.000000 |
75% | 39.000000 | 31.275000 | 0.000000 | 982.000000 | 3.000000 | 1.000000 | 1.000000 |
max | 80.000000 | 512.329200 | 9.000000 | 1309.000000 | 3.000000 | 8.000000 | 1.000000 |
# seeing the data type for each colume and total number of data
full.info()
'''
We found that the data has a total of 1309 rows
Data type column : Age 、Cabin number missing data:
(1)The total number of data in Age is 1046,missing 1309-1046=263,missing rate263/1309=20%
(2)The total number of data in Fare is 1308,1 data is missing
String colmun:
(1)The total number of data in Embarked is 1307,only 2 pieces of data are missing,which is less.
(2)The total number of data in Cabin number is 295,missing 1309-295=1014,missing rate=1014/1309=77.5%,the
missing is relatively large
This indicates the direction for our next data cleaning,only by knowing which data is missing,we can deal with
it in targeted manner
'''
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age 1046 non-null float64
Cabin 295 non-null object
Embarked 1307 non-null object
Fare 1308 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
'\nWe found that the data has a total of 1309 rows\nData type column : Age 、Cabin number missing data:\n(1)The total number of data in Age is 1046,missing 1309-1046=263,missing rate263/1309=20%\n(2)The total number of data in Fare is 1308,1 data is missing\nString colmun:\n(1)The total number of data in Embarked is 1307,only 2 pieces of data are missing,which is less.\n(2)The total number of data in Cabin number is 295,missing 1309-295=1014,missing rate=1014/1309=77.5%,the \nmissing is relatively large\nThis indicates the direction for our next data cleaning,only by knowing which data is missing,we can deal with\nit in targeted manner\n'
In order to train models,many machine learning algorithms require that the features
passed in cannot have null values
1.if it’s a numeric type,replace with average
2.if it’s categorical data, replace with the most common category
3.Use the model to predict missing values, for example: K-NN
'''
For data types,the easiest way to deal with missing values is to fill the missing values with the mean
'''
print('处理前:')
full.info()
# Age
full['Age']=full['Age'].fillna(full['Age'].mean())
# Fare
full['Fare']=full['Fare'].fillna(full['Fare'].mean())
print('处理后:')
full.info()
处理前:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age 1046 non-null float64
Cabin 295 non-null object
Embarked 1307 non-null object
Fare 1308 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
处理后:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age 1309 non-null float64
Cabin 295 non-null object
Embarked 1307 non-null object
Fare 1309 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
# Check if data processing is normal
full.head(5)
Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | NaN | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 1 | 3 | male | 1 | 0.0 | A/5 21171 |
1 | 38.0 | C85 | C | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 2 | 1 | female | 1 | 1.0 | PC 17599 |
2 | 26.0 | NaN | S | 7.9250 | Heikkinen, Miss. Laina | 0 | 3 | 3 | female | 0 | 1.0 | STON/O2. 3101282 |
3 | 35.0 | C123 | S | 53.1000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 4 | 1 | female | 1 | 1.0 | 113803 |
4 | 35.0 | NaN | S | 8.0500 | Allen, Mr. William Henry | 0 | 5 | 3 | male | 0 | 0.0 | 373450 |
'''
Total number of data is 1309
String column:
(1)The total number of data in Embarked is 1307,and only 2 pieces of data are missing,which is relatively small.
(2)The total number of data in Cabin is 295,missing 1309-295=1014,missing rate=1014/1309=77.5%,the missing is
relatively large
'''
# Embarked:See what the data looks like
'''
Departure location: S = Southampton,UK
Way location 1: C = Cherbourg,France
Way location 2: Q = Queenstown,Ireland
'''
full['Embarked'].head()
0 S
1 C
2 S
3 S
4 S
Name: Embarked, dtype: object
'''
Categorical variable Embarked,look at the most common categories and fill them with
'''
full['Embarked'].value_counts()
S 914
C 270
Q 123
Name: Embarked, dtype: int64
'''
From the results,the S category is the most common.We fill in missing values with
the most frequently occurring values: S = Southampton
'''
full['Embarked']=full['Embarked'].fillna('S')
# Cabin: See what the data looks like
full['Cabin'].head()
0 NaN
1 C85
2 NaN
3 C123
4 NaN
Name: Cabin, dtype: object
'''
There are many missing data. The missing value of the cabin number is filled with U,
indicating that it's unknown.
'''
full['Cabin'] = full['Cabin'].fillna('U')
# Cheack if data processing is normal
full.head()
Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22.0 | U | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 1 | 3 | male | 1 | 0.0 | A/5 21171 |
1 | 38.0 | C85 | C | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 2 | 1 | female | 1 | 1.0 | PC 17599 |
2 | 26.0 | U | S | 7.9250 | Heikkinen, Miss. Laina | 0 | 3 | 3 | female | 0 | 1.0 | STON/O2. 3101282 |
3 | 35.0 | C123 | S | 53.1000 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 4 | 1 | female | 1 | 1.0 | 113803 |
4 | 35.0 | U | S | 8.0500 | Allen, Mr. William Henry | 0 | 5 | 3 | male | 0 | 0.0 | 373450 |
'''
Look at the processing of the final missing value,remember the generation(Survived),
This column is our labels,which is used for machine learning prediction,
and there is no need to process this column
'''
full.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age 13
文章浏览阅读2.1k次。新安装的RF框架导入selenium2library库后仍然找不到open browser关键字,1、降低robot framework版本:pip install robotframework==3.1.22、先卸掉selniumlibrary和selenium2librany,然后再安装selinium2librarypip uninstall robotframework-seleniumlibrarypipuninstallrobotframework-se..._robotframework open browser
文章浏览阅读1w次,点赞18次,收藏86次。单片机的烧录方式主要可以分为三种,分别为ICP(在电路编程)、IAP(在应用编程)以及ISP(在系统编程)。玩单片机的同学都应该听所说IAP、ICP和ISP这几个词,在此小编在帮你们“巩固”一下知识。首先先来介绍这几个小伙伴的名字。ICP(In Circuit Programing)在电路编程 ISP(In System Programing)在系统编程 IAP(In applicatin..._isp烧录器
文章浏览阅读888次。【代码】React TypeScript中tsx文件报红。_react+ts 项目 文件提示红色
文章浏览阅读5.1k次,点赞2次,收藏21次。解决 Ubuntu E:无法定位软件包问题-程序员宅基地。_e: 无法定位软件包 x11proto-print-dev
文章浏览阅读1.6k次,点赞20次,收藏24次。从O到1:YOLOV5训练自己的目标检测数据集。_c++深度学习构建数据集实现目标检测
文章浏览阅读1.1k次。seq2seq概述seq2seq模型是以编码(Encode)和解码(Decode)为代表的架构方式,顾名思义是根据输入序列X来生成输出序列Y。encode意思是将输入序列转化成一个固定长度的向量(语义向量,context vector),decode意思是将语义向量解码成输出序列。编码阶段纯粹的RNN/LSTM/GRU解码阶段由上图可以发现Seq2seq中Decoder的公式和..._seq2seq模型是以编码(encode)和解码(decode)为代表的架构方式,seq2seq模型是根据
文章浏览阅读755次。0、RocketMQ 简介RocketMQ 是由阿里捐赠给Apache 的一款分布式、队列模型的开源消息中间件,经过过淘宝双十一的洗礼。RocketMQ 的特性有如下几个方面:原生分布式 两种消息拉取 严格消息顺序 特有的分布式协调器 亿级消息堆积 消息组1、RocketMQ 的基础概念RocketMQ 由以下几个概念组成:Producer:消息生产者Consu...
文章浏览阅读1.4w次,点赞33次,收藏83次。情况:新作了一个项目,想利用git上传到gitee已有仓库上去,但是报错! [rejected] master -> master (fetch first);其报错信息如下:如上所示,根据提示输入命令无效,报错! [rejected] master -> master (fetch first);原因是没有什么readme.md文件,其实你自己创建了,再去上传还是错的。正确的解决方法就是将你的仓库和你的gitee合并了,用填充的方法,即:git pull_! [rejected] master -> master (fetch first)
文章浏览阅读840次。本人通过ssm框架来实现往数据库中导入excel表格遇到了好多问题,在此特意整理出来,以供大家参考,喜欢的可以点一下关注。错误1:在SpringMVC中没有配置MultiparResolver解决方法:在spring3mvc-servlet.xml中配置MultiparResolver
文章浏览阅读64次。编译器不是全智能的,有些错误不会立刻呈现1:编译错误,语法问题2:运行时出错,异常,崩溃,运行出错提示不在出错代码时,在之后。_cx51 编译原理
文章浏览阅读158次。????这是一个或许对你有用的社群????一对一交流/面试小册/简历优化/求职解惑,欢迎加入「芋道快速开发平台」知识星球。下面是星球提供的部分资料:《项目实战(视频)》:从书中学,往事中“练”《互联网高频面试题》:面朝简历学习,春暖花开《架构 x 系统设计》:摧枯拉朽,掌控面试高频场景题《精进 Java 学习指南》:系统学习,互联网主流技术栈《必读 Java 源码专栏》:知其然,知其所以然????这是一个或许..._springboot spring cloud nacos 微服务架构图
文章浏览阅读8.9k次,点赞3次,收藏6次。keytool 错误: java.lang.Exception: 密钥库文件不存在: keystore通过Android Studio编译器获取SHA1第一步、打开Android Studio的Terminal工具第二步、输入命令:keytool -v -list -keystore keystore文件路径,然后提示keytool 错误: java.lang.Exception:..._keytool 错误: java.lang.exception: 密钥库文件不存在: jone.keystore java.lang