python数据分析006—实例项目：泰坦尼克号_could not convert string to float: 'c85-程序员宅基地

技术标签： python kaggle 机器学习数据挖掘

Survival predicetion of Titanic

Directory:

1.Business Understanding

2.Data Understanding

Data collecting
Data importing
View data set information

3.Data Preparation

Dta preprocessing
Feature Engineering
Feature selection

4.Modeling

Build training and test data sets
Choose a machine learning algorithm
Training model

5.Evaluation

6.Deployment

Submit the results to kaggle

Report writing

1.Raise a question

what kind of people are more likely to survive on the Tiatanic

2.Data understanding

2.1.Data preparation

Download the data from kaggle Titanic project page:https://www.kaggle.com/c/titanic

2.2 Data importing

#Ignore warning prompts
import warnings
warnings.filterwarnings('ignore')
# import data processing package
import numpy as np
import pandas as pd

# train data set
train = pd.read_csv('E:\\liaoyuanhao\\train.csv')
# test data set
test = pd.read_csv('E:\\liaoyuanhao\\test.csv')
# keep in mind that there are 891 pieces of data in the training dataset .
#it is convenient to split the test data set from it later for submitting kaggle resulits
print('训练数据集：',train.shape,'测试数据集:',test.shape)

训练数据集： (891, 12) 测试数据集: (418, 11)

# merge data sets to facilitate simultaneous cleaning of two data sets
full = train.append(test,ignore_index = True)
print('合并后的数据集：',full.shape)

合并后的数据集： (1309, 12)

2.3 View data set information

full.head()

	Age	Cabin	Embarked	Fare	Name	PassengerId	Pclass	Sex	SibSp	Survived	Ticket
0	22.0	NaN	S	7.2500	Braund, Mr. Owen Harris	1	3	male	1	0.0	A/5 21171
1	38.0	C85	C	71.2833	Cumings, Mrs. John Bradley (Florence Briggs Th...	2	1	female	1	1.0	PC 17599
2	26.0	NaN	S	7.9250	Heikkinen, Miss. Laina	3	3	female	0	1.0	STON/O2. 3101282
3	35.0	C123	S	53.1000	Futrelle, Mrs. Jacques Heath (Lily May Peel)	4	1	female	1	1.0	113803
4	35.0	NaN	S	8.0500	Allen, Mr. William Henry	5	3	male	0	0.0	373450

'''
'describe'can only view descriptive statistics for data types,not shown for other types of data
'''
full.describe()

	Age	Fare	Parch	PassengerId	Pclass	SibSp	Survived
count	1046.000000	1308.000000	1309.000000	1309.000000	1309.000000	1309.000000	891.000000
mean	29.881138	33.295479	0.385027	655.000000	2.294882	0.498854	0.383838
std	14.413493	51.758668	0.865560	378.020061	0.837836	1.041658	0.486592
min	0.170000	0.000000	0.000000	1.000000	1.000000	0.000000	0.000000
25%	21.000000	7.895800	0.000000	328.000000	2.000000	0.000000	0.000000
50%	28.000000	14.454200	0.000000	655.000000	3.000000	0.000000	0.000000
75%	39.000000	31.275000	0.000000	982.000000	3.000000	1.000000	1.000000
max	80.000000	512.329200	9.000000	1309.000000	3.000000	8.000000	1.000000

# seeing the data type for each colume and total number of data
full.info()
'''
We found that the data has a total of 1309 rows
Data type column : Age 、Cabin number missing data:
（1）The total number of data in Age is 1046,missing 1309-1046=263,missing rate263/1309=20%
（2）The total number of data in Fare is 1308,1 data is missing
String colmun:
（1）The total number of data in Embarked is 1307,only 2 pieces of data are missing,which is less.
（2）The total number of data in Cabin number is 295,missing 1309-295=1014,missing rate=1014/1309=77.5%，the 
missing is relatively large
This indicates the direction for our next data cleaning,only by knowing which data is missing,we can deal with
it in targeted manner
'''

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1046 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB





'\nWe found that the data has a total of 1309 rows\nData type column : Age 、Cabin number missing data:\n（1）The total number of data in Age is 1046,missing 1309-1046=263,missing rate263/1309=20%\n（2）The total number of data in Fare is 1308,1 data is missing\nString colmun:\n（1）The total number of data in Embarked is 1307,only 2 pieces of data are missing,which is less.\n（2）The total number of data in Cabin number is 295,missing 1309-295=1014,missing rate=1014/1309=77.5%，the \nmissing is relatively large\nThis indicates the direction for our next data cleaning,only by knowing which data is missing,we can deal with\nit in targeted manner\n'

3.Data preparation

3.1 Data preprocessing

Missing value processing

In order to train models,many machine learning algorithms require that the features

passed in cannot have null values

1.if it’s a numeric type,replace with average

2.if it’s categorical data, replace with the most common category

3.Use the model to predict missing values, for example: K-NN

'''
For data types,the easiest way to deal with missing values is to fill the missing values with the mean
'''
print('处理前：')
full.info()
# Age
full['Age']=full['Age'].fillna(full['Age'].mean())
# Fare
full['Fare']=full['Fare'].fillna(full['Fare'].mean())
print('处理后：')
full.info()

处理前：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1046 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
处理后：
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            1309 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1309 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

# Check if data processing is normal
full.head(5)

	Age	Cabin	Embarked	Fare	Name	PassengerId	Pclass	Sex	SibSp	Survived	Ticket
0	22.0	NaN	S	7.2500	Braund, Mr. Owen Harris	1	3	male	1	0.0	A/5 21171
1	38.0	C85	C	71.2833	Cumings, Mrs. John Bradley (Florence Briggs Th...	2	1	female	1	1.0	PC 17599
2	26.0	NaN	S	7.9250	Heikkinen, Miss. Laina	3	3	female	0	1.0	STON/O2. 3101282
3	35.0	C123	S	53.1000	Futrelle, Mrs. Jacques Heath (Lily May Peel)	4	1	female	1	1.0	113803
4	35.0	NaN	S	8.0500	Allen, Mr. William Henry	5	3	male	0	0.0	373450

'''
Total number of data is 1309
String column:
（1）The total number of data in Embarked is 1307,and only 2 pieces of data are missing,which is relatively small.
（2）The total number of data in Cabin is 295,missing 1309-295=1014,missing rate=1014/1309=77.5%，the missing is
relatively large
'''
# Embarked:See what the data looks like
'''
Departure location: S = Southampton,UK
Way location 1: C = Cherbourg,France
Way location 2: Q = Queenstown,Ireland
'''
full['Embarked'].head()

0    S
1    C
2    S
3    S
4    S
Name: Embarked, dtype: object

'''
Categorical variable Embarked,look at the most common categories and fill them with
'''
full['Embarked'].value_counts()

S    914
C    270
Q    123
Name: Embarked, dtype: int64

'''
From the results,the S category is the most common.We fill in missing values with 
the most frequently occurring values: S = Southampton
'''
full['Embarked']=full['Embarked'].fillna('S')

# Cabin: See what the data looks like
full['Cabin'].head()

0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: Cabin, dtype: object

'''
There are many missing data. The missing value of the cabin number is filled with U,
indicating that it's unknown.
'''
full['Cabin'] = full['Cabin'].fillna('U')

# Cheack if data processing is normal
full.head()

	Age	Cabin	Embarked	Fare	Name	PassengerId	Pclass	Sex	SibSp	Survived	Ticket
0	22.0	U	S	7.2500	Braund, Mr. Owen Harris	1	3	male	1	0.0	A/5 21171
1	38.0	C85	C	71.2833	Cumings, Mrs. John Bradley (Florence Briggs Th...	2	1	female	1	1.0	PC 17599
2	26.0	U	S	7.9250	Heikkinen, Miss. Laina	3	3	female	0	1.0	STON/O2. 3101282
3	35.0	C123	S	53.1000	Futrelle, Mrs. Jacques Heath (Lily May Peel)	4	1	female	1	1.0	113803
4	35.0	U	S	8.0500	Allen, Mr. William Henry	5	3	male	0	0.0	373450

''' 
Look at the processing of the final missing value,remember the generation（Survived）,
This column is our labels,which is used for machine learning prediction,
and there is no need to process this column 
'''
full.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
Age            13

本文链接：https://blog.csdn.net/qq_41680326/article/details/104316019

原作者删帖不实内容删帖广告或垃圾文章投诉

智能推荐

robotframework关键字缺失“open browser”解决_robotframework open browser-程序员宅基地

文章浏览阅读2.1k次。新安装的RF框架导入selenium2library库后仍然找不到open browser关键字，1、降低robot framework版本：pip install robotframework==3.1.22、先卸掉selniumlibrary和selenium2librany,然后再安装selinium2librarypip uninstall robotframework-seleniumlibrarypipuninstallrobotframework-se..._robotframework open browser

单片机三种烧录方式ICP、IAP和ISP详解_isp烧录器-程序员宅基地

文章浏览阅读1w次，点赞18次，收藏86次。单片机的烧录方式主要可以分为三种，分别为ICP(在电路编程)、IAP(在应用编程)以及ISP(在系统编程)。玩单片机的同学都应该听所说IAP、ICP和ISP这几个词，在此小编在帮你们“巩固”一下知识。首先先来介绍这几个小伙伴的名字。ICP(In Circuit Programing)在电路编程 ISP(In System Programing)在系统编程 IAP(In applicatin..._isp烧录器

React TypeScript中tsx文件报红_react+ts 项目文件提示红色-程序员宅基地

文章浏览阅读888次。【代码】React TypeScript中tsx文件报红。_react+ts 项目文件提示红色

ubuntu 出现“E: 无法定位软件包问题”解决方法_e: 无法定位软件包 x11proto-print-dev-程序员宅基地

文章浏览阅读5.1k次，点赞2次，收藏21次。解决 Ubuntu E:无法定位软件包问题-程序员宅基地。_e: 无法定位软件包 x11proto-print-dev

从O到1：YOLOV5训练自己的目标检测数据集，并使用C++部署，python部署，树莓派等等。_c++深度学习构建数据集实现目标检测-程序员宅基地

文章浏览阅读1.6k次，点赞20次，收藏24次。从O到1：YOLOV5训练自己的目标检测数据集。_c++深度学习构建数据集实现目标检测

【DL小结4】seq2seq与attention机制_seq2seq模型是以编码(encode)和解码(decode)为代表的架构方式,seq2seq模型-程序员宅基地

文章浏览阅读1.1k次。seq2seq概述seq2seq模型是以编码（Encode）和解码（Decode）为代表的架构方式，顾名思义是根据输入序列X来生成输出序列Y。encode意思是将输入序列转化成一个固定长度的向量（语义向量，context vector），decode意思是将语义向量解码成输出序列。编码阶段纯粹的RNN/LSTM/GRU解码阶段由上图可以发现Seq2seq中Decoder的公式和..._seq2seq模型是以编码(encode)和解码(decode)为代表的架构方式,seq2seq模型是根据

随便推点

中间件 - 消息队列 - RocketMQ-程序员宅基地

文章浏览阅读755次。0、RocketMQ 简介RocketMQ 是由阿里捐赠给Apache 的一款分布式、队列模型的开源消息中间件，经过过淘宝双十一的洗礼。RocketMQ 的特性有如下几个方面：原生分布式两种消息拉取严格消息顺序特有的分布式协调器亿级消息堆积消息组1、RocketMQ 的基础概念RocketMQ 由以下几个概念组成：Producer：消息生产者Consu...

报错（! [rejected] master -＞ master (fetch first)）解决方案_! [rejected] master -> master (fetch first)-程序员宅基地

文章浏览阅读1.4w次，点赞33次，收藏83次。情况：新作了一个项目，想利用git上传到gitee已有仓库上去，但是报错! [rejected] master -> master (fetch first)；其报错信息如下：如上所示，根据提示输入命令无效，报错! [rejected] master -> master (fetch first)；原因是没有什么readme.md文件，其实你自己创建了，再去上传还是错的。正确的解决方法就是将你的仓库和你的gitee合并了，用填充的方法，即：git pull_! [rejected] master -> master (fetch first)

ssm往mysql中导入excel出现的错误_ssm导入excel文件错误-程序员宅基地

文章浏览阅读840次。本人通过ssm框架来实现往数据库中导入excel表格遇到了好多问题，在此特意整理出来，以供大家参考，喜欢的可以点一下关注。错误1：在SpringMVC中没有配置MultiparResolver解决方法：在spring3mvc-servlet.xml中配置MultiparResolver

编译器_cx51 编译原理-程序员宅基地

文章浏览阅读64次。编译器不是全智能的，有些错误不会立刻呈现1：编译错误，语法问题2：运行时出错，异常，崩溃，运行出错提示不在出错代码时，在之后。_cx51 编译原理

图解 SpringCloud 微服务架构，写的太好了！-程序员宅基地

文章浏览阅读158次。????这是一个或许对你有用的社群????一对一交流/面试小册/简历优化/求职解惑，欢迎加入「芋道快速开发平台」知识星球。下面是星球提供的部分资料：《项目实战（视频）》：从书中学，往事中“练”《互联网高频面试题》：面朝简历学习，春暖花开《架构 x 系统设计》：摧枯拉朽，掌控面试高频场景题《精进 Java 学习指南》：系统学习，互联网主流技术栈《必读 Java 源码专栏》：知其然，知其所以然????这是一个或许..._springboot spring cloud nacos 微服务架构图

keytool 错误: java.lang.Exception: 密钥库文件不存在: keystore_keytool 错误: java.lang.exception: 密钥库文件不存在: jone.ke-程序员宅基地

文章浏览阅读8.9k次，点赞3次，收藏6次。keytool 错误: java.lang.Exception: 密钥库文件不存在: keystore通过Android Studio编译器获取SHA1第一步、打开Android Studio的Terminal工具第二步、输入命令：keytool -v -list -keystore keystore文件路径，然后提示keytool 错误: java.lang.Exception:..._keytool 错误: java.lang.exception: 密钥库文件不存在: jone.keystore java.lang