(Machine) Learning from the Titanic
This is my first attempt at looking at the RMS Titanic dataset on kaggle. This has been described several times, with many tutorials. This is a list of passengers from the Titanic and the purpose of this long-running kaggle competition is to predict which passengers survived the shipwreck. To start we can load in the data and take a peek at what is inside. I am going to take a look at it using R
.
train.col <- c('integer', # PassengerId
'factor', # Survived
'factor', # Pclass
'character', # Name
'factor', # Sex
'numeric', # Age
'integer', # SibSp
'integer', # Parch
'character', # Ticket
'numeric', # Fare
'character', # Cabin
'factor' # Embarked
)
test.col <- train.col[-2] # no Survived column in test.csv
train <- read.csv("train.csv", colClasses = train.col, na.strings = c(""))
test <- read.csv( "test.csv", colClasses = test.col, na.strings = c(""))
all <- merge(train, test, all = TRUE)
str(all)
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr NA "C85" NA "C123" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
summary(all)
## PassengerId Pclass Name Sex Age
## Min. : 1 1:323 Length:1309 female:466 Min. : 0.17
## 1st Qu.: 328 2:277 Class :character male :843 1st Qu.:21.00
## Median : 655 3:709 Mode :character Median :28.00
## Mean : 655 Mean :29.88
## 3rd Qu.: 982 3rd Qu.:39.00
## Max. :1309 Max. :80.00
## NA's :263
## SibSp Parch Ticket Fare
## Min. :0.0000 Min. :0.000 Length:1309 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.:0.000 Class :character 1st Qu.: 7.896
## Median :0.0000 Median :0.000 Mode :character Median : 14.454
## Mean :0.4989 Mean :0.385 Mean : 33.295
## 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.: 31.275
## Max. :8.0000 Max. :9.000 Max. :512.329
## NA's :1
## Cabin Embarked Survived
## Length:1309 C :270 0 :549
## Class :character Q :123 1 :342
## Mode :character S :914 NA's:418
## NA's: 2
##
##
##
We merge both the training and test set so that we can take a look at everything all at once. The first thing to take a look at is to see what is actually present in the dataset. We can do that below in graphical and tabular forms.
sapply(all, function(x) sum(is.na(x)))
## PassengerId Pclass Name Sex Age SibSp
## 0 0 0 0 263 0
## Parch Ticket Fare Cabin Embarked Survived
## 0 0 1 1014 2 418
Interestingly, the dataset is mostly complete. It seems to be missing some 20% of the Age column and some 90% of the Cabin column. Of course there is a huge chunk of survived missing, but that is the actual test data. Now we can see that we might be able to better estimate the missing Age data given the information we have from all the passengers. The Cabin data is still very sparse, so it will be tough to use this variable. But part of this type of estimate (imputation) will require some feature engineering. That is we can try to learn some things about existing info/columns to try to extract some additional knowledge.
Feature engineering
Titles
So let’s take a look again at the data. Recall the point of this exercise is to predict the passengers who survived in the test dataset.
head(all)
## PassengerId Pclass Name
## 1 1 3 Braund, Mr. Owen Harris
## 2 2 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
## 3 3 3 Heikkinen, Miss. Laina
## 4 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel)
## 5 5 3 Allen, Mr. William Henry
## 6 6 3 Moran, Mr. James
## Sex Age SibSp Parch Ticket Fare Cabin Embarked Survived
## 1 male 22 1 0 A/5 21171 7.2500 <NA> S 0
## 2 female 38 1 0 PC 17599 71.2833 C85 C 1
## 3 female 26 0 0 STON/O2. 3101282 7.9250 <NA> S 1
## 4 female 35 1 0 113803 53.1000 C123 S 1
## 5 male 35 0 0 373450 8.0500 <NA> S 0
## 6 male NA 0 0 330877 8.4583 <NA> Q 0
We can notice the Name field actually includes the title of a passenger. Considering the time period, these titles can help provide some info. So let’s try to extract that info using some reg-ex.
all$Title <- gsub("^.*, (.*?)\\..*$", "\\1", all$Name)
table(all$Sex, all$Title)
##
## Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle Mme
## female 0 0 0 1 1 0 1 0 0 260 2 1
## male 1 4 1 0 7 1 0 2 61 0 0 0
##
## Mr Mrs Ms Rev Sir the Countess
## female 0 197 2 0 0 1
## male 757 0 0 8 1 0
I used the table
function to show how many Titles there are (17) and to which gender each belongs. The vast majority of passengers either were addressed by Mr. or Mrs./Miss. There are a few Titles which are only used a few times. Let’s take a closer look at those:
- Captain - Military rank (one below Major)
- Don - Spanish, Portuguese, and Italian male honorific
- Dona - Female version of Don
- Jonkheer - Dutch male honorific
- Lady - English female honorific
- Major - Military rank (one above Captain)
- Mademoiselle (Mlle) - French form of (polite) address for unmarried females, similar to the English ‘Miss’
- Madame (Mme) - French form of (polite) address for females, similar to the English ‘Mrs.’
- Ms - English form of address for females (regardless of marital status)
- Sir - English male honorific
- Countess - English title of nobility for females
It becomes obvious that some of these Titles can be combined into one. We can again use gsub
to accomplish this task.
all$Title <- gsub("Dona|Lady|the Countess", "Lady", all$Title)
all$Title <- gsub("Mlle", "Miss", all$Title)
all$Title <- gsub("Mme", "Mrs", all$Title)
all$Title <- gsub("Don|Jonkheer|Sir", "Sir", all$Title)
So, we’re able to reduce the number of Titles to 12. I also want to take a look at the 7 military officers on board the ship.
subset(all, Title == "Capt" | Title == "Major" | Title == "Col")
## PassengerId Pclass Name Sex Age SibSp
## 450 450 1 Peuchen, Major. Arthur Godfrey male 52 0
## 537 537 1 Butt, Major. Archibald Willingham male 45 0
## 648 648 1 Simonius-Blumer, Col. Oberst Alfons male 56 0
## 695 695 1 Weir, Col. John male 60 0
## 746 746 1 Crosby, Capt. Edward Gifford male 70 1
## 1023 1023 1 Gracie, Col. Archibald IV male 53 0
## 1094 1094 1 Astor, Col. John Jacob male 47 1
## Parch Ticket Fare Cabin Embarked Survived Title
## 450 0 113786 30.500 C104 S 1 Major
## 537 0 113050 26.550 B38 S 0 Major
## 648 0 13213 35.500 A26 C 1 Col
## 695 0 113800 26.550 <NA> S 0 Col
## 746 1 WE/P 5735 71.000 B22 S 0 Capt
## 1023 0 113780 28.500 C51 C <NA> Col
## 1094 0 PC 17757 227.525 C62 C64 C <NA> Col
We can see they are generally around the same Age, travelling in the top 3 cabins available and paid around the same price for their tickets. I think it is fair to lump them into one Title, so we’ll upgrade some of them to the rank of Colonel.
all$Title <- gsub("Capt|Major|Col", "Col", all$Title)
Since there are only two passengers with the Title ‘Ms,’ I want to try to reclassify those Titles as well.
subset(all, Title == "Ms")
## PassengerId Pclass Name Sex Age SibSp Parch
## 444 444 2 Reynaldo, Ms. Encarnacion female 28 0 0
## 980 980 3 O'Donoghue, Ms. Bridget female NA 0 0
## Ticket Fare Cabin Embarked Survived Title
## 444 230434 13.00 <NA> S 1 Ms
## 980 364856 7.75 <NA> Q <NA> Ms
If we take a look at the history of both these passengers at the Encyclopedia Titanica. We can see that Ms. Reynaldo was a widow and Ms. O’Donoghue was never married. So I will change their Titles to Miss.
all$Title <- gsub("Ms", "Miss", all$Title)
all$Title <- as.factor(all$Title) #Make sure its a factor
table(all$Sex, all$Title)
##
## Col Dr Lady Master Miss Mr Mrs Rev Sir
## female 0 1 3 0 264 0 198 0 0
## male 7 7 0 61 0 757 0 8 3
str(all)
## 'data.frame': 1309 obs. of 13 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Pclass : Factor w/ 3 levels "1","2","3": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr NA "C85" NA "C123" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
## $ Survived : Factor w/ 2 levels "0","1": 1 2 2 2 1 1 1 1 2 2 ...
## $ Title : Factor w/ 9 levels "Col","Dr","Lady",..: 6 7 5 7 6 6 6 4 7 7 ...
Group Sizes
So I have reduced the number of Titles in half to 9 from the original 18. I think this introduces a lot of extra info we had from the Name column, but in a compact way. Next we can focus on the Ticket column. These are the ticket numbers for each passenger. From previous printouts we know that some of the ticket information includes a character string. We can strip this info and keep just the ticket number itself. However, if we look carefully at the Ticket info, there are four that have the string ‘LINE’ listed.
subset(all, Ticket=="LINE")
## PassengerId Pclass Name Sex Age SibSp
## 180 180 3 Leonard, Mr. Lionel male 36 0
## 272 272 3 Tornquist, Mr. William Henry male 25 0
## 303 303 3 Johnson, Mr. William Cahoone Jr male 19 0
## 598 598 3 Johnson, Mr. Alfred male 49 0
## Parch Ticket Fare Cabin Embarked Survived Title
## 180 0 LINE 0 <NA> S 0 Mr
## 272 0 LINE 0 <NA> S 1 Mr
## 303 0 LINE 0 <NA> S 0 Mr
## 598 0 LINE 0 <NA> S 0 Mr
For whatever reason, these four Tickets are not correctly listed. Further, these four passengers do not seem to be related. So I will insert an arbitrary set of strings to replace the ‘LINE.’
all[subset(all, Ticket=="LINE")$PassengerId, "Ticket"] <- c("99991","99992","99993","99994")
With that done, we can finally get the TicketNo extracted.
all$TicketNo <- gsub("[^0-9]", "", all$Ticket)
head(all)
## PassengerId Pclass Name
## 1 1 3 Braund, Mr. Owen Harris
## 2 2 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
## 3 3 3 Heikkinen, Miss. Laina
## 4 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel)
## 5 5 3 Allen, Mr. William Henry
## 6 6 3 Moran, Mr. James
## Sex Age SibSp Parch Ticket Fare Cabin Embarked Survived
## 1 male 22 1 0 A/5 21171 7.2500 <NA> S 0
## 2 female 38 1 0 PC 17599 71.2833 C85 C 1
## 3 female 26 0 0 STON/O2. 3101282 7.9250 <NA> S 1
## 4 female 35 1 0 113803 53.1000 C123 S 1
## 5 male 35 0 0 373450 8.0500 <NA> S 0
## 6 male NA 0 0 330877 8.4583 <NA> Q 0
## Title TicketNo
## 1 Mr 521171
## 2 Mrs 17599
## 3 Miss 23101282
## 4 Mrs 113803
## 5 Mr 373450
## 6 Mr 330877
This will allow us to see that some people traveled together on the same ticket. We do have two fields SibSp and Parch which give us the the sibling plus spouse and parent plus children counts on board, respectively. We can combine these two to compare to our TicketNo info. The way to do this is to count every passenger travelling on the same ticket first. We can do that by using the ddply
method alongside transform
to do this counting for us.
all$FamilySize <- all$SibSp + all$Parch
all <- ddply(all, ~ TicketNo, transform, TravelSize = length(TicketNo)-1 )
head(all)
## PassengerId Pclass
## 1 258 1
## 2 505 1
## 3 760 1
## 4 263 1
## 5 559 1
## 6 586 1
## Name Sex Age
## 1 Cherry, Miss. Gladys female 30
## 2 Maioni, Miss. Roberta female 16
## 3 Rothes, the Countess. of (Lucy Noel Martha Dyer-Edwards) female 33
## 4 Taussig, Mr. Emil male 52
## 5 Taussig, Mrs. Emil (Tillie Mandelbaum) female 39
## 6 Taussig, Miss. Ruth female 18
## SibSp Parch Ticket Fare Cabin Embarked Survived Title TicketNo
## 1 0 0 110152 86.50 B77 S 1 Miss 110152
## 2 0 0 110152 86.50 B79 S 1 Miss 110152
## 3 0 0 110152 86.50 B77 S 1 Lady 110152
## 4 1 1 110413 79.65 E67 S 0 Mr 110413
## 5 1 1 110413 79.65 E67 S 1 Mrs 110413
## 6 0 2 110413 79.65 E68 S 1 Miss 110413
## FamilySize TravelSize
## 1 0 2
## 2 0 2
## 3 0 2
## 4 2 2
## 5 2 2
## 6 2 2
Now, we can plot the TravelSize against the FamilySize to see how they each compare. Because the FamilySize only captures certain familial relations, people travelling on the same ticket could be other relations, friends, or neighbors. This means our calculated TravelSize could capture some additional information missing in FamilySize. Of course, this does not need to be true.
The plot shows there is a correlation between the two. However it also shows that there differences between the two, so it reinforces our hypothesis that the TravelSize might be capturing additional information.
Missing data
We also remember from the start that there are some pieces of data missing, some much bigger than others. Let’s take a look again at what we do not have.
summary(all)
## PassengerId Pclass Name Sex Age
## Min. : 1 1:323 Length:1309 female:466 Min. : 0.17
## 1st Qu.: 328 2:277 Class :character male :843 1st Qu.:21.00
## Median : 655 3:709 Mode :character Median :28.00
## Mean : 655 Mean :29.88
## 3rd Qu.: 982 3rd Qu.:39.00
## Max. :1309 Max. :80.00
## NA's :263
## SibSp Parch Ticket Fare
## Min. :0.0000 Min. :0.000 Length:1309 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.:0.000 Class :character 1st Qu.: 7.896
## Median :0.0000 Median :0.000 Mode :character Median : 14.454
## Mean :0.4989 Mean :0.385 Mean : 33.295
## 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.: 31.275
## Max. :8.0000 Max. :9.000 Max. :512.329
## NA's :1
## Cabin Embarked Survived Title TicketNo
## Length:1309 C :270 0 :549 Mr :757 Length:1309
## Class :character Q :123 1 :342 Miss :264 Class :character
## Mode :character S :914 NA's:418 Mrs :198 Mode :character
## NA's: 2 Master : 61
## Dr : 8
## Rev : 8
## (Other): 13
## FamilySize TravelSize
## Min. : 0.0000 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.000
## Median : 0.0000 Median : 0.000
## Mean : 0.8839 Mean : 1.105
## 3rd Qu.: 1.0000 3rd Qu.: 2.000
## Max. :10.0000 Max. :10.000
##
We are missing Age, Fare, and Embarked info from the NA
that shows up. We also know most of the Cabin information is missing. Let’s start with missing Fares.
Fare
As usual, we start with a distribution and some statistics.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 7.896 14.450 33.300 31.280 512.300 1
We see an interesting bin: there are a dozen or so Fares with zero (what unit exactly??). Let’s take a look at these passengers; they could be infants.
subset(all, Fare == 0)
## PassengerId Pclass Name Sex Age
## 23 807 1 Andrews, Mr. Thomas Jr male 39
## 24 1158 1 Chisholm, Mr. Roderick Robert Crispin male NA
## 25 634 1 Parr, Mr. William Henry Marsh male NA
## 27 816 1 Fry, Mr. Richard male NA
## 28 1264 1 Ismay, Mr. Joseph Bruce male 49
## 29 264 1 Harrison, Mr. William male 40
## 334 823 1 Reuchlin, Jonkheer. John George male 38
## 476 278 2 Parkes, Mr. Francis "Frank" male NA
## 477 414 2 Cunningham, Mr. Alfred Fleming male NA
## 478 467 2 Campbell, Mr. William male NA
## 479 482 2 Frost, Mr. Anthony Wood "Archie" male NA
## 480 733 2 Knight, Mr. Robert J male NA
## 481 675 2 Watson, Mr. Ennis Hastings male NA
## 1306 180 3 Leonard, Mr. Lionel male 36
## 1307 272 3 Tornquist, Mr. William Henry male 25
## 1308 303 3 Johnson, Mr. William Cahoone Jr male 19
## 1309 598 3 Johnson, Mr. Alfred male 49
## SibSp Parch Ticket Fare Cabin Embarked Survived Title TicketNo
## 23 0 0 112050 0 A36 S 0 Mr 112050
## 24 0 0 112051 0 <NA> S <NA> Mr 112051
## 25 0 0 112052 0 <NA> S 0 Mr 112052
## 27 0 0 112058 0 B102 S 0 Mr 112058
## 28 0 0 112058 0 B52 B54 B56 S <NA> Mr 112058
## 29 0 0 112059 0 B94 S 0 Mr 112059
## 334 0 0 19972 0 <NA> S 0 Sir 19972
## 476 0 0 239853 0 <NA> S 0 Mr 239853
## 477 0 0 239853 0 <NA> S 0 Mr 239853
## 478 0 0 239853 0 <NA> S 0 Mr 239853
## 479 0 0 239854 0 <NA> S 0 Mr 239854
## 480 0 0 239855 0 <NA> S 0 Mr 239855
## 481 0 0 239856 0 <NA> S 0 Mr 239856
## 1306 0 0 99991 0 <NA> S 0 Mr 99991
## 1307 0 0 99992 0 <NA> S 1 Mr 99992
## 1308 0 0 99993 0 <NA> S 0 Mr 99993
## 1309 0 0 99994 0 <NA> S 0 Mr 99994
## FamilySize TravelSize
## 23 0 0
## 24 0 0
## 25 0 0
## 27 0 1
## 28 0 1
## 29 0 0
## 334 0 0
## 476 0 2
## 477 0 2
## 478 0 2
## 479 0 0
## 480 0 0
## 481 0 0
## 1306 0 0
## 1307 0 0
## 1308 0 0
## 1309 0 0
It does not appear to be the case at all. They are all men who Embarked at Southampton (‘S’) that span all three Pclasses. Let’s redefine these Fares as NA
and look at those passengers.
all$Fare[all$Fare == 0] <- NA
all[(which(is.na(all$Fare))),]
## PassengerId Pclass Name Sex Age
## 23 807 1 Andrews, Mr. Thomas Jr male 39.0
## 24 1158 1 Chisholm, Mr. Roderick Robert Crispin male NA
## 25 634 1 Parr, Mr. William Henry Marsh male NA
## 27 816 1 Fry, Mr. Richard male NA
## 28 1264 1 Ismay, Mr. Joseph Bruce male 49.0
## 29 264 1 Harrison, Mr. William male 40.0
## 334 823 1 Reuchlin, Jonkheer. John George male 38.0
## 476 278 2 Parkes, Mr. Francis "Frank" male NA
## 477 414 2 Cunningham, Mr. Alfred Fleming male NA
## 478 467 2 Campbell, Mr. William male NA
## 479 482 2 Frost, Mr. Anthony Wood "Archie" male NA
## 480 733 2 Knight, Mr. Robert J male NA
## 481 675 2 Watson, Mr. Ennis Hastings male NA
## 1123 1044 3 Storey, Mr. Thomas male 60.5
## 1306 180 3 Leonard, Mr. Lionel male 36.0
## 1307 272 3 Tornquist, Mr. William Henry male 25.0
## 1308 303 3 Johnson, Mr. William Cahoone Jr male 19.0
## 1309 598 3 Johnson, Mr. Alfred male 49.0
## SibSp Parch Ticket Fare Cabin Embarked Survived Title TicketNo
## 23 0 0 112050 NA A36 S 0 Mr 112050
## 24 0 0 112051 NA <NA> S <NA> Mr 112051
## 25 0 0 112052 NA <NA> S 0 Mr 112052
## 27 0 0 112058 NA B102 S 0 Mr 112058
## 28 0 0 112058 NA B52 B54 B56 S <NA> Mr 112058
## 29 0 0 112059 NA B94 S 0 Mr 112059
## 334 0 0 19972 NA <NA> S 0 Sir 19972
## 476 0 0 239853 NA <NA> S 0 Mr 239853
## 477 0 0 239853 NA <NA> S 0 Mr 239853
## 478 0 0 239853 NA <NA> S 0 Mr 239853
## 479 0 0 239854 NA <NA> S 0 Mr 239854
## 480 0 0 239855 NA <NA> S 0 Mr 239855
## 481 0 0 239856 NA <NA> S 0 Mr 239856
## 1123 0 0 3701 NA <NA> S <NA> Mr 3701
## 1306 0 0 99991 NA <NA> S 0 Mr 99991
## 1307 0 0 99992 NA <NA> S 1 Mr 99992
## 1308 0 0 99993 NA <NA> S 0 Mr 99993
## 1309 0 0 99994 NA <NA> S 0 Mr 99994
## FamilySize TravelSize
## 23 0 0
## 24 0 0
## 25 0 0
## 27 0 1
## 28 0 1
## 29 0 0
## 334 0 0
## 476 0 2
## 477 0 2
## 478 0 2
## 479 0 0
## 480 0 0
## 481 0 0
## 1123 0 0
## 1306 0 0
## 1307 0 0
## 1308 0 0
## 1309 0 0
In order to come up with the Fare imputation, I want to calculate the mean Fare for each Pclass and Title. This is done since all the passengers we care about are all ‘Mr’ (save one). This averaging process is shown below.
df.temp <- subset(all, Embarked == 'S') #Only get Southampton passengers
ddply(df.temp, c("Pclass","Title"), summarize, AveFare = mean(Fare, na.rm = TRUE))
## Pclass Title AveFare
## 1 1 Col 38.65000
## 2 1 Dr 80.47917
## 3 1 Lady 86.50000
## 4 1 Master 117.80277
## 5 1 Miss 127.37830
## 6 1 Mr 55.14049
## 7 1 Mrs 83.35187
## 8 1 Sir NaN
## 9 2 Dr 12.25000
## 10 2 Master 26.42500
## 11 2 Miss 22.59091
## 12 2 Mr 20.69073
## 13 2 Mrs 23.41122
## 14 2 Rev 19.50357
## 15 3 Master 27.78151
## 16 3 Miss 17.50096
## 17 3 Mr 11.88229
## 18 3 Mrs 19.13560
Now, we can assign those values to those passengers.
df.temp <- ddply(df.temp, c("Pclass","Title"), transform, AveFare = mean(Fare, na.rm = TRUE)) #compute
all[(which(is.na(all$Fare))), "Fare"] <- df.temp[(which(is.na(df.temp$Fare))), "AveFare"] #assign
There is still one passenger who does not have a Fare, and that was the only average Fare we could not compute. Instead, we will treat that passenger as a ‘Mr’ for the purposes of the cost of his ticket. In that case then, we simply assign that value we calculated from before (55).
all[all$PassengerId == 823, "Fare"] <- 55
That takes care of all the Fare data. Next we turn to the Embarked data.
Embarkment
all[(which(is.na(all$Embarked))),]
## PassengerId Pclass Name Sex Age
## 59 62 1 Icard, Miss. Amelie female 38
## 60 830 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62
## SibSp Parch Ticket Fare Cabin Embarked Survived Title TicketNo
## 59 0 0 113572 80 B28 <NA> 1 Miss 113572
## 60 0 0 113572 80 B28 <NA> 1 Mrs 113572
## FamilySize TravelSize
## 59 0 1
## 60 0 1
These two passengers, who were travelling on the same ticket and stayed in the same cabin, do not have any Embarkment information. We also know they both paid 80 for each ticket. We can take a look at the usual Fare paid by females for the three ports. This way we can compare the Fares to guess which port is the likeliest.
df.temp <- subset(all, Pclass == 1)
ddply(df.temp, c("Sex","Embarked"), summarize,
AveFare = mean(Fare, na.rm=TRUE), MedFare = median(Fare, na.rm=TRUE), NumPas = length(Pclass))
## Sex Embarked AveFare MedFare NumPas
## 1 female C 118.89595 83.15830 71
## 2 female Q 90.00000 90.00000 2
## 3 female S 101.06914 78.85000 69
## 4 female <NA> 80.00000 80.00000 2
## 5 male C 94.62256 62.66875 70
## 6 male Q 90.00000 90.00000 1
## 7 male S 57.24338 44.00000 108
Doing this we see that Southampton is the most likely Embarkment point, since both median and mean are closest to 80. So let’s set this info before continuing.
all[(which(is.na(all$Embarked))), "Embarked"] <- "S"
Age
Now, we can finally tackle the Age data. Unlike the previous columns with missing data, this data is only about 80% complete: some 260 passengers do not have any Age data. So in order to estimate these ages, we will have to go a little deeper into the data. But first, let’s take a cursory look at the data.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.17 21.00 28.00 29.88 39.00 80.00 263
There are no surprises we can see in the Age distribution, with a peak in the number of 20 year-olds. There are a multitude of ways to come up with a way to estimate the missing data. The easiest way is likely fitting a linear model. While I did go through the motions to do that, I wanted something a little more sophisticated. I arbitrarily chose a gradient boosted model to fit the Age. To do this, I used the R
packages gbm
and caret
for the training and fitting. The gradient boosting does require some tuning and I tried a few different options. The model I will show is the one that performed best. Let’s take a look at it.
gbmControl <- trainControl(method='repeatedcv', number = 10, repeats = 10, verboseIter = FALSE,
returnResamp = 'all',
summaryFunction=defaultSummary)
gbmGrid <- expand.grid(interaction.depth = c(2, 3),
n.trees = (1:20)*20,
shrinkage = 0.1, # or something like seq(.001, .1, .01)
n.minobsinnode = 10) # or something like c(5, 10, 15, 20)
age.model <- train(Age ~ Pclass + Sex + Fare + Embarked + Title + SibSp + Parch + TravelSize,
data = all[!is.na(all$Age), ],
method = 'gbm',
trControl = gbmControl,
tuneGrid = gbmGrid,
verbose = FALSE)
This sets up a trained model of Age using all the Age data we have. The fit does a 10-fold CV repeated 10 times. Also notice I did not include the FamilySize variable I added. Instead, it seems to perform slightly worse than the two variables it replaces.
With a model in hand we can take a look at some of its characteristics. First we can use the plot
method to see how the model changed with the varying tuning parameters.
plot(age.model)
The fit wants to minimize the Root Mean Square Error (RMSE) and we can see that it becomes minimized relatively quickly. The model can also tell us what factors were most important. The very handy varImp
function allows us to visualize this easily.
plot(varImp(age.model))
It seems some of the Titles are important to estimating the Age, so it is good we did extract that information earlier. This type of plot helps with understanding the relationships between the factors and the variable we want to predict. Knowing the model a little better, we can predict the missing Ages. Another check we can perform is make sure the predicted value distribution is not too different from the actual values we have. Let’s take a look. The two plots below show both the Age density (top) and histogram (bottom) for the actual (dark gray) and predicted (red) Age values.
Taking a look at this we can see the model does not seem to make any abnormal predictions. Any peculiar outputs from a model could be a red flag that it is not behaving as desired, so it is important to make these types of sanity checks. A further check of the model is to graph the residuals of the fit. This is done to ensure there does not seem to be any bias from the fit.
The residuals look to be distributed as expected: it is roughly rectangular with a concentration along zero. We seem to have a good model and estimate for the Age. So let’s fill the NA
values with our predictions.
all$Age[is.na(all$Age)] <- predict(age.model, all[is.na(all$Age),])
That takes care of all the missing data, except the Cabin values. Unfortunately it is over three-quarters missing, so any imputation will be very difficult. So I elected to not attempt to estimate this value.
Modeling survival
With the data ready for modeling, let’s split the data frame up again. Recall, we had combined both the training and test data at the beginning of this exercise. Also note that in order to help R
with the modeling, we will rename the two levels of the survived variable.
all$Survived <- revalue(all$Survived, c("1" = "Survived", "0" = "Perished"))
train.all <- all[!is.na(all$Survived), ]
test.all <- all[ is.na(all$Survived), ]
Because I intend to use a few different methods to model whether a passenger survived or not, I will further split the train data. This is done to keep a validation set to compare the performance of the various models we will develop. We will use the createDataPartition
method to accomplish this. An 80/20 split of the data will be made.
set.seed(8020)
train.rows <- createDataPartition(train.all$Survived, p = 0.8, list = FALSE)
train.fit <- train.all[ train.rows, ]
test.fit <- train.all[-train.rows, ]
With a training set ready, now we can start the fitting process. We can start at the simplest fit first: a linear model. We will use a common trainControl
method for use in the train
method. This control function will perform a 10-fold CV repeated 10 times, as before.
fitControl <- trainControl(method = "repeatedcv", repeats = 10, number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE)
Linear models
We will use the glm
model. Note the metric we use in the train
function is the Receiver Operating Characteristic (ROC) curve. We can use this to compare models later. The fit formula will be similar to what we used to fit passenger Ages, with the difference being Age becomes an independent variable to fit the passenger survival.
set.seed(82)
model.glm.1 <- train(Survived ~ Pclass + Sex + Fare + Age + Embarked + Title + SibSp + Parch + TravelSize,
data = train.fit,
method = "glm",
metric = "ROC",
trControl = fitControl)
model.glm.1
## Generalized Linear Model
##
## 714 samples
## 9 predictor
## 2 classes: 'Perished', 'Survived'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 643, 642, 642, 643, 642, 643, ...
## Resampling results:
##
## ROC Sens Spec
## 0.8727354 0.865 0.755463
##
##
summary(model.glm.1)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5611 -0.5627 -0.3335 0.5184 2.5708
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.736e+01 1.455e+03 0.012 0.990485
## Pclass2 -1.396e+00 3.895e-01 -3.583 0.000339 ***
## Pclass3 -2.691e+00 3.946e-01 -6.820 9.13e-12 ***
## Sexmale -1.599e+01 1.455e+03 -0.011 0.991234
## Fare -5.332e-05 3.352e-03 -0.016 0.987310
## Age -3.684e-02 1.123e-02 -3.281 0.001034 **
## EmbarkedQ 1.088e-01 4.508e-01 0.241 0.809265
## EmbarkedS -3.121e-01 2.893e-01 -1.079 0.280520
## TitleDr 3.290e-01 1.532e+00 0.215 0.829942
## TitleLady 1.606e-01 1.762e+03 0.000 0.999927
## TitleMaster 3.257e+00 1.375e+00 2.369 0.017816 *
## TitleMiss -1.309e+01 1.455e+03 -0.009 0.992826
## TitleMr 2.835e-02 1.226e+00 0.023 0.981557
## TitleMrs -1.209e+01 1.455e+03 -0.008 0.993374
## TitleRev -1.374e+01 6.366e+02 -0.022 0.982779
## TitleSir -1.522e+01 1.455e+03 -0.010 0.991658
## SibSp -7.126e-01 1.670e-01 -4.267 1.98e-05 ***
## Parch -4.620e-01 1.830e-01 -2.525 0.011586 *
## TravelSize 2.005e-01 1.143e-01 1.755 0.079332 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 950.86 on 713 degrees of freedom
## Residual deviance: 566.32 on 695 degrees of freedom
## AIC: 604.32
##
## Number of Fisher Scoring iterations: 14
Taking a look at the significance of some of the variables, it seems that Fare is not actually contributing much. So let’s setup another model with that taken out.
set.seed(82)
model.glm.2 <- train(Survived ~ Pclass + Sex + Age + Embarked + Title + SibSp + Parch + TravelSize,
data = train.fit,
method = "glm",
metric = "ROC",
trControl = fitControl)
model.glm.2
## Generalized Linear Model
##
## 714 samples
## 8 predictor
## 2 classes: 'Perished', 'Survived'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 643, 642, 642, 643, 642, 643, ...
## Resampling results:
##
## ROC Sens Spec
## 0.8740179 0.8668182 0.7583598
##
##
summary(model.glm.2)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5615 -0.5628 -0.3336 0.5185 2.5705
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 17.35473 1455.39851 0.012 0.99049
## Pclass2 -1.39292 0.35171 -3.960 7.48e-05 ***
## Pclass3 -2.68788 0.33325 -8.066 7.29e-16 ***
## Sexmale -15.99095 1455.39785 -0.011 0.99123
## Age -0.03684 0.01122 -3.283 0.00103 **
## EmbarkedQ 0.10907 0.45055 0.242 0.80872
## EmbarkedS -0.31154 0.28676 -1.086 0.27730
## TitleDr 0.32784 1.52991 0.214 0.83032
## TitleLady 0.15886 1762.37398 0.000 0.99993
## TitleMaster 3.25663 1.37434 2.370 0.01781 *
## TitleMiss -13.08967 1455.39842 -0.009 0.99282
## TitleMr 0.02705 1.22350 0.022 0.98236
## TitleMrs -12.08938 1455.39841 -0.008 0.99337
## TitleRev -13.74228 636.58700 -0.022 0.98278
## TitleSir -15.21855 1455.39804 -0.010 0.99166
## SibSp -0.71230 0.16593 -4.293 1.77e-05 ***
## Parch -0.46182 0.18256 -2.530 0.01141 *
## TravelSize 0.19968 0.10157 1.966 0.04930 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 950.86 on 713 degrees of freedom
## Residual deviance: 566.32 on 696 degrees of freedom
## AIC: 602.32
##
## Number of Fisher Scoring iterations: 14
It performs slightly better in the AIC score, but the residual deviance remains the same. At this point I cannot say which formula might be the better one. So we will keep both for now to compare later. Next we will use the the glmnet
package. This uses regularization with some penalties to come up with a fit.
set.seed(82)
model.glmnet.1 <- train(Survived ~ Pclass + Sex + Fare + Age + Embarked + Title + SibSp + Parch + TravelSize,
method = "glmnet",
tuneGrid = expand.grid(alpha = seq(0.05,1.0,.05), lambda = seq(0.0001, .1, length = 20)),
data = train.fit,
metric = "ROC",
trControl = fitControl) #No output, on purpose
plot(model.glmnet.1)
set.seed(82)
model.glmnet.2 <- train(Survived ~ Pclass + Sex + Age + Embarked + Title + SibSp + Parch + TravelSize,
method = "glmnet",
tuneGrid = expand.grid(alpha = seq(0.05,1.0,.05), lambda = seq(0.0001, .01, length = 20)),
data = train.fit,
metric = "ROC",
trControl = fitControl) #No output, on purpose
plot(model.glmnet.2)
Like before, I setup two different glmnet models with just the Fare missing in one of them. I played around with the tuning a little to maximize the ROC value as well. Here I also plotted the models to see tuning parameters effect on the ROC value. The outcomes are relatively similar, but again we will use our validation sample later to determine which model is the best.
Boosted models
Next we can also test boosted methods. We used the gbm
function earlier to estimate the missing Ages, and we can use it again here to estimate which passengers survived. It goes without saying that I will develop two models with the only difference being the Fare being included or not.
set.seed(82)
model.gbm.1 <- train(Survived ~ Pclass + Sex + Fare + Age + Embarked + Title + SibSp + Parch + TravelSize,
method = "gbm",
tuneGrid = expand.grid(interaction.depth = c(2, 3), n.trees = (1:30)*5, shrinkage = 0.1,
n.minobsinnode = 10),
data = train.fit,
metric = "ROC",
trControl = fitControl) #No output, on purpose
plot(model.gbm.1)
set.seed(82)
model.gbm.2 <- train(Survived ~ Pclass + Sex + Age + Embarked + Title + SibSp + Parch + TravelSize,
method = "gbm",
tuneGrid = expand.grid(interaction.depth = c(2, 3), n.trees = (5:40)*10, shrinkage = 0.1,
n.minobsinnode = 10),
data = train.fit,
metric = "ROC",
trControl = fitControl) #No output, on purpose
plot(model.gbm.2)
As before, I spent a little time tuning the models separately. There seems to be a small improvement in both gbm models compared to the previous ones. Another popular learning method is ‘adaptive boost’ using the ada
package. The two boosted methods (gbm and ada) are quite similar. Let’s try it out.
set.seed(82)
model.ada.1 <- train(Survived ~ Pclass + Sex + Fare + Age + Embarked + Title + SibSp + Parch + TravelSize,
method = "ada",
tuneGrid = expand.grid(.iter = seq(50, 200, 50), .maxdepth = c(4, 8), .nu = c(0.1)),
data = train.fit,
metric = "ROC",
trControl = fitControl) #No output, on purpose
plot(model.ada.1)
set.seed(82)
model.ada.2 <- train(Survived ~ Pclass + Sex + Age + Embarked + Title + SibSp + Parch + TravelSize,
method = "ada",
tuneGrid = expand.grid(.iter = c(50, 100), .maxdepth = c(4, 8), .nu = c(0.1, 1)),
data = train.fit,
metric = "ROC",
trControl = fitControl) #No output, on purpose
plot(model.ada.2)
Not too surprisingly, the ada method performs roughly similar to gbm. Ultimately though, of course, the test.fit will be used to discern between all the models that we are creating. So we will see which one is the best then. Let’s move onto the last few models.
Other models
We will give support vector machine (SVM) a shot here. Using the kernlab
package, we can fit the svmRadial
method available. Here there are only two tunable parameters to play around with.
set.seed(82)
model.svm.1 <- train(Survived ~ Pclass + Sex + Fare + Age + Embarked + Title + SibSp + Parch + TravelSize,
method = "svmRadial",
preProcess = c("center", "scale"),
tuneGrid = expand.grid(C = seq(0.5, 5.0, 0.5), sigma = c(0.005, 0.01, 0.04, 0.07, 0.10)),
data = train.fit,
metric = "ROC",
trControl = fitControl) #No output, on purpose
plot(model.svm.1)
set.seed(82)
model.svm.2 <- train(Survived ~ Pclass + Sex + Age + Embarked + Title + SibSp + Parch + TravelSize,
method = "svmRadial",
preProcess = c("center", "scale"),
tuneGrid = expand.grid(C = seq(0.5, 5.0, 0.25), sigma = c(0.005, 0.01, 0.04, 0.07)),
data = train.fit,
metric = "ROC",
trControl = fitControl) #No output, on purpose
plot(model.svm.2)
It seems the SVM models do not perform as well as either boosted models, at least in terms of the maximized ROC curve. This could be down to a number of reasons but again, we will have to wait to see how well they actually perform later. The last method I wanted to use is one of the most popular and easiest to use: random forest (RF). There is only one parameter to tune in train
: number of (random) selectors to use in the fit. The general consensus is to use something close to the square root of the total independent variables. Since we have less than 10 variables, there is not much room for tuning, but let’s see what our last fit can do.
set.seed(82)
model.rf.1 <- train(Survived ~ Pclass + Sex + Fare + Age + Embarked + Title + SibSp + Parch + TravelSize,
method = "rf",
tuneGrid = data.frame(.mtry = c(3, 4, 5, 7)),
data = train.fit,
metric = "ROC",
trControl = fitControl) #No output, on purpose
plot(model.rf.1)
set.seed(82)
model.rf.2 <- train(Survived ~ Pclass + Sex + Age + Embarked + Title + SibSp + Parch + TravelSize,
method = "rf",
tuneGrid = data.frame(.mtry = c(3, 4, 5, 7)),
data = train.fit,
metric = "ROC",
trControl = fitControl) #No output, on purpose
plot(model.rf.2)
Both RF models perform as well as any of the other models tested. Since this is the last model I will test, we can now turn our attention at evaluating the models and choosing the best performer.
Model evaluation
With all our models trained and ready we can start comparing their performances. One way is the very handy confusionMatrix
function which gives statistics on a model on a set of predicted data. This was a reason we kept some of the training data as a validation or test set. So let’s check out the output for one of our models.
pred.glm.1 <- predict(model.glm.1, test.fit)
confusionMatrix(pred.glm.1, test.fit$Survived)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Perished Survived
## Perished 98 18
## Survived 11 50
##
## Accuracy : 0.8362
## 95% CI : (0.7732, 0.8874)
## No Information Rate : 0.6158
## P-Value [Acc > NIR] : 1.38e-10
##
## Kappa : 0.6469
## Mcnemar's Test P-Value : 0.2652
##
## Sensitivity : 0.8991
## Specificity : 0.7353
## Pos Pred Value : 0.8448
## Neg Pred Value : 0.8197
## Prevalence : 0.6158
## Detection Rate : 0.5537
## Detection Prevalence : 0.6554
## Balanced Accuracy : 0.8172
##
## 'Positive' Class : Perished
##
I will not show the output of the confusionMatrix
for each model here, for brevity. Instead, I will plot the accuracy of each model (along with the 95% confidence intervals) so that we can compare them.
It is also useful to check the other statistics available to us. We can use the resamples
method (again from the caret
package) to resample the distributions and compare them. In fact, the output can easily be plotted with dotplot
from lattice
, though it does sort the data.
models <- list(glm1 = model.glm.1 , glm2 = model.glm.2,
glmnet1 = model.glmnet.1, glmnet2 = model.glmnet.2,
gbm1 = model.gbm.1 , gbm2 = model.gbm.2,
ada1 = model.ada.1 , ada2 = model.ada.2,
svm1 = model.svm.1 , svm2 = model.svm.2,
rf1 = model.rf.1 , rf2 = model.rf.2)
model.val <- resamples(models)
dotplot(model.val)
The Sensitivity is the probability to correctly identify a true positive while Specificity is the probability to correctly identify a true negative. Both values should be maximized where possible. Now that we have some statistics to be able to compare the models, we can try to determine the best one. Given what we have and the fact that I have to pick one, I am going to go with the first SVM model. By eye, it seems to have the best overall measures of the bunch.
Predictions
Having selected the SVM model, all we have left is to make a prediction and then submit to kaggle! Thankfully, this is also by far the easiest part.
the.prediction <- predict(model.svm.1, test.all)
the.solution <- data.frame(PassengerID = test.all$PassengerId, Survived = the.prediction)
the.solution$Survived <- revalue(the.solution$Survived, c("Survived" = 1, "Perished" = 0)) #Don't forget to revalue
write.csv(the.solution, file = "solution.csv", row.names = FALSE, quote = FALSE) #And write it out
After submitting this, we got a score of 0.78947: not too shabby. I am pretty happy with that score, as it does place in the top third of all entrants (currently).