HomeНаука и техникаRelated VideosMore From: David Langer

Introduction to Data Science with R - Data Analysis Part 1

6621 ratings | 879753 views
Part 1 in a in-depth hands-on tutorial introducing the viewer to Data Science with R programming. The video provides end-to-end data science training, including data exploration, data wrangling, data analysis, data visualization, feature engineering, and machine learning. All source code from videos are available from GitHub. NOTE - The data for the competition has changed since this video series was started. You can find the applicable .CSVs in the GitHub repo. Blog: http://daveondata.com GitHub: https://github.com/EasyD/IntroToDataScience I do Data Science training as a Bootcamp: https://goo.gl/OhIHSc
Html code for embedding videos on your blog
Text Comments (957)
working at microsoft yet he uses Mac operating system
Am getting this Error if anyone can help i will appreciate "Error: `mapping` must be created by `aes()`" am using ggplot2 3.1.0
Jenith Nakrani (14 days ago)
sir i getting in combined row plz help me out this error like this: Error in match.names(clabs, names(xi)) : names do not match previous namestrain<-read.csv("train.csv",header=TRUE) coding test<-read.csv("test.csv",header=TRUE) test_survived<-data.frame(survived=rep("None",nrow(test)),test[,]) data_combined<- rbind(train,test_survived)
fewpoundcory (16 days ago)
lol at 1:09:40
Gabriel Ty (19 days ago)
Literally useless to attempt now, so outdated and not working.. you'll have troubles trying to import files and struggle for hours trying to figure out how to make this work.. CSV files aren't being read by RStudio for some reason.. literally made us download 2 RStudio files for no reason
Jayjay F (19 days ago)
What method or technique are you using in r to predict? Logistic regression?
david kess (24 days ago)
https://www.kaggle.com ... WOW.... great amazing data use... thanks for posting.! data science..#1
Emma Nuel (26 days ago)
how can we use switch to extract those name prefixes? Helped me a lot. Thank you!
ziqi (1 month ago)
Great explanation. Congratulations help me a lot
Zurab N (1 month ago)
Looks like LISP in new shiny wrapping paper :)
Pa (1 month ago)
working for microsoft but using an apple xD ! dmn but great video
Lucas Black (1 month ago)
I'm only 30min into the series and I've had to correct almost every line of the github code. The dataset has changed to include extra variables and while it may be worthwhile watching the video, it's no longer valuable to follow along.
samet öztürk (1 month ago)
Since David has created this tutorial, kaggel changed the data sets. So before dive into tutorial, you need to do some adjustments. Otherwise this tutorial is useless for some beginner users.
HACH Trainings (1 month ago)
We are providing online LIVE Training for Datascience with project and job support, pls register your details for FREE demo session to know more about Datascience http://www.hachtechnologies.com/#!/registration
Abhay Patthey (1 month ago)
where is the second part of this video
Brennan Cooke (1 month ago)
I know I am a couple years late to the game - but I got stuck around 1:14:00 with the extract Title function. Finally got around it by using: #create a utility function to help with title extraction extractTitle <- function(name) { name <- as.character(name) if(length(grep("Miss.", data.combined[i,"Name"]))>0) { return ("Miss.") } else if (length(grep("Master.", data.combined[i,"Name"])) > 0) { return ("Master.") } else if (length(grep("Mrs.", data.combined[i,"Name"])) > 0) { return("Mrs.") } else if (length(grep("Mr.", data.combined[i,"Name"])) > 0) { return ("Mr.") } else { return ("Other") } } -ultimate beginner
Hannah Humphreys (1 month ago)
Data Science, Deep Learning, & Machine Learning with Python > Build artificial neural networks with Tensorflow and Keras > Implement machine learning, clustering, and search using TF/IDF at massive scale with Apache Spark's MLLib > Implement Sentiment Analysis with Recurrent Neural Networks http://bit.ly/2JLRuEH
Stavros pantermarakis (1 month ago)
data.combined[which(data.combined$Name %in% dup.Names),] Return to me with that # A tibble: 0 x 12 # ... with 12 variables: Survived <chr>, PassengerId <int>, # Pclass <fct>, Name <chr>, Sex <chr>, Age <dbl>, SibSp <int>, # Parch <int>, Ticket <chr>, Fare <dbl>, Cabin <chr>, # Embarked <chr> why is this happening and how can i solve it. please help :)
Araldor123 (1 month ago)
Did you pre-write all of the lines 13:10 ? When I open the data none of that is there, I can type it out but feel I may have done something wrong already.
lenin christ (1 month ago)
Wow it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and i got more information from your Video. Visit a Website https://bit.ly/2ILiobL https://bit.ly/2ILtxJD
Maxwell Holmes (2 months ago)
Maybe I am wrong or maybe the program has changed significantly from 2014 but geom_histrogram doesn't work for me and I instead had to use geom_bar.
Liza Harper (2 months ago)
I recommend fixed version lastest working need a rule for firewall.All options may cause a problem https://yadi.sk/d/CT0Q13JH3KF5Lv
Emma Nuel (2 months ago)
how do i fix the error message saying the number of columns for the arguments are different.
akanksha mathur (2 months ago)
Its really great to learn from your videos. Thanks for posting !
polarbear60 (2 months ago)
I can tell you are a data pirate because you use the R language...
Rahul soni (2 months ago)
@david langer: Is this video helpful for non programmers or people from NON IT background?
Nila shri (2 months ago)
Wow, it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and I got more information from your video. https://bit.ly/2NQaDXb Read my Blog:https://bit.ly/2PPVanx
Monde Nyawose (2 months ago)
This video is a collection of dad Jokes (See ThisOldTony) sprinkled with actual technical stuff.
Shalini Priya (3 months ago)
Wow, it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and I got more information from your video. https://bit.ly/2JE1ZVO Read my Blog:https://bit.ly/2OoW2iz
alberto medina (3 months ago)
You are a really good teacher. Thanks for your time!
beezer524 (3 months ago)
Excellent video. Not only helpful in learning R, but I learned a bit about looking at data prior to analyzing it too.
Devyani Acharya (3 months ago)
When I am running the 3rd Code which is ------> data.combined , I am receiving a syntax Error : this is how it looks like in the Console of R Studio _________ #Combine data sets > data.combined <- rbind(train, test.survived) Error in match.names(clabs, names(xi)) : names do not match previous names > I tried number of ways to rectify, I tried to run the code names(test.survived[[1]]) <- names(train[[2]]) > identical(names(test.survived[[1]]), names(train[[2]])) [1] TRUE But still the Error of mismatch is there . Can anyone help me to debug it .!
Gaurav Deshmukh (1 month ago)
I faced the same problem today with the datasets from Kaggle. There is extra column of PassengerID which I removed by train <- train(, -<colnumber>) and secondly ensured that the column headings are exactly same as its case sensitive.
TheHdog101 (3 months ago)
at 55:37 i am not getting the same output when plugging in the code dup.names <- as.character(data.combined[which(duplicated(as.character(data.combined$Name))), "Name"]) as well as the following code data.combined[which(data.combined$Name %in% dup.names),] Anyone who can help ?
MrDamien9445 (2 months ago)
dup.name <- as.character(data.combined[duplicated(as.character(data.combined$Name)),"Name"]) & data.combined[data.combined$Name %in% dup.names,]
Muradean (3 months ago)
great video and very cool lecture!This Titanic example is way more curious and interesting than measuring temperature or raining rates
Chandrakant Tiwari (3 months ago)
for drawing ggplot use code- "ggplot(train, aes(x=Pclass, fill=factor(Survived)))+geom_bar(width=0.5)+labs(fill="Survived")".
ruslan smirnov (3 months ago)
'Women and children first' is only legitimate if there are guaranteed seats in lifeboats for everyone, which was not the case with Titanic. So, it was illegal to save more females than men just of this gentle principle. It was not designed to save more women, it meant to provide them comfort as soon as possible, and men patiently wait. What a tragic misconception for men. What a lucky hypocrisy for women.
Rajnish Gaur (3 months ago)
Thanks David !! Really great video !!
Albert Lazar (3 months ago)
R is a flea on the back of SAS. That's why I use flea control collars. Yikes!!
Filius Flitwick (3 months ago)
You are very helpful. I just started getting into data science myself, and I also study Statistics. You are explaining everything I'm curious about. Thank you very much for putting time and effort into this. ^^
Raj Sheth (4 months ago)
Thank you so much. You're amazing.!!
Abir Soni (4 months ago)
For those who have a problem with the plot: Use the recent qplot function qplot(Pclass, data = train, geom = "auto", fill = Survived, width = 0.5) i.e as per the syntax qplot(x,y,geom,fill,width)
Arief Anbiya (4 months ago)
BEPEC (4 months ago)
Great series, let me know there is a sessions on python too?
张家墅 (4 months ago)
If you have trouble combining data frames, just change the survived into Survived in the 'data.frame(survived =...' part.
Philip Pizzo (4 months ago)
Okay cool.
Stefano Tusini (5 months ago)
Una volta io avevo creso in vasco rossi. Adesso 'Sta troia mi fa schifo
Stephanie, Belle And mum (5 months ago)
Very genuine lecture thanks a lot !
Kevin (5 months ago)
Great job for making these videos! I am long time S-PLUS user trying to convert to RStudio. I see some redundancy in your R coding. A couple things: 1) for a lot places, "which" is not necessary, for example when you try to get the "dup.name" you can simply use "as.character(data.combined$Name[duplicated(as.character(data.combined$Name))])". 2) try to avoid loop as much as possible in R, but use R's generic functions will make your code way faster and simpler. for example, when you try to get the titles, you used a for loop. But instead you can use "data.combined$title<-lapply(data.combined$Name,extractTitle)"
Bibhuti Bhusan Panda (5 months ago)
+David Langer For a guy like me you are just a life saver. I know a good amount of theory but never done any practical in my life. I always wanted good examples before starting to do something of my own. Actually good example builds my confidence. I don't know how to thank you but one thing I could say, I am fully satisfied with it.
Ryan Buchanan (5 months ago)
I typed the code exactly as you did and it did not fill the histogram with color.
Laura - Youtube (5 months ago)
Hi there! I'm level "0" on R. Just wondering from where you get the upper left screen on RStudio (Titanic DataAnalysis R". I just see my Console. Thanks.
Nura BinTTrax (5 months ago)
Interesting video but my head is going to explode - I think I'll stick with SPSS. So much easier.
demudu naganaidu (5 months ago)
Very usefull
William Chen (6 months ago)
For anyone getting a "perhaps you want stat = count?" error at the ggplot part (around 41:00) using newer versions of R, replace geom_histogram with geom_bar. geom_histogram and geom_bar have been split into two functions now, where geom_bar is for discrete variables like the pclass factor, and geom histogram only works with continuous variables. I was getting this problem and doing so fixed it.
Anthony Foster (6 months ago)
Hi, This has been the best video I've found on R and has been absolutely amazing. I've worked through it very carefully and I'm so glad you made it! If you are still answering questions, I have a small problem relating to the very end of the video and I can't find a solution on Github or in the comments. I've downloaded the original datasets from your Github. Basically, I can't get the 'Mrs' to show up in the final plot. Weird thing is, I can delete her code, copy the working code from any of the others, copy it in and change it but still she won't appear. Here is my code; extractTitle <-function(Name) { Name <- as.character(Name) if (length(grep("Miss.", Name))>0) { return ("Miss.") } else if (legnth(grep("Master.", Name))>0) { return("Master.") } else if (length(grep("Mrs.", Name))>0) { return("Mrs.") } else if (length(grep("Mr.", Name))>0) { return("Mr.") } else { return{"Other") } } ggplot (data.combined[1:891,], aes(x = title, fill=Survived)) + geom_bar(stat="count")+ facet_wrap(~Pclass)+ ggtitle("Pclass")+ xlab("Title")+ ylab("Total Count")+ labs(fill="Survived") Any advice gratefully received!! :)
Gilium 117 (6 months ago)
when I did the command for finding duplicates, my dup.names show character(empty) for the Unique names I got 1307/1309 still
Yung Gud (6 months ago)
1:11:55 it's free real estate
Rohit Ramesh (6 months ago)
Thanks a lot for this course! The way you have explained the concepts is just too good! This is a very good place to start Machine Learning. The series is pretty long but really informative. I'd REALLY RECOMMEND THIS TO EVERYONE who is checking the comments section for some inspiration.
SheenSenseBulldozer (6 months ago)
Can anybody help me out? I am trying to combine data sets using the code: data.combined <- rbind(train, test.survived) I'm getting this error, and I've checked my file names and am confused. data.combined <- rbind(train, test.survived) Error in match.names(clabs, names(xi)) : names do not match previous names Thanks!
aperxmim (6 months ago)
i do not know if you ever mentioned that the passengerid variable was not included in both datasets(test & train). The users must remove passengerid variable from the test.csv and train.csv, but I would make a copy of both datasets before proceeding.
Gregory Laskaris (6 months ago)
Greetings from Greece. Congratulations, very nice lectures. Allow me to point that you don't need "which" in front of str_detect. It is redundant. Actually you don't need str_detect. Just use grep.
rahul chaudhary (6 months ago)
Thanks for the very informative series, I am a complete newbie to R and Data analytics and learning purely for my own pleasure. I need one help if you could, I have extracted titles (Miss., Mr.) etc but it is not in order. In combined first 3 should be Mr., Mrs., Miss but I am getting Mr.,Mr.,Miss. used this code NameTitles <- function(Name){ name <- as.character(Name) if (length(grep("Miss.",Name))> 0){ return("Miss.") } else if (length(grep("Master.",Name))> 0){ return("Master.") } else if (length(grep("Mr.",Name))> 0){ return("Mr.") } else if (length(grep("Mrs.",Name))> 0){ return("Mrs.") } else { return("Other") } } for(i in 1:nrow(data.combined)){ titles <- c(titles, NameTitles(data.combined[i, "Name"])) } data.combined$title <- as.factor(titles)
Gregory Laskaris (6 months ago)
Something's wrong with both of the data sets. I get NAs
Shishir Pandey (6 months ago)
You have shared an awesome stuff! The way you have explained was just exceptional!I wonder if you can share more use cases and different scenarios which will help much! Thanks again and super show ✓
Rohit Ramesh (6 months ago)
Please note that you will run into an error given below: data.combined <- rbind(train, test.survived) Error in match.names(clabs, names(xi)) : names do not match previous names So use test.survived <- data.frame(Survived = rep("None", nrow(test)), test[,]) instead of test.survived <- data.frame(survived = rep("None", nrow(test)), test[,])
Paulo Henrique (6 months ago)
Thank you for that!
Rohit Ramesh (6 months ago)
Use test.survived <- data.frame(Survived = rep("None", nrow(test)), test[,]) instead of test.survived <- data.frame(survived = rep("None", nrow(test)), test[,]) Basically, all the feature names (column names) are supposed to be starting with an uppercase letter. At the time of making this video the data set had it starting with lowercase but it got changed later.
SheenSenseBulldozer (6 months ago)
How do I solve this issue?
kevin john (6 months ago)
Its very nice and good video... https://bit.ly/2FZKOMg
Andrew McLaughlin (6 months ago)
Nice video. Check this out - https://flow-analytics.com/blog/flow-analytics-crash-course-part-1
Brian Perkins (6 months ago)
I am learning R from SAS. They are very similar. Excellent explanation, great video!
James Weldon (6 months ago)
When I plot the table table(train$Survived, train$Title, train$Pclass) it shows no people in 3rd class with the title 'Other', as expected. However, when I plot the same data via ggplot (with facet_wrap(~Pclass)) it shows approximately 20 people in 3rd class with title 'Other'. This is true of a number of variables. It plots accurately when I don't include the facet_wrap. Is anybody else having this issue? Any idea what may be causing it?
Tayyab Clean (7 months ago)
priyanka (7 months ago)
HI David , was getting StatBin requires a continuous x variable the x variable is discrete. Perhaps you want stat="count"? in 44:29 i.e plotting the graph but used geom_bar resolved my prblm...
kevin george (7 months ago)
its very nice and good... https://bit.ly/2FZKOMg
christian Martin (7 months ago)
May ask why did you remove the Excel and r video course?
Prajita Sawant (7 months ago)
Great tutorial ......Thank you......
Mario Ma (7 months ago)
Sorry, see all videos now, tx much !!!
Mario Ma (7 months ago)
Tx much David, do you have part 2 where we do model and process test ds? I undestand this is part of some learning program, right ? Would be nice to get cont. even with pay. Tx again. May 2018
Ken Tobin (7 months ago)
Thank you! Looking forward to applying R some day soon. Nice Job!
Nyamaa Bayarma (7 months ago)
Hi David, i'm just wondering, since randomforest algorithm already does feature selection, is it possible that you get almost the same accuracy while not doing data exploratory analysis?
Imran Naseem (7 months ago)
Hi David, I hope you are well. Why you have stopped posting new videos. You also have removed other useful videos. thanks, Imran
Danny waz (7 months ago)
I am having issue with ggplot. it is not loading >> Titanic.r #load ggplot2 train$Pclass <- as.factor(train$Pclass) ggplot(train, aes(x=Pclass, fill=factor(Survived)) + geom_histogram(width=0.5)+ xlab("pclass")+ ylab("total count")+ labs(fill= "Survival") Console:- ggplot(train, aes(x=Pclass, fill=factor(Survived)) + + geom_histogram(width=0.5)+ + xlab("pclass")+ + ylab("total count")+ + labs(fill= "Survival") + I am unable to view the plot
Technogeekscs Pune (7 months ago)
I really enjoyed this tutorial. Very good effort and excellent presentation. https://bit.ly/2pXSuIW
Ali Safa (7 months ago)
Thanks a lot for such a valuable video!
engrakas (8 months ago)
I am trying the code but the error is comming as: data.combined <-rbind(train, test.survived) Error in match.names(clabs, names(xi)) : names do not match previous names Will u please clarify that
Steven Rhodes (8 months ago)
Works for Microsoft... hmm cool...Clicks the Mac link... NOOOOOOO!!!
Ryan Seitz (8 months ago)
As someone evaluating an MBA focused around data analytics, this video was a tremendous help. Thanks, David!
Tarik Jake (8 months ago)
Can you please share the URL of the correct data?? I have read the comments and I ve searched on github, but the data set is still not working. I am having issues with opening the data sets. The ones that I can open have the Passenger ID in first column so when I add column to create the test.Survived, it is creating to the first column. So my test and train have 12 columns but the names of the columns do not match...
Mooni Boo (8 months ago)
still like that video in 2018! very helpful and very "comforting" voice :D
asdfasdfadsfasdfasdfa (8 months ago)
Very useful video, thanks
Renato Gomes (8 months ago)
very very good!
Zoe Salinas (8 months ago)
quite informative!! getting back to this video as soon as I finish work
Charles Pilgrim (8 months ago)
This guy sounds like the fella in charge of the party in the Leeroy Jenkins video
Daryl Goh (8 months ago)
Hi anyone encounter this error? Code: ggplot(train, aes(x = Pclass, fill = factor(Survived))) + stat_count(width = .5) + xlab("Pclass") + ylab("Total Count") + labs(fill = "Survived") Error log: Warning messages: 1: In grid.newpage() : this function not yet implemented 2: In grDevices::recordGraphics(requireNamespace("ggplot2", qui... : this function not yet implemented 3: In grobName(grob, prefix) : this function not yet implemented (4 times) 4: In rectGrob(coords$xmin, coords$ymax, width = coords$xmax - ... : this function not yet implemented 5: In unit(scale_details$x.major, "native") : this function not yet implemented 6: In unit(scale_details$y.major, "native") : this function not yet implemented 7: In unit(scale_details$y.minor, "native") : this function not yet implemented 8: In grobTree(element_render(theme, "panel.background"), if(le... : this function not yet implemented 9: In gTree(children = do.call("gList", panel)) : this function not yet implemented 10: In gpar(fontsize = size, col = colour, fontfamily = family, ... : this function not yet implemented 11: In gpar(fontsize = element$size, col = element$colour, fontf... : this function not yet implemented 12: In unit(rep(xp, n), "npc") : this function not yet implemented 13: In unit(rep(yp, n), "npc") : this function not yet implemented Error in titleGrob(label, x, y, hjust = hj, vjust = vj, angle = angle : could not find function "descentDetails"
Hailah AlArifi (8 months ago)
Hello, I work on a project about knowledge base constructed from text using NLP and IE. And I have some difficulties finding a data set and the process of how to work on it. If you have any information send me an email. [email protected] Thank you.
john bake (8 months ago)
how can you be doing predictive modelling and be IT, do you have a math degree?
ENTROPY (9 months ago)
Git Gud
Ravi Rajput (9 months ago)
Nice one; this will also add some info https://youtu.be/q2czpYm81dY
Amit Baderia (9 months ago)
Very good Explanation
AirborneLRRP (9 months ago)
You are a rock star. YES
Florian Wicher (9 months ago)
Thank you Dave, i'm learning so much :D
kablamopow (9 months ago)
Thanks for the video. Great Info

Would you like to comment?

Join YouTube for a free account, or sign in if you are already a member.