HomeНаука и техникаRelated VideosMore From: David Langer

Introduction to Data Science with R - Data Analysis Part 1

6161 ratings | 832543 views
Part 1 in a in-depth hands-on tutorial introducing the viewer to Data Science with R programming. The video provides end-to-end data science training, including data exploration, data wrangling, data analysis, data visualization, feature engineering, and machine learning. All source code from videos are available from GitHub. NOTE - The data for the competition has changed since this video series was started. You can find the applicable .CSVs in the GitHub repo. Blog: http://daveondata.com GitHub: https://github.com/EasyD/IntroToDataScience I do Data Science training as a Bootcamp: https://goo.gl/OhIHSc
Html code for embedding videos on your blog
Text Comments (923)
lenin christ (1 day ago)
Wow it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and i got more information from your Video. Visit a Website https://bit.ly/2ILiobL https://bit.ly/2ILtxJD
Maxwell Holmes (3 days ago)
Maybe I am wrong or maybe the program has changed significantly from 2014 but geom_histrogram doesn't work for me and I instead had to use geom_bar.
Liza Harper (7 days ago)
I recommend fixed version lastest working need a rule for firewall.All options may cause a problem https://yadi.sk/d/CT0Q13JH3KF5Lv
Kenneth Leung (10 days ago)
Im getting a error when trying to plot the histogram... any suggestions? > ggplot(train, aes(x = Pclass, fill = factor(Survived))) + + geom_histogram(width = 0.5) + + xlab("Pclass") + + ylab("Total Count") + + labs(fill = "Survived") Error: StatBin requires a continuous x variable: the x variable is discrete. Perhaps you want stat="count"?
Kenneth Leung (10 days ago)
fyi for anyone in the future, changing geom_histogram(width = 0.5) to geom_bar(width = 0.5) fixed my problem. Apparently it was updated that the histogram graph is for continuous data and bar graph is for discrete
Emma Nuel (14 days ago)
how do i fix the error message saying the number of columns for the arguments are different.
akanksha mathur (14 days ago)
Its really great to learn from your videos. Thanks for posting !
polarbear60 (15 days ago)
I can tell you are a data pirate because you use the R language...
Rahul soni (25 days ago)
@david langer: Is this video helpful for non programmers or people from NON IT background?
Nila shri (1 month ago)
Wow, it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and I got more information from your video. https://bit.ly/2NQaDXb Read my Blog:https://bit.ly/2PPVanx
Monde Nyawose (1 month ago)
This video is a collection of dad Jokes (See ThisOldTony) sprinkled with actual technical stuff.
Shalini Priya (1 month ago)
Wow, it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and I got more information from your video. https://bit.ly/2JE1ZVO Read my Blog:https://bit.ly/2OoW2iz
alberto medina (1 month ago)
You are a really good teacher. Thanks for your time!
beezer524 (1 month ago)
Excellent video. Not only helpful in learning R, but I learned a bit about looking at data prior to analyzing it too.
Devyani Acharya (1 month ago)
When I am running the 3rd Code which is ------> data.combined , I am receiving a syntax Error : this is how it looks like in the Console of R Studio _________ #Combine data sets > data.combined <- rbind(train, test.survived) Error in match.names(clabs, names(xi)) : names do not match previous names > I tried number of ways to rectify, I tried to run the code names(test.survived[[1]]) <- names(train[[2]]) > identical(names(test.survived[[1]]), names(train[[2]])) [1] TRUE But still the Error of mismatch is there . Can anyone help me to debug it .!
TheHdog101 (1 month ago)
at 55:37 i am not getting the same output when plugging in the code dup.names <- as.character(data.combined[which(duplicated(as.character(data.combined$Name))), "Name"]) as well as the following code data.combined[which(data.combined$Name %in% dup.names),] Anyone who can help ?
MrDamien9445 (1 month ago)
dup.name <- as.character(data.combined[duplicated(as.character(data.combined$Name)),"Name"]) & data.combined[data.combined$Name %in% dup.names,]
Muradean (1 month ago)
great video and very cool lecture!This Titanic example is way more curious and interesting than measuring temperature or raining rates
Chandrakant Tiwari (1 month ago)
for drawing ggplot use code- "ggplot(train, aes(x=Pclass, fill=factor(Survived)))+geom_bar(width=0.5)+labs(fill="Survived")".
ruslan smirnov (1 month ago)
'Women and children first' is only legitimate if there are guaranteed seats in lifeboats for everyone, which was not the case with Titanic. So, it was illegal to save more females than men just of this gentle principle. It was not designed to save more women, it meant to provide them comfort as soon as possible, and men patiently wait. What a tragic misconception for men. What a lucky hypocrisy for women.
Rajnish Gaur (1 month ago)
Thanks David !! Really great video !!
Albert Lazar (1 month ago)
R is a flea on the back of SAS. That's why I use flea control collars. Yikes!!
Filius Flitwick (2 months ago)
You are very helpful. I just started getting into data science myself, and I also study Statistics. You are explaining everything I'm curious about. Thank you very much for putting time and effort into this. ^^
Raj Sheth (2 months ago)
Thank you so much. You're amazing.!!
Abir Soni (2 months ago)
For those who have a problem with the plot: Use the recent qplot function qplot(Pclass, data = train, geom = "auto", fill = Survived, width = 0.5) i.e as per the syntax qplot(x,y,geom,fill,width)
Arief (2 months ago)
BEPEC (2 months ago)
Great series, let me know there is a sessions on python too?
张家墅 (2 months ago)
If you have trouble combining data frames, just change the survived into Survived in the 'data.frame(survived =...' part.
Philip Pizzo (2 months ago)
Okay cool.
Stefano Tusini (3 months ago)
Una volta io avevo creso in vasco rossi. Adesso 'Sta troia mi fa schifo
Stephanie, Belle And mum (3 months ago)
Very genuine lecture thanks a lot !
Kevin (3 months ago)
Great job for making these videos! I am long time S-PLUS user trying to convert to RStudio. I see some redundancy in your R coding. A couple things: 1) for a lot places, "which" is not necessary, for example when you try to get the "dup.name" you can simply use "as.character(data.combined$Name[duplicated(as.character(data.combined$Name))])". 2) try to avoid loop as much as possible in R, but use R's generic functions will make your code way faster and simpler. for example, when you try to get the titles, you used a for loop. But instead you can use "data.combined$title<-lapply(data.combined$Name,extractTitle)"
Bibhuti Bhusan Panda (3 months ago)
+David Langer For a guy like me you are just a life saver. I know a good amount of theory but never done any practical in my life. I always wanted good examples before starting to do something of my own. Actually good example builds my confidence. I don't know how to thank you but one thing I could say, I am fully satisfied with it.
Ryan Buchanan (3 months ago)
I typed the code exactly as you did and it did not fill the histogram with color.
Laura - Youtube (3 months ago)
Hi there! I'm level "0" on R. Just wondering from where you get the upper left screen on RStudio (Titanic DataAnalysis R". I just see my Console. Thanks.
Nura BinTTrax (4 months ago)
Interesting video but my head is going to explode - I think I'll stick with SPSS. So much easier.
demudu naganaidu (4 months ago)
Very usefull
William Chen (4 months ago)
For anyone getting a "perhaps you want stat = count?" error at the ggplot part (around 41:00) using newer versions of R, replace geom_histogram with geom_bar. geom_histogram and geom_bar have been split into two functions now, where geom_bar is for discrete variables like the pclass factor, and geom histogram only works with continuous variables. I was getting this problem and doing so fixed it.
Anthony Foster (4 months ago)
Hi, This has been the best video I've found on R and has been absolutely amazing. I've worked through it very carefully and I'm so glad you made it! If you are still answering questions, I have a small problem relating to the very end of the video and I can't find a solution on Github or in the comments. I've downloaded the original datasets from your Github. Basically, I can't get the 'Mrs' to show up in the final plot. Weird thing is, I can delete her code, copy the working code from any of the others, copy it in and change it but still she won't appear. Here is my code; extractTitle <-function(Name) { Name <- as.character(Name) if (length(grep("Miss.", Name))>0) { return ("Miss.") } else if (legnth(grep("Master.", Name))>0) { return("Master.") } else if (length(grep("Mrs.", Name))>0) { return("Mrs.") } else if (length(grep("Mr.", Name))>0) { return("Mr.") } else { return{"Other") } } ggplot (data.combined[1:891,], aes(x = title, fill=Survived)) + geom_bar(stat="count")+ facet_wrap(~Pclass)+ ggtitle("Pclass")+ xlab("Title")+ ylab("Total Count")+ labs(fill="Survived") Any advice gratefully received!! :)
Gilium 117 (4 months ago)
when I did the command for finding duplicates, my dup.names show character(empty) for the Unique names I got 1307/1309 still
Yung Gud (4 months ago)
1:11:55 it's free real estate
Rohit Ramesh (4 months ago)
Thanks a lot for this course! The way you have explained the concepts is just too good! This is a very good place to start Machine Learning. The series is pretty long but really informative. I'd REALLY RECOMMEND THIS TO EVERYONE who is checking the comments section for some inspiration.
SheenSenseBulldozer (4 months ago)
Can anybody help me out? I am trying to combine data sets using the code: data.combined <- rbind(train, test.survived) I'm getting this error, and I've checked my file names and am confused. data.combined <- rbind(train, test.survived) Error in match.names(clabs, names(xi)) : names do not match previous names Thanks!
aperxmim (4 months ago)
i do not know if you ever mentioned that the passengerid variable was not included in both datasets(test & train). The users must remove passengerid variable from the test.csv and train.csv, but I would make a copy of both datasets before proceeding.
Gregory Laskaris (4 months ago)
Greetings from Greece. Congratulations, very nice lectures. Allow me to point that you don't need "which" in front of str_detect. It is redundant. Actually you don't need str_detect. Just use grep.
rahul chaudhary (4 months ago)
Thanks for the very informative series, I am a complete newbie to R and Data analytics and learning purely for my own pleasure. I need one help if you could, I have extracted titles (Miss., Mr.) etc but it is not in order. In combined first 3 should be Mr., Mrs., Miss but I am getting Mr.,Mr.,Miss. used this code NameTitles <- function(Name){ name <- as.character(Name) if (length(grep("Miss.",Name))> 0){ return("Miss.") } else if (length(grep("Master.",Name))> 0){ return("Master.") } else if (length(grep("Mr.",Name))> 0){ return("Mr.") } else if (length(grep("Mrs.",Name))> 0){ return("Mrs.") } else { return("Other") } } for(i in 1:nrow(data.combined)){ titles <- c(titles, NameTitles(data.combined[i, "Name"])) } data.combined$title <- as.factor(titles)
Gregory Laskaris (4 months ago)
Something's wrong with both of the data sets. I get NAs
Shishir Pandey (4 months ago)
You have shared an awesome stuff! The way you have explained was just exceptional!I wonder if you can share more use cases and different scenarios which will help much! Thanks again and super show ✓
Rohit Ramesh (4 months ago)
Please note that you will run into an error given below: data.combined <- rbind(train, test.survived) Error in match.names(clabs, names(xi)) : names do not match previous names So use test.survived <- data.frame(Survived = rep("None", nrow(test)), test[,]) instead of test.survived <- data.frame(survived = rep("None", nrow(test)), test[,])
Paulo Henrique (4 months ago)
Thank you for that!
Rohit Ramesh (4 months ago)
Use test.survived <- data.frame(Survived = rep("None", nrow(test)), test[,]) instead of test.survived <- data.frame(survived = rep("None", nrow(test)), test[,]) Basically, all the feature names (column names) are supposed to be starting with an uppercase letter. At the time of making this video the data set had it starting with lowercase but it got changed later.
SheenSenseBulldozer (4 months ago)
How do I solve this issue?
john kevin (4 months ago)
Its very nice and good video... https://bit.ly/2FZKOMg
Andrew McLaughlin (4 months ago)
Nice video. Check this out - https://flow-analytics.com/blog/flow-analytics-crash-course-part-1
Brian Perkins (4 months ago)
I am learning R from SAS. They are very similar. Excellent explanation, great video!
James Weldon (5 months ago)
When I plot the table table(train$Survived, train$Title, train$Pclass) it shows no people in 3rd class with the title 'Other', as expected. However, when I plot the same data via ggplot (with facet_wrap(~Pclass)) it shows approximately 20 people in 3rd class with title 'Other'. This is true of a number of variables. It plots accurately when I don't include the facet_wrap. Is anybody else having this issue? Any idea what may be causing it?
Tayyab Clean (5 months ago)
priyanka (5 months ago)
HI David , was getting StatBin requires a continuous x variable the x variable is discrete. Perhaps you want stat="count"? in 44:29 i.e plotting the graph but used geom_bar resolved my prblm...
kevin george (5 months ago)
its very nice and good... https://bit.ly/2FZKOMg
christian Martin (5 months ago)
May ask why did you remove the Excel and r video course?
Prajita Sawant (5 months ago)
Great tutorial ......Thank you......
Mario Ma (5 months ago)
Sorry, see all videos now, tx much !!!
Mario Ma (5 months ago)
Tx much David, do you have part 2 where we do model and process test ds? I undestand this is part of some learning program, right ? Would be nice to get cont. even with pay. Tx again. May 2018
Ken Tobin (5 months ago)
Thank you! Looking forward to applying R some day soon. Nice Job!
Nyamaa Bayarma (5 months ago)
Hi David, i'm just wondering, since randomforest algorithm already does feature selection, is it possible that you get almost the same accuracy while not doing data exploratory analysis?
Imran Naseem (5 months ago)
Hi David, I hope you are well. Why you have stopped posting new videos. You also have removed other useful videos. thanks, Imran
Danny waz (5 months ago)
I am having issue with ggplot. it is not loading >> Titanic.r #load ggplot2 train$Pclass <- as.factor(train$Pclass) ggplot(train, aes(x=Pclass, fill=factor(Survived)) + geom_histogram(width=0.5)+ xlab("pclass")+ ylab("total count")+ labs(fill= "Survival") Console:- ggplot(train, aes(x=Pclass, fill=factor(Survived)) + + geom_histogram(width=0.5)+ + xlab("pclass")+ + ylab("total count")+ + labs(fill= "Survival") + I am unable to view the plot
Technogeekscs Pune (5 months ago)
I really enjoyed this tutorial. Very good effort and excellent presentation. https://bit.ly/2pXSuIW
Ali Safa (6 months ago)
Thanks a lot for such a valuable video!
engrakas (6 months ago)
I am trying the code but the error is comming as: data.combined <-rbind(train, test.survived) Error in match.names(clabs, names(xi)) : names do not match previous names Will u please clarify that
Steven Rhodes (6 months ago)
Works for Microsoft... hmm cool...Clicks the Mac link... NOOOOOOO!!!
Ryan Seitz (6 months ago)
As someone evaluating an MBA focused around data analytics, this video was a tremendous help. Thanks, David!
Tarik Jake (6 months ago)
Can you please share the URL of the correct data?? I have read the comments and I ve searched on github, but the data set is still not working. I am having issues with opening the data sets. The ones that I can open have the Passenger ID in first column so when I add column to create the test.Survived, it is creating to the first column. So my test and train have 12 columns but the names of the columns do not match...
Mooni Boo (6 months ago)
still like that video in 2018! very helpful and very "comforting" voice :D
asdfasdfadsfasdfasdfa (6 months ago)
Very useful video, thanks
Renato Gomes (6 months ago)
very very good!
Zoe Salinas (6 months ago)
quite informative!! getting back to this video as soon as I finish work
Charles Pilgrim (6 months ago)
This guy sounds like the fella in charge of the party in the Leeroy Jenkins video
Daryl Goh (6 months ago)
Hi anyone encounter this error? Code: ggplot(train, aes(x = Pclass, fill = factor(Survived))) + stat_count(width = .5) + xlab("Pclass") + ylab("Total Count") + labs(fill = "Survived") Error log: Warning messages: 1: In grid.newpage() : this function not yet implemented 2: In grDevices::recordGraphics(requireNamespace("ggplot2", qui... : this function not yet implemented 3: In grobName(grob, prefix) : this function not yet implemented (4 times) 4: In rectGrob(coords$xmin, coords$ymax, width = coords$xmax - ... : this function not yet implemented 5: In unit(scale_details$x.major, "native") : this function not yet implemented 6: In unit(scale_details$y.major, "native") : this function not yet implemented 7: In unit(scale_details$y.minor, "native") : this function not yet implemented 8: In grobTree(element_render(theme, "panel.background"), if(le... : this function not yet implemented 9: In gTree(children = do.call("gList", panel)) : this function not yet implemented 10: In gpar(fontsize = size, col = colour, fontfamily = family, ... : this function not yet implemented 11: In gpar(fontsize = element$size, col = element$colour, fontf... : this function not yet implemented 12: In unit(rep(xp, n), "npc") : this function not yet implemented 13: In unit(rep(yp, n), "npc") : this function not yet implemented Error in titleGrob(label, x, y, hjust = hj, vjust = vj, angle = angle : could not find function "descentDetails"
Hailah AlArifi (7 months ago)
Hello, I work on a project about knowledge base constructed from text using NLP and IE. And I have some difficulties finding a data set and the process of how to work on it. If you have any information send me an email. [email protected] Thank you.
john bake (7 months ago)
how can you be doing predictive modelling and be IT, do you have a math degree?
ENTROPY (7 months ago)
Git Gud
Ravi Rajput (7 months ago)
Nice one; this will also add some info https://youtu.be/q2czpYm81dY
Amit Baderia (7 months ago)
Very good Explanation
AirborneLRRP (7 months ago)
You are a rock star. YES
Florian Wicher (7 months ago)
Thank you Dave, i'm learning so much :D
kablamopow (7 months ago)
Thanks for the video. Great Info
Sasha Films Reviews (7 months ago)
I can't get the str_detect function to work, I get this error > library(stringr) Error in value[[3L]](cond) : Package ‘stringr’ version 1.2.0 cannot be unloaded: Error in unloadNamespace(package) : namespace ‘stringr’ is imported by ‘apa’, ‘ez’, ‘evaluate’, ‘reshape2’ so cannot be unloaded In addition: Warning message: package ‘stringr’ was built under R version 3.4.3 then when I try and run the code to assign misses > misses <- data.combined[which(str_detect(data.combined$Name, "Miss.")),] Error in str_detect(data.combined$Name, "Miss.") : could not find function "str_detect" So bizarre. If it's a problem with latest updates is their an alternate to str_detect?
Dinesh Kumar (7 months ago)
work at microsoft and using mac. irony haha
Andrew Feist (7 months ago)
they let you use a macbook?
Patrick Logan (7 months ago)
Just wanted to add my $0.02 and say thank you as well. This really short circuits learning the application of R to analytics. Very much appreciate your time and contribution.
Ronen Slonim (7 months ago)
The geom-histogram plot doesn't work anymore - must use geom-bar function
Petru Radu (8 months ago)
A VERY GOOD TUTORIAL for data analysis in R can be found at: www.picsag.com/2018/02/12/r-tutorial-data-analysis-project/
Alexander Radev (8 months ago)
Hi, thanks for the videos! I've ran into an error with geom_histogram which was solved using geom_bar. I did the same as you did. How is it possible that I've got an error and you didn't? Is it because I'm on a newer version or I've just missed something?
Tyler Durden (8 months ago)
The tutorial is not really for beginners
Gaël Latouche (8 months ago)
Hello David and thank you for your video! I am a newbie in the big data world and i don't have any experience in R. Should i go across this video tutorial to understand or should i go through another video first? Really passionate by the topic but i need to catch up quickly with R for some exams stuff. Thanks for your help and advise.
Luke Ruffner Robinson (8 months ago)
If you're having issues combining data, make sure the S in "test.Survived" is capitalized. Press run and it should work then.
Martin Monsalvo (8 months ago)
Great video! Thank you so much.
dsv dsv (8 months ago)
1 hour 5 minutes in *probobly biologically impossible
Noe Lomeli (8 months ago)
Thank you for making this video! You are truly gifted at explaining R in such a simple and eloquent manner.
James O'Connor (8 months ago)
This was my first exposure to R, or any programming language for that matter, and was very easy to follow and enjoyable. Thank you!
Raghu Raj Rai (8 months ago)
Wow! That was great. I downloaded this offline and took a lot of time to study this. It was great. I like how you went about explaining things as we progressed. Thanx a lot!
Marli Richmond (8 months ago)
Incredibly helpful video, thank you David!
Hakuna Matata (9 months ago)
Great videos on the subject :) , I have a question will I be able to do Data Analysis on iPad ? I mean can I run python,R, SQL and Tableau efficiently on it or I will need a laptop?
Gurkaran Singh (9 months ago)
Great Video

Would you like to comment?

Join YouTube for a free account, or sign in if you are already a member.