--- title: "STAT 19000 Project 7" output: word_document: default pdf_document: includes: in_header: "box.tex" html_document: css: box.css --- ## Topics: xml, rvest, scraping data **Motivation:** There are a ton of websites that are loaded with data. However, many times the data online that we want to analyze is not readily available to download in a convenient format. Knowing how to extract data from websites in a structure way is a valuable skill, and it will enable you to explore new sources of information. **Context:** In this project, we will continue to utilize principles we learned about XML files with a focus on scraping and parsing web pages. To do so, we will explore and familiarize ourselves with `rvest`. `rvest` is an R package inspired by a Python library called `beautifulsoup` for web scraping. **Scope:** Understand how to scrape and parse web pages using rvest package. You can find useful examples that walk you through relevant material [here](https://datamine.purdue.edu/seminars/spring2020/stat19000project07examples.R) or on scholar: `/class/datamine/data/spring2020/stat19000project07examples.R`. It is highly recommended to read through these to help solve problems. Use the template found [here](https://datamine.purdue.edu/seminars/spring2020/stat19000project07template.ipynb) or on scholar: `/class/datamine/data/spring2020/stat19000project07template.ipynb` to submit your solutions. After each problem, we've provided you with a list of keywords. These keywords could be package names, functions, or important terms that will point you in the right direction when trying to solve the problem, and give you accurate terminology that you can further look into online. You are not required to utilize all the given keywords. You will receive full points as long as your code gives the correct result. Don't forget the very useful documentation shortcut `?`. To use, simply type `?` in the console, followed by the name of the function you are interested in. You can also look for package documentation by using `help(package=PACKAGENAME)`. Sometimes it can be helpful to see the source code of a defined function. To do so, type the function's name without the `()`. ## Question 1: Prior to beginning, please take the time to install and load `rvest`. To do so, simply add the following to the top of your notebook: ```{r eval=FALSE, include=TRUE} install.packages("rvest") library(rvest) ``` **1a.** *(1 pt)* There are two popular text editors: [emacs and vim](https://en.wikipedia.org/wiki/Editor_war). Your friend wants your help with choosing one of these to use for an upcoming coding project. Dr. Ward told your friend to use emacs, but you are feeling brave and decided to scrape a little data to see whether other people agree with him. Dr. Ward was nice and provided you the code below to scrape reviews for 'vim' from TrustRadius, a popular website where programmers voice their opinions about programming tools. Use his code to scrape the same information for emacs. You can find emacs reviews using the following link 'https://www.trustradius.com/products/gnu-emacs/reviews'. Make sure to print out the first two results just like we do below. ```{r eval=TRUE, include=TRUE} library(rvest) vim_html <- read_html('https://www.trustradius.com/products/vim/reviews') reviews_vim <- html_nodes(vim_html, xpath="//div[@class='review-questions']") reviews_texts_vim <- html_text(html_nodes(reviews_vim, xpath=".//div[@class='not-edited question-text']/div[@class='ugc']")) # print(reviews_texts_vim[1:2]) scores_vim <- html_nodes(vim_html, xpath="//div[@class='trust-score__score']") scores_texts_vim <- html_text(html_nodes(scores_vim, xpath=".//span[not(@class)]")) # print(scores_texts_vim[1:2]) likely_to_recommend_vim <- html_nodes(vim_html, xpath="//div[@class='review-questions']") likely_to_recommend_vim_texts <- html_text(html_nodes(likely_to_recommend_vim, xpath="//div[@class='ugc followup']")) # print(likely_to_recommend_vim_texts[1:2]) ``` Make sure to run the similar code for emacs as well (we want to scrape the data for both editors not just vim). **Keywords:** *html_node, html_nodes, html_text, not* :::: {.bbox data-latex=""} ::: {data-latex=""} **Item(s) to submit:** - The code used for the problem. - The output from printing the first 2 values of the resulting `reviews_texts_emacs` variable. - The output from printing the first 2 values of the resulting `scores_texts_emacs` variable. - The output from printing the first 2 values of the resulting `likely_to_recommend_emacs_texts` variable. ::: :::: **1b.** *(2 pt)* For each of the following variables: `reviews_texts_vim`, `scores_texts_vim`, `likely_to_recommend_vim_texts`, `reviews_texts_emacs`, `scores_texts_emacs`, `likely_to_recommend_emacs_texts`, extract one relevant piece of information. For example, for the `reviews_texts_vim`, `reviews_texts_emacs`, `likely_to_recommend_vim_texts`, and `likely_to_recommend_emacs_texts`, you could create a new variable that contains the number of characters in each review, or the word count of each review, or if the reviews contain one or more of a certain word or set of words, etc. Make sure that you extract the same piece of information for each matching variable (i.e. reviews_texts_vim/reviews_texts_emacs, scores_texts_vim/scores_texts_emacs, etc.). Keep in mind that at the end the goal is to help your friend choose a text editor. **Important note:** You should extract the first score from `scores_texts_emacs` and `scores_texts_vim` using `strsplit` and convert to a number using `as.numeric`. The following is an example of extracting the word count from the `reviews_texts_vim`, and `reviews_texts_emacs` variables: ```{r eval=FALSE, include=TRUE} # This is an example of extracting a relevant piece of information. # Please note that the solution should have a total of 6 "new" sets of data. # This is a possible solution for `reviews_vim` count_words <- function(x) {length(strsplit(x, " ")[[1]])} reviews_vim_word_count <- unname(sapply(reviews_texts_vim, count_words)) # print(reviews_vim_word_count) # This is a possible matching solution for `reviews_emacs` reviews_emacs_word_count <- unname(sapply(reviews_texts_emacs, count_words)) # print(reviews_emacs_word_count) # A solution for `scores_texts_vim` should go under this # A solution for `scores_texts_emacs` should go under this # A solution for `likely_to_recommend_vim_texts` should go under this # A solution for `likely_to_recommend_emacs_texts` should go under this ``` **Hint:** *Take a look at the examples, and think about how you would use each variable to make a recommendation.* **Note:** *Keywords will vary depending on what you decided to extract from the variables.* **Potential keywords:** *as.numeric, substr, grep* :::: {.bbox data-latex=""} ::: {data-latex=""} **Item(s) to submit:** - The code used for the problem. - The output from printing the "new" sets of data for each of: `reviews_texts_vim`, `scores_texts_vim`, `likely_to_recommend_vim_texts`, `reviews_texts_emacs`, `scores_texts_emacs`, and `likely_to_recommend_emacs_texts`. ::: :::: **1c.** *(2 pt)* In this question, we will combine all of the data from (1a) and (1b) into a single `data.frame`. This makes it easy for us to thoroughly explore the data, plot the data, and use it in functions. The end result will resemble the following: ```{r eval=TRUE, include=TRUE} df <- data.frame(editor=c('vim', 'vim', 'emacs', 'vim', 'emacs'), reviews=c("this is the first review for vim", "this is the second review for vim", "this is the first review for emacs", "this is the third review for vim", "this is the second review for emacs"), scores_texts=c("4 of 5", "5 of 5", "3 of 5", "4 of 5", "5 of 5"), word_count=c(7, 7, 7, 7, 7)) print(df) ``` As you can see, the `editor` column dictates which editor the row's data is for: vim or emacs. Please note that the end result will have more columns, this is just an example. To combine all of your data, first create two data.frames. One called `df_vim` and another called `df_emacs`. Each data.frame should have the `editor` column filled with the name of the editor. For example, `df_vim` should have the value "vim" for each and every row of the data.frame. Add the remaining variables from (1a) and (1b) to each respective dataframe as new columns, and make sure the column names match exactly. ```{r eval=TRUE, include=TRUE} df_vim <- data.frame(editor="vim", review_word_count=c(1,2,3,4,5)) print(df_vim) df_emacs <- data.frame(editor="emacs", review_word_count=c(5,4,4,4,3,2,1)) print(df_emacs) ``` Notice how each data.frame has the exact same column names? Continue this pattern and add the rest of your variables to each data.frame. Once completed, run the following to combine the data.frame's: ```{r eval=TRUE, include=TRUE} # Note that rbind relies on the column names in order to combine the data. final_df <- rbind(df_vim, df_emacs) print(final_df) # What that means is that as long as you have all of the columns named exactly the same, rbind # will properly combine your data! vim_reversed <- df_vim[, c(2,1)] still_good_df <- rbind(vim_reversed, df_emacs) print(still_good_df) ``` Great now you have your data.frame, `final_df` to use in question 2! **Keywords:** *data.frame* :::: {.bbox data-latex=""} ::: {data-latex=""} **Item(s) to submit:** - The code used for the problem. - The output from printing the `head` of the `final_df`. This output should have at least 7 columns: `editor`, `reviews_texts`, `scores_texts`, `likely_to_recommend_texts`, as well as the "new" sets of data from (1b), which, when combined should result in 3 additional columns. ::: :::: ## Question 2: We have now scraped, parsed, and organized data on vim and emacs into a single data.frame called `final_df`. Although it is not a lot of data, you can imagine a situation where a website contains thousands of entries. Let's do some data exploration and see if we can find any interesting relationships in our data. **Hypothesis:** *(1 pt)* Propose a hypothesis for how you think two variables are related (if at all). Make the numeric score you extracted in (1b) one of the two variables you will draw a hypothesis about. Let's say, for example, that you have the variable `word_count` that has the number of words in each review, and another variable called `character_count` that has the number of characters (letters/numbers/symbols) in each review. A good hypothesis would be: "I hypothesize that `word_count` and `character_count` are positively correlated." (meaning the more words that a review has, the more characters the review will have). This one is obvious, as typically the more words that a review has the more characters it will have. Note that we could have hypothesized: "I hypothesize that `word_count` and `character_count` are negatively correlated." In our exploratory analysis step, however, we would discover that is most likely not true. A wrong hypothesis is still valid. **Exploratory analysis:** *(3 pts)* Poke through the information like we did in the examples. Include short comments about what you are doing and why for each step. Make sure you explore your data in at least two different ways. The following are some examples of ways to explore your data: - getting a correlation coefficient using `cor` - plotting a scatterplot using `plot` - plotting a lineplot using `plot(..., type='l')` - making a barplot using `barplot` - evaluating the median values of columns in our data using `sapply` - breaking our data into groups and calculating statistics on those groups using `tapply` - etc. You are by no means limited to just these methods, you may do anything you'd like, just make sure to mention *what* and *why* in your comments. **Final recommendation:** *(1 pt)* Write down your final recommendation (vim or emacs), 1-2 sentences about whether or not you noticed anything interesting in your exploratory analysis, and any other comment you may have. Don't worry, your conclusion will not effect your grade. We have a very small sample here, and the important thing is to learn something new and understand how scraping data is a useful tool to have in your toolbox! Have fun and be creative! **Hint:** *Take a look at the examples for some ideas.* :::: {.bbox data-latex=""} ::: {data-latex=""} **Item(s) to submit:** - A markdown cell with your hypothesis. - The code used for the exploratory analysis where you explored the relationship between the numeric score from (1b) and another variable. Each line of code should have a comment explaining what is going on. - A markdown cell with your final recommendation (vim or emacs), 1-2 sentences about whether or not you noticed anything interesting in your exploratory analysis, and any other comment you may have. ::: :::: ## Project Submission: Submit your solutions for the project at this URL: https://classroom.github.com/a/PCJeRdng using the instructions found in the GitHub Classroom instructions folder on Blackboard. **Important note:** Make sure you submit your solutions in both .ipynb and .pdf formats. We've updated [our instructions](https://datamine.purdue.edu/seminars/spring2020/jupyter.pdf) to include multiple ways to convert your .ipynb file to a .pdf on scholar. You can find a copy of the instructions on scholar as well: `/class/datamine/data/spring2020/jupyter.pdf`. If for some reason the script does not work, just submit the .ipynb. Make sure you have your output calculated and displayed inside your notebook prior to submission.