dplyr
One of the core packages of the tidyverse in the R programming language, dplyr is primarily a set of functions designed to enable dataframe manipulation in an intuitive, user-friendly way. Data analysts typically use dplyr in order to transform existing datasets into a format better suited for some particular type of analysis, or data visualization.[1][2]
Original author(s) | Hadley Wickham |
---|---|
Initial release | January 7, 2014 |
Stable release | 1.0.0
/ June 1, 2020 |
Written in | R |
License | GPLv2 |
Website | dplyr |
For instance, someone seeking to analyze an enormous dataset may wish to only view a smaller subset of the data. Alternatively, a user may wish to rearrange the data in order to see the rows ranked by some numerical value, or even based on a combination of values from the original dataset.
Authored primarily by Hadley Wickham, dplyr was launched in 2014.[3] On the dplyr web page, the package is described as "a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges."[4]
The five core verbs
While dplyr actually includes several dozen functions that enable various forms of data manipulation, the package features five primary verbs:[5]
filter(), which is used to extract rows from a dataframe, based on conditions specified by a user;
select(), which is used to subset a dataframe by its columns;
arrange(), which is used to sort rows in a dataframe based on attributes held by particular columns;
mutate(), which is used to create new variables, by altering and/or combining values from existing columns; and
summarize(), also spelled summarise(), which is used to collapse values from a dataframe into a single summary.
Additional functions
In addition to its five main verbs, dplyr also includes several other functions that enable exploration and manipulation of dataframes. Included among these are:
count(), which is used to sum the number of unique observations that contain some particular value or categorical attribute;
rename(), which enables a user to alter the column names for variables, often to improve ease of use and intuitive understanding of a dataset;
slice_max(), which returns a data subset that contains the rows with the highest number of values for some particular variable;
slice_min(), which returns a data subset that contains the rows with the lowest number of values for some particular variable.
Built-in datasets
The dplyr package comes with five datasets. These are: band_instruments, band_instruments2, band_members, starwars, storms.
References
- Yadav, Rohit (2019-10-29). "Python's Pandas vs R's Tidyverse: Who Comes Out On Top?". Analytics India Magazine. Retrieved 2021-02-06.
- Krill, Paul (2015-06-30). "Why R? The pros and cons of the R language". InfoWorld. Retrieved 2021-02-06.
- "Introducing dplyr". blog.rstudio.com. Retrieved 2020-09-02.
- "Function reference". dplyr.tidyverse.org. Retrieved 2021-02-06.
- Grolemund, Garrett; Wickham, Hadley. 5 Data transformation | R for Data Science.