Table of contents
R, primarily an open-source programming language, provides an environment for performing statistical computing and graphics. It has a suite of software packages that can be used to accomplish a wide range of tasks such as data mining, time series analysis, machine learning, multivariate statistical analysis, analysis of spatial data, graphical plotting, etc.
Origin of R
R is an alternate implementation of the statistical programming language called S. S-PLUS was developed post S as its commercial version. R was introduced later by Ross Ihaka and Robert Gentleman in 1991. Though R is independent of S-PLUS, much of its code works without any alteration for R too. The first official version of R was released in 1995 as an open-source software package under the GNU General Public License.
Sign up for your weekly dose of what's up in emerging technology.
Fundamental operations and concepts
Here, we explain in brief some basic yet essential concepts and functions an R programming beginner should know. Each of the further sub-topics has been demonstrated with a snippet of code implemented in R Console (RGui (32-bit)), which can be installed from here.
help.start(): opens R’s official documentation for general help on available functionalities.
?sum: opens documentation for the sum() function
Note: If there is no function with the parameter name, a message will be displayed on the console informing the user that there is no documentation for it in the specified packages and libraries. E.g. help(“add”) gives the result:
??sum: searches the help system to find instances of the string “sum”
Sample condensed output:
apropos(“sum”, mode=”function”): lists all the available functions with “sum” string present in their name
data(): lists all the example datasets available in the currently loaded packages. (A new window named ‘R data sets’ gets opened in the console in which the output appears)
Sample condensed output:
Some general purpose functions
getwd(): to know the current working directory
setwd(PATH): sets the specified path as the current working directory (changes done can be verified using getwd())
ls(): lists the objects in the current workspace.
rm(objects): removes the object(s) specified as parameters from the current workspace
The following snippet creates objects x,y, z and then use ls() to display the objects’ names. On executing rm(x,y) removes objects x and y so again doing ls() gives only “z” as output.
history(num): opens a new window named ‘R History’, which contains names of ‘num’ number of last executed commands. If nothing is specified as an argument, last 25 commands are displayed by default.
savehistory(“fname”)saves the workspace history in a ‘fname’ named file which can be loaded into the current workspace using loadhistory(“fname”) command.
save.image(“my_workspace”)saves the current workspace to a file named ‘my_workspace’ which can further be loaded using load(“my_workspace”) command.
q(): a dialog box will ask if you want to save the current workspace and then the R console will be exited.
Packages in R
A package in R is a collection of data, functions and compiled code in a properly defined format. Several packages are stored in a library.
.libPaths()command shows the path location where the library is located
library()command displays the list of all the packages saved in the library.
Sample condensed output:
- Package installation:
install.packages()command displays a list of CRAN mirror websites for installing a package.
Sample condensed output:
update.packages()can be used to to get the changes/updates done to each package in the library
installed.packages()displays the list of all the installed packages along with some additional information such as version number, dependencies etc.
- Particular package can be loaded in the current session using
Objects in R
An object refers to anything that can be assigned to a variable. Each object has two attributes:
- length: number of elements in the object
- mode: denotes type of the object’s data (numeric, character, complex or logical)
Note: ‘numeric’ data type in R by default means decimal value and not an integer. E.g if we assign x=10 and then check
is.integer(x), it will return FALSE. It can be converted to integer type using
as.integer() as follows:
There are six types of R objects as follows:
- Vector: a 1D array which is a collection of fixed-sized cells having the same type of data.
Ways to create a vector:
vector1 <- 1:10(has elements from 1 to 10)
- Use ‘seq’ to create a vector of sequence
e.g. seq(from=1,to=10, by=2) (choose elements from 1 to 10 in step of 2)
- Use ‘rep’ to create vector having repeated element or another vector
- Use c() method where ‘c’ stands for ‘combine’
vector1 <- c(1,2,3,4,5)
Element(s) of a vector can be accessed using indexing as follows:
- Matrix: It is a 2D vector with fixed-sized cells having the same type of data.
Matrix creation example:
Where, nrow and ncol denote the rows and columns respectively; byrow=TRUE means the matrix will be filled row-by-row.
Ways to access element(s) of a matrix:
- M[n] : nth element of matrix M (counting occurs column-wise, with n=1 denoting the first element)
- M[n,] : nth row of matrix M (n=1 denotes first row)
- M[,n] : nth column of matrix M (n=1 denotes first column)
- M[x,y] : element at xth row and yth column
- M[,c(x,y)] : extract xth and yth columns at a time
- M[c(x,y),] : extract xth and yth rows at a time
- Array : It is one or more dimensional array. So 1D array and 2D array are (almost) the same as a vector and a matrix respectively. The one with 3 or more dimensions is said to be a multidimensional array.
- List : It is a collection of elements which can be of different data types. Also, the size of a list can be expanded on the fly.
- Factor : A factor in R is a data object which deals with categorical variables (i.e. those having some fixed possible values, e.g. ‘gender’ and ‘months’ variable). Each factor has a levels attribute that denotes the permitted values of the variable. The usefulness of a factor can be understood from the following short example.
e.g. Suppose, there is a list x1 having some of the months’ names as its elements. We create a factor with the data of x1 and initialize the ‘levels’ attribute with a list named ‘months’ which contains names of all the months in a year. If we simply sort x1, it will be sorted in alphabetical order, but if the factor y1 with well-defined levels is sorted, we get the x1’s elements sorted in the order in which those months occur in a year.
Now suppose there is a value in a list which does not match any of the ‘levels’ list, it will be converted to NA in the factor and the wrong element will be missing in the output if the factor is sorted.
If we miss defining the levels, explicitly, they will be taken as the data’s values sorted in alphabetical order.
Levels of a factor can be known using
levels() method by passing the factor’s name as its argument.
- Data frame : A data frame in R refers to a data table in which the columns can be of different types but each particular column holds the same type of data.
Some inbuilt datasets such as the Iris Flower dataset can be loaded by loading the ‘datasets’ package and then loading the dataset using
data.frame() as follows:
(Sample condensed output)
We can also create a custom data frame as follows:
Number of rows and columns of a dataframe can be known using
ncol() methods respectively.
We have covered some fundamental R software packages that are required for an R programmer to know for leveraging R’s functionalities. However, there are numerous other details of the topics covered in this article. Also, there are several other concepts of the language that an R programmer needs to deal with. For an in-depth understanding of such topics, refer to the following sources: