R Data Frame Basics
An R data frame is a tabular data structure very similar to an Excel spreadsheet, or a table in a relational database. Actually, the name "Data Frame" is a bit confusing if you ask me. I would have preferred calling it a "Data Table". This R data frame tutorial will explain the basics of working with R data frames, such as getting the number of rows, columns, performing simple functions on R data frames etc.
Loading Data Frames
As you may already have see in the tutorial about loading data in R, R makes it easy to load data into memory so you can work with it. Actually, when you load a CSV file into memory the data in the CSV file is loaded into an R data frame.
For the remaining part of this R data frame tutorial I will assume that you have loaded a data frame into memory using the following command (except with the file path to point to where the file is stored on your computer):
data = read.table(file="data.csv", header=T, sep=";")
This command loads the CSV file into an R data frame, and assigns the data frame to the variable named
data. Using the
data variable you can now access the loaded data frame.
The file should contain the following data:
name;age;salary John;35;99000 Joe;42;120000 Cindy;55;150000
summary() data frame function prints a summary of a given R data frame. You refer to the
data frame using its variable name. If you want to print a summary of the data frame loaded using the
R command shown earlier in this tutorial, write this into the R console in R Studio:
When executed inside R Studio on the data frame with the data shown earlier, you will get the following output:
name age salary Cindy:1 Min. :35.0 Min. : 99999 Joe :1 1st Qu.:38.5 1st Qu.:110000 John :1 Median :42.0 Median :120000 Mean :44.0 Mean :123333 3rd Qu.:48.5 3rd Qu.:135000 Max. :55.0 Max. :150000
As you can see, the
summary() function gives you a variety of different information about the
data frame. I will briefly explain what that information is.
The left-most column printed corresponds to the first column of the data frame - the column called
name. The left-most column shows each distinctive value of the
name column, along with
a counter telling how many times each value occurred. Since the
name column from the data example
shown earlier only contains 3 names and all names are distinct from each other, the counter after each name
only shows 1. If one name had occurred more than once in the
name column, the counter would have
shown how many times the name had occurred.
The second column printed shows a summary of the data from the column called
age. The printed column
shows the minimum, mean and maximum values found in the
age column, along with the median, 1st and
3rd quartile values (statistical values).
The reason the second column printed contains different information than the first column printed is, that the
columns in the R data frame from which they show a summary contain different types of data. The
column contains textual data. The
age columns contains numerical data. It doesn't make sense to
calculate the mean value of names.
The third printed column contains the same type of summary data as the second printed column. This is because
the third column in the data frame,
salary, contains numerical data too.
dim() function returns the dimensions of a data structure in R. When called with the name of
an R data frame as parameter,
dim() will print the number of rows and columns (in that order) of
the data frame.
For instance, if you have the data set shown earlier loaded into memory and assigned to a variable
data, you could obtain the dimensions of the loaded data frame with this command:
The output from this command would be:
 3 3
The first 3 is the number of rows in the data frame, and the last 3 is the number of columns. The data set shown earlier in this r data frame tutorial contains 3 rows and 3 columns, remember?
Referencing Data Frame Columns
You can reference a column of an R data frame via the column name. If the data was loaded from a CSV file, the
column name is the name given to that column in the first line (the header line) of the CSV file. If you look at
the data set shown earlier in this tutorial, you will see that the CSV file contains three columns. The first
line of the data set specifies the column names, which are
To reference one of the
salary columns you write the name
of the variable pointing to the data frame, a dollar sign ($) and then the name of the column to reference.
Here is how referencing the
age column looks:
If you type the above R command into the R console in R Studio and press enter, R will print out the values of
age column to the console. Here is how the output looks:
 35 42 55
The three values 25, 42 and 55 are the values listed in the
age column of the data frame.
Simply printing out the values of a column is not always that useful. However, you can pass a data frame column to some R functions and thus have the given R function perform its calculation on the values in that column. For instance, here is how you can calculate the mean age of the data set shown earlier:
The output from this R command would be:
The mean age of the persons listed in the data set is 44.