Data Manipulation
The easiest way to import and export data in R is with dataframes using and
. The user has control over the separator between columns; the most common are the comma (","), and the tab ("\t"), the presence (or not) of headers and whether or not rows have names and where the names are found. By default, missing data is output as
. These commands have many options with reasonable defaults, so most formats can be easily accommodated. Following is a simple example that will work on a small sample file you can download, and which will be easily extensible for importing most data.
After you have moved the datafile to your home directory, start R, and then run this simple command:
The arguments to tell the interpreter that the data file has a header row to identify each column, the labels for each row are in column 1 (comprising the date and time) and that the data in the file are tab-delimited.
Quoting from the R Manual, "statistical systems like R are not particularly well suited to manipulations of large scale data"; it may be in some cases that it is preferable to perform manipulations in other frameworks, such as relational databases and output the results into R for further analysis; in addition, all data being manipulated need to reside in memory, so that running out of memory can be a problem in R in some cases. However, there are a number of simple manipulations that can come in handy and most users will not suffer the aforementioned problems with their own use cases.
Dataframes can be reduced to a subset of columns using subscripts; for a dataframe, the first subscript describes which rows are being selected, separated by a comma from the second subscript which describes which columns to select. Each subscript is either a single object—either a column index or a sequence generated with the colon operator —or a vector. To select certain contiguous columns for all rows use the following, which will produce a new dataframe with the id for each row and the first two columns:
Note that if you subscript to end up with a dataframe (or a matrix) with only one row or column, the result will be a vector; you may not want this, and can avoid it by adding after the second subscript, before closing the square bracket.
Here we select rows 1, 4, 5 and 6, and columns 1, 3 and 4, using the combine operator to produce a vector:
Filters can be used to obtain only those rows that meet certain criteria. Before using filters, you can the names of the dataframe using
, if you want to avoid having to reference using the
syntax, but this is often not advisable as it can cause naming confusion or collision, particularly where you are using more than one dataframe. Here we want all columns, for those rows where the value in the "clip" column is greater than zero:
We could also just request columns 2 to 4 (i.e., excluding the "clip" column altogether, although it is still used in the filter):
Although beyond the scope of this introductory module, additional packages exist for data manipulation, such as SparkR; Spark is a multi-language cluster-computing framework targeted at data analysis. A SparkDataFrame is a distributed collection of data organized into named columns, conceptually equivalent to a table in a relational database or a data frame in R, but with richer optimizations under the hood.