Subset a <code>data.frame</code> by Completeness of Rows or Columns

An alternative to stats::complete.cases() that lets you specify the percentage of completeness desired.

almostComplete(dataset, rowPct, colPct = rowPct, n = 1)

Arguments

dataset	The input `data.frame`
rowPct	The maximum percent of `NA` values in rows, as a decimal.
colPct	The maximum percent of `NA` values in columns, as a decimal.
n	When `rowPct` and `colPct` are `NULL`, the function will drop at least the number of rows and columns specified here, by "rank", if any contain `NA`. See "Details".

Value

A data.frame

Details

When n is specified and rowPct and colPct are NULL, the function calculates the number of NA values by row and column. By default, it then drops the rows and columns with the highest number of missing values. With the dataset in the Examples section, if you use n = 2, the function will remove rows 1, 3, and 6 and columns A, B, C, and F. Compare this behavior with the results of rowSums(is.na(mydf)) and colSums(is.na(mydf)).

References

http://stackoverflow.com/a/20475029/1270695

Author

Ananda Mahto

Examples


mydf <- read.csv(text="
SampleID,A,B,C,D,E,F
x1,NA,x,NA,x,NA,x
x2,x,x,NA,x,x,NA
x3,NA,NA,x,x,x,NA
x4,x,x,x,NA,x,x
x5,x,x,x,x,x,x
x6,NA,NA,NA,x,NA,NA
x7,x,x,x,NA,x,x
x8,NA,NA,x,x,x,x
x9,x,x,x,x,x,NA
x10,x,x,x,x,x,x
x11,NA,x,x,x,x,NA"ext="
SampleID,A,B,C,D,E,F
x1,NA,x,NA,x,NA,x
x2,x,x,NA,x,x,NA
x3,NA,NA,x,x,x,NA
x4,x,x,x,NA,x,x
x5,x,x,x,x,x,x
x6,NA,NA,NA,x,NA,NA
x7,x,x,x,NA,x,x
x8,NA,NA,x,x,x,x
x9,x,x,x,x,x,NA
x10,x,x,x,x,x,x
x11,NA,x,x,x,x,NA")

## What do the data look like?
## How many NAs are there per column and row?
mydf
#>    SampleID    A    B    C    D    E    F
#> 1        x1 <NA>    x <NA>    x <NA>    x
#> 2        x2    x    x <NA>    x    x <NA>
#> 3        x3 <NA> <NA>    x    x    x <NA>
#> 4        x4    x    x    x <NA>    x    x
#> 5        x5    x    x    x    x    x    x
#> 6        x6 <NA> <NA> <NA>    x <NA> <NA>
#> 7        x7    x    x    x <NA>    x    x
#> 8        x8 <NA> <NA>    x    x    x    x
#> 9        x9    x    x    x    x    x <NA>
#> 10      x10    x    x    x    x    x    x
#> 11      x11 <NA>    x    x    x    x <NA>
colSums(is.na(mydf))
#> SampleID        A        B        C        D        E        F 
#>        0        5        3        3        2        2        5 
rowSums(is.na(mydf))
#>  [1] 3 2 3 1 0 5 1 2 1 0 2

## What does complete.cases do?
mydf[complete.cases(mydf), ]
#>    SampleID A B C D E F
#> 5        x5 x x x x x x
#> 10      x10 x x x x x x

## Drop whichever row and column have
## the highest percentage of NA values
almostComplete(mydf, NULL, NULL)
#>    SampleID    B    C    D    E
#> 1        x1    x <NA>    x <NA>
#> 2        x2    x <NA>    x    x
#> 3        x3 <NA>    x    x    x
#> 4        x4    x    x <NA>    x
#> 5        x5    x    x    x    x
#> 7        x7    x    x <NA>    x
#> 8        x8 <NA>    x    x    x
#> 9        x9    x    x    x    x
#> 10      x10    x    x    x    x
#> 11      x11    x    x    x    x

## Drop the rows and columns which have
## more than the second highest percentage of NA values
almostComplete(mydf, NULL, NULL, n = 2)
#>    SampleID    D E
#> 2        x2    x x
#> 4        x4 <NA> x
#> 5        x5    x x
#> 7        x7 <NA> x
#> 8        x8    x x
#> 9        x9    x x
#> 10      x10    x x
#> 11      x11    x x

## Set one threshold value for both rows and columns.
almostComplete(mydf, .7)
#>    SampleID    B    C    D    E
#> 1        x1    x <NA>    x <NA>
#> 2        x2    x <NA>    x    x
#> 3        x3 <NA>    x    x    x
#> 4        x4    x    x <NA>    x
#> 5        x5    x    x    x    x
#> 6        x6 <NA> <NA>    x <NA>
#> 7        x7    x    x <NA>    x
#> 8        x8 <NA>    x    x    x
#> 9        x9    x    x    x    x
#> 10      x10    x    x    x    x
#> 11      x11    x    x    x    x

## Specify row and column threshold values separately.
almostComplete(mydf, rowPct = .2, colPct = .5)
#>    SampleID    B    C    D E
#> 2        x2    x <NA>    x x
#> 4        x4    x    x <NA> x
#> 5        x5    x    x    x x
#> 7        x7    x    x <NA> x
#> 8        x8 <NA>    x    x x
#> 9        x9    x    x    x x
#> 10      x10    x    x    x x
#> 11      x11    x    x    x x