An alternative to stats::complete.cases() that lets you specify the percentage of completeness desired.

almostComplete(dataset, rowPct, colPct = rowPct, n = 1)

Arguments

dataset

The input data.frame

rowPct

The maximum percent of NA values in rows, as a decimal.

colPct

The maximum percent of NA values in columns, as a decimal.

n

When rowPct and colPct are NULL, the function will drop at least the number of rows and columns specified here, by "rank", if any contain NA. See "Details".

Value

A data.frame

Details

When n is specified and rowPct and colPct are NULL, the function calculates the number of NA values by row and column. By default, it then drops the rows and columns with the highest number of missing values. With the dataset in the Examples section, if you use n = 2, the function will remove rows 1, 3, and 6 and columns A, B, C, and F. Compare this behavior with the results of rowSums(is.na(mydf)) and colSums(is.na(mydf)).

References

http://stackoverflow.com/a/20475029/1270695

Author

Ananda Mahto

Examples

mydf <- read.csv(text=" SampleID,A,B,C,D,E,F x1,NA,x,NA,x,NA,x x2,x,x,NA,x,x,NA x3,NA,NA,x,x,x,NA x4,x,x,x,NA,x,x x5,x,x,x,x,x,x x6,NA,NA,NA,x,NA,NA x7,x,x,x,NA,x,x x8,NA,NA,x,x,x,x x9,x,x,x,x,x,NA x10,x,x,x,x,x,x x11,NA,x,x,x,x,NA"ext=" SampleID,A,B,C,D,E,F x1,NA,x,NA,x,NA,x x2,x,x,NA,x,x,NA x3,NA,NA,x,x,x,NA x4,x,x,x,NA,x,x x5,x,x,x,x,x,x x6,NA,NA,NA,x,NA,NA x7,x,x,x,NA,x,x x8,NA,NA,x,x,x,x x9,x,x,x,x,x,NA x10,x,x,x,x,x,x x11,NA,x,x,x,x,NA") ## What do the data look like? ## How many NAs are there per column and row? mydf
#> SampleID A B C D E F #> 1 x1 <NA> x <NA> x <NA> x #> 2 x2 x x <NA> x x <NA> #> 3 x3 <NA> <NA> x x x <NA> #> 4 x4 x x x <NA> x x #> 5 x5 x x x x x x #> 6 x6 <NA> <NA> <NA> x <NA> <NA> #> 7 x7 x x x <NA> x x #> 8 x8 <NA> <NA> x x x x #> 9 x9 x x x x x <NA> #> 10 x10 x x x x x x #> 11 x11 <NA> x x x x <NA>
colSums(is.na(mydf))
#> SampleID A B C D E F #> 0 5 3 3 2 2 5
rowSums(is.na(mydf))
#> [1] 3 2 3 1 0 5 1 2 1 0 2
## What does complete.cases do? mydf[complete.cases(mydf), ]
#> SampleID A B C D E F #> 5 x5 x x x x x x #> 10 x10 x x x x x x
## Drop whichever row and column have ## the highest percentage of NA values almostComplete(mydf, NULL, NULL)
#> SampleID B C D E #> 1 x1 x <NA> x <NA> #> 2 x2 x <NA> x x #> 3 x3 <NA> x x x #> 4 x4 x x <NA> x #> 5 x5 x x x x #> 7 x7 x x <NA> x #> 8 x8 <NA> x x x #> 9 x9 x x x x #> 10 x10 x x x x #> 11 x11 x x x x
## Drop the rows and columns which have ## more than the second highest percentage of NA values almostComplete(mydf, NULL, NULL, n = 2)
#> SampleID D E #> 2 x2 x x #> 4 x4 <NA> x #> 5 x5 x x #> 7 x7 <NA> x #> 8 x8 x x #> 9 x9 x x #> 10 x10 x x #> 11 x11 x x
## Set one threshold value for both rows and columns. almostComplete(mydf, .7)
#> SampleID B C D E #> 1 x1 x <NA> x <NA> #> 2 x2 x <NA> x x #> 3 x3 <NA> x x x #> 4 x4 x x <NA> x #> 5 x5 x x x x #> 6 x6 <NA> <NA> x <NA> #> 7 x7 x x <NA> x #> 8 x8 <NA> x x x #> 9 x9 x x x x #> 10 x10 x x x x #> 11 x11 x x x x
## Specify row and column threshold values separately. almostComplete(mydf, rowPct = .2, colPct = .5)
#> SampleID B C D E #> 2 x2 x <NA> x x #> 4 x4 x x <NA> x #> 5 x5 x x x x #> 7 x7 x x <NA> x #> 8 x8 <NA> x x x #> 9 x9 x x x x #> 10 x10 x x x x #> 11 x11 x x x x