An alternative to stats::complete.cases()
that lets you specify the
percentage of completeness desired.
almostComplete(dataset, rowPct, colPct = rowPct, n = 1)
dataset | The input |
---|---|
rowPct | The maximum percent of |
colPct | The maximum percent of |
n | When |
A data.frame
When n
is specified and rowPct
and colPct
are NULL
, the function
calculates the number of NA
values by row and column. By default, it then
drops the rows and columns with the highest number of missing values. With
the dataset in the Examples section, if you use n = 2
, the function will
remove rows 1, 3, and 6 and columns A, B, C, and F. Compare this behavior
with the results of rowSums(is.na(mydf))
and colSums(is.na(mydf))
.
http://stackoverflow.com/a/20475029/1270695
Ananda Mahto
mydf <- read.csv(text=" SampleID,A,B,C,D,E,F x1,NA,x,NA,x,NA,x x2,x,x,NA,x,x,NA x3,NA,NA,x,x,x,NA x4,x,x,x,NA,x,x x5,x,x,x,x,x,x x6,NA,NA,NA,x,NA,NA x7,x,x,x,NA,x,x x8,NA,NA,x,x,x,x x9,x,x,x,x,x,NA x10,x,x,x,x,x,x x11,NA,x,x,x,x,NA"ext=" SampleID,A,B,C,D,E,F x1,NA,x,NA,x,NA,x x2,x,x,NA,x,x,NA x3,NA,NA,x,x,x,NA x4,x,x,x,NA,x,x x5,x,x,x,x,x,x x6,NA,NA,NA,x,NA,NA x7,x,x,x,NA,x,x x8,NA,NA,x,x,x,x x9,x,x,x,x,x,NA x10,x,x,x,x,x,x x11,NA,x,x,x,x,NA") ## What do the data look like? ## How many NAs are there per column and row? mydf#> SampleID A B C D E F #> 1 x1 <NA> x <NA> x <NA> x #> 2 x2 x x <NA> x x <NA> #> 3 x3 <NA> <NA> x x x <NA> #> 4 x4 x x x <NA> x x #> 5 x5 x x x x x x #> 6 x6 <NA> <NA> <NA> x <NA> <NA> #> 7 x7 x x x <NA> x x #> 8 x8 <NA> <NA> x x x x #> 9 x9 x x x x x <NA> #> 10 x10 x x x x x x #> 11 x11 <NA> x x x x <NA>#> SampleID A B C D E F #> 0 5 3 3 2 2 5#> [1] 3 2 3 1 0 5 1 2 1 0 2#> SampleID A B C D E F #> 5 x5 x x x x x x #> 10 x10 x x x x x x## Drop whichever row and column have ## the highest percentage of NA values almostComplete(mydf, NULL, NULL)#> SampleID B C D E #> 1 x1 x <NA> x <NA> #> 2 x2 x <NA> x x #> 3 x3 <NA> x x x #> 4 x4 x x <NA> x #> 5 x5 x x x x #> 7 x7 x x <NA> x #> 8 x8 <NA> x x x #> 9 x9 x x x x #> 10 x10 x x x x #> 11 x11 x x x x## Drop the rows and columns which have ## more than the second highest percentage of NA values almostComplete(mydf, NULL, NULL, n = 2)#> SampleID D E #> 2 x2 x x #> 4 x4 <NA> x #> 5 x5 x x #> 7 x7 <NA> x #> 8 x8 x x #> 9 x9 x x #> 10 x10 x x #> 11 x11 x x## Set one threshold value for both rows and columns. almostComplete(mydf, .7)#> SampleID B C D E #> 1 x1 x <NA> x <NA> #> 2 x2 x <NA> x x #> 3 x3 <NA> x x x #> 4 x4 x x <NA> x #> 5 x5 x x x x #> 6 x6 <NA> <NA> x <NA> #> 7 x7 x x <NA> x #> 8 x8 <NA> x x x #> 9 x9 x x x x #> 10 x10 x x x x #> 11 x11 x x x x## Specify row and column threshold values separately. almostComplete(mydf, rowPct = .2, colPct = .5)#> SampleID B C D E #> 2 x2 x <NA> x x #> 4 x4 x x <NA> x #> 5 x5 x x x x #> 7 x7 x x <NA> x #> 8 x8 <NA> x x x #> 9 x9 x x x x #> 10 x10 x x x x #> 11 x11 x x x x