Back to the main page.

Bug 2400 - FT randsample samples with replacement, while the statistics toolbox does not

Status ASSIGNED
Reported 2013-11-29 19:39:00 +0100
Modified 2014-04-01 16:42:22 +0200
Product: FieldTrip
Component: external
Version: unspecified
Hardware: All
Operating System: All
Importance: P3 normal
Assigned to: Robert Oostenveld
URL:
Tags:
Depends on:
Blocks:
See also: http://bugzilla.fcdonders.nl/show_bug.cgi?id=2497

Craig Richter - 2013-11-29 19:39:41 +0100

This can cause unexpected behaviour when the user assumes that the call to randsample is invoking the statistics toolbox.


Philipp Ruhnau - 2014-03-12 09:02:18 +0100

is this being worked on? I just encountered it as well. at least the help should be changed!


Robert Oostenveld - 2014-03-12 09:13:19 +0100

Hi Craig and Philipp, thanks for notifying, I'll get to work on it. As I understand it, it pertains to external/stats/randsample.m which serves as drop-in replacement for the stats toolbox version. >> which randsample.m /Volumes/Data/roboos/matlab/fieldtrip/external/stats/randsample.m >> restoredefaultpath >> which randsample.m /Applications/MATLAB_R2012b.app/toolbox/stats/stats/randsample.m The drop-in replacement functions are supposed to behave identical. I see in the help of the Mathworks version that there are 5 use cases: % Y = RANDSAMPLE(N,K) returns Y as a column vector of K values sampled.. % Y = RANDSAMPLE(POPULATION,K) returns K values sampled uniformly at random,.. % Y = RANDSAMPLE(N,K,REPLACE) or RANDSAMPLE(POPULATION,K,REPLACE) returns a.. % Y = RANDSAMPLE(N,K,true,W) or RANDSAMPLE(POPULATION,K,true,W) returns a.. % Y = RANDSAMPLE(S,...) uses the random number stream S for random number.. of which only the first two are matched by the FT implementation. I will start by implementing a quick fix for some of the potential issues by checking on the number of input arguments.


Robert Oostenveld - 2014-03-12 09:13:47 +0100

(In reply to Robert Oostenveld from comment #2) Oh, I see that there is already a check if nargin>2 error('only two input variables are supported'); end


Robert Oostenveld - 2014-03-12 09:16:57 +0100

looking at http://code.google.com/p/fieldtrip/source/list?path=/trunk/external/stats/randsample.m&start=8410 I see that the function was renamed by Marcel (marvger) in revision 622 from rndsample to randsample, although he was aware of the incompatibility.


Philipp Ruhnau - 2014-03-12 09:25:19 +0100

huh, in any case the help needs to be changed as currently the output is with replacement


Robert Oostenveld - 2014-03-12 09:38:35 +0100

(In reply to Robert Oostenveld from comment #4) and prior to that it has been called randsample http://code.google.com/p/fieldtrip/source/list?path=/trunk/private/rndsample.m&start=621 I.e. the history of the function name is randsample -> rndsample -> randsample and at a certain point it moved from fieldtrip/private (where people could not use it directly) to external/stats. Looking at the code, I see it presently at use in mac001> find . -name \*.m -exec grep -H randsample {} \; ./external/dmlt/+dml/crossvalidator.m: tmp = [tmp; randsample(F{f}(iidx),maxsmp - length(iidx),true)]; ./external/dmlt/external/glmnet/cvglmnet.m:% g2=randsample(2,100,true); ./external/dmlt/external/glmnet/cvglmnet.m: foldid = randsample([repmat(1:nfolds,1,floor(N/nfolds)) 1:mod(N,nfolds)],N); ./external/dmlt/external/glmnet/glmnet.m:% g2=randsample(2,100,true); ./external/dmlt/external/glmnet/glmnet.m:% g4=randsample(4,100,true); ./external/dmlt/external/glmnet/glmnetPlot.m:% g2=randsample(2,100,true); ./external/dmlt/external/glmnet/glmnetPlot.m:% g4=randsample(4,100,true); ./external/dmlt/external/glmnet/glmnetPredict.m:% g2=randsample(2,100,true); ./external/dmlt/external/glmnet/glmnetPredict.m:% g4=randsample(4,100,true); ./private/resampledesign.m: resample(i,:) = randsample(1:Nrepl, Nrepl); ./private/resampledesign.m: tmp = randsample(1:Nrepl/Nrep(1), Nrepl/Nrep(1)); ./private/resampledesign.m: tmp(i,:) = sort(randsample(1:Nrepl/Nrep(1), Nrepl/Nrep(1))); ./external/stats/randsample.m:function [y] = randsample(x, k) ./external/stats/randsample.m:% $Id: randsample.m 8410 2013-08-21 14:16:44Z eelspa $ The use in dmlt requires the version from the matlab/stats toolbox to be used, as it has (mostly) 3 input args. The use in resampledesign is restricted to cfg.resampling='bootstrap', which is not used by default. The comments in the code make clear that there it is meant to resample with replacement, which is what is implemented in the external/stats version (although with incorrect help). So the issue can be addressed by removing the version from external/stats and replacing the call in the resampledesign to make correct use of the matlab/stats version (i.e. add the "true" option). mac001> svn commit external/stats/randsample.m private/resampledesign.m Deleting external/stats/randsample.m Sending private/resampledesign.m Transmitting file data . Committed revision 9284.


Robert Oostenveld - 2014-03-12 09:46:49 +0100

(In reply to Philipp Ruhnau from comment #5) The functions in external/stats function should be proper drop-in replacements to reduce the number of stats toolbox licenses that are needed when running it on a compute cluster or in training workshops, not functions with the same name but slightly different functionality. Hence changing the documentation would not address the actual problem. Removing the function does address the core problem, although does increase toolbox license requirements in specific cases. It still would be nice to have a drop-in replacement function for randsample, but it should then be implemented from scratch (not looking at the Mathworks code, only at its help) to ensure consistent behavior. See also http://fieldtrip.fcdonders.nl/faq/matlab_requirements?s[]=drop#replacements_for_functions_from_mathworks_toolboxes although I now realize that that documentation is outdated. It would be good to have a set of test scripts for all fieldtrip/external/stats functions (and idem for fieldtrip/external/signal). We do have http://code.google.com/p/fieldtrip/source/browse/trunk/test/test_nanstat.m but that tests only a specific subset (which are implemented as mex files).


Craig Richter - 2014-04-01 16:31:47 +0200

I've written a replacement for randsample. It performs the functions of the statistics toolbox randsample except for the RandStream option, and weighted resampling, but should satisfy the demands of FT listed above by Robert. It performs on par with the Mathworks code. What is the best way to test it, since a number of functions depend upon it? function [rnd] = randsample(varargin) % Randsample replacement: Craig Richter 2014 if nargin==4 error('The Fieltrip replacement of randsample does not support weighted resampling') elseif nargin==3 if varargin{3} ~= 0 && varargin{3} ~= 1 error('Replacement input must be zero, one or ''true/false'' ') else replace_opt = varargin{3}; end elseif nargin==2 replace_opt = 0; end n = varargin{1}; k = varargin{2}; if k>n error('population must be larger than selection') end if replace_opt % sample with replacement if length(n)==1 rnd = randi(n,k,1); % sample k from 1:n elseif length(n)>1 rnd = n(randi(length(n),k,1))'; % sample k from population n end else % sample without replacement if length(n)==1 % sample k from 1:n pop = 1:n; elseif length(n)>1 pop = n; end [~,indx] = sort(rand(length(pop),1)); rnd = pop(indx(1:k))'; end


Craig Richter - 2014-04-01 16:42:22 +0200

Well, I found the first bug (in one of the error checks) :-/, here is the updated code: function [rnd] = randsample(varargin) % Replacement for randsample: Craig Richter 2014 if nargin==4 error('The Fieltrip replacement of randsample does not support weighted resampling') elseif nargin==3 if varargin{3} ~= 0 && varargin{3} ~= 1 error('Replacement input must be zero, one or ''true/false'' ') else replace_opt = varargin{3}; end elseif nargin==2 replace_opt = 0; end n = varargin{1}; k = varargin{2}; if ~replace_opt && k>n error('population must be larger than selection') end if replace_opt % sample with replacement if length(n)==1 rnd = randi(n,k,1); % sample k from 1:n elseif length(n)>1 rnd = n(randi(length(n),k,1))'; % sample k from population n end else % sample without replacement if length(n)==1 % sample k from 1:n pop = 1:n; elseif length(n)>1 pop = n; end [~,indx] = sort(rand(length(pop),1)); rnd = pop(indx(1:k))'; end