I have analyzed rather carefully the scaling properties of the code
 and my conclusions are that the use of a smaller cutoff for deltapsi
 does not help.

 This will be summarized in my notes and in the paper. Here I just report
 the major conclusions:

 * The most expensive part of this story is the FFT (forth and back) to
   calculate V*deltapsi in h_psi_c called by chpsi_all_eta (within the
   CG minimizer).

   The computational cost of the FFTW is 39/4*N*log2(N) where N is the
   number of elements of the transform. This number is equal to the number
   of real space points (in the example 12*12*12=1728 - typically we want   
   0.1 A spacing, corresponding to 24*24*24=13824 in diamond at 60Ry). 

   We can try to use "pruned FFTWs" (http://www.fftw.org/pruned.html),
   where the input (output) array is sparse and the other one is dense.
   For instance we could start from deltapsi on a small cutoff in G-space
   Ns, and FFTW to real space on the large N cutoff.
   Howevere the computational cost would be about 39/4*N*log2(Ns)=
   39/4*N*log2(Ns/N*N) = 39/4*N*[log2(N) - log2(N/Ns)].
   Therefore the gain w.r.t. a full FFT on the large cutoff would be
   [log2(N) - log2(N/Ns)]/log2(N) = 1 - log2(N/Ns)/log2(N).
   For teh example I am using (Ns=51,N=411) this gain is 0.65   
   (i.e. 65% of the time employed in a full FFT).
   However, with realistic cutoffs, Ns=137,N=13824: gain = 0.52.
   For very large systems N/Ns is a fixed constant, and log2(N) increases,
   therefore the gain goes to 1 and the two implementations become
   equivalent.

   The previous estimates become even worse if we consider that in the
   V*deltapsi calculation we need also the FFT backwards to G-space.
   Since we want V*deltapsi on the large cutoff, we cannot prune this
   second transform. Therefore we can gain a tiny bit on the first transform
   and nothing on the second one. The previous two example gains would
   become then (1+0.65)/2 = 0.83, and 0.76.

 * Anothe possibility is to avoid the FFT from the outset and use the
   convolution between V and psi. This calculation requires Ns * N multiplications,
   to be compared with the FFT case 2 * 39/4 * N * log2N (factor 2 because we
   have 2 FFTs).
   Therefore the gain in this case would be Ns/(2 * 39/4 * log2N).
   In the examples above: 
   51/411: 0.24
   137/13824: 0.51
   If we go to a much larger system (supercell with large N), then the
   FFT approach is ways more convenient.