::::::::::::::
gwdfpt.note.6.11.08
::::::::::::::

 I don't think we can achieve linear scaling as suggested
 in Baroni&Giannozzi92. In their case the kinetic energy
 was calculated by finite differences, so it's strictly
 localized in R-space.
 Using psinc functions instead, the kinetic energy is still
 localized, but it non-vanishing on further neighbors.
 Mostofi et al. mention that the application of the kinetic
 energy is still a global operation to be performed with
 FFTs (http://www.tcm.phy.cam.ac.uk/onetep/intro/node6.html).
 In other words, if we want to keep the G-space accuracy, 
 we NEED to use the full representation of the kinetic
 energy WITHOUT truncation.
 This means that it is useless to compute the matrix 
 elements of the kinetic energy in the psinc basis, since
 they don't decay so fast.
 For Haydock, we still need a localized object to start
 with, and the psinc choice seems still to be the most
 convenient to switch to/from the G-space.

::::::::::::::
gwdfpt.note.6.12.08
::::::::::::::

 In order to have a complete mapping of the psinc basis into the G-space,
 we need to keep the entire G-bx, and not to trunctate it within the
 kinetic energy cutoff !!

 All the nice properties of the psinc's are based on this full mapping,
 if we truncate the orthogonality and all that vanish.
 
 I think we need to define a grid spacing such that the corresponding
 G-box is CONTAINED within the sphere of the G-vectors smaller than
 the cutoff. In practice the calculation HAS TO BE RUN with a cutoff
 smaller than for the wavefunctions. I think this is still fine, since
 calculations of epsilon are usually done with 5 or 10 Ry.

::::::::::::::
gwdfpt.note.6.17.08
::::::::::::::

 Ok I think I am able to calculate the eigenvalues of H with the recursion method 
 for the special case k=0 and psinc(1) (diagonal element, hermitian Hamiltonian).
 The eigenvalues do coincide with those of the CG method to within 1 meV after
 about 150 iterations.

 I obtained the eigenvalues by diagonalizing the tridiagonal matrix obtained from
 the recursion coefficients. Some non0degenerate eigenvalues appear multiple times,
 I don't know at this stage whether this is a problem for the computation of the
 Green's function or not. i would say no, since in the recursion methods there is
 no assumption about the multiplicity of the calculated spectrum.

 The Green's function has poles at SOME OF the eigenvalues, the strenght is nonzero
 when the initial state has components on that eigenstate.

 Now I need to try the nondiagonal elements, and the nonzero k vectors.

::::::::::::::
gwdfpt.note.6.18.08
::::::::::::::

 I tested the Green's function G(r,r',k,w) against the sum-over-state expression.
  
 The diagonal element on psinc(1) and k=0 gives EXACTLY the same result as for the
 sum over states (I compared the imaginary parts to avoid calculating many empty states).
 This is great, now I need to do the off-diagonal matrix elements and the case k/=0.

::::::::::::::
gwdfpt.note.6.19.08
::::::::::::::

 I tested the Green's function G(r,r',k,w) against the sum-over-state expression.

 OK, the off-diagonal elements work perfectly.
 Also the case k/=0 works perfectly.

 I need about 100 iterations to have the Green's function converged with delta = 0.1 eV.

 It looks like the Green's function converges from the bottom, i.e. the more iterations
 we do, the larger the energy range where it matches the full calculation based on
 the sum-over-states.

::::::::::::::
gwdfpt.note.7.30.08
::::::::::::::

 The screened coulomb interaction seems to work now.

 I tried three things as a sanity check: (1) the inverse dielectric matrix eps^{-1}
 as defined by W/v_c using the SCF linear response. (2) the dielectric matrix 
 eps = delta - v_c * chi_0 using the non-SCF linear response. 
 (3) The dielectric matrix eps = delta - v_c * chi_0 using the expansion over
 empty states.
 
 (3) and (2) give identical results (see results/comparison_eps.eps for G=G'=0
 and varying q) and agree well with Walter&Cohen. I obtain 11.4 for q=0,
 W&C obtain 11.3.

 The comparison for eps^{-1} is more delicate. I have been struggling on this
 for a while. The main point is that, if you take eps^{-1} and invert it for 
 q->0, you don't get eps...
 After some reading, it appears that this is related to the non-analiticity
 of the dielectric matrix as q approaches 0. 
 See Baldereschi&Tosatti as well as Hybertsen&Louie for a detailed discussion.

 In particular, looking at B&T we find that Table II - 10 point local gives
 eps = 12.1, while Table VIII - Local gives eps-1 = 0.092 (both for q close to 0).
 Now: 1/0.092 = 10.9, in disagreement with the value 12.1.
 However, Table VII of B&T gives 0.083 and 1/0.083 = 12.04, in better agreement
 with 12.1.

 In practice there is no simple way to relate eps^{-1}(q->0) to eps(q->0) 
 [it looks like one has to set the wings to zero but I did not try that].

 The good news are that also in my case I obtain eps^{-1} = 0.098 with 1/0.098 = 10.2
 while eps = 11.4. In particular, also the other elements of my IDM are in 
 agreement with the column "Local" in Table VIII of B&T.

 Note also in Tables VI and VII of H&L that the RPA results for eps^{-1}, 0.082,
 (1/0.082 = 12.2) does not correspond to the RPA result for eps, 13.6.

 So everything is fine and we can proceed with the self-energy.

 I need to check whether we can run with \omega /= 0 and with \omega + i*\delta...

::::::::::::::
gwdfpt.note.21.12.08
::::::::::::::

I am putting together G and W. The debug tests on G are successful.
The debug tests on W now seem to work.
Since I moved to a uniform and Gamma-centered grid for the purpose
of folding, I was obtaining eps0 values around 15.7 for 555-000.
After some tests I realized that the code seems correct, but
my previous tests giving 11.65 at 555-111 corresponded to 19 inequivalent
points, while 555-000 corresponds to 10 inequivalent points.
By simply bringing 555-000 to 10,10,10-000 I obtain 11.69
and the frequency dependence is ok.

For the q=0,G=0 case I perform a separate calculation in coulomb_q0G0.f90
where the initial {k} grid is kept, and the {k+q} grid is generated for
a very small xq0, and the eigenstates calculated explicitly (no folding).

Everything seems ok to this point. 

The question now is whether I NEED q=0 in the sum, or I can skip it
in the integration for Sigma. In Steve's code I think that q=0 is
included explicitly but I don't remember why...

::::::::::::::
gwdfpt.note.04.01.09
::::::::::::::

I was able to optimize substantially solve_linter_dyn by using the
following scheme: 1) run a full bcgsolve minimization for the first
dvscf iteration, and 2) run a "few-steps" minimization (ideally 1 step,
but I use 5 for safety) of dpsi in bcgsolve and update the dvscf
in solve_linter. The convergence takes place with the same number
of dvscf iterations (sometimes smaller).

The rationale for this is that the bcgsolve part runs over k-points
and is way more expensive than the update of the potential.

This idea is similar to the direct minimization of the energy functional
used in Paratec.

For the case q=0,G=0 I need to perform the original minimization
(separate relaxations of dpsi and dvscf), the combined minimization
does not seem to converge.

Now I think it's time to seriously optimize bcgsolve, which is the
#1 bottleneck of the entire story.

::::::::::::::
gwdfpt.note.05.01.09
::::::::::::::

By doing some profiling of bcgsolve_all it turns out that the most
expensive part (by a proportion 9:1) is the application of the
Hamiltonian (A*x - ch_psi_all_eta), and in particular the FFT
forth and back to compute V*psi. There is not much to optimize here,
so I re-considered the possibility of including some preconditioning.

In order to add the preconditioner I simply modified the linear
system from Ax = b to (MA)x = Mb with M given by h_diag
and replaced everywhere in the procedure A by MA and b by Mb.
The new matrix MA is still complex, and does not need to be symmetric
or Hermitian in the complex-BCG algorithm by Jacobs - so no problem here.
In practice the preconditioner is applied in only three locations.

A comparison of the convergence with and without preconditioning
is in the file results/preconditioning.eps.
It is clear that preconditioning gives a reduction of the number
of iterations required for a given accuracy of a factor 3 to 4.
The preconditioning 1/max(1,x) of Giannozzi (?, PH) and the one
by Payne-Teter-Allan (PRB 1988) give similar results.

WARNING: it seems that preconditioning does not work for
several k-points.

::::::::::::::
gwdfpt.note.17.01.09
::::::::::::::

Trying to parallelize on hbar: I had to declare plan in fftw.f90
as integer*8 otherwise segfault probelms occurred.

I realized that some q vecs could not be converged in coulomb
for the versions devel-7 and devel-8.

This has to do with the combined minimization. By playing a bit with
the numbers, it seems that the whole process is quite sensitive
to the choice of parameters.

Now I stick to

  maxscf = 1
  maxbcgsolve = 8
  alpha_mix = 0.3

I *disabled* the preconditioning.

::::::::::::::
gwdfpt.note.19.01.09
::::::::::::::

I was able to parallelize over the G-perturbations in hbar and slater.
The results look identical (apart some tiny differences in the minimization
steps). Now the Makefile can deal with the serial (x61) and the parallel
(hbar and slater) cases.
Hbar seems hugely faster than slater (at 8 procs) - what is that?

Now I need to parallelize for the Green's function (currently done only
by proc 1) and see wheher I can speedup solve_linter by using a ngms
cutoff on dpsi. This is reasonable in principle and could lead to an
extremely efficient code. I need to work a bit on that.