Author Topic: Problems with parallel dscf in the case of large number (256) of CPU-cores  (Read 3534 times)

evgeniy

  • Sr. Member
  • ****
  • Posts: 102
  • Karma: +0/-0
Dear All,

I have encountered problems when running TM (6.3.1) with
large number of CPU-cores, namely 256. The calculation I am
trying to make run is big, ~ 4000 basis functions, hence so many
CPUs. The problem occurs in the dscf module. It starts OK, reaches
the following line in the output:

" DSCF restart information will be dumped onto file mos"

and then it stops/hangs up.  I noticed that there is an extra process
on one (the first) node (out of 256 nodes). I mean I use nodes=32:ppn=8
but for some reason there turned 9 processes on just only the first node.

When I reduced the number of CPU-cores to 128, i.e. nodes=16,ppn=8
everything got fine, the job is running and there only 8 processes on each
node as it should be.

Any comment on this problem will be greatly appreciated.

Best regards,
Evgeniy


tjmustard

  • Newbie
  • *
  • Posts: 2
  • Karma: +0/-0
I had a similar issue when I was using the dscf scratch space settings in the control file. Once I removed these, the job would work.

Hope this helps,
TJ Mustard

evgeniy

  • Sr. Member
  • ****
  • Posts: 102
  • Karma: +0/-0
I had a similar issue when I was using the dscf scratch space settings in the control file. Once I removed these, the job would work.

Hope this helps,
TJ Mustard

Hi,

I didn't get your point quite. If you mean $tmpdir setting, I have no any such setting
in the control file. If the problem had to do with storing 2e-integrals on scratch disc
the job with 128 CPUs would also behave similar to the job with 256 CPUs.


Best,
Evgeniy

ps: I've  just tried to run it with 192 CPUs and it is running OK, no extra process.
I found out that the number of CPUs for that it is still OK is 200. If it is large than
200 the job hangs up. I have no idea what it can be related to; could be hardware
dependent.
« Last Edit: January 15, 2012, 11:52:47 am by evgeniy »