0

pT+H compiling

Hi Dear All

 

We are trying to compile a parallel version of Tough+Hydrate (pT+H).  So far we have been able to generate executable files of the Parallel version of TOUGH+HYDRATE (pT+H) at the supercomputer. At the supercomputer, instead of OpenMPI, MPICH is available. Instead of mpicc, mpifort, etc, we have cc & ftn commands.  However, when we run the test files (i.e., Test_2DX in the attachment). We got the runtime error:

 

runtime error on SC:

At line 294 of file Parallel_subs.f
Fortran runtime error: Allocatable actual argument 'part' is not allocated
srun: error: nid00198: task 1: Exited with exit code 2
srun: Terminating job step 3624227.0
slurmstepd: error: *** STEP 3624227.0 ON nid00197 CANCELLED AT 2022-10-17T18:15:08 ***
srun: error: nid00197: task 0: Terminated
srun: Force Terminated job step 3624227.0

 

We got the same errors on the different supercomputers and our Dell Tower Desktop (Linux) with OpenMPI. When we start to run the infile, the executable file can read the input file, it can generate the INCON, OUT and Plot Coord, and other related files. But then the simulation is aborted without generating data in Hydrate Status, etc. We tried to use different compilers and other options with our HPC team at USGS but got the same error as the above.

 

Could you please help us with this runtime error?

11 replies

null
    • kenny
    • 2 yrs ago
    • Reported - view

    The error is pretty strange. I am pretty sure the "part" has been allocated. Did you accidently delete one line of the source code for "part" allocation or do you correctly link to Metis? The error occurs at the section for mesh partition using metis. 

    By the way, do you compile the codes in debug mode (I saw the error message with source code line number)?  Debug mode will be significantly slower than  the release mode. 

      • toughtt
      • 2 yrs ago
      • Reported - view

      Kenny thank you for the reply. I didnt delete anything in the source code. Actually, in Metis 4.0, there is a problem while compiling. I need to change log2 to ilog2 to compile Metis and obtain libmetis.a. In Metis 4.0.3, this problem was also solved and I also used the libmetis.a coming from the compiling of Metis 4.0.3.

    • kenny
    • 2 yrs ago
    • Reported - view

    you may add follow line to the file Parallel_subs.f right before line 294, recompile the source code and try again:

     if (myid .NE. iMaster) allocate(part(1))

      • toughtt
      • 2 yrs ago
      • Reported - view

      Kenny 

      You can find the makefile we use in the attachment.

       

      When I added: if (myid .NE. iMaster) allocate(part(1))

      Now I got this error: At line 299 of file Parallel_subs.f

      Fortran runtime error: Allocatable actual argument 'grpntr' is not allocated

       

      Error termination. Backtrace:

      At line 299 of file Parallel_subs.f

      Fortran runtime error: Allocatable actual argument 'grpntr' is not allocated

       

      Error termination. Backtrace:

      A column referenced on one processor was not

      found on any other processors. This means that

      no processor is assigned to the row equal to

      this column number.

      Note: matrices must be square.

      The following columns were not found:

       34  37  40      

    • kenny
    • 2 yrs ago
    • Reported - view

    can you remove the option "-fcheck=all -Wall" from makefile and try again.   The error message for the arrays that have not been used in certain CPU and so they are not allocated at these CPUs. The executable should not be forced  doing the check during run time. 

      • toughtt
      • 2 yrs ago
      • Reported - view

      Kenny thank you for your quick replies. I compiled the codes without -fcheckall and got the executable file with several warnings. Test 1 file I used for running is in the attachment.

      I got the following fault

      mpirun -np 3 ./tough-mp3

      Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

      Backtrace for this error:

      Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

      Backtrace for this error:

      Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

      Backtrace for this error:

      #0  0x7ff891a8fad0 in ???

      #1  0x7ff891a8ec35 in ???

      #2  0x7ff89177f51f in ???

      at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0

      #3  0x5586ff55eb0d in allreplicom_

      at /home/sukru/Desktop/New Folder/src/Parallel_subs.f:4057

      #4  0x5586ff562b85 in do_parallel_

      at /home/sukru/Desktop/New Folder/src/Parallel_subs.f:148

      #0  0x7fc912df0ad0 in ???

      #1  0x7fc912defc35 in ???

      #2  0x7fc912ae051f in ???

      at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0

      #3  0x55dbbf8a8b0d in allreplicom_

      at /home/sukru/Desktop/New Folder/src/Parallel_subs.f:4057

      #5  0x5586ff51685f in tough_plus

      at /home/sukru/Desktop/New Folder/src/Main.f:286

      #6  0x5586ff50463e in main

      at /home/sukru/Desktop/New Folder/src/Main.f:21

      #4  0x55dbbf8acb85 in do_parallel_

      at /home/sukru/Desktop/New Folder/src/Parallel_subs.f:148

      #5  0x55dbbf86085f in tough_plus

      at /home/sukru/Desktop/New Folder/src/Main.f:286

      #6  0x55dbbf84e63e in main

      at /home/sukru/Desktop/New Folder/src/Main.f:21

      #0  0x7f0460b17ad0 in ???

      #1  0x7f0460b16c35 in ???

      #2  0x7f046080751f in ???

      at ./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0

      #3  0x55895487fb0d in allreplicom_

      at /home/sukru/Desktop/New Folder/src/Parallel_subs.f:4057

      #4  0x558954883b85 in do_parallel_

      at /home/sukru/Desktop/New Folder/src/Parallel_subs.f:148

      #5  0x55895483785f in tough_plus

      at /home/sukru/Desktop/New Folder/src/Main.f:286

      #6  0x55895482563e in main

       

      If I increase MaxNum_SS number in input file to 999, this fault disappears and I got the following faults:

      mpirun -np 3 ./tough-mp3

      A column referenced on one processor was not

      found on any other processors. This means that

      no processor is assigned to the row equal to

      this column number.

      Note: matrices must be square.

      The following columns were not found:

       34  37  40

      A column referenced on one processor was not

      found on any other processors. This means that

      no processor is assigned to the row equal to

      this column number.

      Note: matrices must be square.

      The following columns were not found:

       30  33  65

      A column referenced on one processor was not

      found on any other processors. This means that

      no processor is assigned to the row equal to

      this column number.

      Note: matrices must be square.

      The following columns were not found:

       24  28  54

    • kenny
    • 2 yrs ago
    • Reported - view

    I can run your example with my own version code on google cloud. I wonder what version codes you are using

      • toughtt
      • 2 yrs ago
      • Reported - view

      Kenny Actually this example is the test example shared with the code by LBL. We got the code last month:  pT+HYDRATE_Source_v1.5

    • kenny
    • 2 yrs ago
    • Reported - view

    I have no idea of the V1.5. The tough website shows the last parallel version is V1.0

      • toughtt
      • 2 yrs ago
      • Reported - view

      Kenny Yes I guess there is only one version V1.0 as you said. So we have version 1 as well although the zip file named as pT+HYDRATE_Source_v1.5

    • kenny
    • 2 yrs ago
    • Reported - view

    If you like, you can send me the package by email, and I will do the comparison. I am the original developer of the pTH, but did not touch it for a while. my email is kzhang at LBL dot gov

Content aside

  • 2 yrs agoLast active
  • 11Replies
  • 55Views
  • 2 Following