Error while executing condor batch jobs

gguerrie · February 17, 2022, 12:06pm

Dear experts,
I am trying to run the preSel.py using runDataFrameBatch; the output files are in

this cernbox folder

It seems that no analysis.py file was found, and that no output.root file was available for the copy.

Hopefully useful infos:

I am producing the output root ntuples in a folder on /eos (outdir=’/eos/…’)
I am running the script directly within the folder that contains the analysis.py script (i.e. FCCAnalyses/examples/FCCee/flavour/generic-analysis/)

Do you have any suggestion on how to solve this issue?

Thanks

Giovanni

clement.helsens · February 17, 2022, 1:13pm

Dear @gguerrie ,

could you try to run not withing the folder but like this"
python examples/FCCee/flavour/generic-analysis/preSel.py
if it does nothing, pleae provide some more details such that we can reporduce the problem.
Clement

gguerrie · February 17, 2022, 2:19pm

Hi @clement.helsens,
thanks for your feedback; before running in the generic_analysis folder, I did run python examples/FCCee/flavour/generic-analysis/preSel.py from inside FCCAnalyses; the same outcome occurred.

In the same CERNBOX folder I linked above, I put:

preSel.py (please note that outdir is set on my folder in /eos)
Here I just replaced rdf.runDataFrame with rdf.runDataFrameBatch.
analysis.py
Algorithm.cc since I changed a bit one of the thrust computation algos in analyzers/dataframe (and I recompiled)

Please let me know if something else is necessary.
Cheers

G

clement.helsens · February 17, 2022, 2:40pm

looking at the Bc analysis, I see I have the absolute path:

github.com

HEP-FCC/FCCAnalyses/blob/master/examples/FCCee/flavour/Bc2TauNu/preSel_Batch.py#L14

      
        
            import config.runDataFrameBatch as rdf
            import os
            
            
basedir=os.path.join(os.getenv('FCCDICTSDIR', deffccdicts), '') + "yaml/FCCee/spring2021/IDEA/"
            outdir="/eos/experiment/fcc/ee/analyses/case-studies/flavour/Bc2TauNu/flatNtuples/spring2021/prod_04/Batch_Analysis_stage1/"
            NUM_CPUS=8
            output_list=[]
            
            
fraction=1.
            
            
inputana="/afs/cern.ch/user/h/helsens/FCCsoft/HEP-FCC/FCCAnalyses/examples/FCCee/flavour/Bc2TauNu/analysis_stage1.py"
            
            
process_list=['p8_ee_Zbb_ecm91_EvtGen',
                          'p8_ee_Zbb_ecm91',
                          'p8_ee_Zcc_ecm91',
                          'p8_ee_Zuds_ecm91',
                          ]
            
            
#myana=rdf.runDataFrameBatch(basedir,process_list, outlist=output_list)
            #myana.run(ncpu=NUM_CPUS,fraction=fraction, chunks=50 ,outDir=outdir, inputana=inputana, comp="group_u_ATLAST3.all")

That might be the reason. Indeed, batch needs to know where to take the analysis.
Let me know
Clement

gguerrie · March 7, 2022, 8:36am

Dear @clement.helsens, thanks for the insights.
I now get this error:

Traceback (most recent call last):
  File "/afs/cern.ch/user/g/gguerrie/FCCAnalyses/examples/FCCee/flavour/generic-analysis/analysis.py", line 257, in <module>
    analysis.run()
  File "/afs/cern.ch/user/g/gguerrie/FCCAnalyses/examples/FCCee/flavour/generic-analysis/analysis.py", line 39, in run
    .Alias("MCRecoAssociations0", "MCRecoAssociations#0.index")
cppyy.gbl.std.runtime_error: ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Alias(basic_string_view<char,char_traits<char> > alias, basic_string_view<char,char_traits<char> > columnName) =>
    runtime_error: GetBranchNames: error in opening the tree events

which normally does not happen when I run the analysis on the spot. Maybe some env variable gets lost?

cheers

Giovanni

clement.helsens · March 7, 2022, 8:49am

hello @gguerrie ,

looks like it can not find the tree events. Does it work locally?

gguerrie · March 7, 2022, 9:06am

Yes. Just by changing the lines in preSel.py (e.g. rdf.runDataFrameBatch --> rdf.runDataFrame) it works.
I do not modify anything else…

clement.helsens · March 7, 2022, 9:17am

Could you paste the call to rdf.runDataFrameBatch? I’m wondering what condor queue you are using

gguerrie · March 7, 2022, 9:46am

Here’s the output of the .cfg file:

executable     = $(filename)
Log            = /afs/cern.ch/user/g/gguerrie/FCCAnalyses/BatchOutputs/p8_ee_Zbb_ecm91/condor_job.p8_ee_Zbb_ecm91.$(ClusterId).$(ProcId).log
Output         = /afs/cern.ch/user/g/gguerrie/FCCAnalyses/BatchOutputs/p8_ee_Zbb_ecm91/condor_job.p8_ee_Zbb_ecm91.$(ClusterId).$(ProcId).out
Error          = /afs/cern.ch/user/g/gguerrie/FCCAnalyses/BatchOutputs/p8_ee_Zbb_ecm91/condor_job.p8_ee_Zbb_ecm91.$(ClusterId).$(ProcId).error
getenv         = True
environment    = "LS_SUBCWD=/afs/cern.ch/user/g/gguerrie/FCCAnalyses/BatchOutputs/p8_ee_Zbb_ecm91"
requirements   = ( (OpSysAndVer =?= "CentOS7") && (Machine =!= LastRemoteHost) && (TARGET.has_avx2 =?= True) )
on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)
max_retries    = 3
+JobFlavour    = "espresso"
+AccountingGroup = "group_u_FCC.local_gen"
RequestCpus = 8
queue filename matching files /afs/cern.ch/user/g/gguerrie/FCCAnalyses/BatchOutputs/p8_ee_Zbb_ecm91/jobp8_ee_Zbb_ecm91_chunk0.sh ...  /afs/cern.ch/user/g/gguerrie/FCCAnalyses/BatchOutputs/p8_ee_Zbb_ecm91/jobp8_ee_Zbb_ecm91_chunk100.sh

And I can see group_u_FCC.local_gen by launching condor_userprio

If you need any ohter information let me know

clement.helsens · March 7, 2022, 9:57am

I do not see your name on the egroup that could have access to group_u_FCC.local_gen
I have just added you. In the mean time, could you try with an other group that you are sure it works?

gguerrie · March 8, 2022, 8:57am

Hello @clement.helsens
I tried to submit some jobs as a normal user (which normally works fine) and the error is the same.
If I am the only one experiencing this issue, I have definitely to look into my condor configuration.

clement.helsens · March 8, 2022, 9:35am

hello @gguerrie ,

I have not run condor jobs since some time, but if you give me a full reproducer (link to your github with commands to run) I can give it a try.

Also, there was something wrong the last days with the latest stack that leads some weird errors, have you tried, from a fresh shell, to recompile before sending jobs ?

Cheers,
Clement

gguerrie · March 14, 2022, 11:25am

Hi @clement.helsens,
here’s the repo (branch batch_jobs): https://gitlab.cern.ch/gguerrie/FCCAnalyses/-/tree/batch_jobs

The only things to change are the output path outdir="..." in the FCCAnalyses/examples/FCCee/flavour/Afb/analysis/preSel.py file.

Then from within the FCCAnalysis folder do

source setup.sh
python examples/FCCee/flavour/Afb/analysis/preSel.py

azzi · March 22, 2022, 8:45am

Hi @clement.helsens, any news on this topic? Is there anyone else that is using this way to submit jobs? Thanks

clement.helsens · March 22, 2022, 9:41am

Hello,

yes, sorry for taking long time to reply. I know this is not optimal, but not all analysis.py can be processed on batch, you need to adapt the input arguments like it is done here

I should re-write all the analysis,.py at some point such that they could be processed on batch, or find a generic interface such that the config modules could be shared.

clement.helsens · March 23, 2022, 2:38pm

Hi @gguerrie , have you been able to run a test with a modified analysis.py that complies with runDataFrameBatch?
Cheers,
Clement

gguerrie · March 30, 2022, 12:59pm

Hello @clement.helsens,
I did try to re-run the analysis with the modified script. The problem seems to persist.
Additionally, since the problem is with eoscopy.py, I tried to get the copy script, the dictionaries and the delphes outputs on a local directory and to send some jobs using this setup.
The issue then is that the batch node has no access to the sample (I remark that I am considering a single input sample contained in my personal afs directory, from which I send the job).

This is peculiar, since by just submitting a classic condor job using a simple .sub file listing just the executable to run, everything works.

clement.helsens · March 31, 2022, 8:09am

thanks for the comments @gguerrie, I realise this batch implementation was not generic enough. I am re-writting it and it will hopefully work for everybody. Hope that you at least now have a stable way to process large amount of data.