Error while executing condor batch jobs

Dear experts,
I am trying to run the preSel.py using runDataFrameBatch; the output files are in

this cernbox folder

It seems that no analysis.py file was found, and that no output.root file was available for the copy.

Hopefully useful infos:

  • I am producing the output root ntuples in a folder on /eos (outdir=’/eos/…’)
  • I am running the script directly within the folder that contains the analysis.py script (i.e. FCCAnalyses/examples/FCCee/flavour/generic-analysis/)

Do you have any suggestion on how to solve this issue?

Thanks

Giovanni

Dear @gguerrie ,

could you try to run not withing the folder but like this"
python examples/FCCee/flavour/generic-analysis/preSel.py
if it does nothing, pleae provide some more details such that we can reporduce the problem.
Clement

Hi @clement.helsens,
thanks for your feedback; before running in the generic_analysis folder, I did run python examples/FCCee/flavour/generic-analysis/preSel.py from inside FCCAnalyses; the same outcome occurred.

In the same CERNBOX folder I linked above, I put:

  • preSel.py (please note that outdir is set on my folder in /eos)
    Here I just replaced rdf.runDataFrame with rdf.runDataFrameBatch.
  • analysis.py
  • Algorithm.cc since I changed a bit one of the thrust computation algos in analyzers/dataframe (and I recompiled)

Please let me know if something else is necessary.
Cheers

G

looking at the Bc analysis, I see I have the absolute path:

That might be the reason. Indeed, batch needs to know where to take the analysis.
Let me know
Clement

Dear @clement.helsens, thanks for the insights.
I now get this error:

Traceback (most recent call last):
  File "/afs/cern.ch/user/g/gguerrie/FCCAnalyses/examples/FCCee/flavour/generic-analysis/analysis.py", line 257, in <module>
    analysis.run()
  File "/afs/cern.ch/user/g/gguerrie/FCCAnalyses/examples/FCCee/flavour/generic-analysis/analysis.py", line 39, in run
    .Alias("MCRecoAssociations0", "MCRecoAssociations#0.index")
cppyy.gbl.std.runtime_error: ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Alias(basic_string_view<char,char_traits<char> > alias, basic_string_view<char,char_traits<char> > columnName) =>
    runtime_error: GetBranchNames: error in opening the tree events

which normally does not happen when I run the analysis on the spot. Maybe some env variable gets lost?

cheers

Giovanni

hello @gguerrie ,

looks like it can not find the tree events. Does it work locally?

Yes. Just by changing the lines in preSel.py (e.g. rdf.runDataFrameBatch --> rdf.runDataFrame) it works.
I do not modify anything else…

Could you paste the call to rdf.runDataFrameBatch? I’m wondering what condor queue you are using

Here’s the output of the .cfg file:

executable     = $(filename)
Log            = /afs/cern.ch/user/g/gguerrie/FCCAnalyses/BatchOutputs/p8_ee_Zbb_ecm91/condor_job.p8_ee_Zbb_ecm91.$(ClusterId).$(ProcId).log
Output         = /afs/cern.ch/user/g/gguerrie/FCCAnalyses/BatchOutputs/p8_ee_Zbb_ecm91/condor_job.p8_ee_Zbb_ecm91.$(ClusterId).$(ProcId).out
Error          = /afs/cern.ch/user/g/gguerrie/FCCAnalyses/BatchOutputs/p8_ee_Zbb_ecm91/condor_job.p8_ee_Zbb_ecm91.$(ClusterId).$(ProcId).error
getenv         = True
environment    = "LS_SUBCWD=/afs/cern.ch/user/g/gguerrie/FCCAnalyses/BatchOutputs/p8_ee_Zbb_ecm91"
requirements   = ( (OpSysAndVer =?= "CentOS7") && (Machine =!= LastRemoteHost) && (TARGET.has_avx2 =?= True) )
on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)
max_retries    = 3
+JobFlavour    = "espresso"
+AccountingGroup = "group_u_FCC.local_gen"
RequestCpus = 8
queue filename matching files /afs/cern.ch/user/g/gguerrie/FCCAnalyses/BatchOutputs/p8_ee_Zbb_ecm91/jobp8_ee_Zbb_ecm91_chunk0.sh ...  /afs/cern.ch/user/g/gguerrie/FCCAnalyses/BatchOutputs/p8_ee_Zbb_ecm91/jobp8_ee_Zbb_ecm91_chunk100.sh

And I can see group_u_FCC.local_gen by launching condor_userprio

If you need any ohter information let me know

I do not see your name on the egroup that could have access to group_u_FCC.local_gen
I have just added you. In the mean time, could you try with an other group that you are sure it works?

Hello @clement.helsens
I tried to submit some jobs as a normal user (which normally works fine) and the error is the same.
If I am the only one experiencing this issue, I have definitely to look into my condor configuration.

hello @gguerrie ,

I have not run condor jobs since some time, but if you give me a full reproducer (link to your github with commands to run) I can give it a try.

Also, there was something wrong the last days with the latest stack that leads some weird errors, have you tried, from a fresh shell, to recompile before sending jobs ?

Cheers,
Clement

Hi @clement.helsens,
here’s the repo (branch batch_jobs): https://gitlab.cern.ch/gguerrie/FCCAnalyses/-/tree/batch_jobs

The only things to change are the output path outdir="..." in the FCCAnalyses/examples/FCCee/flavour/Afb/analysis/preSel.py file.

Then from within the FCCAnalysis folder do

  • source setup.sh
  • python examples/FCCee/flavour/Afb/analysis/preSel.py

Hi @clement.helsens, any news on this topic? Is there anyone else that is using this way to submit jobs? Thanks

Hello,

yes, sorry for taking long time to reply. I know this is not optimal, but not all analysis.py can be processed on batch, you need to adapt the input arguments like it is done here

I should re-write all the analysis,.py at some point such that they could be processed on batch, or find a generic interface such that the config modules could be shared.

1 Like

Hi @gguerrie , have you been able to run a test with a modified analysis.py that complies with runDataFrameBatch?
Cheers,
Clement

Hello @clement.helsens,
I did try to re-run the analysis with the modified script. The problem seems to persist.
Additionally, since the problem is with eoscopy.py, I tried to get the copy script, the dictionaries and the delphes outputs on a local directory and to send some jobs using this setup.
The issue then is that the batch node has no access to the sample (I remark that I am considering a single input sample contained in my personal afs directory, from which I send the job).

This is peculiar, since by just submitting a classic condor job using a simple .sub file listing just the executable to run, everything works.

thanks for the comments @gguerrie, I realise this batch implementation was not generic enough. I am re-writting it and it will hopefully work for everybody. Hope that you at least now have a stable way to process large amount of data.

1 Like