← Go back

Private MC sample generation

Local MC generation setup


In CMS, the SM processes are generated and simulated centrally, and we don't have to worry about them. But for a specific BSM search, a physicist has to take care of the BSM samples themselves, if they are not being handled already. For this section, I am using vector-like leptons as a reference BSM signal arXiv:1510.03556. First, we need to integrate this new physics model to the existing standard model in an event generator like MadGraph. Typically, theorists share their BSM model in the Universal FeynRules Output (UFO) format, which contain automatized matrix-element generators. These files contain python modules which can be easily included as an extension to MadGraph. This allows us to play with the new particles and the Feynman rules associated with them. I have taken the latest VLL UFO file from the repository: github.com/prudhvibhattiprolu.

This is the order of installation for this setup.

  1. Setting up MadGraph along with the BSM models.
  2. Building HepMC3.
  3. Installing and configuring pythia for hadronization.
  4. Building Delphes for detector simulation.

Here are some important packages required before installation. I am mentioning the versions that I have in my setup, but any recent versions should also work fine. I recommend using conda to install everything. It's important that ROOT is build with the same Python version that is used here. For detailed instructions on how to handle Python environments and install ROOT properly, visit here.

Packages Versions
python 3.10.9
cmake 3.22.1
git 2.34.1
g++, gcc 11.4.0
ROOT 6.26/10

I would also recommend to keep all of these MC-generation tools inside the same directory. For me, the working directory is /home/phazarik/mcgeneration.

Setting up MadGraph, including the BSM models

MadGraph can be downloaded from the official website. Building is not needed. The binary file is available at mg5amcnlo/bin/mg5_aMC. I have given the full path to MadGraph in my .bashrc file as follows.

alias mg5="/home/phazarik/mcgeneration/mg5amcnlo/bin/mg5_aMC"

If you are not importing any BSM model, MadGraph setup is done. In case of any BSM model (like VLL in my case), the BSM models are unpacked into the MadGraph directory as follows.

                /home/phazarik/mcgeneration/mg5amcnlo/models/VLLS_NLO
                /home/phazarik/mcgeneration/mg5amcnlo/models/VLLD_NLO
              

These model files in my example are compatible with MadGraph version v2 that was built on python2, but should also work with the latest MadGraph v3 that uses python3. In order to use MadGraph v3, these model files are made compatible with python3, and then imported into MadGraph as follows.

                shell>> mg5 #Takes me into MadGraph interface.
                MG5_aMC> set auto_convert_model T #For compatibility with python3. 
                MG5_aMC> import model VLLD_NLO
              

If the following outputs pop-up successfully, then the setup is ready.

                INFO: Change particles name to pass to MG5 convention 
the definition of 'j' and 'p' to 5 flavour scheme.
definitions of multiparticles l+ / l- / vl / vl~ unchanged
multiparticle all = g ghg ghg~ u c d s b u~ c~ d~ s~ b~ a gha gha~ ve vm vt e- mu- ta- ve~ vm~ vt~ e+ mu+ ta+ t t~ z w+ ghz ghwp ghwm h w- ghz~ ghwp~ ghwm~ taup nup taup~ nup~
INFO: Change particles name to pass to MG5 convention
definitions of multiparticles p / j / l+ / l- / vl / vl~ unchanged
multiparticle all = g ghg ghg~ u c d s b u~ c~ d~ s~ b~ a gha gha~ ve vm vt e- mu- ta- ve~ vm~ vt~ e+ mu+ ta+ t t~ z w+ ghz ghwp ghwm h w- ghz~ ghwp~ ghwm~ taup nup taup~ nup~

Setting up HepMC3

HepMC is widely used for exchanging event information between event generators and detector simulation tools. For this exercise, HepMC3 can be downloaded from GitLab. Some usage instructions are available here.

Note: Building HepMC3 is required before installing pythia, because PYTHIA configuration is done with HepMC3.

HepMC3 can be brought from GitLab and built as follows.

                git clone https://gitlab.cern.ch/hepmc/HepMC3.git # This will create a source directory called 'HepMC3'.
                mkdir hepmc3_build  hepmc3_install                # This will create two directories where hepmc3 is built and installed.
                cd hepmc3_build                                   # This is where building hepmc3 happens.
                cmake -DCMAKE_INSTALL_PREFIX=../hepmc3_install -Dmomentum:STRING=GEV -Dlength:STRING=MM ../HepMC3
                make                                              # This requires the C++ compilers (as checked by the previous command), and takes some time.
                make install                                      # This will transfer files to the install directory.
              
Note: HepMC3 is built with a certain Python version. Changing the Python version (or the conda environment) while using HepMC3 might cause problems later.

Setting up PYTHIA

PYTHIA is a program for simulating particle interactions as well as hadronization. It can be downloaded from the official website. Installation instructions are provided there. The following steps are used to install pythia with HepMC3 configuration. Make sure to give the full path to HepMC3 during configuration. I am using version 8.312, but these instructions are valid for all versions.

                wget https://www.pythia.org/download/pythia83/pythia8312.tgz  # Downloading pythia.
                tar xvfz pythia8312.tgz                                       # Unzipping the tarball.
                cd pythia8312
              

In the configuration, I am including the HepMC3 library. For this, I need to put the full path to HepMC3 along with the ./configure command as follows.


                ./configure --with-hepmc3=/home/phazarik/mcgeneration/hepmc3_install
 
                # Alternative build commands:
                #./configure --with-hepmc3=/home/phazarik/mcgeneration/hepmc3_install \
                #            --with-python-include=/home/phazarik/miniconda3_backup_2024_10_09/envs/analysis/include/python3.10 \
                #            --with-python-bin=/home/phazarik/miniconda3_backup_2024_10_09/envs/analysis/bin
                # ./configure
                # ./configure --with-python

                #If everything goes right, the following report should pop up.
                #---------------------------------------------------------------------
                #|                    PYTHIA Configuration Summary                   |
                #---------------------------------------------------------------------
                #  Architecture                = LINUX
                #  C++ compiler     CXX        = g++
                #  C++ dynamic tags CXX_DTAGS  = -Wl,--disable-new-dtags
                #  C++ flags        CXX_COMMON = -O2 -std=c++11 -pedantic -W -Wall -Wshadow -fPIC -pthread
                #  C++ shared flag  CXX_SHARED = -shared
                #  Further options             =

                #The following optional external packages will be used:
                #+ HEPMC3 (-I/home/phazarik/mcgeneration/hepmc3_install/include)

                make clean                       # Removes temporary files from previous attempts, if any.
                make                             # This takes a couple of minutes.
              

Hadronization happens in the pythiaXXXX/examples directory. That's why I have built pythia in my work-area for my convenience. But it can also be kept along with the other tools and export the output files to work-area for the next steps. Anyway, once pythia is build, I added the following paths to my .bashrc file, which is needed for C++ compilation of the hadronizer codes.

                export PYTHIA8=/mnt/d/work/temp/mcgeneration/pythia8312
                export PYTHIA8DATA=$PYTHIA8/share/Pythia8/xmldoc
                export PATH=$PYTHIA8/bin:$PATH
                export LD_LIBRARY_PATH=$PYTHIA8/lib:$LD_LIBRARY_PATH
              

If you are a big fan of Python ... [optional]

In some of the examples, I have also seen PYTHIA used in a python based interface. For this, one can easily install PYTHIA and HepMC3 using conda-forge. But this is not important for the toy example that I have shared in the next section. For a new user, I would not recommend this, because managing multiple versions of tools is messy.

                conda install -c conda-forge pythia8        # not important
                conda install -c conda-forge hepmc3         # not important
              

I also found a nice YouTube video on this Python based PYTHIA interface, which is a part of an HSF workshop. This video broadly covers the basics of a lot of things that can be done with PYTHIA, starting from event generation.

Setting up Delphes

Delphes is a fast simulation framework for high-energy physics detectors. The Delphes outputs are equivalent to NanoAOD, but the information is structured differently in the ROOT files. It can be downloaded from the GitHub repository. Installation instructions are available there as well. Building delphes is pretty straight forward.

                git clone https://github.com/delphes/delphes.git
                cd delphes
                make             # This takes a couple of minutes
              

Wrapping up ... [optional]

In some examples, PYTHIA, HEPMC3 etc. can be used from the MadGraph interface itself (I'm exploring it, not an expert yet). For this, paths to PYTHIA and Delples should be included in MadGraph configuration. This has to be done once. I also added my fastjet setup, in case I need it later.

                set pythia8_path /home/phazarik/mcgeneration/pythia8312
                set delphes_path /home/phazarik/mcgeneration/delphes
                set fastjet /home/phazarik/fastjet-3.4.2/bin/fastjet-config          #optional
              

These paths are needed only when the hadronization etc is done inside MadGraph itself, by importing these modules. But in the toy example, I am doing each step of the sample generation individually. These paths can also be added manually by editing the input/mg5_configuration.txt file in the MadGraph directory. Also, MadGraph may ask for something called lhapdf, which is a standard tool for evaluating parton distribution functions (PDFs). This can be installed using conda as follows.

                conda install -c conda-forge lhapdf
              

After these installations, following are the list of MC-generation tools I have in my setup.

Packages Versions Sources
mg5amnlo 3.5.4 GitHub
hepmc3 3.3.0 GitLab
pythia 8.312 pythia.org
delphes 3.5.0 GitHub
lhapdf 6.5.4 conda-forge
Note: Pythia, HepMC and Delphes can also be installed by simply going to the MadGraph prompt and running install pythia8 (for pythia and HepMC together) and install Delphes (for Delphes alone). These are kept in mg5amcnlo/HEPTools. But I don't trust it yet, and there is no control on the versions. I prefer to do it manually.

↑ back to top

Toy example


The flowchart below illustrates the event simulation workflow I am going to use. The process begins with generation using MadGraph, where parton-level events are generated in the LHE format. The output LHE file is then fed into PYTHIA for hadronization, where partons are converted into physical hadrons, resulting in a DAT file. Finally, detector simulation is done using Delphes with a CMS card, which simulate how the CMS would record these events, producing a ROOT file that can be used for further analysis.

graph LR; A[Generation in MadGraph
Input: Interaction parameters, Output: LHE] --> B[Hadronization in pythia
Input: LHE file, Output: DAT file]; B --> C[Simulation in delphes
Input: HepMC file, Output: ROOT file ];

Generation (QFT → LHE through MadGraph)

In the generation phase, tools like MadGraph are used to simulate hard scattering events based on quantum field theory. MadGraph takes in a process definition and a set of parameters from the run_card and param_card. The input file typically consists of these cards and configuration files, which define the physics process and the collider setup. The output of this phase is in the LHE format (Les Houches Event), a standardized text file that contains detailed information about the parton-level event, such as particle IDs, momenta, and event weights. The LHE file serves as the bridge to further event processing.

Let's try to generate a simple Drell-Yan process: pp → Z → ll. For this, be in your work area, go to MadGraph prompt, and define the process.

                mg5                       # my shortcut for opening MadGraph prompt.
                
                #Now inside MadGraph prompt.
                
                display particle          # displays all the available individual particles.
                display multiparticles    # displays all the labels for groups of particles.

                generate p p > z > l+ l-
                output ppToZToLL

                exit
              

The output line creates a new directory containing the pp → Z → ll process, including the run-cards and param-cards etc. In principle, all of these can be done in a single step in the MadGraph prompt, but it's convenient to customize the parameters later. The cards are kept at ppToZToLL/Cards. Let's edit the run_card.dat file and change/add some important parameters as follows, and keep everything else as it is.

                #Edited line:
                100  = nevents ! Number of unweighted events requested, changed to 10
                
                #Newly added lines:
                # -- customize according to your need --
              

The ppToZToLL/bin directory contains the binaries used to run the event generator, and ppToZToLL/Events contains the outputs for each run. Let's run a test and see what it does.

                # Be inside ppToZToLL directory.

                ./bin/generate_events testrun -f > /dev/null 2>&1
                
                # testrun          = name of the directory to be generated inside ppToZToLL/Events
                # -f               = suppresses the MadGraph CLI prompts and takes the parameters from run_card.
                # > /dev/null 2>&1 = Suppresses any GUI related issues (in WSL) 
              

This creates a ppToZToLL/Events/testrun directory containing a .gz file which can be unzipped and used for later purposes.

                cd Events/testrun
                gunzip unweighted_events.lhe.gz
              

This unzipped LHE file, unweighted_events.lhe, contains all the generated parton-level information, including particle IDs, momenta, event weights, and other metadata.

Hadronization (LHE → DAT through PYTHIA and HepMC3)

The hadronization step happens in the pythiaXXXX/examples directory. For this, we need a code that reads the lhe file and hadronizes them, and edit pythiaXXXX/examples/Makefile to include this code so that it is compiled along with the rest of the staff.

  • The hadronizer file is named hadlhe.cc, and can be found here.
  • The updated Makefile can be found here, and should replace the old one.

These files can also be directly downloaded into the correct directory as follows.

                  cd [path to pythia]/examples
                  wget -O hadlhe.cc https://raw.githubusercontent.com/phazarik/phazarik.github.io/main/mypages/files/codes/hadlhe.cc
                  wget -O Makefile https://raw.githubusercontent.com/phazarik/phazarik.github.io/main/mypages/files/codes/Makefile_for_pythia.txt
              

The hadlhe.cc file should be edited to give the correct path to the input LHE file that was generated using MadGraph, the number of events in the input file, and a desired output path for the dat file that contains the hadronized outputs. This dat file is later imported in Delphes for detector simulation. The hadronizer is executed as follows.

                #inside the examples directory

                make clean               # to get rid of any previous compiled files
                make hadlhe              # this looks fof hadlhe.cc, compiles it, and creates an executable
                ./hadlhe                 # execution
              

Simulation (DAT → ROOT through Delphes)

Delphes simulation can be run using the executable DelphesHepMC3 followed by the arguments: card-name, output-file and input-file. The first two arguments are the same in every case, so I added them in my bashrc as follows.

                alias delphes="/home/phazarik/mcgeneration/delphes/DelphesHepMC3 /home/phazarik/mcgeneration/delphes/cards/delphes_card_CMS.tcl"
              

For this toy example, I ran the following from my work-directory to produce a delphes tree.

                delphes testroot.root pythia8312/examples/ppToZToLL_10.dat
              

Hadronization and Simulation [combined step]

So far, the whole MC production process is done step-by-step. First, event generation at the LHE level, then producing the hadronized gen-level information using PYTHIA, and then detector simulation using Delphes. This whole exercise is done for understanding what happens at the back-end. The hadronization step produces large dat files, which needs to be handled by Delphes before being useful. It turns out that this extra step can be avoided by merging the work of Delphes and PYTHIA, which is illustrated in the flowchart below.

graph LR; A[Generation in MadGraph
Input: Interaction parameters, Output: LHE] --> B[Hadronization and Simulation in Delphes
Input: LHE, PYTHIA config file, Output: ROOT file];

For this the following things need to be done.

  • Make sure that the PYTHIA paths are available. The following lines can be added to .bashrc.
                        export PYTHIA8=[path to pythia8312] #Give the full path here
                        export PYTHIA8DATA=$PYTHIA8/share/Pythia8/xmldoc
                        export PATH=$PYTHIA8/bin:$PATH
                        export LD_LIBRARY_PATH=$PYTHIA8/lib:$LD_LIBRARY_PATH
                      
  • Configure Delphes with PYTHIA8.
                        cd /home/phazarik/mcgeneration/delphes
                        make clean
                        make HAS_PYTHIA8=true   # This will take a while
                      
  • Write a PYTHIA config file for the process as pythia8_ppToZToLL.cfg, and keep it in the process directory. This contains a specific set of instructions that is needed during hadronization, including the path to the input files, how many events to process, etc. This file is fed as an input to Delphes. If the number of events specified in the configuration does not match the number of events available in the LHE file, PYTHIA will still proceed with whatever number of events it can read.
  •                   ! Pythia8 configuration for pp -> Z -> LL process
    
                      Main:numberOfEvents = 10     ! Number of events to simulate
                      Main:timesAllowErrors = 10   ! How many errors Pythia will allow before stopping
    
                      ! Set up beam parameters
                      Beams:idA = 2212              ! Proton beam
                      Beams:idB = 2212              ! Proton beam
                      Beams:eCM = 13000.0           ! Center-of-mass energy in GeV
    
                      ! Load the LHEF events generated by MadGraph
                      Beams:LHEF = /mnt/d/work/temp/mcgeneration/ppToZToLL/Events/testrun/unweighted_events.lhe
    
                      ! Pythia8-specific physics settings
                      WeakSingleBoson:ffbar2gmZ = on   ! Enable Z boson production
                      23:onMode = off                  ! Turn off all Z decays
                      23:onIfAny = 11 13               ! Allow only decays to leptons (e+ e- and mu+ mu-)
    
                      ! Hadronization and jet clustering
                      HadronLevel:all = on
                    
  • Instead of running DelphesHepMC3, run DelphesPythia8 along with the CMS card. For this, I created an alias in the .bashrc file for my convenience.
                        alias delphespythia="/home/phazarik/mcgeneration/delphes/DelphesPythia8 /home/phazarik/mcgeneration/delphes/cards/delphes_card_CMS.tcl"
                      

Once these changes are made, the following can be run from the work-directory.

                delphespythia ppToZToLL/pythia8_ppToZToLL.cfg test_output.root
              

I find it convenient because I don't have to go to the pythia8312/examples directory and create a mess there with all the temporary folders, and deal with the DAT files. However, this method complains about some missing libraries related to ExRootAnalysis at the beginning, and ROOT prompts some warnings about the same while loading the output files. But as long as we are not using those features, it's fine.


Summary

  1. Generate LHE events in MadGraph.
    • Define the processes carefully in the MadGraph prompt.
    • Save the process by doing output processname.
    • Edit run_card.dat and specify the number of events.
    • Run ./bin/generate_events testrun.
    • Unzip the output file in Events/testrun.
  2. Run hadronization and detector simulation.
    • Write a PYTHIA config file mentioning the path to the LHE file along with hadronization parameters and how many events to run on.
    • Run delphes/DelphesPythia8 by providing the CMS card, the PYTHIA config file, and the output file as arguments.

For analyzing Delphes trees and writing histograms, I created a MakeSelelector() based setup with instructions available in this GitHub repository.

↑ back to top

Gridpack production in lxplus


Previous sections describe how to generate MC completely outside CMSSW framework, using delphes to approximate detector simulation. This section focuses on generating gridpacks, which are required to produce MC samples within the CMSSW framework, where full detector simulation is performed using GEANT4. This section is completely independent of the earlier ones.

MC generation in CMS follows a standard sequence: Gridpack > LHEGS > Premix > AODSIM > MINIAODSIM > NANOAODSIM. These steps include event generation (e.g., MadGraph), parton showering and hadronization (Pythia), detector simulation (GEANT4 via CMSSW), and full reconstruction. Each step is configured through CMSSW fragments, and production is aligned with centrally defined campaign configurations. Users generate gridpacks using the gridpack_generation.sh script from the GitHub:genproductions/MadGraph5_aMCatNLO tool, specifying the model and process. For central production, the gridpack, Pythia fragment, number of events, and other metadata are provided to the NPS MC contact. For local validation, the gridpack is processed with cmsDriver.py to create GEN-SIM or full AOD workflows.

Prerequisites

  • Access to CMS computing grid (lxplus).
  • Access to CRAB client and gridpack tools.
  • CMS VOMS proxy, incase any external file used during production (such as pile-up profile).
  • A BSM model UFO file (contact a theorist for this).
  • Required cards/fragments in correct format (discuss this with the NPS MC contact).

Setting up the work area

As an example, I am producing a VLL doublet (electron type) sample with mass of both new leptons being 600 GeV. I prepared some helper scripts to manage and organize the outputs. These are to be brought to the work area as follows.

  1. modeldict.yaml: YAML file containing information on the signal models. This is used to auto-generate the cards from a template.
  2. Template cards which can be customized according to the needs of each model:
  3. Python scripts to generate the cards and gridpacks:
  4. Python script for organization:

Next, the genproductions tool is cloned inside a directory named Run3.

                mkdir Run3 && cd Run3
                git clone https://github.com/cms-sw/genproductions.git --depth=1
              

In case of Run2, a specific branch mg265UL is recommended, and that's why it should be kept in a separate repository to avoid conflicts.

                mkdir Run2 && cd Run2
                git clone https://github.com/cms-sw/genproductions.git --depth=1 -b mg265UL
              

At this point, the work area should look like this:

                .
                ├── Run3
                │   └── genproductions.
                ├── generate_cards.py
                ├── generate_one_gridpack.py
                ├── modeldict.yaml
                ├── move_cards.py
                └── templates
                    ├── customizecards.dat
                    ├── extramodels.dat
                    ├── proc_card_doublet.dat
                    ├── proc_card_singlet.dat
                    └── run_card.dat
              

Generating cards

The card generation is automatically handled by generate_cards.py which loops over the different mass points described in the YAML file, fills up the parameters (such as beam energy, decay leptons, masses of the VLLs etc.) in the template cards and creates a set of cards for each mass point. After this, move_cards.py is used to move the generated cards to the genproductions/bin/MadGraph5_aMCatNLO/cards/VLL directory. Now the setup is ready to generate gridpacks. At this point, relevant files/directories in the genproductions setup should look like this:

                .
                └── Run3
                    └── genproductions
                        └──bin
                            └──MadGraph5_aMCatNLO
                                ├── PLUGIN
                                ├── Utilities
                                ├── cards
                                │   └── VLL
                                └── gridpack_generation.sh
              
Note: For gridpack production, CMSSW environment is not needed, but the CMSSW release and the SCRAM architecture can be mentioned while runnning the tool. This is important for ensuring compatibility with the downstream workflow. Also, make sure that the BSM model (VLL.tgz in this case) is available in the central repository (cms-project-generators).


A note on VLL M-100: In the cards, the logic for producing the vector-like charged lepton L, and the neutral lepton N (both of which couple to muons in this example) is the following.

                define lepton = mu+ mu- vm vm~                                                          # Taken from YAML
                generate p p > L L, (L > z lepton), (L > h lepton)                                      # Pair production
                add process p p > N N, (N > w+ lepton), (N > w- lepton)                                 # Pair production
                add process p p > L N, (L > z lepton), (L > h lepton), (N > w+ lepton), (N > w- lepton) # Associated production
              

However, there is an issue with these for the low mass-point. For M = 100 GeV, some of the decay chains (e.g., L > higgs + lepton with mH = 125 GeV) are not kinematically allowed (if the Higgs is on-shell). So MadGraph tries to include these subprocesses but fails midway. The same thing is true for the associated production. The gridpack generation tries to compile all 24 processes regardless of whether they are viable or not. If even one fails, the entire gridpack generation fails. MadGraph is designed to produce on-shell bosons with minimal instructions, and it is safe to skip this mass point if it's not too important.

Generating one gridpack

Just to generate one gridpack, the following command can be run from the genproductions/bin/MadGraph5_aMCatNLO directory.

                chmod +x gridpack_generation.sh
                ./gridpack_generation.sh VLLD_ele_M600 cards/VLL/VLLD_ele_M600 local ALL el8_amd64_gcc10 CMSSW_12_4_8
              

Where,

  • VLLD_ele_M600: Name of the gridpack to be generated.
  • cards/VLL/VLLD_ele_M600: Path to the process card directory.
  • local: Run mode (condor can be used here).
  • ALL: Indicates that all available cores should be used. Number of cores can also be mentioned here.
  • CMSSW_12_4_8: Version of CMSSW to be used for the subsequent steps.
  • el8_amd64_gcc10: SCRAM architecture (must be compatible with the CMSSW release).

To simplify the process, I have automated gridpack generation using generate_one_gridpack.py. This script automatically sets the run mode, SCRAM architecture, and CMSSW version - requiring only the sample name as input. After generating the gridpack, it also transfers the output to EOS to conserve space in the AFS workspace. The script can be executed from the base working directory as shown below:

 python3 generate_one_gridpack.py --name VLLD_ele_M600

Full automation has not been implemented for this setup, as gridpack generation is a one-time task and the process cards are rarely modified. The resulting gridpacks will be available in the EOS directory.

Note: The genproductions/bin/MadGraph5_aMCatNLO directory also contains several submit_*.sh scripts, such as submit_cmsconnect_gridpack_generation.sh and submit_condor_gridpack_generation.sh. These are wrapper scripts around gridpack_generation.sh, originally created to simplify submission in specific environments like CMSConnect or HTCondor. While they can still be useful for batch submission, site-specific configurations, or legacy workflows, they are not strictly necessary. The current setup uses this main script directly, without relying on the wrappers.

↑ back to top

Gridpack validation in lxplus


This section describes the local validation of the gridpacks after they are successfully generated, i.e, the full production chain to produce NanoAOD. I am taking VLLD_ele_M600 as an example. First, pick a separate work area for validating the test gridpack and unzip it as follows.

                mkdir VLLD_ele_M600
                tar -xf VLLD_ele_M600_el8_amd64_gcc10_CMSSW_12_4_8_tarball.tar.xz -C VLLD_ele_M600
              

This creates a predefined structure required for event generation. A breakdown of the key contents is as follows:

                VLLD_ele_M600
                ├── InputCards
                ├── gridpack_generation.log
                ├── merge.pl
                ├── mgbasedir
                ├── process
                │   ├── madevent
                │   └── run.sh
                └── runcmsgrid.sh
              
  • InputCards/: Contains the cards used to generate the gridpack.
  • gridpack_generation.log: Full log of the gridpack creation process for debugging.
  • mgbasedir/: Contains the full MadGraph installation with model files, binaries etc. needed to reproduce the generation.
  • process/: Includes process-specific configuration and scripts, notably run.sh for launching the internal generation step.
  • runcmsgrid.sh: The main script used to generate LHE events.

Gridpack ≫ LHE

Once the gridpack has been extracted, parton-level events in LHE (Les Houches Event) format can be produced using the runcmsgrid.sh script. This script is included inside the unpacked gridpack directory and serves as the interface to MadGraph, MadSpin (if applicable), and Pythia (if configured). It prepares the runtime environment, handles all necessary steps of event generation, and produces a cmsgrid_final.lhe file containing the events. To run the script for a small test sample of events:

                cd VLLD_ele_M600
                ./runcmsgrid.sh 1000 12345 4
              

Where:

  • 1000 is the number of events to generate
  • 12345 is the random seed
  • 4 is the number of threads or parallel jobs to use

This also creates a temporary CMSSW environment under the hood and ensures all dependencies are properly set up for the run. The output file cmsgrid_final.lhe contains all the generated parton-level four-vectors, with each event uniquely defined. Make sure to generate enough events here to support the statistics needed for downstream steps such as full simulation and analysis.

LHE ≫ GENSIM

This step takes the parton-level LHE events and runs hadronization and particle showering using Pythia8 to produce GEN-SIM files that simulate particles and their interactions with the detector. A CMSSW configuration is required for this step. The gridpack was produced with CMSSW_12_4_8. However, I will proceed with CMSSW_13_0_13 to match 2022 (preEE) MC conditions. First, set up the CMSSW environment and directory structure as follows.

Note: The exact CMSSW version used for this step typically does not matter much, as long as it is compatible with the target era (e.g., Run 3). For Run 3, versions like CMSSW_12_4_X or CMSSW_13_0_X can be used interchangeably for GEN-SIM. However, Run 2 productions require a separate branch of the genproductions repository and use an older version of MadGraph, along with older CMSSW versions executed inside a cmssw-el7 container. These setups are not interchangeable with the Run 3 setup.
                echo $SCRAM_ARCH   # It has to be compatible with the target CMSSW release
                cmsrel CMSSW_13_0_13
                cd CMSSW_13_0_13/src/
                mkdir -p Configuration/GenProduction/python/
              

The Configuration/GenProduction/python directory is used to store the fragment (configuration script) describing how to hadronize the LHE events. The directory structure is important because cmsDriver.py (configuration maker tool) expects fragments to be in the Python path under this namespace. Here is an example of a hadronizer, without using externalLHEProducer (since we are testing a local LHE file).

                import FWCore.ParameterSet.Config as cms

                from Configuration.Generator.Pythia8CommonSettings_cfi import *
                from Configuration.Generator.MCTunes2017.PythiaCP5Settings_cfi import *            ## Used till 2022
                #from Configuration.Generator.MCTunesRun3ECM13p6TeV.PythiaCP5Settings_cfi import * ## From 2023 onwards
                from Configuration.Generator.PSweightsPythia.PythiaPSweightsSettings_cfi import *

                generator = cms.EDFilter("Pythia8ConcurrentHadronizerFilter",
                    comEnergy = cms.double(13600.),
                    maxEventsToPrint = cms.untracked.int32(1),
                    pythiaPylistVerbosity = cms.untracked.int32(1),
                    pythiaHepMCVerbosity = cms.untracked.bool(False),
                    nAttempts = cms.uint32(1),
                    PythiaParameters = cms.PSet(
                        pythia8CommonSettingsBlock,
                        pythia8CP5SettingsBlock,
                        pythia8PSweightsSettingsBlock,
                        parameterSets = cms.vstring(
                            'pythia8CommonSettings',
                            'pythia8CP5Settings',
                            'pythia8PSweightsSettings'
                        )
                    )
                )

                ProductionFilterSequence = cms.Sequence(generator)
              
Note: Make sure that the hadronization energy (in this case, comEnergy = cms.double(13600.)) matches the center-of-mass energy of the LHE events. This is crucial for correct hadronization and particle showering.

Call it myfragment.py. Once the fragment is in place, compile the setup.

                scram b -j8
                which cmsDriver.py
              

This should display the path to the cmsDriver.py available in the current CMSSW environment. This tool is used to convert the generated LHE file into a GEN-SIM file. However, for debugging purposes, it is often useful to generate a configuration file first using the --no_exec option. This config file can be inspected or modified before executing. Adjust filenames, number of events, and detector-specific settings like beamspot and global conditions as needed. In the following, I am following an example of Run3Summer22 (pre-EE) conditions.

                cmsDriver.py Configuration/GenProduction/python/myfragment.py \
                  --filein file:../../cmsgrid_final.lhe \
                  --fileout file:VLLD_ele_M600_GENSIM.root \
                  --mc \
                  --eventcontent RAWSIM \
                  --datatier GEN-SIM \
                  --beamspot Realistic25ns13p6TeVEarly2022Collision \
                  --step GEN,SIM \
                  --nThreads 8 \
                  --geometry DB:Extended \
                  --era Run3 \
                  --conditions 130X_mcRun3_2022_realistic_v5 \
                  --customise Configuration/DataProcessing/Utils.addMonitoring \
                  --python_filename cfg_1_GENSIM.py \
                  --no_exec \
                  -n 100
              
Note: Copy-pasting the command directly from the webpage into the terminal may introduce formatting issues (e.g. invisible line breaks or special characters). It is recommended to paste it into a text editor first, review the parameters, and adjust them to match the intended MC campaign.

This creates a configuration file named cfg_1_GENSIM.py without running the job immediately. Matching the parameters - --conditions, --beamspot, and --era with the intended simulation campaign and CMSSW release is important. Finally, run the generation step.

cmsRun cfg_1_GENSIM.py

This produces a GEN-SIM ROOT file named VLLD_ele_M600_GENSIM.root from the LHE input, suitable for the next simulation steps. This file contains both the generator-level information (i.e. final-state particles from MadGraph/Pythia) and the full detector simulation output from GEANT4, which emulates how those particles would interact with the CMS detector. This includes simulated energy deposits, tracking hits, and digitized detector responses. The GEN-SIM step is notably slow and computationally heavy because of the detailed physics and geometry involved in the simulation. While crucial for realistic MC production, such jobs are typically done on grid resources rather than locally, except for small-scale validation like this. During the validation, it also runs the GenXsecAnalyzer, which computes the cross-section of the generated events, and displays the time taken per event. Estimation of time-per-event and size-per-event is needed so that PdmV can estimate how long the central production might take.

GENSIM ≫ DIGIRAW

In this step, the simulated detector hits are converted into raw detector data format, and this is where pileup interactions can be added by overlaying minimum bias events on top of the hard scattering. Access to pileup datasets from DBS requires a valid VOMS proxy. In this example, I am not providing a pileup dataset.

                cmsDriver.py step1 \
                  --filein file:VLLD_ele_M600_GENSIM.root \
                  --fileout file:VLLD_ele_M600_DIGIRAW.root \
                  --eventcontent FEVTDEBUGHLT \
                  --datatier GEN-SIM-DIGI-RAW \
                  --step DIGI,L1,DIGI2RAW,HLT:@fake2 \
                  --nThreads 8 \
                  --geometry DB:Extended \
                  --era Run3 \
                  --conditions 130X_mcRun3_2022_realistic_v5 \
                  --beamspot Realistic25ns13p6TeVEarly2022Collision \
                  --customise Configuration/DataProcessing/Utils.addMonitoring \
                  --python_filename cfg_2_DIGIRAW.py \
                  --no_exec \
                  --mc \
                  -n 100
              
cmsRun cfg_2_DIGIRAW.py

DIGIRAW ≫ AOD

In this step, the full detector reconstruction is performed on the RAW or DIGIRAW data, producing reconstructed physics objects such as tracks, jets, electrons, and muons. This completes the previous digitization step and produces fully analysis-ready objects.

                cmsDriver.py step2 \
                  --filein file:VLLD_ele_M600_DIGIRAW.root \
                  --fileout file:VLLD_ele_M600_AOD.root \
                  --eventcontent AODSIM \
                  --datatier AODSIM \
                  --step RAW2DIGI,L1Reco,RECO,RECOSIM \
                  --nThreads 8 \
                  --geometry DB:Extended \
                  --era Run3 \
                  --conditions 130X_mcRun3_2022_realistic_v5 \
                  --beamspot Realistic25ns13p6TeVEarly2022Collision \
                  --customise Configuration/DataProcessing/Utils.addMonitoring \
                  --python_filename cfg_3_AOD.py \
                  --no_exec \
                  --mc \
                  -n 100
              
cmsRun cfg_3_AOD.py

AOD ≫ MINIAOD

MINIAOD is a reduced format derived from AOD where reconstruction is not redone but the data is skimmed and slimmed to contain only essential reconstructed objects and variables. Some high-level corrections and possibly DNN-based identification variables can be added at this stage.

                cmsDriver.py step3 \
                  --filein file:VLLD_ele_M600_AOD.root \
                  --fileout file:VLLD_ele_M600_MINIAOD.root \
                  --eventcontent MINIAODSIM \
                  --datatier MINIAODSIM \
                  --step PAT \
                  --nThreads 8 \
                  --geometry DB:Extended \
                  --era Run3 \
                  --conditions 130X_mcRun3_2022_realistic_v5 \
                  --beamspot Realistic25ns13p6TeVEarly2022Collision \
                  --customise Configuration/DataProcessing/Utils.addMonitoring \
                  --python_filename cfg_4_MINIAOD.py \
                  --no_exec \
                  --mc \
                  -n 100
              
cmsRun cfg_4_MINIAOD.py

MINIAOD ≫ NANOAOD

NanoAOD further reduces the data size for fast physics analysis, containing selected reconstructed objects and variables, often including DNN outputs for particle identification or event classification. No reconstruction is performed here; it uses the objects produced in previous steps.

                cmsDriver.py step4 \
                  --filein file:VLLD_ele_M600_MINIAOD.root \
                  --fileout file:VLLD_ele_M600_NANOAOD.root \
                  --eventcontent NANOAODSIM \
                  --datatier NANOAODSIM \
                  --step NANO \
                  --nThreads 8 \
                  --geometry DB:Extended \
                  --era Run3 \
                  --conditions 130X_mcRun3_2022_realistic_v5 \
                  --beamspot Realistic25ns13p6TeVEarly2022Collision \
                  --customise Configuration/DataProcessing/Utils.addMonitoring \
                  --python_filename cfg_5_NANOAOD.py \
                  --no_exec \
                  --mc \
                  -n 100
              
(cmsRun cfg_5_NANOAOD.py)>nanoaod.log
Note: If events fail during CMS production steps (e.g., GEN-SIM, DIGI-RAW, AOD, MiniAOD, NanoAOD), check the log for errors or failed event numbers. To avoid crashes in the subsequent steps, you can reduce the number of events using -n or skip specific events with process.source.skipEvents. To skip events that raise ProductNotFound errors automatically, add process.options = cms.untracked.PSet(SkipEvent = cms.untracked.vstring('ProductNotFound')) to the config. Always validate the outputs with edmDumpEventContent to ensure key branches are populated.