How To Pass Python Package To Spark Job And Invoke Main File From Package With Arguments

June 29, 2023 Post a Comment

I have my python code with a structure like, Project1 --src ----util.py ----job1.py ----job2.py --config ----config1.json ----config2.json I want to run this job1 in spark but the

Solution 1:

Specific to your question, you need to use --py-files to include python files that should be made available on the PYTHONPATH.

I just ran into a similar problem where I want to run a modules main function from a module inside an egg file.

The wrapper code below can be used to run main for any module via spark-submit. For this to work you need to drop it into a python file using the package and module name as the filename. The filename is then used inside the wrapper to identify which module to run. This makes for a more natural means of executing packaged modules without needing to add extra arguments (which can get messy).

Here's the script:

"""
Wrapper script to use when running Python packages via egg file through spark-submit.

Rename this script to the fully qualified package and module name you want to run.
The module should provide a ``main`` function.

Pass any additional arguments to the script.

Usage:

  spark-submit --py-files <LIST-OF-EGGS> <PACKAGE>.<MODULE>.py <MODULE_ARGS>
"""import os
import importlib


defmain():
    filename = os.path.basename(__file__)
    module = os.path.splitext(filename)[0]
    module = importlib.import_module(module)
    module.main()


if __name__ == '__main__':
    main()

You won't need to modify any of this code. It's all dynamic and driven from the filename.

As an example, if you drop this into mypackage.mymodule.py and use spark-submit to run it, then the wrapper will import mypackage.mymodule and run main() on that module. All command line arguments are left intact, and will be naturally picked up by the module being executed.

You will need to include any egg files and other supporting files in the command. Here's an example:

spark-submit --py-files mypackage.egg mypackage.mymodule.py--module-arg1 value1

Solution 2:

There a few basic steps:

Create a Python package.
Either build egg file or create a simple zip archive.
Add package as a dependency using --py-files / pyFiles.
Create a thin main.py which invokes functions from the package and submit it to Spark cluster.

Solution 3:

A convenient solution in general IMO is using setuptools. It will come into handy when your project has too many dependent packages (Python file).

First, put an empty __init__ file on each directory which has the code file to be imported. This will help the package to be visible to the import statement.

After installing setuptools, create a simple set_up.py file (sample here). Remember to place it at the outermost directory which is your src folder.

from setuptools import setup, find_packages
setup(
    name = "name_of_lib",
    version = "0.0.1",
    author = "your name",
    packages = find_packages()
)

The find_packages config will collect all your packages within that folder recursively.

Run python set_up.py bdist_egg, and finally check out the .egg file in the dist folder.
Submit your job using either --py-files oraddPyFile of SparkContext.

And it works like a charm!

Solution 4:

Add this to your PYTHONPATH environment variable: /path-to-your-spark-directory/python. Also your path variable should have location of spark/bin

Solution 5:

If you want to have a bit more flexibility, i.e run files, modules or even a script specified on the command line, you can use something like the following launcher script:

launcher.py

import runpy
import sys
from argparse import ArgumentParser


defsplit_passthrough_args():
    args = sys.argv[1:]
    try:
        sep = args.index('--')
        return args[:sep], args[sep + 1:]
    except ValueError:
        return args, []


defmain():
    parser = ArgumentParser(description='Launch a python module, file or script')
    source = parser.add_mutually_exclusive_group(required=True)
    source.add_argument('-m', type=str, help='Module to run', dest='module')
    source.add_argument('-f', type=str, help="File to run", dest='file')
    source.add_argument('-c', type=str, help='Script to run', dest='script')
    parser.add_argument('--', nargs='*', help='Arguments', dest="arg")
    self_args, child_args = split_passthrough_args()
    args = parser.parse_args(self_args)
    sys.argv = [sys.argv[0], *child_args]

    if args.file:
        runpy.run_path(args.file, {}, "__main__")
    elif args.module:
        runpy.run_module(f'{args.module}.__main__', {}, "__main__")
    else:
        runpy._run_code(args.script, {}, {}, "__main__")


if __name__ == "__main__":
    main()

It tries to emulate the Python interpreter's behavior, so when you have a package with the following module hierarchy

mypackage
  mymodule
    __init__.py
    __main__.py

where __main__.py contains the following:

import sys
if __name__ == "__main__":
   print(f"Hello {sys.argv[1]}!")

which you built and packaged as mypackage.whl; you can run it with

spark-submit --py-files mypackage.whl launcher.py -m mypackage.mymodule -- World

Supposing the package is preinstalled and available on /my/path/mypackage on the driver:

spark-submit launcher.py -f /my/path/mypackage/mymodule/__main__.py -- World

You could even submit a script:

spark-submit launcher.py -c "import sys; print(f'Hello {sys.argv[1]}')"-- World

Python Manual