How To Pass Python Package To Spark Job And Invoke Main File From Package With Arguments
Solution 1:
Specific to your question, you need to use --py-files
to include python files that should be made available on the PYTHONPATH.
I just ran into a similar problem where I want to run a modules main function from a module inside an egg file.
The wrapper code below can be used to run main
for any module via spark-submit. For this to work you need to drop it into a python file using the package and module name as the filename. The filename is then used inside the wrapper to identify which module to run. This makes for a more natural means of executing packaged modules without needing to add extra arguments (which can get messy).
Here's the script:
"""
Wrapper script to use when running Python packages via egg file through spark-submit.
Rename this script to the fully qualified package and module name you want to run.
The module should provide a ``main`` function.
Pass any additional arguments to the script.
Usage:
spark-submit --py-files <LIST-OF-EGGS> <PACKAGE>.<MODULE>.py <MODULE_ARGS>
"""import os
import importlib
defmain():
filename = os.path.basename(__file__)
module = os.path.splitext(filename)[0]
module = importlib.import_module(module)
module.main()
if __name__ == '__main__':
main()
You won't need to modify any of this code. It's all dynamic and driven from the filename.
As an example, if you drop this into mypackage.mymodule.py
and use spark-submit to run it, then the wrapper will import mypackage.mymodule
and run main()
on that module. All command line arguments are left intact, and will be naturally picked up by the module being executed.
You will need to include any egg files and other supporting files in the command. Here's an example:
spark-submit --py-files mypackage.egg mypackage.mymodule.py--module-arg1 value1
Solution 2:
There a few basic steps:
- Create a Python package.
- Either build
egg
file or create a simplezip
archive. - Add package as a dependency using
--py-files
/pyFiles
. - Create a thin
main.py
which invokes functions from the package and submit it to Spark cluster.
Solution 3:
A convenient solution in general IMO is using setuptools
. It will come into handy when your project has too many dependent packages (Python file).
- First, put an empty
__init__
file on each directory which has the code file to be imported. This will help the package to be visible to theimport
statement. After installing
setuptools
, create a simpleset_up.py
file (sample here). Remember to place it at the outermost directory which is yoursrc
folder.from setuptools import setup, find_packages setup( name = "name_of_lib", version = "0.0.1", author = "your name", packages = find_packages() )
The find_packages
config will collect all your packages within that folder recursively.
- Run
python set_up.py bdist_egg
, and finally check out the.egg
file in thedist
folder. - Submit your job using either
--py-files
oraddPyFile
of SparkContext.
And it works like a charm!
Solution 4:
Add this to your PYTHONPATH
environment variable: /path-to-your-spark-directory/python
.
Also your path variable should have location of spark/bin
Solution 5:
If you want to have a bit more flexibility, i.e run files, modules or even a script specified on the command line, you can use something like the following launcher script:
launcher.py
import runpy
import sys
from argparse import ArgumentParser
defsplit_passthrough_args():
args = sys.argv[1:]
try:
sep = args.index('--')
return args[:sep], args[sep + 1:]
except ValueError:
return args, []
defmain():
parser = ArgumentParser(description='Launch a python module, file or script')
source = parser.add_mutually_exclusive_group(required=True)
source.add_argument('-m', type=str, help='Module to run', dest='module')
source.add_argument('-f', type=str, help="File to run", dest='file')
source.add_argument('-c', type=str, help='Script to run', dest='script')
parser.add_argument('--', nargs='*', help='Arguments', dest="arg")
self_args, child_args = split_passthrough_args()
args = parser.parse_args(self_args)
sys.argv = [sys.argv[0], *child_args]
if args.file:
runpy.run_path(args.file, {}, "__main__")
elif args.module:
runpy.run_module(f'{args.module}.__main__', {}, "__main__")
else:
runpy._run_code(args.script, {}, {}, "__main__")
if __name__ == "__main__":
main()
It tries to emulate the Python interpreter's behavior, so when you have a package with the following module hierarchy
mypackage
mymodule
__init__.py
__main__.py
where __main__.py
contains the following:
import sys
if __name__ == "__main__":
print(f"Hello {sys.argv[1]}!")
which you built and packaged as mypackage.whl
; you can run it with
spark-submit --py-files mypackage.whl launcher.py -m mypackage.mymodule -- World
Supposing the package is preinstalled and available on /my/path/mypackage on the driver:
spark-submit launcher.py -f /my/path/mypackage/mymodule/__main__.py -- World
You could even submit a script:
spark-submit launcher.py -c "import sys; print(f'Hello {sys.argv[1]}')"-- World
Post a Comment for "How To Pass Python Package To Spark Job And Invoke Main File From Package With Arguments"