diff options
| author | CoprDistGit <infra@openeuler.org> | 2023-05-10 06:05:29 +0000 |
|---|---|---|
| committer | CoprDistGit <infra@openeuler.org> | 2023-05-10 06:05:29 +0000 |
| commit | 27d87af662311d2e85111e1248e1f64611f996e4 (patch) | |
| tree | d67e1c847ff22ab29e5956dc25c3a2482f1702ff | |
| parent | d7f340d9c3306c73000a10996e76ba72f9e4a977 (diff) | |
automatic import of python-simeon
| -rw-r--r-- | .gitignore | 1 | ||||
| -rw-r--r-- | python-simeon.spec | 1061 | ||||
| -rw-r--r-- | sources | 1 |
3 files changed, 1063 insertions, 0 deletions
@@ -0,0 +1 @@ +/simeon-0.0.24.tar.gz diff --git a/python-simeon.spec b/python-simeon.spec new file mode 100644 index 0000000..7d011ad --- /dev/null +++ b/python-simeon.spec @@ -0,0 +1,1061 @@ +%global _empty_manifest_terminate_build 0 +Name: python-simeon +Version: 0.0.24 +Release: 1 +Summary: A CLI tool to help process research data from edX +License: MIT LICENSE +URL: https://github.com/MIT-IR/simeon +Source0: https://mirrors.nju.edu.cn/pypi/web/packages/41/1c/71f8c37b3b2a2b791e5bb96a6a81475783fb491e96fa976f55a038864c9a/simeon-0.0.24.tar.gz +BuildArch: noarch + +Requires: python3-boto3 +Requires: python3-google-cloud-bigquery +Requires: python3-google-cloud-storage +Requires: python3-jinja2 +Requires: python3-dateutil +Requires: python3-geoip2 +Requires: python3-sphinx +Requires: python3-tox + +%description +simeon +~~~~~~ + +``simeon`` is a CLI tool to help with the processing of edx Research +data. It can ``list``, ``download``, and ``split`` edX data packages. It +can also ``push`` the output of the ``split`` subcommand to both GCS and +BigQuery. It is heavily inspired by the +`edx2bigquery <https://github.com/mitodl/edx2bigquery>`__ package. If +you’ve used that tool, you should be able to navigate the quirks that +may come with this one. + +Installing with pip +~~~~~~~~~~~~~~~~~~~ + +.. code:: sh + + python3 -m pip install simeon + # Or with geoip + python3 -m pip install simeon[geoip] + # Then invoke the CLI tool with + simeon --help + +Installing with git clone +~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: sh + + git clone git@github.com:MIT-IR/simeon.git + cd simeon && python -m pip install . + # Or with geoip + cd simeon && python -m pip install .[geoip] + # Then invoke the CLI tool with + simeon --help + +Using Docker +~~~~~~~~~~~~ + +.. code:: sh + + docker run -it mitir/simeon:latest + simeon --help + +Developing +~~~~~~~~~~ + +.. code:: sh + + git clone git@github.com:MIT-IR/simeon.git + cd simeon + # Set up a virtual environment if you don't already have on + python3 -m venv venv + . venv/bin/activate + # pip install the package in an editable way + python3 -m pip install -e .[test,geoip] + # Invoke the executable + simeon --help + # Run the tests + tox + # Write code and tests and submit PR's + +Setups and configurations +~~~~~~~~~~~~~~~~~~~~~~~~~ + +``simeon`` is a glorified downloader and uploader set of scripts. Much +of the downloading and uploading that it does makes the assumptions that +you have your AWS credentials configured properly and that you’ve got a +service account file for GCP services available on your machine. If the +latter is missing, you may have to authenticate to GCP services through +the SDK. However, both we and Google recommend you not do that. + +Every downloaded file is decrypted either during the download process or +while it gets split by the ``simeon split`` command. So, this tool +assumes that you’ve installed and configured ``gpg`` to be able to +decrypt files from edX. + +The following steps may be useful to someone just getting started with +the edX data package: + +1. Credentials from edX + + - Reach out to edX to get your data czar credentials + - Configure both AWS and gpg, so your credentials can access the S3 + buckets and your ``gpg`` key can decrypt the files there + +2. Setup a GCP project + + - Create a GCP project + - Setup a BigQuery workspace + - Create a GCS bucket + - Create a service account and download the associated file + - Give the service account Admin Role access to both the BigQuery + project and the GCS bucket + +If the above steps are carried out successfully, then you should be able +to use ``simeon`` without any issues. + +However, if you’ve taken care of the above steps but are still unable to +get ``simeon`` to work, please open an issue. + +Further, ``simeon`` can parse INI formatted configuration files. It, by +default, looks for files in the user’s home directory, or in the current +working directory of the running process. The base names that are +targeted when config files are looked up are: ``simeon.cfg`` or +``.simeon.cfg`` or ``simeon.ini`` or ``.simeon.ini``. You can also +provide ``simeon`` with a config file by using the global option +``--config-file`` or ``-C`` and giving it a path to the file with the +corresponding configurations. + +The following is a sample file content: + +.. code:: sh + + # Default section for things like the organization whose data package is processed + # You can also set a default site as one of the following: edx, edge, patches + [DEFAULT] + site = edx + org = yourorganizationx + clistings_file = /path/to/file/with/course_ids + + # Section related to Google Cloud (project, bucket, service account) + [GCP] + project = your-gcp-project-id + bucket = your-gcs-bucket + service_account_file = /path/to/a/service_account_file.json + wait_for_loads = True + geo_table = your-gcp-project.geocode_latest.geoip + youtube_table = your-gcp-project.videos.youtube + youtube_token = your-YouTube-API-token + + # Section related to the AWS credentials needed to download data from S3 + [AWS] + aws_cred_file = ~/.aws/credentials + profile_name = default + +The options in the config file(s) should match the optional arguments of +the CLI tool. For instance, the ``--service-account-file``, +``--project`` and ``--bucket`` options can be provided under the ``GCP`` +section of the config file as ``service_account_file``, ``project`` and +``bucket``, respectively. Similarly, the ``--site`` and ``--org`` +options can be provided under the ``DEFAULT`` section as ``site`` and +``org``, respectively. + +List files +~~~~~~~~~~ + +``simeon`` can list files on S3 for your organization based on criteria +like file type (``sql`` or ``log`` or ``email``), time intervals (begin +and end dates), and site (``edx`` or ``edge`` or ``patches``). + +- Example: List the latest data packages for file types ``sql``, + ``email``, and ``log`` + + .. code:: sh + + # List the latest SQL bundle + simeon list -s edx -o mitx -f sql -L + # List the laetst email data dump + simeon list -s edx -o mitx -f email -L + # List the latest tracking log file + simeon list -s edx -o mitx -f log -L + +Download and split files +~~~~~~~~~~~~~~~~~~~~~~~~ + +``simeon`` can download, decrypt and split up files into folders +belonging to specific courses. + +- Example 1: Download, split and push SQL bundles to both GCS and + BigQuery + + .. code:: sh + + # Download the latest SQL bundle + simeon download -s edx -o mitx -f sql -L -d data/ + + # Download SQL bundles dumped any time since 2021-01-01 and + # extract the contents for course ID MITx/12.3x/1T2021. + # Place the downloaded files in data/ and the output of the split operation + # in data/SQL + simeon download -s edx -o mitx -c "MITx/12.3x/1T2021" -f sql \ + -b 2021-01-01 -d data -S -D data/SQL/ + + # Push to GCS the split up SQL files inside data/SQL/MITx__12_3x__1T2021 + simeon push gcs -f sql -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \ + -S ${SAFILE} data/SQL/MITx__12_3x__1T2021 + + # Push the files to BigQuery and wait for the jobs to finish + # Using -s or --use-storage tells BigQuery to extract the files + # to be loaded from Google Cloud Storage. + # So, use the option when you've already called simeon push gcs + simeon push bq -w -s -f sql -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \ + -S ${SAFILE} data/SQL/MITx__12_3x__1T2021 + +- Example 2: Download, split and push tracking logs to both GCS and + BigQuery + + .. code:: sh + + # Download the latest tracking log file + simeon download -s edx -o mitx -f log -L -d data/ + + # Download tracking logs dumped any time since 2021-01-01 + # and extract the contents for course ID MITx/12.3x/1T2021 + # Place the downloaded files in data/ and the output of the split operation + # in data/TRACKING_LOGS + simeon download -s edx -o mitx -c "MITx/12.3x/1T2021" -f log \ + -b 2021-01-01 -d data -S -D data/TRACKING_LOGS/ + + # Push to GCS the split up tracking log files inside + # data/TRACKING_LOGS/MITx__12_3x__1T2021 + simeon push gcs -f log -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \ + -S ${SAFILE} data/TRACKING_LOGS/MITx__12_3x__1T2021 + + # Push the files to BigQuery and wait for the jobs to finish + # Using -s or --use-storage tells BigQuery to extract the files + # to be loaded from Google Cloud Storage. + # So, use the option when you've already called simeon push gcs + simeon push bq -w -s -f log -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \ + -S ${SAFILE} data/TRACKING_LOGS/MITx__12_3x__1T2021 + +- If you already have downloaded SQL bundles or tracking log files, you + can use ``simeon split`` them up. + +Make secondary/aggregated tables +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``simeon`` can generate secondary tables based on already loaded data. +Call ``simeon report --help`` for the expected positional and optional +arguments. + +- Example: Make ``person_course`` for course ID ``MITx/12.3x/1T2021`` + + .. code:: sh + + # Make a person course table for course ID MITx/12.3x/1T2021 + # Provide the -g option to give a geolocation BigQuery table + # to fill the ip-to-location details in the generated person course table + COURSE=MITx/12.3x/1T2021 + simeon report -w -g "${GCP_PROJECT_ID}.geocode.geoip" -t "person_course" \ + -p ${GCP_PROJECT_ID} -S ${SAFILE} ${COURSE} + +Notes: +~~~~~~ + +1. Please note that SQL bundles are quite large when split up, so + consider using the ``-c`` or ``--courses`` option when invoking + ``simeon download -S`` or ``simeon split`` to make sure that you + limit the splitting to a set of course IDs. You may also use the + ``--clistings-file`` option, which expects a txt file of course IDs; + one ID per line. If the aforementioned options are not used, + ``simeon`` may end up failing to complete the split operation due to + exhausted system resources (storage to be specific). + +2. ``simeon download`` with file types ``log`` and ``email`` will both + download and decrypt the files matching the given criteria. If the + latter operations are successful, then the encrypted files are + deleted by default. This is to make sure that you don’t exhaust + storage resources. If you wish to keep those files, you can always + use the ``--keep-encrypted`` option that comes with + ``simeon download`` and ``simeon split``. SQL bundles are only + downloaded (not decrypted). Their decryption is done during a + ``split`` operation. + +3. Unless there is an unhandled exception (which should be reported as a + bug), ``simeon`` should, by default, print to the standard output + both information and errors encountered while processing your files. + You can capture those logs in a file by using the global option + ``--log-file`` and providing a destination file for the logs. + +4. When using multi argument options like ``--tables`` or ``--courses``, + you should try not to place them right before the expected positional + arguments. This will help the CLI parser not confuse your positional + arguments with table names (in the case of ``--tables``) or course + IDs (when ``--courses`` is used). + +5. Splitting tracking logs is a resource intensive process. The routine + that splits the logs generates a file for each course ID encountered. + If you happen to have more course IDs in your logs than the running + process can open operating system file descriptors, then ``simeon`` + will put away records it can’t save to disk for a second pass. + Putting away the records involves using more memory than normally + required. The second pass will only require one file descriptor at a + time, so it should be safe in terms of file descriptor limits. To + help ``simeon`` not have to do a second pass, you may increase the + file descriptor limits of processes from your shell by running + something like ``ulimit -n 2000`` before calling ``simeon split`` on + Unix machines. For Windows users, you may have to dig into the + Windows Registries for a corresponding setting. This should tell your + OS kernel to allow OS processes to open up to 2000 file handles. + +6. Care must be taken when using ``simeon split`` and ``simeon push`` to + make sure that the number of positional arguments passed does not + lead to the invoked command exceeding the maximum command-line length + allowed for arguments in a command. To avoid errors along those + lines, please consider passing the positional arguments as UNIX glob + patterns. For instance, + ``simeon split --file-type log 'data/TRACKING-LOGS/*/*.log.gz'`` + tells ``simeon`` to expand the given glob pattern, instead of relying + on the shell to do it. + +7. The ``report`` subcommand relies on the presence of SQL query files + to parse and send to BigQuery to execute. Any errors arising from + executing the parsed queries will be shown to the end user through + the given log stream. While the ``simeon`` tool ships with query + files for most secondary/reporting tables that are based on the + ``edx2bigquery`` tool, an end user should be able to point ``simeon`` + to a different location with SQL query files by using the + ``--query-dir`` option that comes with ``simeon report``. + Additionally, these query files can contain + ```jinja2 templated`` <https://jinja.palletsprojects.com/en/latest/>`__ + SQL code. Any mentioned variables within these templated queries can + be passed to ``simeon report`` by using the ``--extra-args`` option + and passing key-value pair items in the format + ``var1=value1,var2=value2,var3=value3,...,varn=valuen``. Further, + these key-value pair items can also be typed by using the format + ``var1:i=value1,var2:s=value2,var3:f=value3,...,varn:s=valuen``. In + this format, the type is append to the key, separated by a colon. The + only supported scalar types, so far, are ``s`` for ``str``, ``i`` for + ``int``, and ``f`` for ``float``. If any conversion errors occur + during value parsing, then those are shown to the end user, and the + query won’t get executed. Finally, if you wish to pass an ``array`` + or ``list`` to the template, you will need to repeat a key multiple + times. For instance, if you want to pass a list named ``mylist`` + containing the integers, you could write something like + ``--extra-args mylist:i=1,mylist:i=2,mylist:i=3``. This means that + you’ll have a python ``list`` named ``mylist`` within your template, + and it should contain ``[1, 2, 3]``. + + +%package -n python3-simeon +Summary: A CLI tool to help process research data from edX +Provides: python-simeon +BuildRequires: python3-devel +BuildRequires: python3-setuptools +BuildRequires: python3-pip +%description -n python3-simeon +simeon +~~~~~~ + +``simeon`` is a CLI tool to help with the processing of edx Research +data. It can ``list``, ``download``, and ``split`` edX data packages. It +can also ``push`` the output of the ``split`` subcommand to both GCS and +BigQuery. It is heavily inspired by the +`edx2bigquery <https://github.com/mitodl/edx2bigquery>`__ package. If +you’ve used that tool, you should be able to navigate the quirks that +may come with this one. + +Installing with pip +~~~~~~~~~~~~~~~~~~~ + +.. code:: sh + + python3 -m pip install simeon + # Or with geoip + python3 -m pip install simeon[geoip] + # Then invoke the CLI tool with + simeon --help + +Installing with git clone +~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: sh + + git clone git@github.com:MIT-IR/simeon.git + cd simeon && python -m pip install . + # Or with geoip + cd simeon && python -m pip install .[geoip] + # Then invoke the CLI tool with + simeon --help + +Using Docker +~~~~~~~~~~~~ + +.. code:: sh + + docker run -it mitir/simeon:latest + simeon --help + +Developing +~~~~~~~~~~ + +.. code:: sh + + git clone git@github.com:MIT-IR/simeon.git + cd simeon + # Set up a virtual environment if you don't already have on + python3 -m venv venv + . venv/bin/activate + # pip install the package in an editable way + python3 -m pip install -e .[test,geoip] + # Invoke the executable + simeon --help + # Run the tests + tox + # Write code and tests and submit PR's + +Setups and configurations +~~~~~~~~~~~~~~~~~~~~~~~~~ + +``simeon`` is a glorified downloader and uploader set of scripts. Much +of the downloading and uploading that it does makes the assumptions that +you have your AWS credentials configured properly and that you’ve got a +service account file for GCP services available on your machine. If the +latter is missing, you may have to authenticate to GCP services through +the SDK. However, both we and Google recommend you not do that. + +Every downloaded file is decrypted either during the download process or +while it gets split by the ``simeon split`` command. So, this tool +assumes that you’ve installed and configured ``gpg`` to be able to +decrypt files from edX. + +The following steps may be useful to someone just getting started with +the edX data package: + +1. Credentials from edX + + - Reach out to edX to get your data czar credentials + - Configure both AWS and gpg, so your credentials can access the S3 + buckets and your ``gpg`` key can decrypt the files there + +2. Setup a GCP project + + - Create a GCP project + - Setup a BigQuery workspace + - Create a GCS bucket + - Create a service account and download the associated file + - Give the service account Admin Role access to both the BigQuery + project and the GCS bucket + +If the above steps are carried out successfully, then you should be able +to use ``simeon`` without any issues. + +However, if you’ve taken care of the above steps but are still unable to +get ``simeon`` to work, please open an issue. + +Further, ``simeon`` can parse INI formatted configuration files. It, by +default, looks for files in the user’s home directory, or in the current +working directory of the running process. The base names that are +targeted when config files are looked up are: ``simeon.cfg`` or +``.simeon.cfg`` or ``simeon.ini`` or ``.simeon.ini``. You can also +provide ``simeon`` with a config file by using the global option +``--config-file`` or ``-C`` and giving it a path to the file with the +corresponding configurations. + +The following is a sample file content: + +.. code:: sh + + # Default section for things like the organization whose data package is processed + # You can also set a default site as one of the following: edx, edge, patches + [DEFAULT] + site = edx + org = yourorganizationx + clistings_file = /path/to/file/with/course_ids + + # Section related to Google Cloud (project, bucket, service account) + [GCP] + project = your-gcp-project-id + bucket = your-gcs-bucket + service_account_file = /path/to/a/service_account_file.json + wait_for_loads = True + geo_table = your-gcp-project.geocode_latest.geoip + youtube_table = your-gcp-project.videos.youtube + youtube_token = your-YouTube-API-token + + # Section related to the AWS credentials needed to download data from S3 + [AWS] + aws_cred_file = ~/.aws/credentials + profile_name = default + +The options in the config file(s) should match the optional arguments of +the CLI tool. For instance, the ``--service-account-file``, +``--project`` and ``--bucket`` options can be provided under the ``GCP`` +section of the config file as ``service_account_file``, ``project`` and +``bucket``, respectively. Similarly, the ``--site`` and ``--org`` +options can be provided under the ``DEFAULT`` section as ``site`` and +``org``, respectively. + +List files +~~~~~~~~~~ + +``simeon`` can list files on S3 for your organization based on criteria +like file type (``sql`` or ``log`` or ``email``), time intervals (begin +and end dates), and site (``edx`` or ``edge`` or ``patches``). + +- Example: List the latest data packages for file types ``sql``, + ``email``, and ``log`` + + .. code:: sh + + # List the latest SQL bundle + simeon list -s edx -o mitx -f sql -L + # List the laetst email data dump + simeon list -s edx -o mitx -f email -L + # List the latest tracking log file + simeon list -s edx -o mitx -f log -L + +Download and split files +~~~~~~~~~~~~~~~~~~~~~~~~ + +``simeon`` can download, decrypt and split up files into folders +belonging to specific courses. + +- Example 1: Download, split and push SQL bundles to both GCS and + BigQuery + + .. code:: sh + + # Download the latest SQL bundle + simeon download -s edx -o mitx -f sql -L -d data/ + + # Download SQL bundles dumped any time since 2021-01-01 and + # extract the contents for course ID MITx/12.3x/1T2021. + # Place the downloaded files in data/ and the output of the split operation + # in data/SQL + simeon download -s edx -o mitx -c "MITx/12.3x/1T2021" -f sql \ + -b 2021-01-01 -d data -S -D data/SQL/ + + # Push to GCS the split up SQL files inside data/SQL/MITx__12_3x__1T2021 + simeon push gcs -f sql -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \ + -S ${SAFILE} data/SQL/MITx__12_3x__1T2021 + + # Push the files to BigQuery and wait for the jobs to finish + # Using -s or --use-storage tells BigQuery to extract the files + # to be loaded from Google Cloud Storage. + # So, use the option when you've already called simeon push gcs + simeon push bq -w -s -f sql -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \ + -S ${SAFILE} data/SQL/MITx__12_3x__1T2021 + +- Example 2: Download, split and push tracking logs to both GCS and + BigQuery + + .. code:: sh + + # Download the latest tracking log file + simeon download -s edx -o mitx -f log -L -d data/ + + # Download tracking logs dumped any time since 2021-01-01 + # and extract the contents for course ID MITx/12.3x/1T2021 + # Place the downloaded files in data/ and the output of the split operation + # in data/TRACKING_LOGS + simeon download -s edx -o mitx -c "MITx/12.3x/1T2021" -f log \ + -b 2021-01-01 -d data -S -D data/TRACKING_LOGS/ + + # Push to GCS the split up tracking log files inside + # data/TRACKING_LOGS/MITx__12_3x__1T2021 + simeon push gcs -f log -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \ + -S ${SAFILE} data/TRACKING_LOGS/MITx__12_3x__1T2021 + + # Push the files to BigQuery and wait for the jobs to finish + # Using -s or --use-storage tells BigQuery to extract the files + # to be loaded from Google Cloud Storage. + # So, use the option when you've already called simeon push gcs + simeon push bq -w -s -f log -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \ + -S ${SAFILE} data/TRACKING_LOGS/MITx__12_3x__1T2021 + +- If you already have downloaded SQL bundles or tracking log files, you + can use ``simeon split`` them up. + +Make secondary/aggregated tables +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``simeon`` can generate secondary tables based on already loaded data. +Call ``simeon report --help`` for the expected positional and optional +arguments. + +- Example: Make ``person_course`` for course ID ``MITx/12.3x/1T2021`` + + .. code:: sh + + # Make a person course table for course ID MITx/12.3x/1T2021 + # Provide the -g option to give a geolocation BigQuery table + # to fill the ip-to-location details in the generated person course table + COURSE=MITx/12.3x/1T2021 + simeon report -w -g "${GCP_PROJECT_ID}.geocode.geoip" -t "person_course" \ + -p ${GCP_PROJECT_ID} -S ${SAFILE} ${COURSE} + +Notes: +~~~~~~ + +1. Please note that SQL bundles are quite large when split up, so + consider using the ``-c`` or ``--courses`` option when invoking + ``simeon download -S`` or ``simeon split`` to make sure that you + limit the splitting to a set of course IDs. You may also use the + ``--clistings-file`` option, which expects a txt file of course IDs; + one ID per line. If the aforementioned options are not used, + ``simeon`` may end up failing to complete the split operation due to + exhausted system resources (storage to be specific). + +2. ``simeon download`` with file types ``log`` and ``email`` will both + download and decrypt the files matching the given criteria. If the + latter operations are successful, then the encrypted files are + deleted by default. This is to make sure that you don’t exhaust + storage resources. If you wish to keep those files, you can always + use the ``--keep-encrypted`` option that comes with + ``simeon download`` and ``simeon split``. SQL bundles are only + downloaded (not decrypted). Their decryption is done during a + ``split`` operation. + +3. Unless there is an unhandled exception (which should be reported as a + bug), ``simeon`` should, by default, print to the standard output + both information and errors encountered while processing your files. + You can capture those logs in a file by using the global option + ``--log-file`` and providing a destination file for the logs. + +4. When using multi argument options like ``--tables`` or ``--courses``, + you should try not to place them right before the expected positional + arguments. This will help the CLI parser not confuse your positional + arguments with table names (in the case of ``--tables``) or course + IDs (when ``--courses`` is used). + +5. Splitting tracking logs is a resource intensive process. The routine + that splits the logs generates a file for each course ID encountered. + If you happen to have more course IDs in your logs than the running + process can open operating system file descriptors, then ``simeon`` + will put away records it can’t save to disk for a second pass. + Putting away the records involves using more memory than normally + required. The second pass will only require one file descriptor at a + time, so it should be safe in terms of file descriptor limits. To + help ``simeon`` not have to do a second pass, you may increase the + file descriptor limits of processes from your shell by running + something like ``ulimit -n 2000`` before calling ``simeon split`` on + Unix machines. For Windows users, you may have to dig into the + Windows Registries for a corresponding setting. This should tell your + OS kernel to allow OS processes to open up to 2000 file handles. + +6. Care must be taken when using ``simeon split`` and ``simeon push`` to + make sure that the number of positional arguments passed does not + lead to the invoked command exceeding the maximum command-line length + allowed for arguments in a command. To avoid errors along those + lines, please consider passing the positional arguments as UNIX glob + patterns. For instance, + ``simeon split --file-type log 'data/TRACKING-LOGS/*/*.log.gz'`` + tells ``simeon`` to expand the given glob pattern, instead of relying + on the shell to do it. + +7. The ``report`` subcommand relies on the presence of SQL query files + to parse and send to BigQuery to execute. Any errors arising from + executing the parsed queries will be shown to the end user through + the given log stream. While the ``simeon`` tool ships with query + files for most secondary/reporting tables that are based on the + ``edx2bigquery`` tool, an end user should be able to point ``simeon`` + to a different location with SQL query files by using the + ``--query-dir`` option that comes with ``simeon report``. + Additionally, these query files can contain + ```jinja2 templated`` <https://jinja.palletsprojects.com/en/latest/>`__ + SQL code. Any mentioned variables within these templated queries can + be passed to ``simeon report`` by using the ``--extra-args`` option + and passing key-value pair items in the format + ``var1=value1,var2=value2,var3=value3,...,varn=valuen``. Further, + these key-value pair items can also be typed by using the format + ``var1:i=value1,var2:s=value2,var3:f=value3,...,varn:s=valuen``. In + this format, the type is append to the key, separated by a colon. The + only supported scalar types, so far, are ``s`` for ``str``, ``i`` for + ``int``, and ``f`` for ``float``. If any conversion errors occur + during value parsing, then those are shown to the end user, and the + query won’t get executed. Finally, if you wish to pass an ``array`` + or ``list`` to the template, you will need to repeat a key multiple + times. For instance, if you want to pass a list named ``mylist`` + containing the integers, you could write something like + ``--extra-args mylist:i=1,mylist:i=2,mylist:i=3``. This means that + you’ll have a python ``list`` named ``mylist`` within your template, + and it should contain ``[1, 2, 3]``. + + +%package help +Summary: Development documents and examples for simeon +Provides: python3-simeon-doc +%description help +simeon +~~~~~~ + +``simeon`` is a CLI tool to help with the processing of edx Research +data. It can ``list``, ``download``, and ``split`` edX data packages. It +can also ``push`` the output of the ``split`` subcommand to both GCS and +BigQuery. It is heavily inspired by the +`edx2bigquery <https://github.com/mitodl/edx2bigquery>`__ package. If +you’ve used that tool, you should be able to navigate the quirks that +may come with this one. + +Installing with pip +~~~~~~~~~~~~~~~~~~~ + +.. code:: sh + + python3 -m pip install simeon + # Or with geoip + python3 -m pip install simeon[geoip] + # Then invoke the CLI tool with + simeon --help + +Installing with git clone +~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code:: sh + + git clone git@github.com:MIT-IR/simeon.git + cd simeon && python -m pip install . + # Or with geoip + cd simeon && python -m pip install .[geoip] + # Then invoke the CLI tool with + simeon --help + +Using Docker +~~~~~~~~~~~~ + +.. code:: sh + + docker run -it mitir/simeon:latest + simeon --help + +Developing +~~~~~~~~~~ + +.. code:: sh + + git clone git@github.com:MIT-IR/simeon.git + cd simeon + # Set up a virtual environment if you don't already have on + python3 -m venv venv + . venv/bin/activate + # pip install the package in an editable way + python3 -m pip install -e .[test,geoip] + # Invoke the executable + simeon --help + # Run the tests + tox + # Write code and tests and submit PR's + +Setups and configurations +~~~~~~~~~~~~~~~~~~~~~~~~~ + +``simeon`` is a glorified downloader and uploader set of scripts. Much +of the downloading and uploading that it does makes the assumptions that +you have your AWS credentials configured properly and that you’ve got a +service account file for GCP services available on your machine. If the +latter is missing, you may have to authenticate to GCP services through +the SDK. However, both we and Google recommend you not do that. + +Every downloaded file is decrypted either during the download process or +while it gets split by the ``simeon split`` command. So, this tool +assumes that you’ve installed and configured ``gpg`` to be able to +decrypt files from edX. + +The following steps may be useful to someone just getting started with +the edX data package: + +1. Credentials from edX + + - Reach out to edX to get your data czar credentials + - Configure both AWS and gpg, so your credentials can access the S3 + buckets and your ``gpg`` key can decrypt the files there + +2. Setup a GCP project + + - Create a GCP project + - Setup a BigQuery workspace + - Create a GCS bucket + - Create a service account and download the associated file + - Give the service account Admin Role access to both the BigQuery + project and the GCS bucket + +If the above steps are carried out successfully, then you should be able +to use ``simeon`` without any issues. + +However, if you’ve taken care of the above steps but are still unable to +get ``simeon`` to work, please open an issue. + +Further, ``simeon`` can parse INI formatted configuration files. It, by +default, looks for files in the user’s home directory, or in the current +working directory of the running process. The base names that are +targeted when config files are looked up are: ``simeon.cfg`` or +``.simeon.cfg`` or ``simeon.ini`` or ``.simeon.ini``. You can also +provide ``simeon`` with a config file by using the global option +``--config-file`` or ``-C`` and giving it a path to the file with the +corresponding configurations. + +The following is a sample file content: + +.. code:: sh + + # Default section for things like the organization whose data package is processed + # You can also set a default site as one of the following: edx, edge, patches + [DEFAULT] + site = edx + org = yourorganizationx + clistings_file = /path/to/file/with/course_ids + + # Section related to Google Cloud (project, bucket, service account) + [GCP] + project = your-gcp-project-id + bucket = your-gcs-bucket + service_account_file = /path/to/a/service_account_file.json + wait_for_loads = True + geo_table = your-gcp-project.geocode_latest.geoip + youtube_table = your-gcp-project.videos.youtube + youtube_token = your-YouTube-API-token + + # Section related to the AWS credentials needed to download data from S3 + [AWS] + aws_cred_file = ~/.aws/credentials + profile_name = default + +The options in the config file(s) should match the optional arguments of +the CLI tool. For instance, the ``--service-account-file``, +``--project`` and ``--bucket`` options can be provided under the ``GCP`` +section of the config file as ``service_account_file``, ``project`` and +``bucket``, respectively. Similarly, the ``--site`` and ``--org`` +options can be provided under the ``DEFAULT`` section as ``site`` and +``org``, respectively. + +List files +~~~~~~~~~~ + +``simeon`` can list files on S3 for your organization based on criteria +like file type (``sql`` or ``log`` or ``email``), time intervals (begin +and end dates), and site (``edx`` or ``edge`` or ``patches``). + +- Example: List the latest data packages for file types ``sql``, + ``email``, and ``log`` + + .. code:: sh + + # List the latest SQL bundle + simeon list -s edx -o mitx -f sql -L + # List the laetst email data dump + simeon list -s edx -o mitx -f email -L + # List the latest tracking log file + simeon list -s edx -o mitx -f log -L + +Download and split files +~~~~~~~~~~~~~~~~~~~~~~~~ + +``simeon`` can download, decrypt and split up files into folders +belonging to specific courses. + +- Example 1: Download, split and push SQL bundles to both GCS and + BigQuery + + .. code:: sh + + # Download the latest SQL bundle + simeon download -s edx -o mitx -f sql -L -d data/ + + # Download SQL bundles dumped any time since 2021-01-01 and + # extract the contents for course ID MITx/12.3x/1T2021. + # Place the downloaded files in data/ and the output of the split operation + # in data/SQL + simeon download -s edx -o mitx -c "MITx/12.3x/1T2021" -f sql \ + -b 2021-01-01 -d data -S -D data/SQL/ + + # Push to GCS the split up SQL files inside data/SQL/MITx__12_3x__1T2021 + simeon push gcs -f sql -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \ + -S ${SAFILE} data/SQL/MITx__12_3x__1T2021 + + # Push the files to BigQuery and wait for the jobs to finish + # Using -s or --use-storage tells BigQuery to extract the files + # to be loaded from Google Cloud Storage. + # So, use the option when you've already called simeon push gcs + simeon push bq -w -s -f sql -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \ + -S ${SAFILE} data/SQL/MITx__12_3x__1T2021 + +- Example 2: Download, split and push tracking logs to both GCS and + BigQuery + + .. code:: sh + + # Download the latest tracking log file + simeon download -s edx -o mitx -f log -L -d data/ + + # Download tracking logs dumped any time since 2021-01-01 + # and extract the contents for course ID MITx/12.3x/1T2021 + # Place the downloaded files in data/ and the output of the split operation + # in data/TRACKING_LOGS + simeon download -s edx -o mitx -c "MITx/12.3x/1T2021" -f log \ + -b 2021-01-01 -d data -S -D data/TRACKING_LOGS/ + + # Push to GCS the split up tracking log files inside + # data/TRACKING_LOGS/MITx__12_3x__1T2021 + simeon push gcs -f log -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \ + -S ${SAFILE} data/TRACKING_LOGS/MITx__12_3x__1T2021 + + # Push the files to BigQuery and wait for the jobs to finish + # Using -s or --use-storage tells BigQuery to extract the files + # to be loaded from Google Cloud Storage. + # So, use the option when you've already called simeon push gcs + simeon push bq -w -s -f log -p ${GCP_PROJECT_ID} -b ${GCS_BUCKET} \ + -S ${SAFILE} data/TRACKING_LOGS/MITx__12_3x__1T2021 + +- If you already have downloaded SQL bundles or tracking log files, you + can use ``simeon split`` them up. + +Make secondary/aggregated tables +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +``simeon`` can generate secondary tables based on already loaded data. +Call ``simeon report --help`` for the expected positional and optional +arguments. + +- Example: Make ``person_course`` for course ID ``MITx/12.3x/1T2021`` + + .. code:: sh + + # Make a person course table for course ID MITx/12.3x/1T2021 + # Provide the -g option to give a geolocation BigQuery table + # to fill the ip-to-location details in the generated person course table + COURSE=MITx/12.3x/1T2021 + simeon report -w -g "${GCP_PROJECT_ID}.geocode.geoip" -t "person_course" \ + -p ${GCP_PROJECT_ID} -S ${SAFILE} ${COURSE} + +Notes: +~~~~~~ + +1. Please note that SQL bundles are quite large when split up, so + consider using the ``-c`` or ``--courses`` option when invoking + ``simeon download -S`` or ``simeon split`` to make sure that you + limit the splitting to a set of course IDs. You may also use the + ``--clistings-file`` option, which expects a txt file of course IDs; + one ID per line. If the aforementioned options are not used, + ``simeon`` may end up failing to complete the split operation due to + exhausted system resources (storage to be specific). + +2. ``simeon download`` with file types ``log`` and ``email`` will both + download and decrypt the files matching the given criteria. If the + latter operations are successful, then the encrypted files are + deleted by default. This is to make sure that you don’t exhaust + storage resources. If you wish to keep those files, you can always + use the ``--keep-encrypted`` option that comes with + ``simeon download`` and ``simeon split``. SQL bundles are only + downloaded (not decrypted). Their decryption is done during a + ``split`` operation. + +3. Unless there is an unhandled exception (which should be reported as a + bug), ``simeon`` should, by default, print to the standard output + both information and errors encountered while processing your files. + You can capture those logs in a file by using the global option + ``--log-file`` and providing a destination file for the logs. + +4. When using multi argument options like ``--tables`` or ``--courses``, + you should try not to place them right before the expected positional + arguments. This will help the CLI parser not confuse your positional + arguments with table names (in the case of ``--tables``) or course + IDs (when ``--courses`` is used). + +5. Splitting tracking logs is a resource intensive process. The routine + that splits the logs generates a file for each course ID encountered. + If you happen to have more course IDs in your logs than the running + process can open operating system file descriptors, then ``simeon`` + will put away records it can’t save to disk for a second pass. + Putting away the records involves using more memory than normally + required. The second pass will only require one file descriptor at a + time, so it should be safe in terms of file descriptor limits. To + help ``simeon`` not have to do a second pass, you may increase the + file descriptor limits of processes from your shell by running + something like ``ulimit -n 2000`` before calling ``simeon split`` on + Unix machines. For Windows users, you may have to dig into the + Windows Registries for a corresponding setting. This should tell your + OS kernel to allow OS processes to open up to 2000 file handles. + +6. Care must be taken when using ``simeon split`` and ``simeon push`` to + make sure that the number of positional arguments passed does not + lead to the invoked command exceeding the maximum command-line length + allowed for arguments in a command. To avoid errors along those + lines, please consider passing the positional arguments as UNIX glob + patterns. For instance, + ``simeon split --file-type log 'data/TRACKING-LOGS/*/*.log.gz'`` + tells ``simeon`` to expand the given glob pattern, instead of relying + on the shell to do it. + +7. The ``report`` subcommand relies on the presence of SQL query files + to parse and send to BigQuery to execute. Any errors arising from + executing the parsed queries will be shown to the end user through + the given log stream. While the ``simeon`` tool ships with query + files for most secondary/reporting tables that are based on the + ``edx2bigquery`` tool, an end user should be able to point ``simeon`` + to a different location with SQL query files by using the + ``--query-dir`` option that comes with ``simeon report``. + Additionally, these query files can contain + ```jinja2 templated`` <https://jinja.palletsprojects.com/en/latest/>`__ + SQL code. Any mentioned variables within these templated queries can + be passed to ``simeon report`` by using the ``--extra-args`` option + and passing key-value pair items in the format + ``var1=value1,var2=value2,var3=value3,...,varn=valuen``. Further, + these key-value pair items can also be typed by using the format + ``var1:i=value1,var2:s=value2,var3:f=value3,...,varn:s=valuen``. In + this format, the type is append to the key, separated by a colon. The + only supported scalar types, so far, are ``s`` for ``str``, ``i`` for + ``int``, and ``f`` for ``float``. If any conversion errors occur + during value parsing, then those are shown to the end user, and the + query won’t get executed. Finally, if you wish to pass an ``array`` + or ``list`` to the template, you will need to repeat a key multiple + times. For instance, if you want to pass a list named ``mylist`` + containing the integers, you could write something like + ``--extra-args mylist:i=1,mylist:i=2,mylist:i=3``. This means that + you’ll have a python ``list`` named ``mylist`` within your template, + and it should contain ``[1, 2, 3]``. + + +%prep +%autosetup -n simeon-0.0.24 + +%build +%py3_build + +%install +%py3_install +install -d -m755 %{buildroot}/%{_pkgdocdir} +if [ -d doc ]; then cp -arf doc %{buildroot}/%{_pkgdocdir}; fi +if [ -d docs ]; then cp -arf docs %{buildroot}/%{_pkgdocdir}; fi +if [ -d example ]; then cp -arf example %{buildroot}/%{_pkgdocdir}; fi +if [ -d examples ]; then cp -arf examples %{buildroot}/%{_pkgdocdir}; fi +pushd %{buildroot} +if [ -d usr/lib ]; then + find usr/lib -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/lib64 ]; then + find usr/lib64 -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/bin ]; then + find usr/bin -type f -printf "/%h/%f\n" >> filelist.lst +fi +if [ -d usr/sbin ]; then + find usr/sbin -type f -printf "/%h/%f\n" >> filelist.lst +fi +touch doclist.lst +if [ -d usr/share/man ]; then + find usr/share/man -type f -printf "/%h/%f.gz\n" >> doclist.lst +fi +popd +mv %{buildroot}/filelist.lst . +mv %{buildroot}/doclist.lst . + +%files -n python3-simeon -f filelist.lst +%dir %{python3_sitelib}/* + +%files help -f doclist.lst +%{_docdir}/* + +%changelog +* Wed May 10 2023 Python_Bot <Python_Bot@openeuler.org> - 0.0.24-1 +- Package Spec generated @@ -0,0 +1 @@ +7ef8c4f2ff33b8653c162a638edfd091 simeon-0.0.24.tar.gz |
