使用docker构造slurm镜像问题汇总

使用docker构造slurm镜像问题汇总

  1. ius仓库不支持arm64架构

    使用github workflow流水线构建镜像时,若支持arm64架构,则构建镜像失败,github action workflow 如下

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    
    name: Docker Image CI
    
    on:
    push:
        branches: [ "main" ]
    pull_request:
        branches: [ "main" ]
    
    jobs:
    build:
        runs-on: ubuntu-latest
        steps:
        - name: Checkout
        uses: actions/checkout@v3
        - name: Login to Docker Hub
        uses: docker/login-action@v2
        with:
            username: ${{ secrets.DOCKERHUB_USERNAME }}
            password: ${{ secrets.DOCKERHUB_TOKEN }}
        - name: Set up QEMU
        uses: docker/setup-qemu-action@v2
        - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2
        - name: Build and push
        uses: docker/build-push-action@v4
        with:
            context: .
            file: ./Dockerfile
            platforms: |
                linux/amd64
                linux/arm64
            push: true
            tags: ${{ secrets.DOCKERHUB_USERNAME }}/centos7.9-slurm22:latest
    

    platforms里包含 linux/arm64 时,构建镜像失败

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    
    > [linux/arm64 2/7] RUN set -ex     && yum makecache fast     && yum -y update     && yum -y install https://repo.ius.io/ius-release-el7.rpm     && yum -y install         munge         munge-devel         mariadb-server         mariadb-devel         mysql-devel         gcc         gcc-c++         python3         readline-devel         perl-ExtUtils-MakeMaker         pam-devel         http-parser-devel         json-c-devel         libyaml-devel         libjwt-devel         wget         git         vim         bzip2         make automake libtool         supervisor         psmisc         openldap openldap-servers openldap-clients nss-pam-ldapd authconfig         kde-l10n-Chinese glibc-common         bash-completion         openssh-server     && yum clean all     && rm -rf /var/cache/yum:
    13528588.7      5. Configure the failing repository to be skipped, if it is unavailable.
    13529588.7         Note that yum will try to contact the repo. when it runs most commands,
    13530588.7         so will have to try and fail each time (and thus. yum will be be much
    13531588.7         slower). If it is a very temporary problem though, this is often a nice
    13532588.7         compromise:
    13533588.7 
    13534588.7             yum-config-manager --save --setopt=ius.skip_if_unavailable=true
    13535588.7 
    13536588.7 failure: repodata/repomd.xml from ius: [Errno 256] No more mirrors to try.13537
    588.7 https://repo.ius.io/7/aarch64/repodata/repomd.xml: [Errno 14] HTTPS Error 404 - Not Found
    ------
    13539Dockerfile:6
    13540--------------------
    

    原因:ius仓库(https://repo.ius.io/ius-release-el7.rpm)不支持arm64架构, 解决:将workflow中的第31行 linux/arm64 删除

  2. ERROR: failed to solve: circular dependency detected on stage: build

    原因: Dockerfile脚本问题,使用多阶段构建时,第二个FROM前面的语句末尾存在 \

  3. Dockerfile 构建时出现 command not found

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    
    $ cat Dockerfile
    FROM hekai/centos7.9-jdk8u202
    
    
    RUN set -ex \
        && yum makecache fast \
        && yum -y update \
        && yum -y install \
            mariadb-server \
            mariadb-devel \
            mysql-devel \
        && yum clean all \
        && rm -rf /var/cache/yum \
        # config mariadb
        && sed -i '/\[mysqld\]/a\innodb_buffer_pool_size=1024M\ninnodb_log_file_size=64M\ninnodb_lock_wait_timeout=900' /etc/my.cnf \
        && /usr/bin/mysql_install_db --user=mysql &>/dev/null \
        && /usr/bin/mysqld_safe --user=mysql & &>/dev/null \
        && sleep 3s \
        && mysql -e "CREATE USER 'slurm'@'localhost' identified by 'password'" \
        && mysql -e "GRANT ALL ON slurm_acct_db.* to 'slurm'@'localhost' identified by 'password' with GRANT option" \
        && mysql -e "CREATE DATABASE slurm_acct_db"
    
    CMD ["/bin/bash"]
    

    构建镜像时出现错误

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    
    $ docker build -t mariadb:v1 -f Dockerfile.b  .
    [+] Building 4.0s (5/5) FINISHED                                                                                                         
    => [internal] load build definition from Dockerfile.b                                                                              0.1s
    => => transferring dockerfile: 832B                                                                                                0.0s
    => [internal] load .dockerignore                                                                                                   0.1s
    => => transferring context: 109B                                                                                                   0.0s
    => [internal] load metadata for docker.io/hekai/centos7.9-jdk8u202:latest                                                          0.0s
    => [1/2] FROM docker.io/hekai/centos7.9-jdk8u202                                                                                   0.1s
    => ERROR [2/2] RUN set -ex     && yum makecache fast     && yum -y update     && yum -y install         mariadb-server         ma  3.7s
    ------                                                                                                                                   
    > [2/2] RUN set -ex     && yum makecache fast     && yum -y update     && yum -y install         mariadb-server         mariadb-devel         mysql-devel     && yum clean all     && rm -rf /var/cache/yum     && sed -i '/\[mysqld\]/a\innodb_buffer_pool_size=1024M\ninnodb_log_file_size=64M\ninnodb_lock_wait_timeout=900' /etc/my.cnf     && /usr/bin/mysql_install_db --user=mysql &>/dev/null     && /usr/bin/mysqld_safe --user=mysql & &>/dev/null     && sleep 3s     && mysql -e "CREATE USER 'slurm'@'localhost' identified by 'password'"     && mysql -e "GRANT ALL ON slurm_acct_db.* to 'slurm'@'localhost' identified by 'password' with GRANT option"     && mysql -e "CREATE DATABASE slurm_acct_db":
    #0 0.519 + yum makecache fast
    #0 1.193 Loaded plugins: fastestmirror, ovl
    #0 1.567 Determining fastest mirrors
    #0 2.744  * base: mirrors.huaweicloud.com
    #0 2.745  * extras: mirrors.huaweicloud.com
    #0 2.745  * updates: mirrors.huaweicloud.com
    #0 3.524 /bin/sh: mysql: command not found
    ------
    Dockerfile.b:4
    --------------------
    

    原因:可能是docker对RUN指令进行优化,解析到 mysql 指令时还未安装mysql

    解决:将mysql的安装和配置放在不同的RUN语句

    修改后如下

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    
    $ cat Dockerfile
    FROM hekai/centos7.9-jdk8u202
    
    
    RUN set -ex \
        && yum makecache fast \
        && yum -y update \
        && yum -y install \
            mariadb-server \
            mariadb-devel \
            mysql-devel \
        && yum clean all \
        && rm -rf /var/cache/yum \
    
    RUN sest -ex \
        # config mariadb
        && sed -i '/\[mysqld\]/a\innodb_buffer_pool_size=1024M\ninnodb_log_file_size=64M\ninnodb_lock_wait_timeout=900' /etc/my.cnf \
        && /usr/bin/mysql_install_db --user=mysql &>/dev/null \
        && /usr/bin/mysqld_safe --user=mysql & &>/dev/null \
        && sleep 3s \
        && mysql -e "CREATE USER 'slurm'@'localhost' identified by 'password'" \
        && mysql -e "GRANT ALL ON slurm_acct_db.* to 'slurm'@'localhost' identified by 'password' with GRANT option" \
        && mysql -e "CREATE DATABASE slurm_acct_db"
    
    CMD ["/bin/bash"]
    
  4. 构建镜像时出现错误,连不上mariadb

    1
    
    ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
    

    对应Dockerfile中的RUN指令

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    
    RUN set -ex \
        ... ...
        # config mariadb
        && sed -i '/\[mysqld\]/a\innodb_buffer_pool_size=1024M\ninnodb_log_file_size=64M\ninnodb_lock_wait_timeout=900' /etc/my.cnf \
        && /usr/bin/mysql_install_db --user=mysql &>/dev/null \
        && /usr/bin/mysqld_safe --user=mysql & &>/dev/null \
        && sleep 3s \
        && mysql -e "CREATE USER 'slurm'@'localhost' identified by 'password'" \
        && mysql -e "GRANT ALL ON slurm_acct_db.* to 'slurm'@'localhost' identified by 'password' with GRANT option" \
        && mysql -e "CREATE DATABASE slurm_acct_db" \
        ... ...
        # clean
        && rm -rf /var/log/* /var/cache/* /tmp/*
    

    sleep 设置了100s 也无法解决

    解决: 将mariadb单独抽出来,放在一个RUN中

    原因: 未知

    修改后如下

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    
    RUN set -ex \
        ... ...
        # clean
        && rm -rf /var/log/* /var/cache/* /tmp/*
    
    RUN set -ex \
        # config mariadb
        && sed -i '/\[mysqld\]/a\innodb_buffer_pool_size=1024M\ninnodb_log_file_size=64M\ninnodb_lock_wait_timeout=900' /etc/my.cnf \
        && /usr/bin/mysql_install_db --user=mysql &>/dev/null \
        && /usr/bin/mysqld_safe --user=mysql & &>/dev/null \
        && sleep 3s \
        && mysql -e "CREATE USER 'slurm'@'localhost' identified by 'password'" \
        && mysql -e "GRANT ALL ON slurm_acct_db.* to 'slurm'@'localhost' identified by 'password' with GRANT option" \
        && mysql -e "CREATE DATABASE slurm_acct_db"
    
  5. 构建镜像出现错误,mysql启动失败

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    
    > [stage-1 5/7] RUN set -ex     && sed -i '/\[mysqld\]/a\innodb_buffer_pool_size=1024M\ninnodb_log_file_size=64M\ninnodb_lock_wait_timeout=900' /etc/my.cnf     && /usr/bin/mysql_install_db --user=mysql &>/dev/null     && /usr/bin/mysqld_safe --user=mysql & &>/dev/null     && sleep 3s     && mysql -e "CREATE USER 'slurm'@'localhost' identified by 'password'"     && mysql -e "GRANT ALL ON slurm_acct_db.* to 'slurm'@'localhost' identified by 'password' with GRANT option"     && mysql -e "CREATE DATABASE slurm_acct_db":
    257730.064 + sed -i '/\[mysqld\]/a\innodb_buffer_pool_size=1024M\ninnodb_log_file_size=64M\ninnodb_lock_wait_timeout=900' /etc/my.cnf
    257740.068 + /usr/bin/mysql_install_db --user=mysql
    257750.251 + /usr/bin/mysqld_safe --user=mysql
    257760.355 230905 03:18:57 mysqld_safe Logging to '/var/log/mariadb/mariadb.log'.
    257770.386 230905 03:18:57 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
    257780.390 /usr/bin/mysqld_safe_helper: Can't create/write to file '/var/log/mariadb/mariadb.log' (Errcode: 2)
    257793.072 ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
    25780------
    25781Dockerfile.2:141
    25782--------------------
    

    原因是删掉了/var/log/mariadb目录,清理日志文件时,保留 mariadb 目录

    1
    
    find /var/log/* | grep -v -e mariadb | xargs rm -rf
    
  6. 配置slurm

    slurm配置命令如下

    1
    2
    
    ./configure --prefix=/opt/slurm --sysconfdir=/opt/slurm/etc --enable-slurmrestd 
    --with-mysql_config=/usr/bin --libdir=/usr/lib64
    

    make install 安装时会将slurm的动态库文件放在 /usr/lib64/slurm下

    如果配置时不指定–libdir,则会将动态库安装到prefix指定的目录下

    1
    2
    
    ./configure --prefix=/opt/slurm --sysconfdir=/opt/slurm/etc --enable-slurmrestd 
    --with-mysql_config=/usr/bin
    

    此时会将动态库安装到 /opt/slurm/lib64下

    为了正常使用动态库,需要将/opt/slurm/lib64的动态库链接到/usr/lib64下

    1
    
    ln -s /opt/slurm/lib64 /usr/lib64/slurm
    
  7. 支持killall命令

    1
    
    /usr/local/bin/docker-entrypoint.sh: line 77: killall: command not found
    

    解决:安装 psmisc

    1
    
    yum install -y psmisc
    

    centos7精简版(minimal)运行killall命令提示 command not found

    是由于没有安装psmisc所致,psmisc软件包包含三个帮助管理/proc目录的程序。

    fuser, killall,pstree和pstree.x11(到pstree的链接)

    • fuser 显示使用指定文件或者文件系统的进程的PID。
    • killall 杀死某个名字的进程,它向运行指定命令的所有进程发出信号。
    • pstree 树型显示当前运行的进程。
    • pstree.x11 与pstree功能相同,只是在退出前需要确认。
  8. 安装git IDEA进入docker容器进行远程开发 提示:

    1
    2
    
    Unsupported Git Version 1.8.3.1
    At least 2.17.0 is required
    

    需要对yum源中的git进行升级

    在 centos7.9 的 docker 镜像中,是没有 git 的,所以有两种方式

    1. 编译安装新版本 git
    2. 安装 yum 源中的 git > 2.17.0 的版本
      • IUS yum源中提供了 git 2.36.6 版本
        • 注:安装 IUS yum 源时会自动安装 epel 源
        • 安装 IUS 源: yum -y install https://repo.ius.io/ius-release-el7.rpm
        • 查看 git :yum search git|grep -E "^git"
        • 查看 git 版本: yum info git236
        • 安装 git 236 版本:yum -y install git236
      • endpoint yum源中提供了最新版本的 git
        • 安装 endpoint 源:yum install https://packages.endpointdev.com/rhel/7/os/x86_64/endpoint-repo.x86_64.rpm
        • 安装 git: yum -y install git
  9. 编译安装python3 https://www.jianshu.com/p/2cad40bc9e1b

    1
    2
    3
    4
    5
    6
    7
    
    curl -O https://www.python.org/ftp/python/3.7.14/Python-3.7.14.tar.xz
    tar -xf Python-3.7.14.tar.xz
    cd Python-3.7.14
    yum -y install make
    yum-builddep -y python
    ./configure --prefix=/opt/tools/python-3.7.14
    make && make install
    
  10. docker compose 挂root目录,导致环境变量无效

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    
    $ cat docker-compose.yml 
    version: '3'
    
    services:
    slurm-master:
        image: hekai/centos7.9-slurm22
        hostname: linux0
        privileged: true
        stdin_open: true
        restart: always
        tty: true
        ports:
        - 389:389
        - 6820:6820
        environment:
        role: "master"
        TZ: Asia/Shanghai
        volumes:
        - .root:/root
        - .data/log:/var/log/slurm
        - etc_munge:/etc/munge
        - etc_slurm:/etc/slurm
        - spool_slurm:/var/spool/slurm
        - mysql:/var/lib/mysql
    
    slurm-compute-1:
        image: hekai/centos7.9-slurm22
        hostname: linux1
        privileged: true
        stdin_open: true
        restart: always
        tty: true
        environment:
        role: "compute"
        TZ: Asia/Shanghai
        volumes:
        - .data/log:/var/log/slurm
        - etc_munge:/etc/munge
        - etc_slurm:/etc/slurm
        depends_on:
        - "slurm-master"
    
    volumes:
    etc_munge:
    etc_slurm:
    spool_slurm:
    mysql:
    

    如上,slurm-master 节点挂在了容器内的 /root 到本地,进入容器后,发现构建镜像时设置的环境变量没有了,在容器中执行下 source /etc/profile,然后可以将在 /etc/profile.d/jdk.sh 中定义的变量加载出来了

    1. Dockerfile中添加 source /etc/profile 命令,无法解决
    2. entrypoint.sh中添加 source /etc/profile 命令,无法解决
    3. 不直接挂载容器中 /root,将/root下需要的文件单独挂载出来,可以解决该问题
  11. 希望先查ldap中的用户,若ldap中不存在,再查linux系统用户

    实现这种方案,需要修改 /etc/nsswitch.conf 中 passwprd、shadow、group 属性对应值的顺序

    怎么优雅地修改这些属性值呢?
    搜索了下,linux好像没有提供命令修改这些值的顺序,只能通过修改文本的方式 使用sed命令修改nsswitch.conf 中的用户搜索顺序

    1
    2
    3
    
    sed -i 's/passwd:     files ldap/passwd:     ldap files/g' ./nsswitch.conf 
    sed -i 's/shadow:     files ldap/shadow:     ldap files/g' ./nsswitch.conf 
    sed -i 's/group:      files ldap/group:      ldap files/g' ./nsswitch.conf
    

    参考:
    https://serverfault.com/questions/972401/try-ldap-authentication-before-local-authentication
    https://documents.uow.edu.au/~blane/netapp/ontap/nag/networking/concept/c_oc_netw_maintaining_host_name_search.html#c_oc_netw_maintaining_host_name_search
    https://superuser.com/questions/1417190/ why-do-i-need-to-change-the-order-of-hosts-in-nsswitch-conf
    https://man7.org/linux/man-pages/man5/nsswitch.conf.5.html
    https://unix.stackexchange.com/questions/140378/editing-nsswitch-conf-file-safely

Built with Hugo
主题 StackJimmy 设计