Software Installation

Dev Account
Dev Account
  • Updated

Software Installation

Summary

These tickets cover OS installation, driver and CUDA setup, package conflicts, storage and RAID bring-up, cluster configuration, and application-stack activation on otherwise functional Exxact systems. The recurring pattern is not hardware failure, but getting a supported or workable software baseline onto the machine.

Frequency

  • 483 tickets

Common Causes

  1. GPU driver, CUDA, or kernel-module mismatches
    Driver installation, DKMS breakage, nvidia-smi errors, and CUDA environment setup are the most common software-install theme. Examples: #10415, #11012, #11108, #22648, #40529, and 80+ more.
  2. OS install or reinstall on Ubuntu, Rocky, or CentOS
    Many tickets involve initial imaging, reinstall after corruption or security events, or choosing the right OS baseline. Examples: #10356, #11750, #18175, #41989, #32080, and 70+ more.
  3. RAID, GRUB, storage, or encryption setup
    A meaningful subset centers on bootloader repair, RAID assembly, LVM, or disk-encryption guidance during install. Examples: #12523, #13584, #18175, #32392, #40017, and 25+ more.
  4. Cluster, network, or remote-access configuration
    Some installation issues are really Slurm, SSH, static-IP, or cluster-service bring-up on newly provisioned systems. Examples: #10304, #12308, #15770, #20865, #29846, and 20+ more.
  5. Application-stack or environment activation questions
    Customers often need help with Conda, PyTorch, Docker, CryoSPARC, or vendor-provided environments after the base OS is already running. Examples: #10788, #14272, #18898, #22648, #27505, and 30+ more.

Diagnostic Steps

  1. Identify the failing layer first
    Separate base OS install, bootloader, driver, package, cluster, and app-environment failures before suggesting fixes. Representative tickets: #10356, #11750, #18175, #22648, #41989.
  2. Capture exact command output and version state
    Kernel version, OS release, package errors, nvidia-smi, and service logs are repeatedly needed to avoid blind advice. Representative tickets: #10415, #11012, #14953, #19686, #40529.
  3. Check storage and boot configuration during reinstall work
    RAID mode, GRUB target, disk layout, LVM, and encryption choices often explain why installs fail or boot incorrectly. Representative tickets: #12523, #13584, #18175, #32392, #40017.
  4. Use Exxact docs or validated references when possible
    Many successful tickets relied on KB articles, README corrections, or vendor docs rather than ad hoc instructions. Representative tickets: #12308, #12821, #13684, #22648, #41989.
  5. Escalate to live help when the customer is stuck mid-install
    Remote sessions or calls were especially effective for RAID, boot, and cluster setup. Representative tickets: #12446, #14303, #17019, #17826, #18175.

Solutions

  1. Install or correct the right driver and CUDA stack
    The most common durable fix is aligning GPU driver, CUDA, kernel, and related packages to the actual platform. Examples: #10415, #11012, #11108, #40529, #41003, and 70+ more.
  2. Provide a validated reinstall procedure
    Clear OS-baseline guidance, including firmware prerequisites and install order, resolves many otherwise open-ended requests. Examples: #11750, #18175, #32080, #41989, #41219.
  3. Repair storage and boot configuration
    Correcting RAID, GRUB, partitioning, or LVM choices repeatedly gets self-managed reinstalls booting again. Examples: #12523, #13584, #18175, #30249, #32392.
  4. Fix documentation gaps with concrete instructions
    Several strong tickets succeeded because support rewrote incomplete README or doc steps into usable guidance. Examples: #12821, #14953, #22648, #41989, #32411.
  5. Set best-effort boundaries clearly when the request is advisory
    Boundary-setting works well for encryption, custom app stacks, or unsupported software choices, as long as the customer still gets a practical next step. Examples: #10356, #14272, #14953, #15729, #27489.

Edge Cases

  • Software symptom hiding firmware dependency: some install failures only cleared after BIOS, BMC, or PCIe power-setting changes. See #11750, #40529, #41003.
  • Preinstalled environment existed but activation instructions were wrong or incomplete: the software was present, but the handoff documentation was insufficient. See #22648, #32411.
  • Security or recovery-driven reinstall: some tickets ask for a known-good baseline after compromise or corruption rather than ordinary setup help. See #41989, #32103.
  • Best-effort advisory rather than break-fix: encryption, app recommendations, or custom stack questions often close with guidance rather than a single technical fix. See #10356, #14272, #27489.

Related Issues

Referenced by

  • CryoSPARC Integration — co-occurs with this issue (×11)
  • OS Boot Failure — co-occurs with this issue (×38)
  • Matt — handled tickets on this issue (×71)
  • David Nguyen — handled tickets on this issue (×7)
  • Andrew Rodriguez — handled tickets on this issue (×117)
  • H200 — product affected by this issue (×3)
  • RTX 6000 Ada — product affected by this issue (×6)
  • H100 — product affected by this issue (×9)
  • Duc Bui — handled tickets on this issue (×25)
  • RTX A5000 — product affected by this issue (×3)

Was this article helpful?

0 out of 0 found this helpful

Have more questions? Submit a request

Comments

0 comments

Please sign in to leave a comment.