This article contains everything you need to know about managing the Iconik Storage Gateway Pro (ISG Pro) clusters and nodes

Node roles

Each node has one or more roles that determine what kinds of jobs it will handle:

Role Handles
MAIN Scanning, ingest, deletion, event handling. Polls Iconik for events.
CHECKSUM File checksum calculation
TRANSCODE Transcoding jobs
TRANSFER Uploads, downloads, archives, exports

A node can hold any combination of roles. Roles can also be split: one node could be dedicated to transcoding while another handles transfers. This is how customers tune the cluster for their workload: a customer with heavy transcoding demand can spin up multiple TRANSCODE-role nodes; a customer with heavy ingest can scale CHECKSUM-role nodes.

Each role must be held by at least one node for the cluster to be fully operational. The default cluster setup gives the first node every role.

ISG Pro High availability, Leader Selection & failure handling

Only one MAIN-role node is the Leader at any time. The Leader runs MAIN-role jobs (scanning, ingest). Other MAIN-role nodes are Followers, ready to take over if the Leader fails. Election is PostgreSQL-based.

If the Leader goes offline — the machine dies, loses network access, loses Iconik access, or loses shared storage access — a Follower wins the next election and picks up MAIN-role work. CHECKSUM, TRANSCODE, and TRANSFER work continues uninterrupted on healthy nodes throughout, because that work doesn't require Leader status.

Each node runs continuous health checks: database connectivity, Iconik API connectivity, shared storage access. A node that fails health checks steps back; another node takes over.

Failure scenarios to cover:

  1. Leader machine dies — a Follower wins the next election after passing all health checks; the new Leader updates the old Leader's availability_status since the old node can't.
  2. Leader loses iconik access (proxy / external gateway down) — Leader stops MAIN work, releases the lock, becomes Follower. Another healthy node takes over.
  3. Leader loses shared storage access — same outcome: drop leadership, release lock.
  4. Leader loses all network — equivalent to "machine died" from the cluster's perspective.

Health checks each node performs:

  • Database connectivity
  • iconik API connectivity
  • Shared storage access (per configured storage)

What does not require leader status (continues even on Followers as long as the node is healthy):

  • Checksum calculation
  • Transcoding
  • Transfers / archives (uploads, downloads)

Settings & Precedence

Settings can live in three places. Higher overrides lower:

config.ini (highest) → ISG (node) settings → ISG cluster settings (lowest)

Currently available settings that can be defined using admin panel:

  • checksum_max_workers - checksum calculation concurrency (applied to nodes with CHECKSUM role)
  • scanner_concurrency_value - scanner concurrency (applied to a Leader Main node)
  • file_download_parallel_downloads_num - max download jobs amount per node (applied to nodes with TRANSFER role)
  • file_upload_parallel_uploads_num - max upload jobs amount per node (applied to nodes with TRANSFER role)
  • max_transcoding_jobs - max transcoding jobs amount per transcoder profile per node (applied to nodes with TRANSCODER role)

Cluster-only settings:

  • db_connection_uri - connection string a node uses for opening a database connection.
  • visibility_timeout - how long a node holds a lease on a queued job before another node can pick it up.

Monitoring & telemetry

What admins should watch for in the UI:

Upgrades

  • Recommended order:
    1. Disable cluster (via Web);
    2. Upgrade main nodes;
    3. Upgrade remaining nodes;
    4. Enable cluster.

ISG Pro Troubleshooting

  • No node becomes Primary — check DB connectivity from every node; check shared storage reachability; check iconik connectivity.
  • Jobs stuck "in progress" forever — check visibility_timeout is sane; check if at least one node has a required role; look for nodes that disappeared mid-job (the lease will expire).
  • Two nodes claim same storage_gateway_id — visible in telemetry as multiple worker_ids for one gateway. Stop the duplicate.
  • PostgreSQL TLS errors — verify sslmode, certificate paths, and that pgBouncer (if used) is reachable.