Bash & Shell Scripting for Engineers

Common Production Failures

Bash scripts pass local tests. They pass CI. They run fine on your laptop. They fail in production in ways that look impossible. Almost every "impossible" failure traces back to one of a handful of environmental differences between the author's machine and the machine the script ran on.

This lesson is a catalog of those failures: the shell that runs the script isn't the one you wrote for, the arg list grew too long, the locale changed case-comparison behavior, a Windows editor added \r\n. Recognize them once, diagnose them in seconds forever after.

KEY CONCEPT

Most "Bash worked on my machine" bugs are environmental. Different shell, different Bash version, different locale, different filesystem. Know what to check.


Failure 1: /bin/sh is not Bash

You wrote:

#!/bin/sh
arr=(a b c)
echo "${arr[0]}"

On Debian/Ubuntu /bin/sh is dash, not Bash. Arrays don't exist in POSIX sh. You get:

script.sh: 2: Syntax error: "(" unexpected

Fix: use the right shebang.

#!/usr/bin/env bash

#!/usr/bin/env bash finds Bash in $PATH. #!/bin/bash hardcodes a path (which may not exist on Alpine, BusyBox, etc.). The env form is more portable across distros.

POSIX-vs-Bash distinctions to remember

Things only Bash has:

  • [[ ... ]] (POSIX only has [ ... ])
  • Arrays (both indexed and associative)
  • ${var^^}, ${var,,} (case conversion)
  • =~ regex matching
  • (( )) arithmetic
  • local (POSIX has no function scope)
  • $'...' C-style escape strings
  • Process substitution <(cmd), >(cmd)

If your script uses any of these, the shebang must be bash, not sh.

Detecting which shell is actually running

echo "$BASH_VERSION"    # empty if not bash
echo "$ZSH_VERSION"     # empty if not zsh
echo "$0"               # the interpreter (not always accurate)

Useful during debugging a "works on my machine" report.


Failure 2: macOS Bash is 3.2

macOS ships Bash 3.2 by default (for licensing reasons — Bash 4+ is GPLv3, Apple doesn't want it). Your Linux server has Bash 5+. Features that work on the server fail on developer laptops.

Missing from Bash 3.2:

  • Associative arrays (declare -A)
  • ${var^^}, ${var,,}
  • mapfile / readarray
  • ;& and ;;& in case
  • BASH_COMPAT

The fix

#!/usr/bin/env bash
# Require Bash 4+
if (( BASH_VERSINFO[0] < 4 )); then
  echo "error: requires Bash 4+ (you have $BASH_VERSION)" >&2
  echo "on macOS: brew install bash" >&2
  exit 1
fi

Or, for developers who use macOS, install a newer Bash via Homebrew:

brew install bash
# then use #!/usr/bin/env bash — picks up the Homebrew one

Failure 3: argument list too long

You try:

rm /tmp/*.tmp

And get:

/bin/rm: Argument list too long

The reason: there are more files matching the glob than rm can fit in argv. Linux default is ~128KB of argv total (including env). Thousands of filenames blow past that.

Fixes

Fix 1: xargs. Let xargs split the list into multiple invocations:

# Find files, pass to rm in batches
find /tmp -maxdepth 1 -name '*.tmp' -print0 | xargs -0 rm

-print0 and -0 use null-delimited input, which is safe for any filename.

Fix 2: find -delete. No subprocess overhead at all:

find /tmp -maxdepth 1 -name '*.tmp' -delete

Fix 3: -exec with +. find batches invocations:

find /tmp -maxdepth 1 -name '*.tmp' -exec rm -- {} +

The + at the end (instead of \;) tells find to batch many args per rm call.

WARNING

Never "fix" arg-list-too-long by piping through sh -c:

ls *.tmp | xargs rm        # BROKEN on filenames with spaces
ls *.tmp | xargs -I{} rm {}   # slow — one rm call per file

Use find -print0 | xargs -0 or find -delete.


Failure 4: locale-dependent behavior

# On a system with LANG=en_US.UTF-8
[[ "$name" =~ ^[A-Z]+$ ]]    # matches "HELLO"

# On the same system with LANG=C
[[ "$name" =~ ^[A-Z]+$ ]]    # same behavior

# But with LANG=en_US.UTF-8
sort < file.txt    # sorts A < a < B < b (case-insensitive!)

# With LANG=C
sort < file.txt    # sorts A < B < a < b (ASCII order)

Different locales change:

  • Sort order (case-insensitive vs ASCII).
  • Regex classes ([a-z] may or may not match Unicode letters).
  • Number formatting (1,234.56 vs 1.234,56).
  • Month/day names (date output).

The fix: force a locale

For scripts that need deterministic behavior:

#!/usr/bin/env bash
export LC_ALL=C    # ASCII-only, locale-independent

# Now sort, grep, regex, date — all behave the same everywhere

LC_ALL=C is the "safe" locale. ASCII order. No locale-dependent transformations. Use it for any script that compares, sorts, or pattern-matches.

PRO TIP

In CI/CD, always export LC_ALL=C at the top of scripts. Otherwise your script can work on the developer's laptop but fail in a different container that uses a different locale.


Failure 5: CRLF line endings

You wrote the script on Windows (or edited it with a Windows editor that auto-converted line endings). On Linux:

$ ./script.sh
./script.sh: line 1: #!/usr/bin/env bash^M: No such file or directory

The ^M is \r, appended to every line. The shebang looks for a program named bash\r, which doesn't exist.

Detecting

file script.sh
# script.sh: Bourne-Again shell script, ASCII text executable, with CRLF line terminators

Or see the \r:

cat -A script.sh | head
# #!/usr/bin/env bash^M$
# set -euo pipefail^M$

^M at line ends is \r. $ is the end of line.

Fix

# Remove \r from a file
sed -i 's/\r$//' script.sh

# Or with dos2unix if installed
dos2unix script.sh

# Prevent it at source: in .gitattributes
# *.sh text eol=lf

Prevention

.editorconfig in your repo:

[*.sh]
end_of_line = lf

And .gitattributes:

*.sh text eol=lf

These coerce editors and Git to use LF for shell scripts, regardless of the developer's OS.


Failure 6: PATH differences between environments

A script works in a developer's shell but not in cron:

$ crontab -e
* * * * * /home/alice/scripts/backup.sh

# backup.sh uses `aws` from ~/.local/bin/aws (not in cron's default PATH)

Cron typically has a minimal PATH (/usr/bin:/bin). ~/.local/bin isn't there, so aws isn't found.

Fix 1: set PATH explicitly

#!/usr/bin/env bash
export PATH="/usr/local/bin:/usr/bin:/bin:$HOME/.local/bin:$PATH"
# ... use tools ...

Fix 2: use full paths in scripts meant for cron

/home/alice/.local/bin/aws s3 cp ...

Less portable, but dependable.

Fix 3: check tools are available at script start

require_command() {
  command -v "$1" >/dev/null || { echo "missing: $1" >&2; exit 1; }
}

require_command aws jq kubectl

Fails fast with a clear error rather than a cryptic "command not found" later.


Failure 7: glob matching empty sets

for file in *.log; do
  gzip "$file"
done

If no .log files exist, the loop runs once with file literally equal to *.log. gzip *.log fails because the file doesn't exist.

Fix

shopt -s nullglob

for file in *.log; do
  gzip "$file"
done

nullglob makes unmatched globs expand to nothing (the loop body is skipped entirely).

Covered in the Parsing Order lesson; worth repeating because it bites everyone.


Failure 8: unset $HOME or $USER

Cron jobs, container init scripts, and systemd units often run with minimal environment:

$ sudo -u myapp /opt/myapp/start.sh
/opt/myapp/start.sh: line 5: cd: HOME not set

sudo doesn't always preserve $HOME. Cron has its own set of env vars. systemd services start with almost nothing.

Fix: set explicitly or derive

# Set HOME from passwd if needed
if [[ -z "${HOME:-}" ]]; then
  HOME=$(getent passwd "$(id -u)" | cut -d: -f6)
fi
export HOME

Or pass via systemd:

[Service]
Environment="HOME=/home/myapp"
Environment="PATH=/usr/local/bin:/usr/bin:/bin"

Failure 9: integer overflow

big=$((1 << 62))
(( big * 4 ))
# Bash silently overflows at 2^63. Results become wrong, negative, or wrap.

Bash uses signed 64-bit integers. Overflow gives wrong results, not an error. For financial calculations, timestamps-in-nanoseconds, or anything else near the limit, be aware.

Fix

# Use bc for arbitrary-precision arithmetic
echo "2^100" | bc

# Or awk
echo | awk '{print 2^100}'

Or just: don't do big math in Bash.


Failure 10: subshell state loss (revisited)

You've seen this one. It's worth a reminder:

count=0
find . -name '*.txt' | while read -r f; do
  count=$((count + 1))
done
echo "count: $count"    # always 0

The while runs in a subshell; the increment is lost.

Fixes:

# Fix 1: process substitution
count=0
while read -r f; do
  count=$((count + 1))
done < <(find . -name '*.txt')

# Fix 2: shopt lastpipe
shopt -s lastpipe
count=0
find . -name '*.txt' | while read -r f; do
  count=$((count + 1))
done

# Fix 3: don't use a pipeline
count=0
while read -r f; do
  count=$((count + 1))
done < <(find . -name '*.txt')

Failure 11: pipes hiding errors

# Without pipefail
cat nonexistent.log | grep ERROR | wc -l
echo $?    # 0  — because wc succeeded

The pipeline succeeds even though cat failed. The outputs of grep (zero-length) and wc (printing "0") look normal.

Always set -o pipefail. Covered in the set-flags lesson.


Failure 12: Trailing newlines in command substitution

expected="hello"
got=$(echo hello)
[[ "$expected" == "$got" ]] && echo "match" || echo "no match"
# match — because $() strips trailing newlines

# But:
expected="hello"
got=$(printf 'hello\n\n\n')
[[ "$expected" == "$got" ]] && echo "match" || echo "no match"
# Still match. $() strips ALL trailing newlines, not just one.

Usually what you want. Occasionally a surprise when you're comparing against a variable that was supposed to have specific trailing whitespace.


Failure 13: read loop ends early on files missing final newline

while read -r line; do
  process "$line"
done < file.txt

If file.txt doesn't end with \n, the last line is not processed.

Fix

while IFS= read -r line || [[ -n "$line" ]]; do
  process "$line"
done < file.txt

|| [[ -n "$line" ]] runs the body one more time if read hit EOF but the partial line is non-empty.


A debugging checklist

When a script fails in production but not locally:

1. ShebangIs /bin/sh or /bin/bash what you expect on the target?2. Bash versionIs it 3.2 (macOS)? Are you using 4+ features?3. PATHIs every tool your script uses actually in the target's PATH?4. LocaleLANG, LC_ALL same as dev? Export LC_ALL=C for determinism.5. Line endingsfile script.sh — does it say CRLF?6. EnvironmentHOME, USER, TMPDIR all set as expected? env output reveals surprises.7. Run under set -xWhen in doubt, DEBUG=true and see exactly what Bash ran.

Quiz

KNOWLEDGE CHECK

Your Bash script works on your Mac but fails with mysterious errors on a Linux CI runner. Both have Bash installed. Which single issue is the most likely cause?


What to take away

  • Use #!/usr/bin/env bash. Never assume /bin/sh is Bash. Handle macOS's Bash 3.2 gotcha with an explicit version check or with Homebrew bash.
  • argument list too long — use find -print0 | xargs -0 or find -delete.
  • export LC_ALL=C at the top of scripts for locale-independent sort, regex, comparison.
  • CRLF line endings: .editorconfig + .gitattributes + dos2unix. Prevent at source.
  • PATH in cron/systemd is minimal. Set explicitly or use full paths.
  • Glob zero-match: shopt -s nullglob at the top.
  • set -o pipefail to catch mid-pipeline failures.
  • Files without trailing newline: while read -r line || [[ -n "$line" ]].
  • When something fails mysteriously, run with DEBUG=true and trace — usually the cause becomes obvious.

Next lesson: ShellCheck — the linter that catches these bugs before you ship them.