Skip to main content

Lib/shlex.py

cpython 3.14 @ ab2d84fe1023/Lib/shlex.py

shlex provides a class and a set of module-level helpers for splitting strings using shell-like tokenization rules. The shlex class is a stateful lexer that reads from any file-like stream or a string, tracks line numbers, supports an inclusion stack (so one source can push another), and produces tokens one at a time via get_token(). The module also exposes split(), join(), and quote() as convenient top-level functions for the common case of splitting or safely quoting a single string.

Map

LinesSymbolRole
1-12module headerDocstring, contributor credits, imports
13__all__Exports shlex, split, quote, join
15-303shlexStateful lexer class
17-64shlex.__init__Initializes stream, character-class attributes, state machine, and punctuation mode
66-68shlex.punctuation_charsRead-only property exposing the frozen punctuation set
70-74shlex.push_tokenPushes a pre-formed token onto the pushback deque
76-88shlex.push_sourceSaves current stream state and switches to a new input stream
90-97shlex.pop_sourceRestores the previous stream from the file stack
99-129shlex.get_tokenReturns next token, handling pushback, source inclusions, and EOF
131-275shlex.read_tokenCore state-machine loop that reads characters and transitions between states
277-285shlex.sourcehookResolves a filename for source inclusion (cpp-like relative paths)
287-293shlex.error_leaderFormats an Emacs-friendly error prefix with filename and line number
295-302shlex.__iter__, shlex.__next__Iterator protocol wrapping get_token()
304-312splitModule-level convenience: splits a string in POSIX mode with whitespace splitting
315-317joinModule-level convenience: joins a list of strings into a safely-quoted shell command
320-338quoteReturns a shell-safe quoted version of a single string
341-351_print_tokens, __main__Debug helper and self-test entry point

Reading

The state machine in read_token

read_token() is the heart of the module. It maintains a single state character that drives a large while True loop. The key states are: ' ' (whitespace, between tokens), 'a' (accumulating a word), 'c' (accumulating a punctuation-char token), a quote character such as '"' or "'" (inside a quoted string), and an escape character such as '\\' (escape sequence). Transitions happen character by character. When POSIX mode is off, quote characters are included in the returned token; in POSIX mode they are stripped and the content is merged into the surrounding word.

# CPython: Lib/shlex.py:131 shlex.read_token
def read_token(self):
quoted = False
escapedstate = ' '
while True:
if self.punctuation_chars and self._pushback_chars:
nextchar = self._pushback_chars.pop()
else:
nextchar = self.instream.read(1)
if nextchar == '\n':
self.lineno += 1
...
if self.state is None:
self.token = '' # past end of file
break
elif self.state == ' ':
...

Source inclusion stack

The push_source() and pop_source() methods implement a depth-first inclusion mechanism analogous to the C preprocessor #include. When the lexer sees a token equal to self.source (a user-configurable string, not set by default), get_token() reads the next token as a filename, calls sourcehook() to open it, and pushes it. At EOF of the inner stream, pop_source() closes it and restores the outer stream. This allows scripts that parse config files with an include directive to use shlex directly without writing their own stack.

# CPython: Lib/shlex.py:76 shlex.push_source
def push_source(self, newstream, newfile=None):
"Push an input source onto the lexer's input source stack."
if isinstance(newstream, str):
newstream = StringIO(newstream)
self.filestack.appendleft((self.infile, self.instream, self.lineno))
self.infile = newfile
self.instream = newstream
self.lineno = 1

quote and the safe-char fast path

quote() must handle arbitrary Python strings, including those with Unicode characters. In 3.13+ the fast path encodes the string to bytes and uses bytes.translate() with a deletion table to check whether only safe ASCII characters are present. If they are, the original string is returned unchanged. Otherwise the function wraps the string in single quotes and escapes any embedded single quote as '"'"'. The isinstance guard ensures a TypeError is raised early for non-string input.

# CPython: Lib/shlex.py:320 quote
def quote(s):
"""Return a shell-escaped version of the string *s*."""
if not s:
return "''"

if not isinstance(s, str):
raise TypeError(f"expected string object, got {type(s).__name__!r}")

safe_chars = (b'%+,-./0123456789:=@'
b'ABCDEFGHIJKLMNOPQRSTUVWXYZ_'
b'abcdefghijklmnopqrstuvwxyz')
if s.isascii() and not s.encode().translate(None, delete=safe_chars):
return s
return "'" + s.replace("'", "'\"'\"'") + "'"

punctuation_chars mode

When punctuation_chars is truthy, the lexer activates an additional state 'c' for runs of punctuation characters (();<>|& by default). These characters are removed from wordchars and accumulated separately so they form their own tokens rather than being merged into adjacent words. A secondary _pushback_chars deque is used alongside the main pushback deque, because punctuation tokens may consume a character that belongs to the next word token and need to put it back at the character level rather than the token level.

gopy notes

Porting shlex to Go requires representing the state machine's state variable as a rune (Go's character type). The filestack and pushback deques map to Go container/list or simple slices. StringIO wrapping for string inputs maps to strings.NewReader. The sourcehook method that opens files would need a configurable hook interface in Go rather than an overridable method. The quote() fast path using bytes.translate can be replicated in Go with a strings.IndexFunc over a rune predicate.

CPython 3.14 changes

No functional changes were made to shlex in 3.14. The module's public API (split, join, quote, shlex) and its state-machine behavior are stable. The quote() fast-path using bytes.translate() was introduced in 3.12 and carries over unchanged.