Strings and Regular Expressions in PHP, or
"PCRE, POSIX, and Bears, Oh My!"
UPHPU Meeting
January 18, 2005
Who am I?
Campaign Promises
- Intro to Strings in PHP
- (Feel free to tell me how fast or slow to go)
- Functions relating to HTML, SQL, etc.
- Regular Expressions
- Performance/Speed considerations
- Grab bag of cool string functions
Introducing: Strings in PHP
- Much like strings in any other language
- Major difference: Boundary between string, integer, float, and
boolean is very blurred
- Actually a benefit: if it's not a string, but should be, it will be
- Though this can lead to some unexpected results
- Info in PHP Manual:
String Syntax
- Single quotes: 'a string'
- No variable interpolation, \' is only escape code
- Double quotes: "a $better string\n"
- Variables work, standard escape codes work
- "Here-doc" syntax: $foo = <<<END ... END;
- Great for large multi-line blocks of text or html
- Variables are interpolated
- Gotchas: newline must follow <<<END
- END; must be the entire line, with no whitespace
String Operators
- Array-like character access:
- $str = "MyBigString" => $str{3} == "B"
- Concatenation: the dot operator
- "This lets you join strings into ". "bigger ones"
- Note: Avoiding embedded newlines "in strings that wrap onto
multiple lines" is a good idea
- Concatenating Assignment : .=
- $str = "My name is"; $str .= " Mac.\n";
Variables in Strings
- "Simple string with a $var in it\n"
- "You can use $an_array[$var] too\n"
- "Sometimes you need ${curl}ies to mark where the {$var}iable ends"
- "Curlies help on {$big['fancy'][$stuff]} too"
- "Where it's confusing to embed ". $big['ugly'][$var].
"iables, break it up as needed with concatenation."
Must-Have String Functions
- www.php.net/strings
- echo/print - (print $foo)==1, echo "can", $take,"more than one","argument";
- Echo shortcut: <b><?=$foo?></b>
- trim, ltrim, rtrim/chop - remove whitespace
- explode, implode/join
- $arr = explode(" ", "List of words");
- $str = implode(",",$arr);
Obligatory C-like Functions
- All your old favorites are in there:
- printf, sprintf, sscanf, fprintf
- strcmp, strlen, strpos, strtok
- They all do just what you expect, though many of them have
easier alternatives
- Gotcha: Some of them (like strpos and friends) return boolean
false, because 0 is a valid result. Always use "===false".
Basic String Manipulation
- Any of this can be done with regular expressions as well...
- and in more complex cases, can only be done with regular expressions
- But regular expressions are slower (more later)
- str_replace("bar","baz","foobar");
- str_repeat("1234567890",8);
Formatting functions
- strtolower, strtoupper
- ucfirst, ucwords - uppercase first char, or first char of each word
- wordwrap
- wrap text to a given width
- str_pad("tooshort",15," ");
- vprintf, vfprintf, vsprintf - formatted output
- number_format - add thousands grouping
- money_format - format as currency
Special-Purpose Functions
- One of PHP's strengths is the way it caters to the common things
people need
- Many string functions are specifically for use with things like
dates/times, URLs, HTML, and SQL databases
- Advice: When you need them, use them. "Rolling your own" doesn't
usually work out the way you plan it.
Date and Time Functions
- www.php.net/datetime
- A variety of functions to not only do calculations with dates,
but to convert dates to strings - date(), strftime()
- And more importantly, to convert strings to dates -
strtotime(), strptime()
- Great example of why not to "roll your own", even if it doesn't
seem that complex at first
URL Functions
- www.php.net/url
- urlencode, urldecode
- Turn non-alphanumerics to %[hex] and ' '->'+'
- rawurl{en,de}code do the same except for '+'
- parse_url - break into host, path, query, etc.
- http_build_query - turn array to URL query
- base64_{en,de}code - base64 conversions for use with MIME, etc.
HTML Functions
- htmlspecialchars - encode &, ", <, and >
with &, ", <, and >
- htmlentities is same but for every char
- html_entity_decode is the reverse
- nl2br - turn newline (\n) into <br> tags
- parse_str - parse GET query into variables or an array (see also: extract)
- strip_tags - strip html tags [selectively]
SQL Functions
- "Magic Quotes" - on by default
- Misnamed - adds magic slashes, not quotes
- addslashes, stripslashes - escape ', ", and \
- Advice: do db queries first, then use $var =
htmlspecialchars(stripslashes($input)) for use in
<input value='$var'> tags
- quotemeta - escape . \ + * ? [ ^ ] ( $ )
- Good for commands: system() and `backticks`
Now for the fun stuff...
- Intro to Strings in PHP
- (Feel free to tell me how fast or slow to go)
- Functions relating to HTML, SQL, etc.
- Regular Expressions
- Performance/Speed considerations
- Grab bag of cool string functions
Regular Expressions
- Extremely powerful tool for pattern matching - same thing used
by compilers and interpreters to run your programs
- Two flavors in PHP:
- PCRE - Perl-Compatible Regular Expressions
- POSIX Extended
- I favor PCRE - multiple languages, more features, faster, and binary-safe
Basics of RE's
- They match patterns - the magic is in the pattern you tell them to match
- They have to be precise, including and excluding exactly what you want
- People get scared of them because the details can be tricky
- But they're one of the best tools you have for doing some pretty
fancy string stuff
RE Patterns
- Start with strings and grouping: "abc(def)"
- Add alternative branches: "abc(def|123)"
- Wildcard: . matches any char but \n
- Quantifiers/Repeating:
- * = "0 or more", + = "1 or more", ? = "0 or 1"
- {n} = "n times", {n,m} = "n to m times"
- "(abc)+(def|123)*(.{2})*"
- At least one abc, maybe some triplets, then an even number of characters
Character Classes and Types
- [] makes character classes
- List of characters and ranges: [a-zA-Z0-9]
- If you want to use -, put it at the beginning
- Escape any special chars with \ as usual
- If first char is ^, class is negated
- \d = [0-9], \D = [^0-9]
- \s = whitespace, \S = non-whitespace
- \w = [a-zA-Z0-9_], \W = [^a-zA-Z0-9_]
- \b = word boundary - "zero-width assertion"
Anchors
- What if you want to force it to match only at the beginning of
the string? Or to match the entire string?
- Use an anchor!
- ^ as the first char anchors the beginning
- $ as the last char anchors the end
- (Varies slightly in multi-line mode)
Greediness and Modifiers
- Regular Expressions are Greedy
- They'll keep eating characters as long as they can keep matching.
- Consider: "<.*>" vs. "<[^>]*>" when matching against "<b>Hi</b>"
- PCRE has modifiers: /<pattern>/<mods>
- /i = case insensitive
- /U = un-greedy
- /m = multi-line
Back References
- Most commonly used in replace operations, but can be used in match
patterns as well
- Parentheses not only group, but capture too
- Use \ followed by the number of the capture
- "ab(.)\1(.)\2" will match abccdd or abxxyy, but not abcccd or abdcdc
- Can get tricky to count which backref goes where with nested parentheses
Modifiers for Parentheses
- PCRE Only - makes some things possible that otherwise couldn't be done
- Non-capturing grouping: (?: )
- Can simplify back-reference counting
- Look-ahead Assertions:
- They don't advance the matching position
- Positive: (?= ), or Negative: (?! )
- Very powerful, but not always easy to understand. Trial and error
can be your friend!
PCRE Specifics
- www.php.net/pcre
- preg_match, preg_match_all, preg_replace, preg_split,
preg_grep (filter an array)
- Perl RE's have a delimiter, usually /, but can be anything:
- preg_match("/foo/",$bar);
- preg_match("%/usr/local/bin/%",$path);
POSIX Specifics
- www.php.net/regex
- ereg, ereg_replace, split, eregi, spliti, etc.
- [Only] Advantage over PCRE: It doesn't require the PCRE library to
be installed, so it's always there in any PHP installation
- Other regex engines support this specification, though the Perl
style seems to be more popular.
Almost there...
- Intro to Strings in PHP
- (Feel free to tell me how fast or slow to go)
- Functions relating to HTML, SQL, etc.
- Regular Expressions
- Performance/Speed considerations
- Grab bag of cool string functions
Performance/Speed
- Rule of thumb: use the simplest function that will get the job done right
- strpos instead of substr
- str_replace instead of preg_replace
- And so forth...
- The PHP manual online usually includes notes about speed differences
- PCRE is faster than POSIX Regex
Grab Bag
- md5, md5_file - Calculate md5 hashes
- Great for passwords in databases, etc.
- levenshtein, similar_text - calculate the "similarity" of two strings
- metaphone, soundex - calculate how similar two strings sound when
spoken out loud
- str_rot13 - Encryption algorithm
Grab Bag 2
- str_shuffle - words are much more fun once they've been randomized
- count_chars, str_word_count - statistics about your strings
- str_rev - if it doesn't make sense forward, try it backwards
Grand Finale
Group Practice
- 8.3 filenames - anything but zip files
- /^.{0,8}(\.[^z][^i]?[^p]?)?$/i - fails filename.ftp
- /^.{0,8}\.(!?zip)$/I - PCRE only
- Sometimes easier to match rejects rather than keepers
- Apache access log example:
- 4.79.40.166 - - [07/Jan/2005:04:35:42 -0700] "GET /robots.txt
HTTP/1.0" 404 337 "-" "Holmes/1.0"
- preg_match("/^(\d{1,3}(:?\.\d{1,3}){3}) ". #IP
- "- - \[(.+)\] \"\w+ (\S+) (\S+)\" (\d+) (\d+) ".
- "\"-\" \"([^"]*)\"$/",$row,$matches);