PHP Introduction | Text and Date Handling | Regular Expressions

Tags:

When working with text, one unavoidable topic is “regular expressions.” A regular expression prepares something like a rule for arranging text, called a “pattern,” and uses that rule to find or replace text.

Ordinary search and replacement can only find fixed text. For example, it can search for the word “PHP.” By contrast, regular expressions can check whether text is arranged according to a certain rule. For example, you can search for “three-digit numbers” or find “text that starts with a and ends with s.”

Regular expression patterns are assembled by using special symbols called “meta characters.” Many symbols and rules are available. The following lists summarize them.

Basic Meta Characters

Symbol	Description
\	General escape character
^	Start of the search target, or start of a line in multiline mode
$	End of the search target, or end of a line in multiline mode
.	Matches any character except a newline
[]	Character class
\|	Starts an alternative
()	Subpattern
?	Meaning extension, repeat 0 or 1 time, or minimize repetition
*	Repeat 0 or more times
+	Repeat 1 or more times
{}	Specify minimum and maximum repetitions

Special Characters Available in Character Classes

Meta Character	Description
\	Escape character
\b	Backspace
\f	Form feed
\n	New line
\r	Carriage return
\t	Tab character
\d	One digit, same as `[0-9]`
\s	Whitespace character, such as a space, newline, carriage return, or tab
\w	Numbers, letters, and underscore, same as `[a-zA-Z0-9_]`
^	Negation
-	Range

Except for ^ and -, these can also be used outside character classes.

Special Characters Used Outside Character Classes

Meta Character	Description
\a	Alert
\c	Control character, with the following character specified
\e	Escape
\D	One non-digit character, same as `[^0-9]`
\S	Any non-whitespace character, same as `[^\f\n\r\t\v]`
\W	Anything other than numbers, letters, and underscore
\b	Word boundary
\B	Non-word boundary
\A	Start of text
\Z	End of text or newline
\z	End of text
\ddd	Character represented by octal `ddd`
\xhh	Character represented by hexadecimal `hh`

Pattern Modifiers

Meta Character	Description
i	Match letters in the pattern regardless of uppercase or lowercase
m	Treat the target string as composed of multiple lines
s	Make `.` match every character, including newlines
x	Ignore whitespace characters
e	Evaluate `preg_replace()` replacement as PHP code
A	Match only at the beginning
D	Make `$` match only at the very end
S	Perform more time-consuming analysis
U	Reverse shortest and longest match behavior
X	Use Perl-incompatible features
u	Treat the pattern as UTF-8

These characters define the nature of the pattern and are not included in the pattern itself.

This does not mean that you must remember all of them. In practice, not many people use every symbol from memory. Even remembering a few commonly used symbols is enough to experience the power of regular expressions.

Several functions are available for using regular expressions. First, it is enough to understand the meanings of “pattern matching” and “replacement.”

Functions for Pattern Matching

$variable = preg_match(pattern, text, $variable);
$variable = preg_match_all(pattern, text, $variable);

These functions use the pattern in the first argument to inspect the text in the second argument, and return the number of matches found. preg_match checks only the first pattern match, while preg_match_all checks all matches.

The interesting part is the variable prepared as the third argument. You do not specify a value there. Instead, the pattern matching result is obtained in that variable. It is structured as a multidimensional array and gathers information about each matched text.

You can also specify detailed flags for pattern matching as a fourth argument. Look into this when you begin using regular expressions seriously.

Function for Replacement

$variable = preg_replace(pattern, replacement text, text);

This function performs replacement by using a regular expression. Pass the pattern as the first argument, the replacement text as the second argument, and the text to inspect as the third argument. The return value is the replaced text.

If you first learn these three functions, you can do many things. Pattern matching can select specific elements from text data, such as finding only <a> tag links or email addresses in HTML source code. If you can replace using regular expressions, you can perform complex replacement processing that ordinary replacement could never handle.

Regular Expression Example

Let us write and run an example that uses regular expressions.

<?php
    if ($_POST != null){
        $url = $_POST['text1'];
        $lines = file($url);
        $data = implode($lines);
        // Pattern matching
        $pattern = "/([\w-]+)@([\w\.-]+)\b/";
        $flg = preg_match_all($pattern, $data, $matchs);
        if ($flg != false){
            $result = "";
            foreach($matchs[0] as $key => $val){
                $result .= $val . "\n";
            }
        } else {
            $result = "None.";
        }
    }
?>
<!DOCTYPE html>
<html lang="ko">
    <head> 
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> 
        <title>sample page</title>
    </head>
    <body>
        <h1>Hello PHP!</h1>
        <form method="post" action="./index.php">
            <input type="text" name="text1" size="40" value="<?php echo htmlspecialchars($url); ?>"><br>
            <textarea name="area1" cols="30" rows="5"><?php echo $result; ?></textarea><br>
            <input type="submit">
        </form>
        <hr>
    </body>
</html>

Here, when you enter a URL, the text of that page is obtained and only email addresses are found and displayed.

Enter the URL of the page you want to inspect, such as http://..., in the input field above and submit it. The email addresses written on that page will be found and displayed in the text area below.

Here, the file function first reads the URL text, and implode combines it into one text string. Then pattern matching is performed using a pattern.

$pattern = "/([\w-]+)@([\w\.-]+)\b/";

This is the prepared pattern. Many people see this and give up, thinking, “Regular expressions are too hard.” But honestly, there is no need to do that. If you search the Internet for regular expression patterns, many people publish commonly used patterns for email addresses, URLs, and similar data. You can copy and use those patterns, so even if you cannot create patterns yourself at first, you can still use regular expressions. Then you can study them little by little, and they may become useful later.

$flg = preg_match_all($pattern, $data, $matchs);

This is the part that runs the pattern match. This line itself is not especially difficult. When it runs, the matching results are stored in the variable $matchs.

This variable has a fairly complex structure. To obtain the matched text, retrieve the value at index 0. It contains an associative array, and the value part stores the found text, which in this case is the found email address.

In short, with pattern matching, you can obtain all found text by taking values from the associative array at result array index 0. If you keep just this in mind, you can use pattern matching to some extent.

References

For more details, refer to the following sites.