Regex can be a powerful tool when dealing with dynamic data. That being said, it can be challenging to create a regular expression that can do precisely what you want it to 100% of the time. Unexpected edge cases or carefully crafted malicious payloads may be able to bypass regex filtering, resulting in unexpected or unsafe results.
Regex Best Practices
Prioritize using specialized packages or libraries that are designed to perform specific parsing or filtering functions. This includes parsing HTML, validating email addresses, or sanitizing user input that may influence code functionality.
ex. Using Python's urllib.parse package over regex to filter out specific URL schemes
Use previously validated patterns for common use cases.
ex. OWASP's validation regex repository for URLs, IPs, usernames, and passwords.
Avoid using Evil Regex. These are regex patterns that get stuck in exponential backtracking due to specific crafted inputs, causing excessive CPU usage and potential system downtime.
ex.
(a+)+
,([a-zA-Z]+)*
,(.*a){x} for x \> 10
Use well-known and trusted regex tools for building, linting, validating, and testing regex
ex. regex101, RegExr
Limit regex complexity. Overly complex regex can be difficult to create correctly and can lead to performance issues.
Use regex timeouts when available
Avoid using regex when the incoming data is unconstrained or from an unknown source.