Text processing from bird's eye view
This article was initially published on delphi3000.com, but the site had not survived..
Do You want to write program for extracting weather forecast or currency rates or e-mails or whatsoever You want from HTML-pages, e-mails or other unformatted source? Or do You need to import data into Your database from old DB’s ugly export format? Or You want just ensure that the e-mail user entered is syntaxically correct one?
There are two ways.
The traditional one - You must make full featured text parser. This is an awful peace of work! For example, try to implement rules how to recognize e-mail address - simple code like
1 2 3 p := Pos ('@', email); if (p > 1) and (p < length (email)) then ...
don’t filter many common errors, for example, users frequently forget enter domain-part of e-mail, You’ll need much more complex code (just read the big article Extended E-mail Address Verification and Correction). Just think about writing and debugging this code.
The second way - look at the text from bird’s eye view with help of regular expressions engine. You don’t write the check processing routine, You just describe how regexp engine must do it for You. Your application will be implemented very fast and will be robust and easy to change!
Unfortunately, Delphi component palette contains no TRegularExpression component.
But there are some third-party implementations, for example my TRegExpr.
Example 1 How to check e-mail address syntax.
1 2 if ExecRegExpr ('[\w\d\-\.]+@[\w\d\-]+(\.[\w\d\-]+)+', email) then ... gotcha! e-mail is valid ...
Do not forget to add TRegExpr into uses section of the unit.
Example 2 How to extract phone numbers from unformatted text (web-pages, e-mails, etc).
For example, we need only St-Petersburg (Russia) phones (city code 812).
1 2 3 4 5 6 7 8 9 10 11 12 procedure ExtractPhones (const AText : string; APhones : TStrings); begin with TRegExpr.Create do try Expression := '(\+\d *)?(\((\d+)\) *)?(\d+(-\d*)*)'; if Exec (AText) then REPEAT if Match  = '812' then APhones.Add (Match ) UNTIL not ExecNext; finally Free; end; end;
For the input text
this procedure returns
Example 3 Extracting currency rate from Russian Bank web page.
Create new project and place at the main form TBitBtn, TLabel and TNMHTTP components.
Add following code as BitBtn1 OnClick event handler (don’t mind Russian letter - they need for Russian web-page parsing):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 procedure TForm1.BitBtn1Click(Sender: TObject); const Template = '(?i)Ioeoeaeuiue eo?n OA ii aieea?o' + '.*Aaoa\s*Eo?n\s*Eo?n iie.\s*Eo?n i?ia. [^<\d]*' + '(\d?\d)/(\d?\d)/(\d\d)\s*[\d.]+\s*([\d.]+)'; begin NMHTTP1.Get ('http://win.www.citycat.ru/finance/finmarket/_CBR/'); with TRegExpr.Create do try Expression := Template; if Exec (NMHTTP1.Body) then begin Label1.Caption := Format ('Russian rouble rate %s.%s.%s: %s', [Match , Match , Match , Match ]); end; finally Free; end; end;
Now, then You click at the BitBtn1, programm connects to specified web-server and extract current rate.
Conclusion “Free Your mind” ((c) The Matrix ;)) and You’ll find many other tasks there regular expressions can save You incredible amount of stupid coding work !