PDA

View Full Version : regular expressions usage



zak
22-04-2011, 08:56
i want to remind the users about using regular expressions in searching for patterns, suppose you want to search pi for your birthday year 1902 then zero or more of any digits then your wife birthday year 1924 without overlapping patterns, we will use the example as a template: VBRegExp_Test_MatchesAndCollections.tbasic in the C:\thinBasic\SampleScripts\VBRegExp
the pattern "1902.*?1924" and the text is the attached pi.txt (beware it is a continous digits up to 2 millions digits without new lines so your notepad or wordpad may hang in windows xp, i am using freeware notepad++ from http://notepad-plus-plus.org/release/5.9 it can display such file)
the search result will be saved to a file, since the console can't display the possible big text files.
if the string to search is 34190242819244412441902234192456 then applying the regex 1902.*?1924 will result in patterns:
19024281924
19022341924
the meaning of .*? in 1902.*?1924 : . any char or digit, * zero or more of the previous (.) , and we put ? to suppress the greedy behaviour of the engine from searching to the widest pattern possible to searc for the smallest patterns.
attached the same VBRegExp_Test_MatchesAndCollections.tbasic modified slightly and the pi.txt, i have attached pi.txt to experiment more with huge text but you can use any text and any regex.

'---The following code illustrates how to obtain a SubMatches collection from a regular
'---expression search and how to access its individual members.
Uses "VBREGEXP", "file"

dim lpRegExp as dword
dim lpMatches as dword
dim lpMatch as dword
Dim strValue, sPi As String

'---Allocate a new regular expression instance
lpRegExp = VBREGEXP_New
sPi = FILE_Load(APP_SourcePath + "pi.txt")
'---Check if it was possible to allocate and if not stop the script
if isfalse lpRegExp then
MSGBOX 0, "Unable to create an instance of the RegExp object." & $crlf & "Script terminated"
stop
end if

'---Set pattern
VBRegExp_SetPattern lpRegExp, "1902.*?1924"
'---Set case insensitivity
VBREGEXP_SetIgnoreCase lpRegExp, -1
'---Set global applicability
VBRegExp_SetGlobal lpRegExp, -1
'---Execute search
lpMatches = VBRegExp_Execute(lpRegExp, sPi)
IF ISFALSE lpMatches THEN
MSGBOX 0, "1. No match found"
else

dim nCount as long value VBMatchCollection_GetCount(lpMatches)
IF nCount = 0 THEN
MSGBOX 0, "2. No match found"
else
'---Iterate the Matches collection
dim I as long
strValue += "Total matches found: " & nCount & $CRLF & string$(50, "-") & $crlf
FOR i = 1 TO nCount
lpMatch = VBMatchCollection_GetItem(lpMatches, i)
IF ISFALSE lpMatch THEN EXIT FOR

strValue += "Match number " & i & " found at position: " & VBMatch_GetFirstIndex(lpMatch) & " length: " & VBMatch_Getlength(lpMatch) & $CRLF
strValue += "Value is: " & VBMatch_GetValue(lpMatch) & $CRLF
strValue += "--------------" & $CRLF

VBREGEXP_Release lpMatch

NEXT

'MSGBOX 0, strValue
'PrintL strValue
FILE_Save(APP_SourcePath +"results.txt",strValue)
END IF

END IF

IF istrue lpMatches THEN VBREGEXP_Release(lpMatches)
IF istrue lpRegExp THEN VBREGEXP_Release(lpRegExp)
MsgBox 0,"results saved to a results.txt"

JohnP
23-04-2011, 08:38
i want to remind the users about using regular expressions in searching for patterns...

...the pattern "1902.*?1924"...

...the meaning of .*? in 1902.*?1924 : . any char or digit, * zero or more of the previous (.) , and we put ? to suppress the greedy behaviour of the engine from searching to the widest pattern possible to searc for the smallest patterns.



Hi Zak,

Thank you very much for the very useful example you gave.
It has re-awakened my interest in using RegEx for web-scraping (obtaining useful data from webpages). :D

I have re-written your example in the form of a simplified function, so that I can call it repeatedly (each time, with different start and ending strings, which frame the data of interest in the webpage) from the main section of the program.

However, I have realised that the .*? sequence will not allow me to match any character which is found in HTML code. For example, I think it won't match \ or " or * or & or < several other symbols. :cry:

I have tried to read several RegEx textbooks on the web, but I find them very hard to understand. Could you possibly advise a replacement for the .*? sequence which will get matches for any symbol found in HTML, please?:fishing:

regards
JohnP

zak
23-04-2011, 10:40
Hi JohnP
the previously attached example is not from me , it is from Eros who included it in the C:\thinBasic\SampleScripts\VBRegExp .
regarding the special characters wich can't match, there is an escape character \ which when we insert it before a special character the character then will match. as an example:
replace the text in pi.txt with the following:
eeeeeeexhttp://www.google.commmrtjjhttp://www.google.comewer

and in the code sample replace the correspendent line with:

VBRegExp_SetPattern lpRegExp, "http\:\/\/www\.google\.com"

ie we want to use the pattern "http\:\/\/www\.google\.com"
note that / : . are preceded by \ so to be considered.
when we run the code the result should be:
Total matches found: 2
--------------------------------------------------
Match number 1 found at position: 9 length: 21
Value is: http://www.google.com
--------------
Match number 2 found at position: 36 length: 21
Value is: http://www.google.com
--------------

yes regular expressions can be hard first, i know little about it, but once learned it is a very powerfull tool.
the best introductory book is:
Sams Teach Yourself Regular Expressions in 10 Minutes
By Ben Forta
note that most perl regular expressions can work with thinbasic which are using VBRegExp engine, with a few restrictions such as look backward: ie if i want "zxc" but preceded or not by "y" character. so the tutorials available for perl are mostly working here.
also there is a freeware program for testing regexes , it is Expresso from:
http://www.ultrapico.com/Expresso.htm

PS: there is a long list of Pattern meanings available in the thinbasic help file, just put the cursor over VBREGEXP_New in line 13 in the code and press F1 then go back twice, in that help page there is the Pattern meanings such as \d means a digit only \D non digit ... etc

JohnP
23-04-2011, 23:19
Zak,

Thank you for a really full response to my request for help.

Your explanation has revealed one of my misunderstandings and led me to a workable solution.

I now have a couple of functions which work well; one inserts escape characters into my selected Start and End marker strings (which bracket the wanted data), where needed, and the other uses RegEx, with those Start & End marker strings, to extract the wanted data from the webpage. The data is stored in a 2-D string array and can easily be displayed complete, or selected items extracted as required. As you say, it works very quickly.

It's a really useful start to re-working some of my existing programs.

Thanks again for your time, :occasion:

JohnP

zak
24-04-2011, 18:23
testing the primality of numbers is the last think i expect possible using regular expressions, i have found here
http://www.noulakaz.net/weblog/2007/03/18/a-regular-expression-to-check-for-prime-numbers/
and here
http://montreal.pm.org/tech/neil_kandalgaonkar.shtml
how to do that , it is described in the first link, with a program in ruby. attached is the thinbasic version. the program convert the number to string of '1' from the number such as 5 converted to "11111", and the regex to run is ^1?$|^(11+?)\1+$
indeed i am still don't understand the regex and will try to re learn the subject.
when you try number like 123479 it will last several seconds so i guess it will last for ever for bigger numbers.
put your number in yourNumber variable.
i have used as a template the same thinbasic example found in C:\thinBasic\SampleScripts\VBRegExp


'---The following code illustrates how to obtain a SubMatches collection from a regular
'---expression search and how to access its individual members.
Uses "VBREGEXP", "console"

dim lpRegExp as dword
dim lpMatches as dword
Dim lpMatch As DWord
Dim yourNumber As Long
Dim strValue, sNumber As String

'---Allocate a new regular expression instance
lpRegExp = VBREGEXP_New
yourNumber = 12347
sNumber = String$(yourNumber, "1")
'---Check if it was possible to allocate and if not stop the script
if isfalse lpRegExp then
MSGBOX 0, "Unable to create an instance of the RegExp object." & $crlf & "Script terminated"
stop
end if

'---Set pattern
VBRegExp_SetPattern lpRegExp, "^1?$|^(11+?)\1+$"
'---Set case insensitivity
VBREGEXP_SetIgnoreCase lpRegExp, -1
'---Set global applicability
VBRegExp_SetGlobal lpRegExp, -1
'---Execute search
lpMatches = VBRegExp_Execute(lpRegExp, sNumber)
IF ISFALSE lpMatches THEN
MsgBox 0, "1. No match found"
else

dim nCount as long value VBMatchCollection_GetCount(lpMatches)
IF nCount = 0 THEN
PrintL yourNumber, "is prime -- press any key to continue "
else
'---Iterate the Matches collection
dim I as long
strValue += "Total matches found: " & nCount & $CRLF & string$(50, "-") & $crlf
FOR i = 1 TO nCount
lpMatch = VBMatchCollection_GetItem(lpMatches, i)
IF ISFALSE lpMatch THEN EXIT FOR
strValue = "is not prime -- press any key to continue "

VBREGEXP_Release lpMatch

NEXT

PrintL yourNumber, strValue

END IF

END IF

IF istrue lpMatches THEN VBREGEXP_Release(lpMatches)
IF istrue lpRegExp THEN VBREGEXP_Release(lpRegExp)

WaitKey