AppleScript Web Scraping Brain Teaser
Posted January 14th, 2005 @ 11:58am by Erik J. Barzeski
Here's something I'd like to do. It begins like this:
set rawSource to ""
tell application "Safari" set rawSource to source of front document end tell
Now, at this point, I'd like to run a grep search on the source and grab every string of the form http:// ... &view=new
. Then, I'd like to tell Safari to open all of the URLs it finds.
Can't quite figure out how to do that, though. I'd like to avoid round-tripping through BBEdit, if possible.
Posted 14 Jan 2005 at 12:04pm #
I forget how to do this, but if you could get the script to print to std out, you could run the script from the command line and then pipe through egrep and then back into applescript to tell Safari to open it. Pipes are your friends.
I haven't done this before, but I've gone from the shell script to applescript for several automated tasks.
Posted 14 Jan 2005 at 12:51pm #
Would a tool like wget not work better? It allows you to follow links through web pages to get everything you need.
Micheal
Posted 14 Jan 2005 at 1:02pm #
Try something like this:
1. From applescript, invoke curl or wget with the URL of the Safari window.
2. Pipe the output of curl/wget through a perl script that writes the found URLs to stdout
3. (back in applescript) Open each of the returned URLs in Safari.
I don't know the exact syntax involved with executing a command-line app from Applescript, nor do I know the flags to use with curl or wget.
The regex to match the URLs would look something like this:
http:\/\/[^\s\t\n]+&view=new
In perl, wrap the whole thing in parens and if it matches, $1 will be your URL.
Hope this helps,
-JK
Posted 14 Jan 2005 at 1:09pm #
The site is a forum and requires a username and password (and not of the http://user:pass@ variety). I'm not sure that I'll be able to grab the proper page via curl or wget.
Posted 14 Jan 2005 at 1:17pm #
I don't know how badly this will be munged going through the comments system, so email me for a .scpt file if it fails. Replace `suffix` with anything you'd like (I used .gif for testing).
set prefix to "http://"
set suffix to ".gif"
tell application "Safari" to set rawSource to source of front document
-- All links, on their own line.
set rawSource to replaceText from "\"" to return for rawSource
set rawSource to replaceText from prefix to return & prefix for rawSource
-- Match.
set linkList to {}
repeat with theLine in paragraphs in rawSource
if theLine begins with prefix and theLine ends with suffix then
set linkList to linkList & theLine
end if
end repeat
-- Open.
tell application "Safari"
repeat with theLink in linkList
make document with properties {
end repeat
end tell
to replaceText from textToFind to replacementText for sourceText
set AppleScript's text item delimiters to {textToFind}
set sourceTextList to every text item of sourceText
set AppleScript's text item delimiters to {replacementText}
return sourceTextList as text
end replaceText
Posted 14 Jan 2005 at 1:21pm #
I do something like this using Python. I have to do a POST login, so it uses cookies and actually logs in to the remote site.
if the above script doesn't work.. the python code could be modified easily to do what you want. Email me if you need it.
Posted 14 Jan 2005 at 2:17pm #
I made some minor changes to rentzsch's code, so that it uses your default browser instead. Lines with '--' are commented out, I usually leave them behind in case I need them later.
-- tell application "Safari"
repeat with theLink in linkList
-- make document with properties {
open location theLink
end repeat
-- end tell
Cheers,
Dave
Posted 14 Jan 2005 at 2:26pm #
The final version. Works fine here and relatively quickly.
set prefix to "http://"
set suffix to "view=getnewpost"
tell application "Safari" to set rawSource to source of front document
-- All links, on their own line.
set rawSource to replaceText from "href='" to return for rawSource
set rawSource to replaceText from ("&" & suffix) to ("&" & suffix & return) for rawSource
-- set the clipboard to rawSource
-- Match.
set linkList to {}
set rawSourceParagraphs to paragraphs in rawSource
repeat with theLine in rawSourceParagraphs
if theLine begins with prefix and theLine ends with suffix then
set linkList to linkList & theLine
end if
end repeat
repeat with theLink in linkList
open location theLink
end repeat
to replaceText from textToFind to replacementText for sourceText
set AppleScript's text item delimiters to {textToFind}
set sourceTextList to every text item of sourceText
set AppleScript's text item delimiters to {replacementText}
return sourceTextList as text
end replaceText
Posted 14 Jan 2005 at 2:28pm #
Incidentally, I've also modified this script to put the links in a very simple XML file on my disk via a cron job. PulpFiction checks the xml file regularly and lets me know what new topics exist (the AppleScript titles them properly and so forth, yes).
Posted 24 Nov 2022 at 1:31pm #
set urlLink to "https://reason.com/latest/"
tell application "Safari"
set the URL of the front document to urlLink
delay 2
set pageText to source of document 1
end tell
set mysplit to thesplit(pageText, "<")
on thesplit(thestring, thechar)
set AppleScript's text item delimiters to thechar
set mysplit to every text item of thestring
return mysplit
end thesplit
set mylist to {}
repeat with anitem in mysplit
if anitem contains "HREF" then
set end of mylist to anitem
end if
end repeat
set thedata to ""
repeat with theitem in mylist
set thedata to thedata & theitem & "
"
end repeat
tell application "TextEdit"
set location to alias "Macintosh HD:Users:JasonMagnuson:Desktop:" as text
make new document with properties {text:thedata}
save document 1 in file (location & "thescrape.txt")
close document 1
quit
end tell