Subscribe to
Posts
Comments
NSLog(); Header Image

AppleScript Web Scraping Brain Teaser

Here's something I'd like to do. It begins like this:

set rawSource to ""
tell application "Safari"
    set rawSource to source of front document
end tell

Now, at this point, I'd like to run a grep search on the source and grab every string of the form http:// ... &view=new. Then, I'd like to tell Safari to open all of the URLs it finds.

Can't quite figure out how to do that, though. I'd like to avoid round-tripping through BBEdit, if possible.

10 Responses to "AppleScript Web Scraping Brain Teaser"

  1. I forget how to do this, but if you could get the script to print to std out, you could run the script from the command line and then pipe through egrep and then back into applescript to tell Safari to open it. Pipes are your friends.

    I haven't done this before, but I've gone from the shell script to applescript for several automated tasks.

  2. Would a tool like wget not work better? It allows you to follow links through web pages to get everything you need.

    Micheal

  3. Try something like this:

    1. From applescript, invoke curl or wget with the URL of the Safari window.

    2. Pipe the output of curl/wget through a perl script that writes the found URLs to stdout

    3. (back in applescript) Open each of the returned URLs in Safari.

    I don't know the exact syntax involved with executing a command-line app from Applescript, nor do I know the flags to use with curl or wget.

    The regex to match the URLs would look something like this:

    http:\/\/[^\s\t\n]+&view=new

    In perl, wrap the whole thing in parens and if it matches, $1 will be your URL.

    Hope this helps,

    -JK

  4. The site is a forum and requires a username and password (and not of the http://user:pass@ variety). I'm not sure that I'll be able to grab the proper page via curl or wget.

  5. I don't know how badly this will be munged going through the comments system, so email me for a .scpt file if it fails. Replace `suffix` with anything you'd like (I used .gif for testing).

    set prefix to "http://"

    set suffix to ".gif"

    tell application "Safari" to set rawSource to source of front document

    -- All links, on their own line.

    set rawSource to replaceText from "\"" to return for rawSource

    set rawSource to replaceText from prefix to return & prefix for rawSource

    -- Match.

    set linkList to {}

    repeat with theLine in paragraphs in rawSource

    if theLine begins with prefix and theLine ends with suffix then

    set linkList to linkList & theLine

    end if

    end repeat

    -- Open.

    tell application "Safari"

    repeat with theLink in linkList

    make document with properties {

    end repeat

    end tell

    to replaceText from textToFind to replacementText for sourceText

    set AppleScript's text item delimiters to {textToFind}

    set sourceTextList to every text item of sourceText

    set AppleScript's text item delimiters to {replacementText}

    return sourceTextList as text

    end replaceText

  6. I do something like this using Python. I have to do a POST login, so it uses cookies and actually logs in to the remote site.

    if the above script doesn't work.. the python code could be modified easily to do what you want. Email me if you need it.

  7. I made some minor changes to rentzsch's code, so that it uses your default browser instead. Lines with '--' are commented out, I usually leave them behind in case I need them later.

    -- tell application "Safari"

    repeat with theLink in linkList

    -- make document with properties {

    open location theLink

    end repeat

    -- end tell

    Cheers,

    Dave

  8. The final version. Works fine here and relatively quickly.

    set prefix to "http://"

    set suffix to "view=getnewpost"

    tell application "Safari" to set rawSource to source of front document

    -- All links, on their own line.

    set rawSource to replaceText from "href='" to return for rawSource

    set rawSource to replaceText from ("&" & suffix) to ("&" & suffix & return) for rawSource

    -- set the clipboard to rawSource

    -- Match.

    set linkList to {}

    set rawSourceParagraphs to paragraphs in rawSource

    repeat with theLine in rawSourceParagraphs

    if theLine begins with prefix and theLine ends with suffix then

    set linkList to linkList & theLine

    end if

    end repeat

    repeat with theLink in linkList

    open location theLink

    end repeat

    to replaceText from textToFind to replacementText for sourceText

    set AppleScript's text item delimiters to {textToFind}

    set sourceTextList to every text item of sourceText

    set AppleScript's text item delimiters to {replacementText}

    return sourceTextList as text

    end replaceText

  9. Incidentally, I've also modified this script to put the links in a very simple XML file on my disk via a cron job. PulpFiction checks the xml file regularly and lets me know what new topics exist (the AppleScript titles them properly and so forth, yes).

  10. set urlLink to "https://reason.com/latest/"
    tell application "Safari"
    set the URL of the front document to urlLink
    delay 2
    set pageText to source of document 1
    end tell
    set mysplit to thesplit(pageText, "<")
    on thesplit(thestring, thechar)
    set AppleScript's text item delimiters to thechar
    set mysplit to every text item of thestring
    return mysplit
    end thesplit
    set mylist to {}
    repeat with anitem in mysplit
    if anitem contains "HREF" then
    set end of mylist to anitem
    end if
    end repeat
    set thedata to ""
    repeat with theitem in mylist
    set thedata to thedata & theitem & "
    "
    end repeat
    tell application "TextEdit"
    set location to alias "Macintosh HD:Users:JasonMagnuson:Desktop:" as text
    make new document with properties {text:thedata}
    save document 1 in file (location & "thescrape.txt")
    close document 1
    quit
    end tell


Comments RSS

Leave a Reply


Warning: Undefined variable $user_ID in /var/www/vhosts/nslog.com/httpsdocs/wp-content/themes/nslog/comments.php on line 96

Please abide by the comment policy. Valid HTML includes: <blockquote><p>, <em>, <strong>, <ul>, <ol>, and <a href>. Please use the "Quote Me" functionality to quote comments.