A PHP battle scraper

Discussion in 'API' started by Farbs, Dec 29, 2013.

  1. Farbs

    Farbs Blue Manchu Staff Member

    Hi,

    I've been working on a php script that read all battles from the Card Hunter API and send them to handler functions. The idea is to run this regularly via a cron job, which would ensure that each battle is sent over to the handlers once and only once. The handler functions can then do things like count wins, respond to matches between rivals, notify you when particular players are online, or even record wins for each side on particular maps to see whether they're balanced. There's huge scope for what this could do, as evidenced by me making half of those things up as I typed them. I'm pretty excited about it.

    So, the point of this thread is to share what I've built thus far. I have very little php experience, and my MySQL-fu is weak, so by all means please suggest changes and improvements. Nonetheless, this does seem to work and I figured might be a good starting point for anyone wanting to run something like it. Without further ado, here are the files:

    PHP:
    <?php
     
    include 'config.php';
     
    // set up battle handlers
    $battleHandlers = array();
    include 
    'echobattlehandler.php';
     
    // hook up to the db
    $dbConn mysqli_connectDB_IP ":" DB_PORTDB_USERDB_PASSWORDDB_DEFAULT_DATABASE );
    if( 
    $dbConn->connect_errno )
    {
        die( 
    "Database connection error: " $dbConn->connect_error );
    }
     
    // grab the last battle id
    // or -1 if we haven't run before
    $result $dbConn->query"SELECT * FROM `last_battle_id`;" );
    if( 
    $result == false )
    {
        die( 
    "Error in battle id query: " $dbConn->error );
    }
    $lastID = -1;
    if( 
    $result->num_rows == )
    {
        
    $row $result->fetch_assoc();
        
    $lastID $row["battle_id"];
    }
    else if( 
    $result->num_rows )
    {
        die( 
    "Too many rows returned by last battle id query" );
    }
    $result->close();
     
    // set up battle id in db if not there already
    if( $lastID == -)
    {
        
    $dbConn->query"INSERT INTO `last_battle_id` VALUES ( -1 );" );
    }
     
    $battlesRemain true;
    $battlesRequestCount 0;
    $curl curl_init();
    curl_setopt$curlCURLOPT_RETURNTRANSFER);
    while( 
    $battlesRemain && $battlesRequestCount MAX_BATTLES_REQUESTS )
    {
        
    // assmeble url
        
    $url "api.cardhunter.com/battles?count=" REQUEST_DATA_COUNT;
        if( 
    $lastID >= )
        {
            
    // if we know where to start from, start from there. Otherwise defaults to the end.
            // use the demarc page system to ensure we start from the next battle after lastID.
            
    $url .= "&page=prev&demarc=" $lastID;
        }
       
        
    // grab data from cardhunter api service
        
    curl_setopt$curlCURLOPT_URL$url );
        
    $battles json_decodecurl_exec$curl ) );
     
        
    // did we get any battles?
        
    if( property_exists$battles"battles" ) )
        {
            
    // flip the battles so they're ordered oldest-to-newest
            
    $battles->battles array_reverse$battles->battles );
       
            
    // handle each battle
            
    $count 0;
            foreach( 
    $battles->battles as $battle )
            {
                
    // reformat time string so they can be used with MySQL DATETIME
                
    $battle->start substr$battle->start010 ) . " " substr$battle->start11);
           
                
    // run each handler for it
                
    foreach( $battleHandlers as $battleHandler )
                {
                    
    call_user_func$battleHandler$battle );
                }
                
    $lastID max$lastID$battle->id );
                ++
    $count;
            }
            if( 
    $count REQUEST_DATA_COUNT )
            {
                
    // well, that's all of 'em
                
    $battlesRemain false;
            }
        }
        else
        {
            
    // no battles array. What happened?
            
    echo "Error: No battles object in data";
            
    var_dump$battles );
            
    $battlesRemain false;
        }
        ++
    $battlesRequestCount;
    }
     
    // store last ID
    $statement $dbConn->prepare"UPDATE `last_battle_id` SET `battle_id`=?;" );
    $statement->bind_param"s"$lastID );
    $statement->execute();
    $statement->close();
     
    // tidy up
    curl_close$curl );
    $dbConn->close();
    This is the main pump. I ended up using the undocumented "demarc" API parameter, which is used to paginate the data for large requests. This allowed me to get the next batch of battles from a particular point onward, which unfortunately didn't seem possible using the "before" and "after" parameters since they always started at the most recent end of the list.

    PHP:
    <?php
    // general
    define"REQUEST_DATA_COUNT"25 );        // maximum number of battles in each request
    define"MAX_BATTLES_REQUESTS"10 );    // maximum number of times to request battles
     
    // database config
    define"DB_IP"'127.0.0.1' );
    define"DB_PORT"'3306' );
    define"DB_USER"'your db user name here' );
    define"DB_PASSWORD"'your db user's password here);
    define"DB_DEFAULT_DATABASE""cardhunter_api_playground" );
    ?>
    This is an example of config.php. It contains options for the battle pump. Splitting this out into its own file should make it easier to maintain live and development versions of the code, I think. Will see. Obviously your own details are used in the db fields.

    PHP:
    <?php
     
    // define and add main handler
    function handleBattle$battle )
    {
        
    var_dump$battle );
    }
    $battleHandlers[] = "handleBattle";
     
    ?>
    This is echobattlehandler.php. It's just a sample handler that dumps the battles out on screen. You'll want to write your own handlers to do clever things with the battle data.

    Code:
    CREATE SCHEMA IF NOT EXISTS `cardhunter_api_playground` DEFAULT CHARACTER SET latin1 ;
    USE `cardhunter_api_playground` ;
     
    -- -----------------------------------------------------
    -- Table `cardhunter_api_playground`.`last_battle_id`
    -- -----------------------------------------------------
    CREATE  TABLE IF NOT EXISTS `cardhunter_api_playground`.`last_battle_id` (
      `battle_id` BIGINT(20) NOT NULL ,
      PRIMARY KEY (`battle_id`) )
    ENGINE = InnoDB
    DEFAULT CHARACTER SET = latin1;
    This chunk of SQL sets up your db to store the most recently accessed battle id. I recommend emptying this table if you haven't run the pump for a while, as that'll instruct the scraper to just start from the most recent set of battles. Obviously if you want to change the db name you'll need to do this both here and in config.php.

    So, uh, yeah. That's it so far. I have a handler which collects more interesting data, but I still need to write some new php pages to browse that data for it to be at all useful. I'm also considering attempting to run the scraper every time the site I'm building is accessed, but then throttling the scraper so it runs no more often than once a minute. Between that and 15 minute cron scheduling I think I'll get a good balance of load / timeliness. If I could just schedule it to run every minute I'd do that instead, however my host doesn't support it.

    Anyhow, since this is currently reasonably simple and self contained I figured I'd share it, since from here on out my dev copy is going to become less and less generically useful.

    Enjoy!
     

Share This Page