Our company is actually quite automated. We have many databases, Oracle 10g, SQL Server 2005, SSAS 2008, Teradata, etc. We have several ways to extract data from them. Some could be accessed through a custom desktop application. Others through VBA code embedded in an Excel file (such as OLAP). Lastly, most used a web-based GUI. Unfortunately, for me as a PHP noob, most of these sites used ASP.NET. This makes automating data extraction (also called scrapping) from them rather difficult, if not impossible.
You may probably ask why bother automating the website interaction to extract data from it when you can go directly to the tables of database behind it. Well... because I dont have access to the database directly. Our IT department deliberately only allows data gathering through (in this case) the website. They do not allow direct access to the database to prevent people bogging down the database with too much queries. So, I really had no choice. Secondly, even if I did have access I probably would not be able to understand the structure of the database since that information is controlled as well.
Now, if the site used the GET method of form submission then I would probably be able to get by using file_get_contents. But the site uses the POST method so Curl is the only way to go. One tool I really am truly grateful for is Wireshark. It is a network protocol analyzer for Unix and Windows allowing me to see all the nitty-gritty details of the HTTP transactions that my web browser creates as I interact with the website. By logging the whole session, I am able to dissect the POST transactions to get at the exact POST variables being sent. These POST variables will be supplied to the Curl function to simulate the exact operation as done by the web browser. In this case what is being simulated is a PostBack operation upon selecting an option in a drop-down list box.
The PostBack JavaScript code looks like this:
function __doPostBack(eventTarget, eventArgument) {For those unfamiliar, what this does is generate a form POST transaction just like what would happen if you clicked on a form Submit button.
var theform;
if (window.navigator.appName.toLowerCase().indexOf("microsoft") > -1) {
theform = document.Form1;
}
else {
theform = document.forms["Form1"];
}
theform.__EVENTTARGET.value = eventTarget.split("$").join(":");
theform.__EVENTARGUMENT.value = eventArgument;
theform.submit();
}
This function is called every time you select an option in a drop-down list box:
HTML Code:
<select name="ddl_name" onchange="__doPostBack('ddl_name','')" language="javascript" id="ddl_name" >In a normal POST request, the POST would contain, at the minimum, the Select form element name (for this example it would be ddl_name) and corresponding Option value you selected (for example Adam). But ASP.NET automatically generates some hidden form variables that need to be included in the POST:
<option value="- Choose Name -">- Choose Name -</option>
<option value="Adam">Adam</option>
<option value="Nathan">Nathan</option>
<option value="George">George</option>
</select>
<input name="__EVENTTARGET" value="" type="hidden">The __EVENTTARGET and __EVENTARGUMENT values are pretty straightforward. You just need to decode the __doPostBack function. The last hidden variable is more interesting. ViewState is a feature built-in to ASP.NET (__VIEWSTATE) and is defined by Scott Mitchell as:
<input name="__EVENTARGUMENT" value="" type="hidden">
<input name="__VIEWSTATE" value="dDwyMDE1Nz..." type="hidden">
"Microsoft® ASP.NET view state, in a nutshell, is the technique used by an ASP.NET Web page to persist changes to the state of a Web Form across postbacks."Basically, it is a serialized representation of all the form variables and their states when the page was sent-out from the Web server. It is required that you send this exact value back during the POST request or the Web server will reject it.
So to complete 1 full cycle you will need to undertake a 3-step process:
- Retrieve the initial page
- Parse the page for the ViewState value
- Send a POST request with all the needed form variables including the ViewState
First, to retrieve the initial page you could probably get by with a simple file_get_contents. Try it, but if it doesn't work then you would probably need to use the Curl extension (Make sure that it is enabled in your php.ini) which has a few tricks up its sleeve. For the site I was interacting with, it required the creation of cookies. So I went ahead and enabled cookies in my Curl function call. Another feature of Curl is the ability to perform user agent spoofing. For more secure sites, you might need to specify and username and password.
For the 2nd part, you can probably use the DOM extension or go with the simpler RegEx approach. Below is an example of how I did it:
$result = curl_exec ($ch);Lastly, you plug in the values for the POST variables in the Curl options array like so:
preg_match_all("/name=\"__VIEWSTATE\" value=\"(.*?)\"/", $result, $arr_viewstate);
$viewstate = urlencode($arr_viewstate[1][0]);
$options = array(That's it! You can now parse your needed data contained in the $result variable. I used the DOM extension for this since it is more tolerant to malformed HTML markup than RegEx.
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => '__EVENTTARGET='.urlencode('ddl_name').'&__EVENTARGUMENT='.urlencode('').'&__VIEWSTATE='.$viewstate.'&ddl_name=Adam');
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$result = curl_exec ($ch);