Our company is actually quite automated. We have many databases, Oracle 10g, SQL Server 2005, SSAS 2008, Teradata, etc. We have several ways to extract data from them. Some could be accessed through a custom desktop application. Others through VBA code embedded in an Excel file (such as OLAP). Lastly, most used a web-based GUI. Unfortunately, for me as a PHP noob, most of these sites used ASP.NET. This makes automating data extraction (also called scrapping) from them rather difficult, if not impossible.
You may probably ask why bother automating the website interaction to extract data from it when you can go directly to the tables of database behind it. Well... because I dont have access to the database directly. Our IT department deliberately only allows data gathering through (in this case) the website. They do not allow direct access to the database to prevent people bogging down the database with too much queries. So, I really had no choice. Secondly, even if I did have access I probably would not be able to understand the structure of the database since that information is controlled as well.
Now, if the site used the GET method of form submission then I would probably be able to get by using file_get_contents. But the site uses the POST method so Curl is the only way to go. One tool I really am truly grateful for is Wireshark. It is a network protocol analyzer for Unix and Windows allowing me to see all the nitty-gritty details of the HTTP transactions that my web browser creates as I interact with the website. By logging the whole session, I am able to dissect the POST transactions to get at the exact POST variables being sent. These POST variables will be supplied to the Curl function to simulate the exact operation as done by the web browser. In this case what is being simulated is a PostBack operation upon selecting an option in a drop-down list box.
The PostBack JavaScript code looks like this:
function __doPostBack(eventTarget, eventArgument) {For those unfamiliar, what this does is generate a form POST transaction just like what would happen if you clicked on a form Submit button.
var theform;
if (window.navigator.appName.toLowerCase().indexOf("microsoft") > -1) {
theform = document.Form1;
}
else {
theform = document.forms["Form1"];
}
theform.__EVENTTARGET.value = eventTarget.split("$").join(":");
theform.__EVENTARGUMENT.value = eventArgument;
theform.submit();
}
This function is called every time you select an option in a drop-down list box:
HTML Code:
<select name="ddl_name" onchange="__doPostBack('ddl_name','')" language="javascript" id="ddl_name" >In a normal POST request, the POST would contain, at the minimum, the Select form element name (for this example it would be ddl_name) and corresponding Option value you selected (for example Adam). But ASP.NET automatically generates some hidden form variables that need to be included in the POST:
<option value="- Choose Name -">- Choose Name -</option>
<option value="Adam">Adam</option>
<option value="Nathan">Nathan</option>
<option value="George">George</option>
</select>
<input name="__EVENTTARGET" value="" type="hidden">The __EVENTTARGET and __EVENTARGUMENT values are pretty straightforward. You just need to decode the __doPostBack function. The last hidden variable is more interesting. ViewState is a feature built-in to ASP.NET (__VIEWSTATE) and is defined by Scott Mitchell as:
<input name="__EVENTARGUMENT" value="" type="hidden">
<input name="__VIEWSTATE" value="dDwyMDE1Nz..." type="hidden">
"Microsoft® ASP.NET view state, in a nutshell, is the technique used by an ASP.NET Web page to persist changes to the state of a Web Form across postbacks."Basically, it is a serialized representation of all the form variables and their states when the page was sent-out from the Web server. It is required that you send this exact value back during the POST request or the Web server will reject it.
So to complete 1 full cycle you will need to undertake a 3-step process:
- Retrieve the initial page
- Parse the page for the ViewState value
- Send a POST request with all the needed form variables including the ViewState
First, to retrieve the initial page you could probably get by with a simple file_get_contents. Try it, but if it doesn't work then you would probably need to use the Curl extension (Make sure that it is enabled in your php.ini) which has a few tricks up its sleeve. For the site I was interacting with, it required the creation of cookies. So I went ahead and enabled cookies in my Curl function call. Another feature of Curl is the ability to perform user agent spoofing. For more secure sites, you might need to specify and username and password.
For the 2nd part, you can probably use the DOM extension or go with the simpler RegEx approach. Below is an example of how I did it:
$result = curl_exec ($ch);Lastly, you plug in the values for the POST variables in the Curl options array like so:
preg_match_all("/name=\"__VIEWSTATE\" value=\"(.*?)\"/", $result, $arr_viewstate);
$viewstate = urlencode($arr_viewstate[1][0]);
$options = array(That's it! You can now parse your needed data contained in the $result variable. I used the DOM extension for this since it is more tolerant to malformed HTML markup than RegEx.
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_FOLLOWLOCATION => true, // follow redirects
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "spider", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_POST => true,
CURLOPT_POSTFIELDS => '__EVENTTARGET='.urlencode('ddl_name').'&__EVENTARGUMENT='.urlencode('').'&__VIEWSTATE='.$viewstate.'&ddl_name=Adam');
$ch = curl_init( $url );
curl_setopt_array( $ch, $options );
$result = curl_exec ($ch);
16 comments:
Many thanks, you ended a few hours of hair-tearing frustration.
Oh Microsoft, where would we be without your proprietary bullshit... oh yeah, in a much, much better place.
I have to say I was excited to find this post after working on scraping one of these horrible ASP pages for hours, but unfortunately your method did not work for me. Maybe the page I'm looking at is implemented differently than the one you worked on. I got this error:
Server Error in '/' Application.
Validation of viewstate MAC failed. If this application is hosted by a Web Farm or cluster, ensure that <machineKey&rt; configuration specifies the same validationKey and validation algorithm. AutoGenerate cannot be used in a cluster.
I looked through the data being posted with Wireshark (Firebug works file also), and when I interacted with the page, I saw that there was info in the cookie that seemed like it might be important. So I tried saving the headers from the original page retrieval and then using CURLOPT_COOKIEFILE in the subsequent POST request. This definitely did something, because I got different error:
Invalid postback or callback argument. Event validation is enabled using <pages enableEventValidation="true"/&rt; in configuration or <%@ Page EnableEventValidation="true" %&rt; in a page. For security purposes, this feature verifies that arguments to postback or callback events originate from the server control that originally rendered them. If the data is valid and expected, use the ClientScriptManager.RegisterForEventValidation method in order to register the postback or callback data for validation.
However, it seemed like the cookie data I was sending with cURL was not the same as the cookie data from the browser.
I was about to give up when I decided to try a perl module I had seen before, WWW::Mechanize. It simulates browser interaction. I used the submit_form method and it worked like a charm. So far it's the only way I've been able to get the results back for the page I'm working on. This page functions just as your example does, with a select pull-down menu and a postback on clicking an option.
That said, cURL is *way* faster, this takes a few seconds per page I want to get. The module was also a pain to install -- you have to force install of a dependency, HTTP::Response:Encoded, and then force install of WWW:Mechanize, because both fail some of their tests. So I wish the cURL way worked for me.
Hope this will help someone else out there :)
That saved me so much time and frustration. I am in a very similar situation as yours, and this is day #3 I was fighting with it. Thanks for posting your solution!
Hi, this regular expression doesn't work always, try to get the viewstate by other way, thanks for the post anyway =D
Fantastic post. This REALLY helped me.
Thanks very much for this post. Got me through a tough spot when scraping from an .aspx page that was got to only after several clicks and selections.
I also had the situation of
Invalid postback or callback argument and worked past that by emulating the call back posts. Used browser F12 tools to observe the request body. Intermediate curl posts used __EVENTTARGET to ensure server thought curl was a human at a browser.
Can i Get full source code, and example website target?? I am newbie.. thanks
i am using this code and its not working for me. please any help
$url = "http://www.riogrande.com/Category/Findings-and-Finished-Jewelry/132/Bails-and-Enhancers/472";
$file=file_get_contents($url);
preg_match("#.*?#mis", $file, $arr_viewstate);
$viewstate = urlencode($arr_viewstate[1]);
$eventvalidation = urlencode($arr_viewstate[2]);
$options = array(
CURLOPT_RETURNTRANSFER => true, // return web page
CURLOPT_HEADER => false, // don't return headers
CURLOPT_ENCODING => "", // handle all encodings
CURLOPT_USERAGENT => "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7'", // who am i
CURLOPT_AUTOREFERER => true, // set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 1120, // timeout on response
CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
CURLOPT_POST => true,
CURLOPT_VERBOSE => true,
CURLOPT_POSTFIELDS => '__EVENTTARGET='.urlencode('ctl00$ContentPlaceHolderBody$SearchPageNavigationTop$rptPager$ctl01').'&__EVENTARGUMENT='.urlencode('').'&__VIEWSTATE='.$viewstate.'&__EVENTVALIDATION='.$eventvalidation.'&__LASTFOCUS='.urlencode(''));
$ch = curl_init($url);
curl_setopt_array($ch,$options);
$result = curl_exec($ch);
curl_close($ch);
echo $result;
Nice post. I'm working on a forms library to make doPostBack actions easier to deal with. If you like it, leave me a comment.
Nice Post...
but wont work for me..
please help me regarding scraping of aspx page using curl.
My target site (.aspx pages).
The exact error message is:
Invalid postback or callback argument. Event validation is enabled using in configuration or <%@ Page EnableEventValidation="true" %> in a page. For security purposes, this feature verifies that arguments to postback or callback events originate from the server control that originally rendered them. If the data is valid and expected, use the ClientScriptManager.RegisterForEventValidation method in order to register the postback or callback data for validation.
The data I'm scraping is from county records and is public information
Any insight on how to fix this would be greatly appreciated.
Thanks a lot, it has helped!
Spot on with this write-up, I actually suppose this web site wants much more consideration. I’ll in all probability be once more to learn far more, thanks for that info. casino games
Post a Comment