Crawl Google Play APP Comments Using requests.post with Python
最近更新:2024-12-16
|
字数总计:1k
|
阅读估时:6分钟
|
阅读量:次
When we want to develop a web crawler program, we always check the request URL, headers, and payload data through DevTools. However, the Google Play website is an interesting case in which we can’t get the comment data easily.
Let me show you what I’m talking about. Here is the comment data at the left hand side, and we can see that the comment data is from the batchexecute?rpcidsxxxxxxx URL link at the right hand side.
And then we can check the request url, headers and payload at the title(標題) and payload(承載) sheet in DevTools like this.
I try to copy these information and send a request but it return a 400 status code.
Does it only accept the payload data in url code format? Let’s try to transform our payload data from dictionary to url code format by urllib.parse.urlencode function. And then we send a post try to get the comment data again. But it still failed.
So, let’s check what the difference is between them. We can see a slight difference: there’s no %5C in Method 2. %5C in the URL code means \, but Python treats \ as a special character, so this leads to the failed request
1 2 3 4 5 6 7 8 9 10 11 12 13
# Method 1 data = 'f.req=%5B%5B%5B%22oCPfdb%22%2C%22%5Bnull%2C%5B2%2C1%2C%5B20%2Cnull%2C%5C%22Cl8KXTAsMTAwMDAwMC4zNTI2NDU3NTQ4LDg3MjcyMjcyOTI5MSwiaHR0cDovL21hcmtldC5hbmRyb2lkLmNvbS9kZXRhaWxzP2lkPXYyOmNvbS5mYnM6MSIsMSxmYWxzZQ%5C%22%5D%2Cnull%2C%5Bnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2C2%5D%5D%2C%5B%5C%22com.fbs%5C%22%2C7%5D%5D%22%2Cnull%2C%22generic%22%5D%5D%5D&at=AI1C5iqPa-CPYK1xNKnGXTLdNUuY%3A1733652497521&' print(data)
# Method 2 data = { 'f.req': '[[["oCPfdb","[null,[2,1,[20,null,\"Cl8KXTAsMTAwMDAwMC4zNTI2NDU3NTQ4LDg3MjcyMjcyOTI5MSwiaHR0cDovL21hcmtldC5hbmRyb2lkLmNvbS9kZXRhaWxzP2lkPXYyOmNvbS5mYnM6MSIsMSxmYWxzZQ\"],null,[null,null,null,null,null,null,null,null,2]],[\"com.fbs\",7]]",null,"generic"]]]', 'at': 'AI1C5iqPa-CPYK1xNKnGXTLdNUuY:1733652497521' } data = urllib.parse.urlencode(data) print(data) >> 'f.req=%5B%5B%5B%22oCPfdb%22%2C%22%5Bnull%2C%5B2%2C1%2C%5B20%2Cnull%2C%5C%22Cl8KXTAsMTAwMDAwMC4zNTI2NDU3NTQ4LDg3MjcyMjcyOTI5MSwiaHR0cDovL21hcmtldC5hbmRyb2lkLmNvbS9kZXRhaWxzP2lkPXYyOmNvbS5mYnM6MSIsMSxmYWxzZQ%5C%22%5D%2Cnull%2C%5Bnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2C2%5D%5D%2C%5B%5C%22com.fbs%5C%22%2C7%5D%5D%22%2Cnull%2C%22generic%22%5D%5D%5D&at=AI1C5iqPa-CPYK1xNKnGXTLdNUuY%3A1733652497521&' >> 'f.req=%5B%5B%5B%22oCPfdb%22%2C%22%5Bnull%2C%5B2%2C1%2C%5B20%2Cnull%2C%22Cl8KXTAsMTAwMDAwMC4zNTI2NDU3NTQ4LDg3MjcyMjcyOTI5MSwiaHR0cDovL21hcmtldC5hbmRyb2lkLmNvbS9kZXRhaWxzP2lkPXYyOmNvbS5mYnM6MSIsMSxmYWxzZQ%22%5D%2Cnull%2C%5Bnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2Cnull%2C2%5D%5D%2C%5B%22com.fbs%22%2C7%5D%5D%22%2Cnull%2C%22generic%22%5D%5D%5D&at=AI1C5iqPa-CPYK1xNKnGXTLdNUuY%3A1733652497521'
According to the discussion above, we need to tell Python to read this string in raw string literals so that Python will keep \ in the string. And it finally works!