Walking across the internet, went to one ancient vysokoposeschaemy own site. To download the file whose name ends with this site, you must guess that's a captcha:
Once again, seeing the picture with the figures - made up his mind. My head has long been swept thought, to break some captcha:)
set myself the task: Write a script that will decrypt shown captcha and spit out the precious tsiferki.
Name of the site is not specifically cite - can guess:)
Here we go!
Analyse image First you need to see as many of these captures to identify similarities / differences, some patterns. For these purposes, I downloaded about 50 captures. Among them, you can choose key, which contain a maximum of the differences:
Generally, I like peer in numbers, as in his time a lot of time devoted to the study of mathematics:)
consider, and understand:
- picture of black and white, in a format gif
- size of the picture may change, but the numbers are always at the center (although they are vertically aligned not centered)
- to use gradient, its direction may change in the 2 sides
- except gradient is "angular gradient" (so I had called him, not kick:) ), that which comes out of the corner angle of 45 (again not kick:) ) is simply a diagonal line, in my view
- I just found 6 different fonts of writing (or rather three, the other 3 are their slanted versions)
- pixels of all figures are not darker than # 606060, but not the same color
- digits 3-5 in the captcha, height not more than 14px
We seek a solution in my head for half an hour scroll through the options, understand one thing: image is desirable crop, and because the fonts used are the same, and they did not change, you can use "prints. By this term I I understand that the figures we have already lie somewhere in the database, and we need to reconcile with the picture.
He came to this decision:
- Plant array prints
- cut the picture from all sides, too much to be thrown
- remove the extra colors - it gradient and angular gradient
- through all the pixels from left to right, top-down, and if the pixel color matches the color of figures (> = # 606060), then compares the prints with all the order
Implementation
- Cooking prints
Results them get 6 * 10 = 60 pieces, put them into an array. The prints I made on the numbers of captures for each font. This is simply an array of lines, where in each line a letter "x" marked pixel numbers.
For example, here is how the figure 2 of the first font:
- Open up the picture
This is done simply through imagecreatefromgif ($ filename);
- Determine the direction of the gradient
It is necessary to determine in which side looks the gradient is needed in the following paragraphs. This is done simply enough to determine the color of the first pixel (0, 0) $ color = imagecolorat ($ image, 0, 0) <0x20? 'Black': 'white';
- clears the angular gradients
Here you need to clean angular lines, gradients, and it is best to set as the cutoff captcha. That's when we need to know the direction of the gradient, to clean with the right hand. By the analysis reveals that the difference of color pixels (1, 1) (2, 2), etc. can not be more than # 202,020. Scrub - does this mean to paint with black, because All figures have no lower color # 606060.
We get the following picture:
php-code you can see in the enclosure (see link below)
- We cut captcha
At this point, cut the left and right by 12px. Because height of the figures are not higher than 14px, then the lower and upper cut off too much, depending on the height of the whole captcha.
Get:
- Clean gradient
From all sides they are still extra stripes gradient. They should be as clean. First pass top-down, then left to right, take the color of the strip, and if it is continuous (length> 10px) and one color - that believe that this strip of the gradient, and cleans it.
Total obtain:
But in some cases (5%) may still be here such noise: But we are still not hurt:) Since their color does not fit the color figures.
- Compares with imprints
through all the pixels from top to bottom, left-right, whose color matches the color of the digits and compares with all the prints in poryadochku.
Results
Test To test, I downloaded 200 of these captures, on my home computer script to analyze them for ~ 19 seconds. This is about 10 captures in second.
Out of 200 there were no no errors, script worked perfectly:)
Results I wrote a class CapCrack, which parses the captcha.
If you want more detail to understand the algorithm, or test on your PC, you can look at the code: cap_crack.zip
on this success I have not stopped and decided to try writing a script for downloading files from the site, in an automatic mode, but that's another story:) worthy of a separate article ...
PS This is my first post on Habre, so please do not judge strictly:)
|