Web Form Data Extraction Based on Interface Crawler

I recently received a task, to climb a data, the data in a web page table, the amount of data hundreds. Open debugging mode and find that the interface returns an html page, as long as it is treated as string. (xpath crawler is troublesome for parsing html files) The scheme uses regular matching of all cell rows and extracting cell content, which encounters some other problems:

Originally, the content was extracted directly, and it was found that the content involved the languages and characters of various countries, which was a bit pitfalls.

After intercepting the cell lines, we find that there are spaces between the contents of the two fields, and the number is uncertain. spit method is used to limit the size of the array.