Saturday, March 14, 2026

How S&P uses deep internet scraping, learning team and architecture of snowflakes to collect 5x more data on SMEs

Share


Join our daily and weekly newsletters to get the latest updates and exclusive content regarding the leading scope of artificial intelligence. Learn more


The world of investing has a significant problem when it comes to data on compact and medium -sized enterprises (SMEs). It has nothing to do with the quality or accuracy of data – it is no data at all.

The assessment of the SME’s creditworthiness was extremely tough because compact financial data is not public, and therefore very tough to obtain.

S&P Global Market IntelligenceThe S&P Global Department and the most crucial supplier of grades and comparative tests claims that he has solved this for many years. The company’s technical team built RisksThe platform powered by artificial intelligence, which induces in elusive data from over 200 million websites, processes it using numerous algorithms and generates risk results.

The platform built on the architecture of snowflakes increased the range of SMEs S&P O 5x.

“Our goal was expansion and performance,” explained Moody Hadi, head of the modern product S&P Global. “The project improved the accuracy and range of data, bringing benefits to clients.”

Basic architecture of RISKGAUGE

The contractor’s loan management generally assesses the creditworthiness and risk of the Company on the basis of several factors, including finances, and the likelihood of appetite for non -performance of liability and risk. S&P Global Market Intelligence provides these observations with institutional investors, banks, insurance companies, property managers and others.

“Larger corporations usually borrow their suppliers, so they must be able to often monitor their exposure time,” explained Hadi. “They can rely on third pages to help them inform them during credit evaluation.”

But the gap within the SMEs of the SME for a long time. Hadi pointed out that although immense public companies, such as IBM, Microsoft, Amazon, Google and the rest, are obliged to disclose quarterly finances, private SMEs in the USA do not have this obligation, thus limiting financial transparency. From the investor’s perspective, take into account that there are about 10 million SMEs in the US compared to about 60,000 public companies.

S&P Global Market Intelligence claims that now it has these covered: The company has significantly expanded its range to 10 million lively private SMEs in the United States, which are not the only properties.

The platform, which entered production in January, is based on a system built by the HADI team, which downloads company data from unstructured internet content, combines it with anonymized sets of data of other companies, and uses machine learning (ML) and advanced algorithms to generate credit evaluation.

The company uses Snowfake (together with other technology providers) to extract the company’s pages and process them into company controllers (market segms), which are then given to risk.

The platform data pipeline consists of:

  • Frawlery/Scrafaters
  • Layer before processing
  • Mining
  • Curators
  • Risk scoring

In particular, Hadi’s team uses the services of the Snowflake data storage and the consistency of Snowpark in the middle of the processing, extraction and treatment stages.

At the end of this process, SMEs are assessed on the basis of a combination of financial, business and market risk; 1 is the highest, 100 lowest. Investors also receive risk reports, in detail, describing finances, company companies, business credit reports, historical results and key events. They can also compare companies with peers.

How S&P collects valuable company data

Hadi explained that RiskGuge is using the multi -layer scraping process that downloads various details from the company’s online domain, such as the basic “contact us” and the land pages as well as information related to messages. Miners leave several layers of URL to scrape the relevant data.

“How can you imagine, a person cannot do it,” said Hadi. “It will be very time -consuming for humans, especially when you are dealing with 200 million websites.” He noticed that he results in several terabytes of information about the site.

After collecting data, the next step is to start algorithms that delete everything that is not the text; Hadi noticed that the system is not interested in javascript tags or even HTML. The data is cleaned so that they become legible by man, not the code. Then it is loaded into a snowflake, and several data miners are started on the pages.

Team algorithms are of key importance for the forecasting process; These types of algorithms combine forecasts from several individual models (basic models or “weak students”, which are basically slightly better than random guessing) to confirm the company’s information, such as name, company description, sector, location and operational activity. The system also affects all polarized moods around the announcements revealed on the website.

“After the site crashes, the algorithms hit various elements of the pages, and also vote and return with recommendation,” explained Hadi. “There is no man in this process, algorithms basically compete with each other. This helps in the efficiency of increasing our range.”

After this initial load, the system monitors the website’s activity, automatically launching weekly scanning. Does not update information every week; Only when he detects the change, Hadi added. When performing subsequent scans, the Hash key follows the target page from the previous indexing, and the system generates a different key; If they are identical, no changes have been made and no action is required. However, if the shortcut does not fit, the system will be launched to update the company’s information.

This continuous scraping is crucial that the system remains as valid as possible. “If they often update the site, he tells us that they live, right?” Noted Hadi.

Challenges related to the speed of processing, giant data sets, unclean websites

Of course, there were challenges to overcome while building the system, especially because of the size of the data sets and the need for quick processing. Hadi’s syndrome had to compromise to balance accuracy and speed.

“We’ve constantly optimized various algorithms to act faster,” he explained. “And improvement; some algorithms we had were really good, had high accuracy, high precision, high withdrawal, but were too expensive to computate.”

Sites do not always comply with standard formats, requiring versatile cutting methods.

“You can hear a lot about the design of websites with such an exercise, because when we originally we started, we thought:” Hey, each website should be in line with the map of websites or XML, “said Hadi. “And guess what? Nobody goes it.”

Hadi said that they did not want a strenuous code or turn on robotic processes (South Africa) because the sites differ so much, and they knew that the most crucial information is in the text. This led to the creation of a system that only downloads the necessary components of the site, and then cleans it for the actual text and rejects the code and any JavaScript or TypeScript.

As Hadi noted: “The biggest challenges concerned performance and tuning, and the fact that the designing websites are not clean.”

Latest Posts

More News