Basically, the main element that is necessary is indeed computing power.
On the data side, there are 3 cases to distinguish:
- training a model,
- the simple execution of an existing model,
- the use of an existing model with access to additional data.
For training, you need a huge stock of data, possibly enriched with metadata for learning (for example, for an image recognition model, we will have a large stock of images, and for each image annotations describing their content, generally made by a human or another AI and then verified by a human). This data often comes from the Internet, but it is very much preferable to copy it locally for reasons of performance and reproducibility (a remote image identified by its URL, its content may possibly change, etc.).
To use a model, the only data needed is the model itself (a few tens or hundreds of GB, depending on the size of the model, to be replicated on all nodes).
To use a model with enrichment, it is also necessary to provide enrichment data, but there it is not necessarily necessary to store them in the datacenter, they can be searched via an external search engine. The idea at this level is that when you ask your model something, instead of simply passing the raw question to it and having it answer alone, we do preprocessing which, based on the question, will fetch documents that might be relevant to the topic, then these documents are sent to the model with the question for use in constructing the answer. So in theory, you can absolutely make a simple query on a public search engine to fetch documents and use them, without having to store them in the DC. In practice, we will still generally have better results if the service builds its own index, as a search engine does, because the index can then be optimized for this use. In particular, we will be able to use our own semantic proximity search algorithm in natural language, which can be derived from the AI model, semantics to find documents, rather than being “limited” by the capabilities of a search engine. public, not necessarily always very good for natural language.
In the latter case, we will therefore, in addition to the model, store at least the index, and possibly also part of the documents which were used to construct it, but not all, that would quickly be too big… We rather keep a reference, and too bad if the document ends up no longer being accessible, we will then remove it from the index.