In this post we will look at some architectural decisions across the stack when building AI applications.
AI model building and use is part of a larger system or Architecture stack, comprised of :
Data is the foundation of the stack, as this directly impacts how well the models are trained and fine-tuned.
In addition, if the application is embedding proprietary data and searching the same then these need to be carefully curated, embedded, indexed, and stored in vector databases. Foundation models take months to train sometimes, so the cutoff data on its knowledge will never be current and it will need to access newer data or proprietary data which was never used for training.
Data processing would start with pre-processing activities such as cleansing, normalization, and feature engineering. Cleansed data will then be used for training the model, and there will also be separate pipelines for inferencing as users interact with the application.
Training and fine-tuning data will typically leverage batch processing as these can be scheduled without concern for latency. Whereas, the inferencing requests from multiple users will need near real time streaming in most scenarios as they may not tolerate latencies. However, one could still have some scenarios where batch inferencing happens- such as in tagging a large collection of images or translating videos.
Infrastructure is typically on the cloud as AI applications will need massive compute resources for training and inferencing. Here one could choose managed serverless instances where you don’t have to pay for idle time for certain workloads or it could be reserved in situations where the real time criticality is high, and we may not be able to tolerate cold start delays. In most real world solutions, it will likely be partially serverless, as GPU instances cost can be optimized that way. If the training or fine-tuning is incremental and small batches then serverless can be very efficient. However, if you are planning to run a large batch training job then serverless may not be suitable as these large trainings tend to be long running. In such cases dedicated CPU/GPU machines on cloud will be more reliable. In case of intermittent user traffic for inferencing and real-time response needs serverless can again be a good option- as these instances can spin up based on demand and also scale dynamically.
UI design has to be crafted more carefully for Gen AI apps as we covered in a detailed post previously. In addition to the details in that post, users also would need ability to deal with variability of Gen AI responses, building trust by means of interactions, and putting the user in driver’s seat by giving them appropriate nudges by AI system to improve their inputs or give feedback to AI responses. Enabling micro interactions with the UI to regenerate answers, tweak prompts, giving transparency on how the system is generating outputs, and ability to give feedbacks need to be incorporated in the UI design. Frameworks like Streamlit and Gradio can be used when building prototypes, as they make it very easy to build data driven web apps. Mature frameworks like React, Vue, Angular can be leveraged for production grade AI stacks.
Apart from the traditional AI frameworks such as PyTorch, Keras, and Tensorflow, there are also good Generative AI application frameworks such as LangChain and LlamaIndex. Llamaindex also has reusable templates called llamapacks that can significantly accelerate full stack Generative AI development by means of leveraging ready to use community tested architecture templates.
AI components may be built in Python with application frameworks such as Keras, and these will need to integrate with rest of the application components which may be built using Java or .Net typically in most enterprises. If it’s a legacy system written many years ago in a language that is no longer supported then this integration becomes even more challenging.
Apart from the regular software maintenance activities that happen in traditional apps, AI models need to be monitored continuously as the backdrop may change. For example, a healthcare application predicting diseases may need to be retrained when new diseases and symptoms are identified. Observability tools and dedicated developers to monitor and maintain should be part of the Architecture decisions.
Monitoring also includes adherence to regulatory compliance, data privacy, transparency, explainability, and ethical execution without any bias affecting some demographic part of population unfairly.
Lastly, the people collaboration is also more complex in full stack AI - as we need data scientists, data engineers, software developers, business sponsors, and regulators all working together to ensure exact solution needed by the users and delivered in a responsible manner.
It’s a great time to be an enterprise architect, as such a person can see the forest view with systems thinking versus just one part of the puzzle.